Recommended Workflow for High Availability

More than two PoolParty instances can form a high availability cluster. The cluster is a centralized cluster with one master node and multiple slave nodes.

PoolParty include 4 components:

  • Thesaurus Server
  • Extractor
  • GraphSearch
  • UnifiedViews

where Thesaurus Server can only be clustered in Active/Passive mode and the rest of them can be clustered in Active/Active mode.

Installation

When PoolParty installation on master node is ready, please follow the guide at Install and Configure Additional Extractor Instance for High Availability (Windows) and Install and Configure Additional Extractor Instance for High Availability (Linux) to install PoolParty on slave nodes.

Post Installation Configuration

When both master and slave nodes are configured and online, please follow this post configuration process:

  1. Export all projects that should be used by the cluster from the Thesaurus Server on master node, create new projects and import the exported projects into the Thesaurus Server on slave nodes.
  2. (for Semantic Integrator edition) Open the Admin Web UI of GraphSearch on each slave node and configure it according to the configuration on the master node.
  3. Synchronize the projects in Thesaurus Server on slave nodes from the master node regularly. This can be done by
    • Develop a custom cron task with PoolParty project export and import APIs
    • (for Semantic Integrator edition) Use the synchronization pipeline in UnifiedViews

Extractor and GraphSearch depend on the search index for PoolParty managed by Solr. Therefore, they are clustered when Solr is clustered and no additional configuration is required. 

Cluster Behaviors

After the aforementioned installation and configuration, the cluster is online and ready to use. Expected behaviors are as follows and should be verified before load balancing:

  • All project updates on the master node are propagated to all slave nodes. Thesaurus Server thesaurus services (i.e., API request paths starting with /PoolParty/api/thesaurus) are accessible from slave nodes with a difference of time window defined in the project synchronization interval.
  • Extraction model calculated on the master node are propagated to all slave nodes. Extractor categorization and extraction services (i.e., API request paths starting with /extractor/api{categorization|extract|annotate}) are accessible from slave nodes with a difference of time window defined in the Solr configuration (20 seconds by default).
  • Document updates on search index on the master node are propagated to all slave nodes. GraphSearch search and suggestion services (i.e., API request paths starting with /GraphSearch/api/{search|suggest}) are accessible from slave nodes with a difference of time window defined in the Solr configuration (20 seconds by default).

Recommended Load Balancer Configuration

Load balancers should be placed on top of the cluster and configured with the following rules:

  • All incoming GET requests on Thesaurus Server thesaurus services (i.e., API request paths starting with /PoolParty/api/thesaurus) can be load balanced on the entire cluster.
  • All incoming POST requests on Thesaurus Server thesaurus services (i.e., API request paths starting with /PoolParty/api/thesaurus) and GET and POST requests on other services of Thesaurus Server (i.e., API request paths starting with /PoolParty/api/{corpusmanagement|user|schema}) must only be routed to the Thesaurus Server on the master node.
  • All incoming GET and POST requests on Extractor categorization and extraction services (i.e., API request paths starting with /extractor/api{categorization|extract|annotate}) can be load balanced on the entire cluster.
  • All incoming GET and POST requests on GraphSearch search, recommendation and suggestion services (i.e., API request paths starting with /GraphSearch/api{search|suggest|recommend}) can be load balanced on the entire cluster and requests on other services of GraphSearch must only be routed to the GraphSearch on the master node.

When any component (Thesaurus Server, Extractor or GraphSearch) on the master node is down, the same components on all slave nodes become read only. Write operations are allowed only to the master node.

Client System Configuration

PoolParty clusters will be used to serve the external client systems for any certain business logic, and most of the interactions will be performed via APIs. All requests should be configured with the following rules:

  • Use project textual identifier instead of project UUID for services of Thesaurus Server requiring a project identifier in service path parameter. For example, use "/PoolParty/api/thesaurus/example/concept" instead of "/PoolParty/api/thesaurus/12345678-1234-1234-1234-123456789012/concept" as request path for service Get Concept, where "example" is the textual identifier used as part of concept URI and "12345678-1234-1234-1234-123456789012" is the UUID of the project. Although both of them can be used to reference a project, the UUID may be different when importing a project to a slave without changing UUID.
  • Use the UUID of a project in Thesaurus Server on the master node for services of Extractor and GraphSearch requiring a project identifier in service request parameter, since the search index is built from Thesaurus Server on the master node.