Extractor DPUs
Extractor DPUs
Extractors establish the integration with external data sources.
A data processing pipeline usually starts with extractors, because the first step is to retrieve the input data to be processed afterwards. Such type of DPUs produce at least one data outputs to other DPUs. They usually do not accept data inputs from other DPUs though, because they should retrieve data from external systems based on the configuration. However, they can accept configuration input to consume dynamic configurations generated by other DPUs at runtime.
PoolParty UnifiedViews provides DPUs to fetch data from file systems, HTTP API endpoints, SQL and SPARQL endpoints, and more.
Please refer to the respective topics for detailed descriptions:
Files Download
Files Download
This DPU downloads one or more files from the defined locations. The files to be downloaded may be located at HTTP URLs, on the local file system, at certain SFTP/FTP servers, etc.
Note
If you want the DPU to download files or directories from the local file system, the UnifiedViews administrator needs to specify the allowed directory the DPU can access first.
Individual files and also whole directories may be downloaded. If a directory is provided then all files and files in subdirectories are extracted.
If an internal name (file name) is specified for the downloaded entry, this name is then used as a symbolic name to internally identify the given file further on the pipeline.
If you specify a directory as an entry then this file name is used as a prefix for the individual files within that directory.
In cases where you just need to iterate and process each downloaded file in the same way, you do not need to specify a file name.
This DPU also sets virtual path metadata for each file extracted. In case of files it is equal to the file name (local file name from the file path, e.g. example.txt from a/b/c/example.txt).
In case of directories, virtual path metadata for each extracted file is equal to the relative path of the original directory.
The URI of a file may contain macro {{execId}}, which is replaced during pipeline execution with the actual pipeline execution ID.
Name | Description | Example |
---|---|---|
List of files and directories to download | List of files and directories to be downloaded. Each entry contains location from which the file should be obtained and optionally the internal file name. | |
URI - If it is a URI of a directory at the local file system, it has to be either absolute and point to the directory (or its subdirectories) or relative and start in the directory root. In either case, the UnifiedViews administrator needs to grant the DPU access to the directory first. | /tmp//Document.pdf | |
Username | admin | |
Password | <password> | |
File name | Document.pdf | |
Default connection timeout (ms) | 20,000 | |
Ignore TLS/SSL errors | If checked, errors with server certificate are ignored when connecting using secure connection (SSL/TLS, URL starts with https://). Wrong host name in certificate is ignored, untrusted certificate issuers are accepted, self-signed certificates are accepted. This option causes the download to be vulnerable to man-in-the-middle attack. Use with caution, it neglects security provided by TLS/SSL connection. Connecting using this option is insecure! | false |
Soft failure | In case the soft failure is checked in the configuration dialog, when there is a problem processing certain VFS entry or file, warning is shown but the execution of the DPU continues. If unchecked (default), in case of problem processing any VFS entry/file, the execution fails. | true |
Skip redundant input file entries | If checked, the DPU checks whether it is not trying to process certain file URIs more times (this may happen when the DPU is configured dynamically). If yes, it just skips processing of redundant entries and logs info message. | true |
Wait between calls for (ms) | Number of milliseconds the DPU should wait between the HTTP calls (0 by default, thus no delays between calls) | 0 |
Name | Type | Data Unit | Description | Required |
---|---|---|---|---|
output | output | FilesDataUnit | Downloaded files | |
config | input | RdfDataUnit | Dynamic DPU configuration, see Advanced configuration |
It is also possible to dynamically configure the DPU over its input `config` data unit using RDF data.
Turtle
<http://localhost/resource/config> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/filesDownload/Config>; <http://unifiedviews.eu/ontology/dpu/filesDownload/hasFile> <http://localhost/resource/file/0>. <http://localhost/resource/file/0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/filesDownload/File>; <http://unifiedviews.eu/ontology/dpu/filesDownload/file/uri> "http://www.zmluvy.gov.sk/data/att/117597_dokument.pdf"; <http://unifiedviews.eu/ontology/dpu/filesDownload/file/fileName> "zmluva.pdf".
The following image shows a fragment of a pipeline which downloads an Excel file from the tmp folder of the UnifiedViews server. The data of the Excel file is subsequently converted to RDF and loaded into a Virtuoso triple store. The DPU configuration is illustrated in the image below.
The following image shows the configuration for downloading multiple files at once.
The following image shows the configuration to download all files in a directory, if any subdirectories are located here then those files will be taken as well.
The following image shows a fragment of a pipeline which downloads an Excel file from the tmp folder of the UnifiedViews server. The data of the Excel file is subsequently converted to RDF and serves as input for a SPARQL Construct Query. The purpose of this query is to construct the configuration file of the second Files Download DPU. After the files are downloaded they are uploaded to the tmp folder of the UnifiedViews server using the Files Upload DPU. The DPU configuration is illustrated in the image below; it is empty as the configuration comes from the input RDF file.
The query used in this pipeline creates triples containing the download URI and the file name of the files that are to be downloaded. The query reads as follows:
CONSTRUCT { <http://localhost/resource/config> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/filesDownload/Config>; <http://unifiedviews.eu/ontology/dpu/filesDownload/hasFile> <http://localhost/resource/file/0>. <http://localhost/resource/file/0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/filesDownload/File>; <http://unifiedviews.eu/ontology/dpu/filesDownload/file/uri> ?fileUri; <http://unifiedviews.eu/ontology/dpu/filesDownload/file/fileName> ?fileName. } WHERE { ?s <http://localhost/fileuri/fileName> ?fileName. ?s <http://localhost/fileUri> ?fileUri }
HTTP API Request
HTTP API Request
This DPU allows executing HTTP requests (GET, POST methods) to web services and passes the response in form of file data unit.
This DPU targets to enable consuming web services, both. REST and SOAP.
For POST HTTP requests, there are 4 possible modes (type of sent data)
multipart (form data) body
raw data (content type can be specified: XML, JSON, ...)
raw data with bodies from input file(s) - for each input file a separate HTTP request (raw data) is executed
multipart (form data) with bodies from an input RDF configuration - for each input set of form params a separate HTTP request is executed
If sent data is multipart or raw, this DPU offers possibility to preview the HTTP response in design time.
The DPU also supports HTTPS requests.
Parameter | Description | Example |
---|---|---|
HTTP method | Supported HTTP request methods: GET, POST, PUT, DELETE. Based on the method additional configuration options are shown. | POST |
URL address | URL address of the target web service, where the HTTP or HTTPS request will be sent. | |
Target file name | Name of created file where the content of the HTTP response is stored. | response.json |
Target files suffix | (POST / file mode) Suffix of created files containing the content of HTTP responses. | 001_suffix, 002_suffix |
Basic authentication | Sets BASIC authentication (user name, password) for HTTP request | true |
User name | (if authentication is on) User name for basic authentication | admin |
Password | (if authentication is on) Password for basic authentication | <password> |
Data type | (only for POST HTTP method) Type of sent data in HTTP request: Raw body (text), Form-data body (multipart), Raw bodies from input files, Form-data bodies from input RDF configuration | Form-data bodies from input RDF configuration |
Content-type | (only for POST HTTP method) Type of sent raw data, set as HTTP header "Content-Type" (e.g. XML, JSON, SOAP, ...) | text/html |
Request body text encoding | (only for POST HTTP method) Encoding of HTTP request body text | This%20is%20some%20sample%20encoded%20text |
Request body | (only for POST HTTP method - raw body) Text sent in HTTP request body | "This is some sample text" |
Form data | (only for POST HTTP method - form-data body) Table of sent form data in the form of key - values |
Name | Type | DataUnit | Description | Required |
---|---|---|---|---|
requestOutput | output | FilesDataUnit | File(s) containing HTTP response(s) | |
requestFilesConfig | input | FilesDataUnit | Files sent as content of raw HTTP POST request | |
rdfConfig | input | RDFDataUnit | RDF configuration used to configure form-data bodies |
It is also possible to dynamically configure the request body over the input config data unit using RDF data. This is available only for raw mode and you can configure only the request.
# to dynamically configure request URL and request body (raw data mode) <http://localhost/resource/config> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/httpRequest/Config>; <http://unifiedviews.eu/ontology/dpu/httpRequest/requestBody> "..." ; <http://unifiedviews.eu/ontology/dpu/httpRequest/url> "http://semantic-web.com/service/x".
# two form-param bodies with the same set of three form params <http://localhost/resource/config> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/httpRequest/Config>; <http://unifiedviews.eu/ontology/dpu/httpRequest/formParamBody> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody/1> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/formParamBody> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody/2> . <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody/1> a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam1> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam2> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam3> . <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam1> a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/param> "corpusId" ; <http://unifiedviews.eu/ontology/dpu/httpRequest/value> "corpus:307b420d-43ad-4771-be41-308199da95b1" . <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam2> a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/param> "text" ; <http://unifiedviews.eu/ontology/dpu/httpRequest/value> "Test" . <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam3> a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/param> "title" ; <http://unifiedviews.eu/ontology/dpu/httpRequest/value> "Test title" . <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody/2> a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam1> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam2> ; <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam3> .
Many times this DPU is used to get RDF data from PoolParty or triple stores using a SPARQL CONSTRUCT query. The following table contains relevant endpoints for this:
Database | Path Variable | Service Path for SPARQL Query | Service Path for SPARQL Graph Store HTTP Protocol | Example |
---|---|---|---|---|
RDF4J | $REPOSITORY: RDF4J repository name | /$REPOSITORY | /$REPOSITORY_ID/rdf-graphs/service | http://db-rdf4j-oews-poc.semantic-web.at:8080/rdf4j-server/repositories/test?query=... |
Stardog | $DATABASE: Stardog database name | /$DATABASE/query | /$DATABASE | http://db-stardog-stardog-poc.semantic-web.at:5820/msmetrics/query?query=.... |
MarkLogic | None. "repository" is decided by port number | /v1/graphs/sparql | /v1/graphs | http://pp-sem-doc.demo.marklogic.com:8026/v1/graphs/sparql with Construct in Body (raw body, text (text/plain), UTF-8) |
Allegrograph | $REPOSITORY: RDF4J repository name | /repositories/$REPOSITORY | Not supported | |
GraphDB | $REPOSITORY: GraphDB repository name | /$REPOSITORY | /$REPOSITORY_ID/rdf-graphs/service |
The following image shows a fragment of a pipeline which first executes a SPARQL DELETE in a PoolParty project (SPARQL Endpoint Loader) and then creates a snapshot of the PoolParty project with a HTTP Request. The configuration for the snapshot API call can be seen in the image below. For more information on the used Web Service Method see here: https://help.poolparty.biz/x/ySiU
The following image shows another configuration example to make an API call to PoolParty. For this example, the Web Service Method Request Subtree of Concept or Concept Scheme is used. For more information on this Web Service Method see here: https://help.poolparty.biz/x/AimU
The following image shows a fragment of a pipeline which downloads a file, sends it to the PoolParty extractor to be annotated, from the response we will construct the configuration that will be sent as input the the HTTP Request API, and finally the result will be loaded into a SPARQL Endpoint.
The SPARQL Construct configuration can be seen below. This construct will be parsed into the HTTP API Request DPU as configuration parameters following a specific format. The construct creates a unique configuration for each subject, which is essentially the document URI. Attached to each config is the document text encoded to remove whitespaces.
The following image shows a fragment of a pipeline which is used to populate a GraphSearch search space. What is not seen are files are downloaded from a folder on the server and are annotated using the PoolParty Concept Extractor. This fragment shows the start of the configuration based on the annotation results, the configuration is transferred to the PoolParty GraphSearch Content Indexing Request Constructor. Following this HTTP API Requests are sent to drop the current index, create the new content in GraphSearch Space, and finally a refresh of the newly created content's index.
The following image is a configuration of the refresh search index for GraphSearch, this is usually attached to the pipeline where content is created (as seen above).
For further information about this API please check https://help.poolparty.biz/x/FyqU
PoolParty Concept Extractor
PoolParty Concept Extractor
PoolParty Concept Extractor is a DPU / plugin for UnifiedViews to consume the Concept Extraction service provided by PoolParty Extractor. Given triples with string literal objects representing texts or files containing texts as input, this extractor annotates texts against a thesaurus project in PoolParty and produces annotations in RDF triples as output.
Please refer to the following documentation for more information about PoolParty Extractor.
Name | Description | Data Type | Example |
---|---|---|---|
Host | Resolvable host name or IP address of the target PoolParty server | String | |
Port | Port number of PoolParty server | Integer | 80 |
Extraction service path | PoolParty Concept Extraction service path relative to PoolParty service root URL | String | /extractor/api/annotate |
Project ID | Project identifier of the PoolParty thesaurus project to be extracted against | String | 12345678-1234-1234-1234-ABCDEF123456 |
Language code | Two-digit ISO 639-1 code of source language of the texts to be extracted | String | en |
Username | Account name of a user for the target PoolParty thesaurus server | String | test |
Password | Password of a user for the target PoolParty thesaurus server | String | **** |
Corpus ID | Identifier of a corpus in the project used to adapt scores with corpus analysis | String | 12345678-1234-1234-1234-ABCDEF123456 |
Number of terms to return | Maximum number of terms to return | Integer | 0 |
Number of concepts to return | Maximum number of concepts to return | Integer | 50 |
useTransitiveBroaderConcepts | Retrieve transitive broader concepts of the extracted concepts | Boolean | false |
useTransitiveBroaderTopConcepts | Retrieve transitive broader top concepts of the extracted concepts | Boolean | false |
useRelatedConcepts | Retrieve related concepts of the extracted concepts | Boolean | false |
filterNestedConcepts | Nested concept filter removes concepts matches which are contained within other matches | Boolean | true |
tfidfScoring | The scores of the concepts and terms are weighted by tfidf (term frequency-inverse document frequency) formula | Boolean | false |
useTypes | Retrieve the custom types for concepts | Boolean | false |
Maximum retry times for failed extraction | Maximum retry times for failed extraction | Integer | 3 |
Use HTTPS | If checked, HTTPS is used for connecting to target PPX service (by default false) | Boolean | false |
Use only symbolic names when creating resulting URIs from input files | If checked, virtual path metadata is not used when forming URIs for outputted resources, but symbolic names are used | Boolean | false |