Skip to main content

Extractor DPUs

Abstract

Extractor DPUs

Extractors establish the integration with external data sources.

A data processing pipeline usually starts with extractors, because the first step is to retrieve the input data to be processed afterwards. Such type of DPUs produce at least one data outputs to other DPUs. They usually do not accept data inputs from other DPUs though, because they should retrieve data from external systems based on the configuration. However, they can accept configuration input to consume dynamic configurations generated by other DPUs at runtime.

PoolParty UnifiedViews provides DPUs to fetch data from file systems, HTTP API endpoints, SQL and SPARQL endpoints, and more.

Please refer to the respective topics for detailed descriptions:

Files Download

Abstract

Files Download

DescriptionFiles Download (uv-e-filesDownload):

This DPU downloads one or more files from the defined locations. The files to be downloaded may be located at HTTP URLs, on the local file system, at certain SFTP/FTP servers, etc.

Note

If you want the DPU to download files or directories from the local file system, the UnifiedViews administrator needs to specify the allowed directory the DPU can access first.

Individual files and also whole directories may be downloaded. If a directory is provided then all files and files in subdirectories are extracted.

If an internal name (file name) is specified for the downloaded entry, this name is then used as a symbolic name to internally identify the given file further on the pipeline.

If you specify a directory as an entry then this file name is used as a prefix for the individual files within that directory.

In cases where you just need to iterate and process each downloaded file in the same way, you do not need to specify a file name.

This DPU also sets virtual path metadata for each file extracted. In case of files it is equal to the file name (local file name from the file path, e.g. example.txt from a/b/c/example.txt).

In case of directories, virtual path metadata for each extracted file is equal to the relative path of the original directory.

The URI of a file may contain macro {{execId}}, which is replaced during pipeline execution with the actual pipeline execution ID.

Configuration Parameters

Name

Description

Example

List of files and directories to download

List of files and directories to be downloaded. Each entry contains location from which the file should be obtained and optionally the internal file name.

URI - If it is a URI of a directory at the local file system, it has to be either absolute and point to the directory (or its subdirectories) or relative and start in the directory root. In either case, the UnifiedViews administrator needs to grant the DPU access to the directory first.

/tmp//Document.pdf

Username

admin

Password

<password>

File name

Document.pdf

Default connection timeout (ms)

20,000

Ignore TLS/SSL errors

If checked, errors with server certificate are ignored when connecting using secure connection (SSL/TLS, URL starts with https://). Wrong host name in certificate is ignored, untrusted certificate issuers are accepted, self-signed certificates are accepted. This option causes the download to be vulnerable to man-in-the-middle attack. Use with caution, it neglects security provided by TLS/SSL connection. Connecting using this option is insecure!

false

Soft failure

In case the soft failure is checked in the configuration dialog, when there is a problem processing certain VFS entry or file, warning is shown but the execution of the DPU continues. If unchecked (default), in case of problem processing any VFS entry/file, the execution fails.

true

Skip redundant input file entries

If checked, the DPU checks whether it is not trying to process certain file URIs more times (this may happen when the DPU is configured dynamically). If yes, it just skips processing of redundant entries and logs info message.

true

Wait between calls for (ms)

Number of milliseconds the DPU should wait between the HTTP calls (0 by default, thus no delays between calls)

0

Inputs and Outputs

Name

Type

Data Unit

Description

Required

output

output

FilesDataUnit

Downloaded files

(tick)

config

input

RdfDataUnit

Dynamic DPU configuration, see Advanced configuration

(error)
NotesAdvanced configuration

It is also possible to dynamically configure the DPU over its input `config` data unit using RDF data.

Configuration samples

Turtle

<http://localhost/resource/config>  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/filesDownload/Config>;
        <http://unifiedviews.eu/ontology/dpu/filesDownload/hasFile> <http://localhost/resource/file/0>.


<http://localhost/resource/file/0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/filesDownload/File>;
        <http://unifiedviews.eu/ontology/dpu/filesDownload/file/uri> "http://www.zmluvy.gov.sk/data/att/117597_dokument.pdf"; 
        <http://unifiedviews.eu/ontology/dpu/filesDownload/file/fileName> "zmluva.pdf".
ExamplesDownload an Excel file, convert the table data to RDF and load it to Virtuoso

The following image shows a fragment of a pipeline which downloads an Excel file from the tmp folder of the UnifiedViews server. The data of the Excel file is subsequently converted to RDF and loaded into a Virtuoso triple store. The DPU configuration is illustrated in the image below.

24577086.png
24577087.png
Download Multiple Files

The following image shows the configuration for downloading multiple files at once.

24577088.png

The following image shows the configuration to download all files in a directory, if any subdirectories are located here then those files will be taken as well.

24577089.png
Download an Excel File Containing Download Links, Convert It to RDF and Use It to Configure Another Files Download DPU

The following image shows a fragment of a pipeline which downloads an Excel file from the tmp folder of the UnifiedViews server. The data of the Excel file is subsequently converted to RDF and serves as input for a SPARQL Construct Query. The purpose of this query is to construct the configuration file of the second Files Download DPU. After the files are downloaded they are uploaded to the tmp folder of the UnifiedViews server using the Files Upload DPU. The DPU configuration is illustrated in the image below; it is empty as the configuration comes from the input RDF file.

24577090.png
24577091.png

The query used in this pipeline creates triples containing the download URI and the file name of the files that are to be downloaded. The query reads as follows:

CONSTRUCT {
<http://localhost/resource/config>  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/filesDownload/Config>;
        <http://unifiedviews.eu/ontology/dpu/filesDownload/hasFile> <http://localhost/resource/file/0>.

<http://localhost/resource/file/0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/filesDownload/File>;
        <http://unifiedviews.eu/ontology/dpu/filesDownload/file/uri> ?fileUri; 
        <http://unifiedviews.eu/ontology/dpu/filesDownload/file/fileName> ?fileName.
}
WHERE {
?s <http://localhost/fileuri/fileName> ?fileName.
?s <http://localhost/fileUri> ?fileUri
}

HTTP API Request

Abstract

HTTP API Request

DescriptionHTTP API Request (uv-e-httpRequest):

This DPU allows executing HTTP requests (GET, POST methods) to web services and passes the response in form of file data unit.

This DPU targets to enable consuming web services, both. REST and SOAP.

For POST HTTP requests, there are 4 possible modes (type of sent data)

  • multipart (form data) body

  • raw data (content type can be specified: XML, JSON, ...)

  • raw data with bodies from input file(s) - for each input file a separate HTTP request (raw data) is executed

  • multipart (form data) with bodies from an input RDF configuration - for each input set of form params a separate HTTP request is executed

If sent data is multipart or raw, this DPU offers possibility to preview the HTTP response in design time.

The DPU also supports HTTPS requests.

Configuration Parameters

Parameter

Description

Example

HTTP method

Supported HTTP request methods: GET, POST, PUT, DELETE. Based on the method additional configuration options are shown.

POST

URL address

URL address of the target web service, where the HTTP or HTTPS request will be sent.

http://localhost/PoolParty

https://vocabulary.semantic-web.at/PoolParty

Target file name

Name of created file where the content of the HTTP response is stored.

response.json

Target files suffix

(POST / file mode) Suffix of created files containing the content of HTTP responses.

001_suffix, 002_suffix

Basic authentication

Sets BASIC authentication (user name, password) for HTTP request

true

User name

(if authentication is on) User name for basic authentication

admin

Password

(if authentication is on) Password for basic authentication

<password>

Data type

(only for POST HTTP method) Type of sent data in HTTP request: Raw body (text), Form-data body (multipart), Raw bodies from input files, Form-data bodies from input RDF configuration

Form-data bodies from input RDF configuration

Content-type

(only for POST HTTP method) Type of sent raw data, set as HTTP header "Content-Type" (e.g. XML, JSON, SOAP, ...)

text/html

Request body text encoding

(only for POST HTTP method) Encoding of HTTP request body text

This%20is%20some%20sample%20encoded%20text

Request body

(only for POST HTTP method - raw body) Text sent in HTTP request body

"This is some sample text"

Form data

(only for POST HTTP method - form-data body) Table of sent form data in the form of key - values

Inputs and Outputs

Name

Type

DataUnit

Description

Required

requestOutput

output

FilesDataUnit

File(s) containing HTTP response(s)

(tick)

requestFilesConfig

input

FilesDataUnit

Files sent as content of raw HTTP POST request

(error)

rdfConfig

input

RDFDataUnit

RDF configuration used to configure form-data bodies

(error)
NotesAdvanced Configuration

It is also possible to dynamically configure the request body over the input config data unit using RDF data. This is available only for raw mode and you can configure only the request.

Configuration samples
# to dynamically configure request URL and request body (raw data mode)
<http://localhost/resource/config>
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/httpRequest/Config>;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/requestBody> "..." ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/url> "http://semantic-web.com/service/x".
# two form-param bodies with the same set of three form params
<http://localhost/resource/config>
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://unifiedviews.eu/ontology/dpu/httpRequest/Config>;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/formParamBody> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody/1> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/formParamBody> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody/2> .

<http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody/1>  a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam1> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam2> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam3> .

<http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam1> a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/param> "corpusId" ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/value>  "corpus:307b420d-43ad-4771-be41-308199da95b1" .

 <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam2> a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/param> "text" ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/value>  "Test" .

 <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam3> a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/param> "title" ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/value>  "Test title" .


 <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody/2>  a <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParamBody> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam1> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam2> ;
    <http://unifiedviews.eu/ontology/dpu/httpRequest/formParam> <http://unifiedviews.eu/ontology/dpu/httpRequest/FormParam3> .
Useful Endpoints for Graph Databases and PoolParty

Many times this DPU is used to get RDF data from PoolParty or triple stores using a SPARQL CONSTRUCT query. The following table contains relevant endpoints for this:

Database

Path Variable

Service Path for SPARQL Query

Service Path for SPARQL Graph Store HTTP Protocol

Example

RDF4J

$REPOSITORY: RDF4J repository name

/$REPOSITORY

/$REPOSITORY_ID/rdf-graphs/service

http://db-rdf4j-oews-poc.semantic-web.at:8080/rdf4j-server/repositories/test?query=...

Stardog

$DATABASE: Stardog database name

/$DATABASE/query

/$DATABASE

http://db-stardog-stardog-poc.semantic-web.at:5820/msmetrics/query?query=....

MarkLogic

None. "repository" is decided by port number

/v1/graphs/sparql

/v1/graphs

http://pp-sem-doc.demo.marklogic.com:8026/v1/graphs/sparql with Construct in Body (raw body, text (text/plain), UTF-8)

Allegrograph

$REPOSITORY: RDF4J repository name

/repositories/$REPOSITORY

Not supported

https://db-allegrograph.poolparty.biz/repositories/tcg?query=Construct%20%7B%3Fs%20%3Fp%20%3Fo%7D%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D&queryLn=SPARQL

GraphDB

$REPOSITORY: GraphDB repository name

/$REPOSITORY

/$REPOSITORY_ID/rdf-graphs/service

http://172.28.9.18:7200/repositories/requisition?query=....

ExamplesGET: Create PoolParty project snapshot

The following image shows a fragment of a pipeline which first executes a SPARQL DELETE in a PoolParty project (SPARQL Endpoint Loader) and then creates a snapshot of the PoolParty project with a HTTP Request. The configuration for the snapshot API call can be seen in the image below. For more information on the used Web Service Method see here: https://help.poolparty.biz/x/ySiU

24577093.png
24577094.png
GET: Get Subtree Concepts from PoolParty

The following image shows another configuration example to make an API call to PoolParty. For this example, the Web Service Method Request Subtree of Concept or Concept Scheme is used. For more information on this Web Service Method see here: https://help.poolparty.biz/x/AimU

24577095.png
POST: With Config Input Example

The following image shows a fragment of a pipeline which downloads a file, sends it to the PoolParty extractor to be annotated, from the response we will construct the configuration that will be sent as input the the HTTP Request API, and finally the result will be loaded into a SPARQL Endpoint.

The SPARQL Construct configuration can be seen below. This construct will be parsed into the HTTP API Request DPU as configuration parameters following a specific format. The construct creates a unique configuration for each subject, which is essentially the document URI. Attached to each config is the document text encoded to remove whitespaces.

24577096.png
24577097.png
POST: Simple GraphSearch Example

The following image shows a fragment of a pipeline which is used to populate a GraphSearch search space. What is not seen are files are downloaded from a folder on the server and are annotated using the PoolParty Concept Extractor. This fragment shows the start of the configuration based on the annotation results, the configuration is transferred to the PoolParty GraphSearch Content Indexing Request Constructor. Following this HTTP API Requests are sent to drop the current index, create the new content in GraphSearch Space, and finally a refresh of the newly created content's index.

24577098.png
POST: Refresh Search Index of GraphSearch Search Space

The following image is a configuration of the refresh search index for GraphSearch, this is usually attached to the pipeline where content is created (as seen above).

For further information about this API please check https://help.poolparty.biz/x/FyqU

24577099.png

PoolParty Concept Extractor

Abstract

PoolParty Concept Extractor

DescriptionPoolParty Concept Extractor (uv-t-poolpartyConceptExtractor):

PoolParty Concept Extractor is a DPU / plugin for UnifiedViews to consume the Concept Extraction service provided by PoolParty Extractor. Given triples with string literal objects representing texts or files containing texts as input, this extractor annotates texts against a thesaurus project in PoolParty and produces annotations in RDF triples as output.

Please refer to the following documentation for more information about PoolParty Extractor.

Configuration Parameters

Name

Description

Data Type

Example

Host

Resolvable host name or IP address of the target PoolParty server

String

test.poolparty.biz

Port

Port number of PoolParty server

Integer

80

Extraction service path

PoolParty Concept Extraction service path relative to PoolParty service root URL

String

/extractor/api/annotate

Project ID

Project identifier of the PoolParty thesaurus project to be extracted against

String

12345678-1234-1234-1234-ABCDEF123456

Language code

Two-digit ISO 639-1 code of source language of the texts to be extracted

String

en

Username

Account name of a user for the target PoolParty thesaurus server

String

test

Password

Password of a user for the target PoolParty thesaurus server

String

****

Corpus ID

Identifier of a corpus in the project used to adapt scores with corpus analysis

String

12345678-1234-1234-1234-ABCDEF123456

Number of terms to return

Maximum number of terms to return

Integer

0

Number of concepts to return

Maximum number of concepts to return

Integer

50

useTransitiveBroaderConcepts

Retrieve transitive broader concepts of the extracted concepts

Boolean

false

useTransitiveBroaderTopConcepts

Retrieve transitive broader top concepts of the extracted concepts

Boolean

false

useRelatedConcepts

Retrieve related concepts of the extracted concepts

Boolean

false

filterNestedConcepts

Nested concept filter removes concepts matches which are contained within other matches

Boolean

true

tfidfScoring

The scores of the concepts and terms are weighted by tfidf (term frequency-inverse document frequency) formula

Boolean

false

useTypes

Retrieve the custom types for concepts

Boolean

false

Maximum retry times for failed extraction

Maximum retry times for failed extraction

Integer

3

Use HTTPS

If checked, HTTPS is used for connecting to target PPX service (by default false)

Boolean

false

Use only symbolic names when creating resulting URIs from input files

If checked, virtual path metadata is not used when forming URIs for outputted resources, but symbolic names are used

Boolean

false