DPU refers to data processing unit.
There are four DPU types:
- Extractor - should be used if DPU extract something from outside UnifiedViews. For example, downloading file, querying remote sparql, etc ..
- Transformer - used for the DPUs which transform certain input data to output data and input and also output data are stored within UnifiedViews working store.
- Loader - analogy of Extractor. Should be used for DPUs that move data outside of UnifiedViews.
- Quality - used for special type of DPUs - DPUs that assess quality of the resources (this type is experimental)
Every data processing task is represented by a data processing pipeline (or simply pipeline) in UnifiedViews. Pipeline contains data processing units (DPUs) and data flows between these DPUs. Pipeline may be designed, executed, scheduled, debugged.
A data unit is a container for data being exchanged between DPUs. We distinguish input and output data units - output data unit is the data unit containing data produced by the DPU. Input data unit is the data unit containing data being the inputs to the DPU's execution.
There are currently three types of data unit (they can be both input or output data units):
RDF data unit
Files data unit
Relational data unit
Every data unit holds entries of certain type depending on the type of data unit. The supported entries are:
- RDF graphs (supported by RDF data unit)
- files (supported by Files data unit)
- relational tables (supported by Relational data unit)
DPU developer does not have to work directly with data units - there are Java helpers, that allows you to realize typical operations on top of data units, such as adding new entry, getting all entries, specifying certain metadata of the entry. Please see Tutorial for details.
For each entry, data unit holds basic metadata, such as the name of the entry, its location, etc. Metadata is represented in the form of RDF triples and stored in the UnifiedViews RDF working store. The following sections describe the main metadata associated with each entry (based on the type of the entry).
Main metadata held by all types of data unit entries are:
- symbolicName = identifier of the entry (file, rdf graph, relational table) in the data unit. This identifier may change as the entry is being processed by the pipeline - the DPU developer must not rely anyhow on the durability of the symbolic name as the created symbolic name may be changed by the next DPU in the pipeline execution. For RDF data units, symbolic name is typically taken from
dataGraphURI(see below). For Files data units, symbolic name is typically derived from
Main metadata held by Files data unit entries (files):
- fileURI = URI under which file is stored by the UnifiedViews backend engine as it is processed in the given pipeline execution.
- virtualPath = path under which the given file should be stored when it is loaded outside of UnifiedViews to some target server. Value of
fileURImetadata cannot be used for this purpose as it points to internal storage of the file. If for example virtual path is e.g. x/y/data.ttl, and loader loads the data to target folder F, the file should be automatically placed to F/x/y/data.ttl. In most cases
symbolicNameof a file =
virtualPathof a file.
The sample below (in TriG syntax) shows metadata produced by e-filesDownload (as they are stored in the internal working RDF store of UnifiedViews). We also who the configuration of the e-filesDownload below.
Line 8 defines symbolic name, Line 9 defines the real working URI under which the file is stored in UnifiedViews working store and Line 10 defines virtualPath. All such data is store in the context
http://unifiedviews.eu/resource/internal/dataunit/exec/83/dpu/80/du/0, which is the unique RDF graph for the given data unit within the given DPU instance and within the given pipeline execution. All entries of that data unit appear in this graph.
Main metadata held by RDF data unit entrries (RDF graphs):
- dataGraphURI = URI under which the RDF graph is stored by the UnifiedViews backend engine as it is processed in the given pipeline execution (so it is URI of the RDF graph in the UnifiedViews working RDF store). In case of RDF data unit entries, symbolic name for entries is typically set to be equal to
- virtualGraph = path under which the given RDF graph should be stored when it is loaded outside of UnifiedViews to some remote RDF store. Value of
dataGraphURImetadata cannot be used for this purpose as it is auto generated internal value, which is different for each pipeline execution and which does not comply with any methodologies for preparing good RDF graph URIs. For example, if virtual graph is e.g. http://data.company.com/graphs/myGraph, then data (triples) from this entry are automatically loaded to the virtual graph on the remote RDF store.
The sample below show metadata produced by t-rdfGraphMerger DPU, which physically merges two or more RDF data units to one graph. The configuration of this t-rdfGraphMerger DPU is also below.
Line 8 defines symbolic name, Line 9 defines the real working URI under which the RDF graph is stored in UnifiedViews working store and Line 10 defines virtualGraph. All such data is store in the context
http://unifiedviews.eu/resource/internal/dataunit/exec/80/dpu/54/du/1, which is the unique RDF graph for the given data unit within the given DPU instance and within the given pipeline execution. All entries of that data unit appear in this graph.
Lines 13 - 16 contains definition of the graph
http://unifiedviews.eu/resource/internal/dataunit/exec/80/dpu/54/du/1/entry/1 which contains the real data - triples that were merged and are send to the output. So in case of RDF data unit, the working RDF data graph contains not only metadata, but also data. In case of Files data unit, data are store on the file system under
Support for DPU developer to work with metadata
DPU developer does not need to directly modify RDF triples with metadata of entries - there are Java helpers; for example to define virtualPath metadata value, one may use VirtualPathHelper from uv-dataunit-helpers module.
Configuration, Migration, Versioning
Basically every DPU has some sort of configuration. For example in case of a file extractor, the configuration may consist of a path to the file being extracted. UnifiedViews represents configuration as a string. The string is given to the DPU to configure itself before its execution and can be modified by DPU’s configuration dialog.
Unfortunately string is not the best option to represent complex configurations with many options. To tackle this issue UnifiedViews provides a way to use simple Java classes as a configuration. So instead of working with a string, a DPU can simply utilize a user defined java class as a configuration object.
During DPU development the requirements and thus the configuration might change. In case of adding new configuration options, this can be easily added to the configuration, however in case of more complex changes (List is used instead of Map, string options is replaced with boolean, more options are aggregated into one) there may arise need for new versions of whole configuration objects.
UnifiedViews helpers provide support for configuration versioning out of the box. The basic idea is that older versions provide function, that enables their conversion to newer configuration object. Helpers then secure that any older version of configuration is prior to its use converted to the latest version.
DPU Contexts & ContextUtils
Every DPU has two types of contexts: DPU execution context and configuration dialog context. In both cases, context is accessible under
ctx property (provided by either
AbstractDialog class). Contexts provide access to additional resources, services and provide means how to communicate with the user.
The intention is to keep user contexts small so that DPU developer can easily find commonly used functions (localization, checking whether DPU was not cancelled) or access more advanced context. The user contexts accessible via
ctx property are
Context accessible via
ctx property is typically operated with
ContextUtils class. This class is focused mostly on communication with user in the form of exceptions and messages (event)s.
ContextUtils also automatically applies localization as common functions for dialog and execution context.
To get access to working directory of the DPU, you may use:
ctx.getExecMasterContext().getDpuContext().getWorkingDir(); Working directory may be used to store temporary files needed during execution of the DPU instance.
DPU Exceptions - how to inform that something was wrong?
There are two ways how to communicate information to a user:
Events are meant to be used to deliver more important informations than logs. There is no strict rule when to use event and when logs. The idea is that if everything goes well then from events user should get all the information he need about DPU execution like: which mode was used, optionally which resources have been processed (be careful here as there might be huge number of resources), etc .. logs on the other hands should provide detailed information about the DPU's execution, which is useful mainly as the DPU is being debugged.
If the DPU developer wants to denote fatal DPU failure, it should throw
DPUException. If possible use this exception to report DPU failure (
ContextUtils can be used to generate this exception and apply localisation to the given message). There might be a situations where it’s not that easy to throw
DPUException, e.g., in case of multi-threaded DPU. In that case, ERROR event can be emitted - ERROR event has the same effect as DPUException - DPU is considered to fail and pipeline execution is stopped.
Summary on sending information to the user:
Use events to report important stuff to the user
Use logs to log anything else.
DPUExceptionto report DPU failure.
Some helpers also throws
DPUException in case of fatal failure; in such case the exception should not be catched and the DPU execution should fail.
Error event causes DPU ends with Error. Warning event, Error log and Warning log message cause that DPU ends with Warning.
Labels in dialog, options, messages, exceptions .. all those shall go through the component that allows localisation. See Overview: src/main/resources/resources.properties to get general idea about file where the default text shall be stored.
In order to properly support localization, each string that should be localized shall go through localization function presented under user context ctx.tr .There migth be some exceptions in helpers, for example ContextUtils call this function on every given part of messages.
Text in DPUs should be denoted as a name of properties in resource.properties file. See existing DPUs for examples. It’s highly recommended to prefix each string with DPU name.