Page tree
Skip to end of metadata
Go to start of metadata

This tutorial explains how entries (files) may be obtained from Files input data units and also how entries (files) may be written to the Files output data unit.

For reading and writing files, you may use either API classes (low level approach) or helpers. If possible, read/write  files using helpers (FilesHelper). 

For basic information about data units, please see the basic description of data unitsTo see the core data unit interfaces and how the particular types of data units (RDF, Files) extend such interfaces, please look at Core Data Unit Interfaces.  


Reading files from input Files data unit:

Please prepare DPU "MyDpu" as described in Tutorial: Creating new DPU. To read files from input Files data unit, one has to define input Files data unit. 

 @DataUnit.AsInput(name = "input")
 public FilesDataUnit input;

All data units must be public with proper annotation - they must at least contain name, which will be the name visible in UnifiedViews administration interface for pipeline developers. The code above goes to the Main DPU class.

Using API classes

Further, we need to iterated over input data unit in order to get access to files which comes over input files data unit. The code below goes to innerExecute() method of the DPU.

Code 1 - Iterating over input files using API classes
Set<File> files = new HashSet<>();
FilesDataUnit.Iteration it = null;
try {
    it = input.getIteration();
    while (it.hasNext()) {
        final String filePathUri = it.next().getFileURIString();
        files.add(new File(java.net.URI.create(filePathUri)));
    }
} catch (DataUnitException ex) {
    throw ContextUtils.dpuException(ctx, ex, "dpuName.error");
} finally {
    if (it != null) {
        try {
            it.close();
        } catch (DataUnitException ex) {
            log.error("Error on close.", ex);
        }
    }
}
 


In Lines 5 - 8, we iterate over the entries in the input data unit. In Line 6, we got URI (physical location) of the entry, which we then on Line 7 add to the set of such URIs. Note: Such approach when we store list of incoming files to some some Java object is not suitable for bigger number of files on the input - in such case it would be better to directly process the file. 

 

As the iterator over files does not extends AutoClosable we need to take care about it’s closing at the end (Line 14). That’s why we do all the work in try-catch block (Lines 3 - 11 ) with finally statement (Lines 11 - 19). Also we catch DataUnitException which may be thrown by the iterator in Lines 9 -11.

So after executing Code 1, we have set of files within the variable files. We may then work with the files as needed. 

 

Using FilesHelper

The code introduced in Code1 can be simplified by using helper - eu.unifiedviews.helpers.dataunit.files.FilesHelper in uv-dataunit-helpers, which automatically stores all the entries to a set of data unit entries. In this case, DPU developer does not need to manually handle iteration. This approach should be used only for smaller amount of files as all the entries (containing couple of metadata, such as fileURI, symbolic name, etc.) are copied to Java object and stored in the memory. 

Code 2 - Iterating over input files using helper FilesHelper
try {
	Set<FilesDataUnit.Entry> files = FilesHelper.getFiles(input);
} catch (DataUnitException ex) {
   throw ContextUtils.dpuException(ctx, ex, "dpuName.error");
}

Line 2 returns set of entries. When processing entries, if you want to work rather with Java File objects, you may call public static File asFile(FilesDataUnit.Entry entry) to get the entry converted to instance of File.

 

Writing files to output Files data unit

Please prepare DPU "MyDpu" as described in Tutorial: Creating new DPU. To write files to output RDF data unit, one has to define output Files data unit. 

@DataUnit.AsOutput(name = "output")
public WritableFilesDataUnit output;

All data units must be public with proper annotation - they must at least contain name, which will be the name visible in UnifiedViews administration interface for pipeline developers. The code above goes to the Main DPU class.

Using API classes

First, lets introduce the way how to add files using API classes.

There are two methods DPU developers may use to add files to output files data unit (using the helper):

  • String addNewFile(String symbolicName) throws DataUnitException;
    • This method creates new empty file in the output data unit with the given symbolicName. For explanation of symbolicNames and other metadata of entries in data units, please see Basic Concepts for DPU developers . The physical name of the create file is generated and the file is physically stored in the working directory of the given pipeline execution. 
  • void addExistingFile(String symbolicName, String existingFileURIString) throws DataUnitException;
    • This method adds existing file located at existingFileURIString to the output data unit. It automatically creates new entry in the output data unit with the given symbolicName. For explanation of symbolicNames and other metadata of entries in data units, please see Basic Concepts for DPU developers . In this case, the real location and the physical name of the file is as it was when it was created before calling this method. Be careful that in this case, the file is not created in the working space of the given pipeline execution.

 

In order to add existing file, the code below may be used in the innerExecute() method of the DPU. 

Code 3 - Adding file to the output data unit using API classes
File file = ...
String fileName = ...
Symbolic symbolicName = output.addExistingFile(fileName, file.toURI().toString());
MetadataUtils.set(output, symbolicName, FilesVocabulary.UV_VIRTUAL_PATH, fileName);
In Line 3, the new entry in the output data unit is created and for such entry metadata symbolicName is set to be equal to fileName and existingFileURIString is set to file.toURI().toString().

Line 4 then sets virtualPath metadata for the entry. For the explanation of this metadata, please see Basic Concepts for DPU developers. Every DPU develop creating output file should also set this metadata.

Using FilesHelper

The code introduced in Code3 can be simplified by using helper - eu.unifiedviews.helpers.dataunit.files.FilesHelper in uv-dataunit-helpers. In general, DPU developers should use helpers, if possible.
There are two methods DPU developers may use to add files to output files data unit (using the helper):
  • public static FilesDataUnit.Entry createFile(WritableFilesDataUnit filesDataUnit, final String filename) throws DataUnitException
    • This method created new empty file in the filesDataUnit data unit with the symbolicName and virtualPath metadata equal to filename. For explanation of symbolicName, virtualPath and other metadata of entries in data units, please see Basic Concepts for DPU developers . The physical name of the create file is generated and the file is physically stored in the working directory of the given pipeline execution. 
  • public static FilesDataUnit.Entry addFile(WritableFilesDataUnit filesDataUnit, final File file, final String filename) throws DataUnitException 

    • This method adds existing file to the filesDataUnit. It automatically creates new entry in the output data unit with the symbolicName and virtualPath metadata equal to fileName. For explanation of symbolicNames, virtualPath and other metadata of entries in data units, please see Basic Concepts for DPU developers . In this case, the real location and the physical name of the file is as it was when it was created before calling this method. Be careful that in this case, the file is not created in the working space of the given pipeline execution, so it is not e.g. automatically cleaned. 
  • public static FilesDataUnit.Entry addFile(WritableFilesDataUnit filesDataUnit, final File file) throws DataUnitException
    • The same as the method above. In this case, filename is automatically computed as file.getName()


As the methods above return FilesDataUnit.Entry as a result, you may also use public static File asFile(FilesDataUnit.Entry entry) method to convert returned entry to standard Java File object. 

Creating new (output) file

To create new File in the working store of the DPU and further work with that file, you may use the approach as follows : 

 

Code 4a
Entry newEntry = FilesHelper.createFile(output, "myFile");
File f = FilesHelper.asFile(newEntry);
//write to the file 

Line 1 creates new empty file in the working space of the DPU. Line 2 obtains File object to further work with the file.

Note: "myFile" in Line 1 is symbolic name of the file, it is not the real name/location of the file. 

Adding existing file to the output

To add existing file, you may follow the example below. The sample fragment below also shows how the Code 3 (adding existing file to the output Files data unit) may be simplified by using helpers. 

 

Code 4
File file = ...
String fileName = ... 
FilesHelper.addFile(output, file, fileName); 

 

Note that it is suggested to copy the existing file to the working directory of the DPU (this has to be doen manually using java.io classes) before you add it to the output, so that, e.g., the file is automatically deleted, when the pipeline finished. You may obtain working directory by calling ctx.getExecMasterContext().getDpuContext().getWorkingDir(); in your main DPU class.
Note: To add file to the output data unit, it is not enough to copy the file to the working dir, you have to also call FilesHelper.addFile or FilesHelper.createFile to let the output data unit know there is a file which goes to the output. 

Summary on using helpers

The advantage of the helpers is that the code is cleaner and easier to be used - compare the Code 3 needed to add existing file to the output file data unit using only API classes - one line (with helper) vs. two lines (when the helper is not used). Further, when the helper is not used, DPU developer must be aware of virtualPath metadata, must know that the recommended practise is to set virtualPath = symbolicName.

 

Using WritableSimpleFiles DPU extension

Apart from the FilesHelper, there is also WritableSimpleFiles, which is not a data unit helper, but DPU extension. Such extension may be used to write files into output data unit. The advantage of such extension is that:

  • the methods for creating new file entries, adding existing entries are a bit simpler, as they do not specify data unit as the parameter. The WritableSimpleFiles is bound at the beginning to certain data unit based on the initialization of the extension.
  • it automatically uses FaultTolerant extension, if it is allowed for the DPU. So if you prepare your DPUs fault tolerant, you should consider using WritableSimpleFiles extension, as it hides fault tolerance calls smoothly. 
  • it is more effective for bigger amounts of files: It does not uses static Helper methods for adding files, which means that adding big amounts of data using WritableSimpleFiles uses one connection to the underlying RDF working store with file entries instead of one connection per one file (as FileHelper does). 

For details about WritableSimpleFiles extension, please see here

 

 


  • No labels