Skip to main content

Working With the Files Data Unit

Abstract

Working With the Files Data Unit

This section contains a short guide on how entries (files) may be obtained from or written to the Files input data units.

Note

For reading and writing files, you may use either API classes (low level approach) or helpers. If possible, read/write files using helpers (FilesHelper).

For basic information about data units, please see the basic description of data units.

Reading Files From the Input Files Data Unit

Prepare DPU 'MyDpu' as described in Working With the Files Data Unit. To read files from the input Files data unit, you have to define the input Files data unit.

 @DataUnit.AsInput(name = "input")
 public FilesDataUnit input;

All data units must be public with the proper annotation: they must at least contain a name, which will be the name visible in the UnifiedViews administration interface for pipeline developers. The code above goes to the Main DPU class.

Using API Classes

Further, we need to iterate over the input data unit in order to get access to files which come through the input files data unit.

The code below goes to the innerExecute() method of the DPU.

Code 1 - Iterating over input files using API classes

Set<File> files = new HashSet<>();
FilesDataUnit.Iteration it = null;
try {
    it = input.getIteration();
    while (it.hasNext()) {
        final String filePathUri = it.next().getFileURIString();
        files.add(new File(java.net.URI.create(filePathUri)));
    }
} catch (DataUnitException ex) {
    throw ContextUtils.dpuException(ctx, ex, "dpuName.error");
} finally {
    if (it != null) {
        try {
            it.close();
        } catch (DataUnitException ex) {
            log.error("Error on close.", ex);
        }
    }
}
 
  • In Lines 5 - 8, we iterate over the entries in the input data unit.

  • In Line 6, we got URI (physical location) of the entry.

  • On Line 7 we then add to the set of such URIs.

    Note

    This approach, when we store list of incoming files to some some Java object, is not suitable for a larger number of files on the input. In such cases it would be better to directly process the file.

  • As the iterator over files does not extendAutoClosable we need to take care about it’s closing at the end (Line 14). That’s why we do all the work in try-catch block (Lines 3 - 11 ) with finally statement (Lines 11 - 19). Also we catch DataUnitException which may be thrown by the iterator in Lines 9 -11.

So after executing Code 1, we have a set of files within the variable files. We may then work with the files as needed.

Using FilesHelper

The code introduced in Code1 can be simplified.

Simplify by using helper: eu.unifiedviews.helpers.dataunit.files.FilesHelper in uv-dataunit-helpers, which automatically stores all the entries to a set of data unit entries. In this case, the DPU developer does not need to manually handle the iteration.

Note

This approach should be used only for smaller amounts of files, since all the entries (containing couple of metadata, such as fileURI, symbolic name, etc.) are copied to the Java object and stored in the memory.

Code 2 - Iterating over input files using helper FilesHelper

try {
        Set<FilesDataUnit.Entry> files = FilesHelper.getFiles(input);
} catch (DataUnitException ex) {
   throw ContextUtils.dpuException(ctx, ex, "dpuName.error");
}
  • Line 2 returns set of entries. When processing entries, if you want to work rather with Java File objects, you may call public static File asFile(FilesDataUnit.Entry entry) to get the entry converted to instance of File.

Writing Files to Output Files Data Unit

Prepare DPU 'MyDpu' as described in Working With the Files Data Unit. To write files to the output RDF data unit, one has to define output Files data unit.

@DataUnit.AsOutput(name = "output")
public WritableFilesDataUnit output;

All data units must be public with proper annotation: they must at least contain a name, which will be the name visible in the UnifiedViews administration interface for pipeline developers. The code above goes to the Main DPU class.

Using API Classes

First, lets introduce the way how to add files using API classes.

There are two methods DPU developers may use to add files to output Files data unit (using the helper):

  • String addNewFile(String symbolicName) throws DataUnitException;

    • This method creates new empty file in the output data unit with the given symbolicName. For explanation of symbolicNames and other metadata of entries in data units, please see Basic Concepts for DPU Developers . The physical name of the create file is generated and the file is physically stored in the working directory of the given pipeline execution.

  • void addExistingFile(String symbolicName, String existingFileURIString) throws DataUnitException;

    • This method adds existing file located at existingFileURIString to the output data unit. It automatically creates new entry in the output data unit with the given symbolicName. For explanation of symbolicNames and other metadata of entries in data units, please see Basic Concepts for DPU Developers . In this case, the real location and the physical name of the file is as it was when it was created before calling this method. Be careful that in this case, the file is not created in the working space of the given pipeline execution.

In order to add existing file, the code below may be used in the innerExecute() method of the DPU.

Code 3 - Adding file to the output data unit using API classes

File file = ...
String fileName = ...
Symbolic symbolicName = output.addExistingFile(fileName, file.toURI().toString());
MetadataUtils.set(output, symbolicName, FilesVocabulary.UV_VIRTUAL_PATH, fileName);
  • In Line 3, the new entry in the output data unit is created and for such entry metadata symbolicName is set to be equal to fileName and existingFileURIString is set to file.toURI().toString().

  • Line 4 then sets virtualPath metadata for the entry. For the explanation of this metadata, please see Basic Concepts for DPU Developers. Every DPU develop creating output file should also set this metadata.

Using FilesHelper

The code introduced in Code3 can be simplified by using helper: eu.unifiedviews.helpers.dataunit.files.FilesHelper in uv-dataunit-helpers.

In general, as a DPU developer you should use helpers, if possible.

There are two methods DPU developers may use to add files to output files data unit (using the helper):

  • public static FilesDataUnit.Entry createFile(WritableFilesDataUnit filesDataUnit, final String filename) throws DataUnitException

    • This method created new empty file in the filesDataUnit data unit with the symbolicName and virtualPath metadata equal to filename. For explanation of symbolicName, virtualPath and other metadata of entries in data units, please see Basic Concepts for DPU Developers . The physical name of the create file is generated and the file is physically stored in the working directory of the given pipeline execution.

  • public static FilesDataUnit.Entry addFile(WritableFilesDataUnit filesDataUnit, final File file, final String filename) throws DataUnitException

    • This method adds existing file to the filesDataUnit. It automatically creates new entry in the output data unit with the symbolicName and virtualPath metadata equal to fileName. For explanation of symbolicNames, virtualPath and other metadata of entries in data units, please see Basic Concepts for DPU Developers . In this case, the real location and the physical name of the file is as it was when it was created before calling this method. Be careful that in this case, the file is not created in the working space of the given pipeline execution, so it is not e.g. automatically cleaned.

  • public static FilesDataUnit.Entry addFile(WritableFilesDataUnit filesDataUnit, final File file) throws DataUnitException

    • The same as the method above. In this case, filename is automatically computed as file.getName().

As the methods above return FilesDataUnit.Entry as a result, you may also use public static File asFile(FilesDataUnit.Entry entry) method to convert returned entry to standard Java File object.

Creating New (Output) File

To create a new file in the working store of the DPU and further work with that file, you may use the approach as follows :

Code 4a

Entry newEntry = FilesHelper.createFile(output, "myFile");
File f = FilesHelper.asFile(newEntry);
//write to the file 
  • Line 1 creates new empty file in the working space of the DPU. Line 2 obtains File object to further work with the file.

Note

'myFile' in Line 1 is a symbolic name of the file, it is not the real name/location of the file.

Adding Existing File to the Output

To add the existing file, you may follow the example below. The sample fragment below also shows how the Code 3 (adding an existing file to the output Files data unit). It may be simplified by using helpers.

Code 4

File file = ...
String fileName = ... 
FilesHelper.addFile(output, file, fileName); 

Note

It is best practice to copy the existing file to the working directory of the DPU (this has to be done manually using java.io classes) before you add it to the output. For example, the file would be automatically deleted, when the pipeline finishes.

You may obtain the working directory by calling ctx.getExecMasterContext().getDpuContext().getWorkingDir(); in your main DPU class.

To add file to the output data unit, it is not enough to copy the file to the working dir, you have to also call FilesHelper.addFile or FilesHelper.createFile to let the output data unit know there is a file which goes to the output.

Summary on Using Helpers

The advantage of the helpers is cleaner and easier to be used code: compare the Code 3 needed to add existing files to the output File data unit using only API classes:

  • with helper: one line

  • without helper: two lines

Further, when the helper is not used, as a DPU developer you must be aware of virtualPath metadata, you must know that best practice is to set virtualPath = symbolicName.

Using the WritableSimpleFiles DPU Extension

Apart from the FilesHelper, there is also WritableSimpleFiles, which is not a data unit helper, but a DPU extension. Such an extension may be used to write files into the output data unit.

The advantage of such extension is this:

  • the methods for creating new file entries, adding existing entries are a bit simpler, as they do not specify data unit as the parameter. The WritableSimpleFiles is bound at the beginning to a certain data unit based on the initialization of the extension.

For details about the WritableSimpleFiles extension, please see Working With the Files Data Unit