Working With the Files Data Unit
Working With the Files Data Unit
This section contains a short guide on how entries (files) may be obtained from or written to the Files input data units.
Note
For reading and writing files, you may use either API classes (low level approach) or helpers. If possible, read/write files using helpers (FilesHelper).
For basic information about data units, please see the basic description of data units.
Reading Files From the Input Files Data Unit
Prepare DPU 'MyDpu' as described in Working With the Files Data Unit. To read files from the input Files data unit, you have to define the input Files data unit.
@DataUnit.AsInput(name = "input") public FilesDataUnit input;
All data units must be public with the proper annotation: they must at least contain a name, which will be the name visible in the UnifiedViews administration interface for pipeline developers. The code above goes to the Main DPU class.
Further, we need to iterate over the input data unit in order to get access to files which come through the input files data unit.
The code below goes to the innerExecute()
method of the DPU.
Code 1 - Iterating over input files using API classes
Set<File> files = new HashSet<>(); FilesDataUnit.Iteration it = null; try { it = input.getIteration(); while (it.hasNext()) { final String filePathUri = it.next().getFileURIString(); files.add(new File(java.net.URI.create(filePathUri))); } } catch (DataUnitException ex) { throw ContextUtils.dpuException(ctx, ex, "dpuName.error"); } finally { if (it != null) { try { it.close(); } catch (DataUnitException ex) { log.error("Error on close.", ex); } } }
In Lines 5 - 8, we iterate over the entries in the input data unit.
In Line 6, we got URI (physical location) of the entry.
On Line 7 we then add to the set of such URIs.
Note
This approach, when we store list of incoming files to some some Java object, is not suitable for a larger number of files on the input. In such cases it would be better to directly process the file.
As the iterator over files does not extend
AutoClosable
we need to take care about it’s closing at the end (Line 14). That’s why we do all the work intry-catch
block (Lines 3 - 11 ) withfinally
statement (Lines 11 - 19). Also we catchDataUnitException
which may be thrown by the iterator in Lines 9 -11.
So after executing Code 1, we have a set of files within the variable files.
We may then work with the files as needed.
The code introduced in Code1 can be simplified.
Simplify by using helper: eu.unifiedviews.helpers.dataunit.files.FilesHelper
in uv-dataunit-helpers
, which automatically stores all the entries to a set of data unit entries. In this case, the DPU developer does not need to manually handle the iteration.
Note
This approach should be used only for smaller amounts of files, since all the entries (containing couple of metadata, such as fileURI, symbolic name, etc.) are copied to the Java object and stored in the memory.
Code 2 - Iterating over input files using helper FilesHelper
try { Set<FilesDataUnit.Entry> files = FilesHelper.getFiles(input); } catch (DataUnitException ex) { throw ContextUtils.dpuException(ctx, ex, "dpuName.error"); }
Line 2 returns set of entries. When processing entries, if you want to work rather with Java File objects, you may call
public static File asFile(FilesDataUnit.Entry entry)
to get the entry converted to instance ofFile
.
Writing Files to Output Files Data Unit
Prepare DPU 'MyDpu' as described in Working With the Files Data Unit. To write files to the output RDF data unit, one has to define output Files data unit.
@DataUnit.AsOutput(name = "output") public WritableFilesDataUnit output;
All data units must be public with proper annotation: they must at least contain a name, which will be the name visible in the UnifiedViews administration interface for pipeline developers. The code above goes to the Main DPU class.
First, lets introduce the way how to add files using API classes.
There are two methods DPU developers may use to add files to output Files data unit (using the helper):
String addNewFile(String symbolicName) throws DataUnitException;
This method creates new empty file in the
output
data unit with the givensymbolicName
. For explanation of symbolicNames and other metadata of entries in data units, please see Basic Concepts for DPU Developers . The physical name of the create file is generated and the file is physically stored in the working directory of the given pipeline execution.
void addExistingFile(String symbolicName, String existingFileURIString) throws DataUnitException;
This method adds existing file located at
existingFileURIString
to the output data unit. It automatically creates new entry in theoutput
data unit with the givensymbolicName
. For explanation of symbolicNames and other metadata of entries in data units, please see Basic Concepts for DPU Developers . In this case, the real location and the physical name of the file is as it was when it was created before calling this method. Be careful that in this case, the file is not created in the working space of the given pipeline execution.
In order to add existing file, the code below may be used in the innerExecute()
method of the DPU.
Code 3 - Adding file to the output data unit using API classes
File file = ... String fileName = ... Symbolic symbolicName = output.addExistingFile(fileName, file.toURI().toString()); MetadataUtils.set(output, symbolicName, FilesVocabulary.UV_VIRTUAL_PATH, fileName);
In Line 3, the new entry in the output data unit is created and for such entry metadata
symbolicName
is set to be equal tofileName
andexistingFileURIString
is set tofile.toURI().toString()
.Line 4 then sets
virtualPath
metadata for the entry. For the explanation of this metadata, please see Basic Concepts for DPU Developers. Every DPU develop creating output file should also set this metadata.
The code introduced in Code3 can be simplified by using helper: eu.unifiedviews.helpers.dataunit.files.FilesHelper
in uv-dataunit-helpers.
In general, as a DPU developer you should use helpers, if possible.
There are two methods DPU developers may use to add files to output files data unit (using the helper):
public static FilesDataUnit.Entry createFile(WritableFilesDataUnit filesDataUnit, final String filename) throws DataUnitException
This method created new empty file in the
filesDataUnit
data unit with thesymbolicName
andvirtualPath
metadata equal tofilename
. For explanation ofsymbolicName
,virtualPath
and other metadata of entries in data units, please see Basic Concepts for DPU Developers . The physical name of the create file is generated and the file is physically stored in the working directory of the given pipeline execution.
public static FilesDataUnit.Entry addFile(WritableFilesDataUnit filesDataUnit, final File file, final String filename) throws DataUnitException
This method adds existing
file
to thefilesDataUnit
. It automatically creates new entry in theoutput
data unit with thesymbolicName
andvirtualPath
metadata equal tofileName
. For explanation of symbolicNames, virtualPath and other metadata of entries in data units, please see Basic Concepts for DPU Developers . In this case, the real location and the physical name of the file is as it was when it was created before calling this method. Be careful that in this case, the file is not created in the working space of the given pipeline execution, so it is not e.g. automatically cleaned.
public static FilesDataUnit.Entry addFile(WritableFilesDataUnit filesDataUnit, final File file) throws DataUnitException
The same as the method above. In this case, filename is automatically computed as file.getName().
As the methods above return FilesDataUnit.Entry
as a result, you may also use public static File asFile(FilesDataUnit.Entry entry)
method to convert returned entry to standard Java File object.
To create a new file in the working store of the DPU and further work with that file, you may use the approach as follows :
Code 4a
Entry newEntry = FilesHelper.createFile(output, "myFile"); File f = FilesHelper.asFile(newEntry); //write to the file
Line 1 creates new empty file in the working space of the DPU. Line 2 obtains File object to further work with the file.
Note
'myFile' in Line 1 is a symbolic name of the file, it is not the real name/location of the file.
To add the existing file, you may follow the example below. The sample fragment below also shows how the Code 3 (adding an existing file to the output Files data unit). It may be simplified by using helpers.
Code 4
File file = ... String fileName = ... FilesHelper.addFile(output, file, fileName);
Note
It is best practice to copy the existing file to the working directory of the DPU (this has to be done manually using java.io classes) before you add it to the output. For example, the file would be automatically deleted, when the pipeline finishes.
You may obtain the working directory by calling ctx.getExecMasterContext().getDpuContext().getWorkingDir();
in your main DPU class.
To add file to the output data unit, it is not enough to copy the file to the working dir, you have to also call FilesHelper.addFile
or FilesHelper.createFile
to let the output data unit know there is a file which goes to the output.
The advantage of the helpers is cleaner and easier to be used code: compare the Code 3 needed to add existing files to the output File data unit using only API classes:
with helper: one line
without helper: two lines
Further, when the helper is not used, as a DPU developer you must be aware of virtualPath
metadata, you must know that best practice is to set virtualPath
= symbolicName
.
Apart from the FilesHelper
, there is also WritableSimpleFiles
, which is not a data unit helper, but a DPU extension. Such an extension may be used to write files into the output data unit.
The advantage of such extension is this:
the methods for creating new file entries, adding existing entries are a bit simpler, as they do not specify data unit as the parameter. The
WritableSimpleFiles
is bound at the beginning to a certain data unit based on the initialization of the extension.
For details about the WritableSimpleFiles
extension, please see Working With the Files Data Unit