Some Data requirements

Pamidighantam, Sudhakar V Wed, 15 May 2019 10:51:16 -0700

Please see some data needs we are seeing in the current gateways. Some of these 
are handled but several require additional development, integration and 
operational changes.


Specific use cases could/should  be documented as well. This may not cover all 
unmet needs and other are encouraged to add to this as we embark on providing 
first class Data management in Apache Airavata.

Airavata Data Requirements


  1.  Data Ingestion

  Data for input can be of different types and hierarchies, starting from 1. 
individual parameters, 2. name lists in a simple/small file (in KB), typically 
instructions for the execution, 3. data files which could be large (up to 20 
GB), 4. Directories containing multiple files each (100s GB).

These files can come in many forms  ascii, binary, compressed ( zip, tar) etc..

There may be data from data bases that need to be extracted and presented 
potentially for the user to choose or modify and use further (Ex. Supercrtbl) 
in an experiment as input.

The data from a previous execution (result/restart data) may need to be used to 
either restart an experiment along with modified inputs (routinely happens in 
SEAGrid). In such case a way to refer to the previous job/experiment and/or 
data locality is needed.

In the case of workflows with multiple tasks, the independent input data for 
different tasks in the workflow may have to uploaded upfront and need to be 
thus labelled appropriately.

In the case of job arrays, data for each of the independent tasks may be 
presented in different hierarchies as folders or compressed sets.

There is a use case where an input segment/field  may have multiple 
files/parameters (file arrays, parameter arrays) associated with that while 
others may have different types (AMPGateway BSR3 application stage1).

Some data may be pre-staged on the remote HPC system (Future water) or brought 
from third party locations/services (Box, Data Repos, Instruments) and 
associated with the experiment.

The web, session and other timeouts need to be tuned for making sure all the 
needed data is transferred in usable condition.

B. Data Validation and Handling

There need to be a way to validate all the inputs needed/required to be 
available before an experiment/workflow is scheduled. Files transferred should 
be checked for completeness by checksum or other validation. The data need to 
be uploaded and organized appropriately for the execution on the remote and 
even intermediate staging areas. If the data need to be  staged from third 
party locations and pre-staged data need to be used a way to verify the data 
accessibility need to be provided. Restart data can be checked if they contain 
right data for restart. The remote hosts may have quotas and the validation 
should consider if there is sufficient space to move the data for scheduling 
the experiment.

C. Data Processing
In some cases, data need to be processed before used in an experiment. 
Uncompressing a zip/tar need to be handled. In some cases,  specific 
preprocessing routines may need to run for the data to be prepared. For some 
cases the data need to be organized for learning by machines. A way to extend 
the extraction of critical attributes from the inputs, experiments and results 
for learning may be very useful.

4. Data Dissemination

Data need to be provided for the users to monitor personally or automatic 
validators and parsers. Once the experiment completes the parsers should be 
able to pickup and complete a post processing step. Data could be large (10s 
GB) and a failsafe way to provide this output data will be needed. Data may 
need to be compressed and organized for the additional post processing steps. 
Users needed a way to extract (output) data from multiple experiments in bulk 
to process it through external programs and scripts. This requires a way to 
select a set of experiments to extract their logs/outputs with sufficient 
warning regarding the size of the resulting download.

E. Data Storage

Data need to be stored for immediate consumption and potential reuse in the 
gateway/or other systems.

F. Data Archive and retrieval

Data need to be archived to tertiary storage device so the primary storage 
service is reused for newer data/experiments. But a way to retrieve the data 
from archival when needed should be in place.

G. Data deletion/hiding
Some data (erroneous, unwanted) need to be deleted so it does not interfere 
with new experiments or processing. A way to hide/delete based on user choice 
would be useful to provide. Sometimes restart data corrupted if a fixed 
checkpoint file is specified and this needs to be deleted or replaced with an 
immediately previous good copy.

Thanks for your attention.
Sudhakar.

Some Data requirements

Reply via email to