[Taverna-hackers] Suggested workflow data format: Data bundle

Stian Soiland-Reyes Fri, 15 Oct 2010 08:42:29 -0700

Hi!

I've written a suggestion for how we can make a new format to replace
the old Baclava data format from Taverna 1. This format is used for
storing a set of workflow inputs/outputs - and might include lists,
lists of lists, etc. for different port bindings.


The main ideas of the new data format is to allow:
  * Avoid including the data encoded in the XML
  * Simpler to use
  * Extensible
  * Give identifiers, and a place to hook in more provenance information

The data format should not only be used as the current "load/save
values" in the workbench, but also allowing third-party clients to
send and receive multiple port data through the Taverna Server. (Now
they need to upload value by value).

Longer term we also want to provide an 'archive' format which could
also include auxiliary information, such as which workflow produced
these data - run with which inputs, etc. - moving towards a 'research
object' you could share together with your paper about the results.


The data bundle structure is based on experiences with building the
SCUFL2 format. The file itself is a ZIP file with an internal folder
structure. There's META-INF/manifest.xml that lists files, their sizes
and mime types. This format, a kind of 'enhanced JAR file', is
inspired by OpenDocument package (used by OpenOffice), and should also
be compatible with Adobe UCF, used by OEPBC and Adobe Mars.

In addition to the manifest, there's typically a folder 'outputs',
which file content maps to individual workflow ports. So
'./outputs/fish/" is the output of the workflow output port 'fish'. If
a list is returned, then numbered files or folders within here again
represent list items, like .//outputs/fish/0.txt.


Additional metadata can optionally be provided in outputs.rdf - in
RDF/XML format, but with a fairly simple XML schema so that you can
parse the file either as plain old XML, or using RDF tools. This
metadata file connects global identifiers with the data items, and can
also say which workflow run produced the value.

For instance:

 <list rdf:about="outputs/fish/">
        <rdf:type
rdf:resource="http://ns.taverna.org.uk/2010/data/workflowOutput"/>
        <depth>1</depth>
        <hasListEntry rdf:parseType="Resource">
            <entry rdf:resource="outputs/fish/0.txt"/>
            <listPosition
rdf:datatype="http://www.w3.org/2001/XMLSchema#integer";>0</listPosition>
        </hasListEntry>
        <hasListEntry rdf:parseType="Resource">
            <entry rdf:resource="outputs/fish/1.uri"/>
            <listPosition
rdf:datatype="http://www.w3.org/2001/XMLSchema#integer";>1</listPosition>
        </hasListEntry>
        <outputFrom
rdf:resource="http://ns.taverna.org.uk/2010/workflowBundle/00626652-55ae-4a9e-80d4-c8e9ac84e2ca/workflow/HelloWorld/out/fish"/>
        <producedBy
rdf:resource="http://ns.taverna.org.uk/2010/run/b9455363-5624-4744-901b-3d6c7ec273d7"/>
        <owl:sameAs
rdf:resource="http://ns.taverna.org.uk/2010/data/list/45b29774-8927-4e9e-8961-6137cb95ef69"/>
    </list>

For the Taverna Server you would also be able to browse this structure
remotely, so you could check out only the manifest and the metadata,
to get a view of the list structures and get an idea of file sizes and
mimetypes before deciding to fetch the data or not. (Say for a web
interface, it might want to create <img> tags for image/* data,
without fetching the images).


The manifests and metadata are all optional, so that for both parsing
or creating inputs, you can start up with something as simple as:
We're
inputs.zip/
     mimetype
     inputs/
     inputs/portA.txt
     inputs/portB/0.txt
     inputs/portB/1.txt

and add the other things as needed. The Taverna tools will try to fill
in as many blanks as possible, though. There's also talk about having
elements of your workflow automatically annotate data as they are
produced - these annotations would naturally also be added to this
data bundle.


See http://www.mygrid.org.uk/dev/wiki/display/developer/Data+bundle for details.

I'm hoping for some good feedback on this approach, before we start
implementing it. Feel free to discuss this in this thread.


The Scufl2 format will follow a similar format, but will be a workflow
bundle. The formats are mixable - so in theory you could have a
workflow bundle that is also a data bundle (say example inputs) or
opposite - but that's more exotic and not the main point now.


I'm also not quite sure about which file extension to go for..
Specially for a data bundle it's interesting to keep the .zip
extension so people realize they can open it - while for the scufl 2
workflow bundle this is not normally desired.

-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester

------------------------------------------------------------------------------
Download new Adobe(R) Flash(R) Builder(TM) 4
The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly 
Flex(R) Builder(TM)) enable the development of rich applications that run
across multiple browsers and platforms. Download your free trials today!
http://p.sf.net/sfu/adobe-dev2dev
_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/about/contact-us/
Developers Guide: http://www.taverna.org.uk/developers/

[Taverna-hackers] Suggested workflow data format: Data bundle

Reply via email to