On Thu, Apr 30, 2009 at 15:42, Alan Williams <[email protected]> wrote:
>> And one more question: Will You change Input/Output data format during
>> development of T2?
> I do not know of any plans to do with data formats.  Perhaps other
> people have suggestions.

The Baclava format is perhaps not very scalable, for instance values
are base64-encoded inside the XML. Until recently the string encoding
of these base64-bytes was not enforced, I believe from 1.7 it is
always UTF-8, though.

The XML format itself might seem a bit cryptic, although it's not too
hard to parse and generate it from other languages, see for example
for Ruby and Python:

http://taverna.cvs.sourceforge.net/viewvc/taverna/taverna-service/taverna-rest-client-python/src/tavernaclient/baclava.py?revision=1.4&view=markup

http://taverna.cvs.sourceforge.net/viewvc/taverna/taverna-service/taverna-rest-client-ruby/lib/baclava/reader.rb?revision=1.2&view=markup

http://taverna.cvs.sourceforge.net/viewvc/taverna/taverna-service/taverna-rest-client-ruby/lib/baclava/writer.rb?revision=1.2&view=markup


It should be possible to use the parsers of Taverna 1 and then
register each item with the reference manager of T2.


A future serialized data format should:

  * Support references (file, http-URIs, gridFTP, etc)
  * Support error documents
  * Support multiple uses of the same value/reference in several
lists/list positions
  * Support Bundle larger textual/binary data (at least 2 GB) outside
the actual structure
  * Be able to include a cached copy of reference values
    (So for instance if http://www.google.com/ is included as a
HTTP-reference, it should also be possible to include a cached copy of
what http://www.google.com/ was dereferenced as at the time)
  * Be Research Object (RO?) compatible


One way of doing this, inspired by early thought experiments done for
Research Objects (perhaps Jits could say something about this) is to
make the data format be a structured ZIP-file, with a manifest. The
manifest is an RDF/XML file that describes the content of the research
object, as well as who/where it came from. The other files in the zip
files would be the values, the cached dereferenced (ie. downloaded)
possibly binary data.

One could for example do something like:

mydata-2009-05-01.t2data   (zip-format)
    content.rdf        (the research object manifest)
    t2references.rdf   (list structures, error documents, references,
references to data/.* )
    data/                  (data cache)
         4C2/             (prefix directories to avoid more than 16**3
= 4096 entries in data/)
            4C2D26E0-3580-4909-BA02-77FF24259DB3.bin   (binary data,
unknown type)
         C56/
            C560F5FB-382E-4BFF-ABEC-C37ED4FB586B.jpg   (binary data,
JPEG format)
         F7C/
            F7C336EA-D529-4C2C-BEA5-1D32E1BA1E7A.txt    (text, UTF-8)


Not sure what about larger data, inherently files larger than 2 GB
don't work very well on some operating systems and/or file systems,
but you could easily make .t2data be a directory structure instead of
a zip file. (Like an application bundle in Mac OS X) - at least this
will avoid the total 2 GB problem.

To support individual data entries larger than 2 GB you could have
4C2D26E0-3580-4909-BA02-77FF24259DB3.part0000.bin or something, or
make a block-based structure where you have
4C2D26E0-3580-4909-BA02-77FF24259DB3.bin.blocks  - and the content of
that file is just a list of the blocks that is contained within, for
instance:

4C2D26E0-3580-4909-BA02-77FF24259DB3.bin.blocks:

DB84EBD2-34E5-4E5C-8CDE-D3087A591AAA.bin.block
A1C0F080-31D8-4503-8C42-9E1A760EC837.bin.block
E376DB43-20A6-46DB-9D5F-F01E25990049.bin.block
4393D17D-A477-4EF0-9296-A6DA7573FA3E.bin.block


DB84EBD2-34E5-4E5C-8CDE-D3087A591AAA.bin.block etc. are also just
stored in the data/ structure.


If you make the filenames be sha1 checksums instead of UUUIDs, then it
would be easy to verify that the content is correct, in addition
double-storage would no longer be a problem. You probably should not
have the extension anymore then, though, but put this as part of the
reference/link to the data instead.


         09a/
            09a9202377d81198d409391ca54376d9c3eaadf2
         569/
            569a8a9e78ea5c78dea5016fc0a2395cc8ab7038
         af7/
            af72e43fb1bf29c144b90283ea641e6591b14727


This is only theoretical, but this is quite key to be able to do a
scaled up version of the Execute Workflow Remotely service, for
instance. It should be easy to build a quite matching REST interface
towards such a structure.


-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester

------------------------------------------------------------------------------
Register Now & Save for Velocity, the Web Performance & Operations 
Conference from O'Reilly Media. Velocity features a full day of 
expert-led, hands-on workshops and two days of sessions from industry 
leaders in dedicated Performance & Operations tracks. Use code vel09scf 
and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf
_______________________________________________
taverna-hackers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/taverna-hackers
Developers Guide: http://www.mygrid.org.uk/usermanual1.7/dev_guide.html
FAQ: http://www.mygrid.org.uk/wiki/Mygrid/TavernaFaq

Reply via email to