On Thu, Apr 30, 2009 at 15:42, Alan Williams <[email protected]> wrote: >> And one more question: Will You change Input/Output data format during >> development of T2? > I do not know of any plans to do with data formats. Perhaps other > people have suggestions.
The Baclava format is perhaps not very scalable, for instance values are base64-encoded inside the XML. Until recently the string encoding of these base64-bytes was not enforced, I believe from 1.7 it is always UTF-8, though. The XML format itself might seem a bit cryptic, although it's not too hard to parse and generate it from other languages, see for example for Ruby and Python: http://taverna.cvs.sourceforge.net/viewvc/taverna/taverna-service/taverna-rest-client-python/src/tavernaclient/baclava.py?revision=1.4&view=markup http://taverna.cvs.sourceforge.net/viewvc/taverna/taverna-service/taverna-rest-client-ruby/lib/baclava/reader.rb?revision=1.2&view=markup http://taverna.cvs.sourceforge.net/viewvc/taverna/taverna-service/taverna-rest-client-ruby/lib/baclava/writer.rb?revision=1.2&view=markup It should be possible to use the parsers of Taverna 1 and then register each item with the reference manager of T2. A future serialized data format should: * Support references (file, http-URIs, gridFTP, etc) * Support error documents * Support multiple uses of the same value/reference in several lists/list positions * Support Bundle larger textual/binary data (at least 2 GB) outside the actual structure * Be able to include a cached copy of reference values (So for instance if http://www.google.com/ is included as a HTTP-reference, it should also be possible to include a cached copy of what http://www.google.com/ was dereferenced as at the time) * Be Research Object (RO?) compatible One way of doing this, inspired by early thought experiments done for Research Objects (perhaps Jits could say something about this) is to make the data format be a structured ZIP-file, with a manifest. The manifest is an RDF/XML file that describes the content of the research object, as well as who/where it came from. The other files in the zip files would be the values, the cached dereferenced (ie. downloaded) possibly binary data. One could for example do something like: mydata-2009-05-01.t2data (zip-format) content.rdf (the research object manifest) t2references.rdf (list structures, error documents, references, references to data/.* ) data/ (data cache) 4C2/ (prefix directories to avoid more than 16**3 = 4096 entries in data/) 4C2D26E0-3580-4909-BA02-77FF24259DB3.bin (binary data, unknown type) C56/ C560F5FB-382E-4BFF-ABEC-C37ED4FB586B.jpg (binary data, JPEG format) F7C/ F7C336EA-D529-4C2C-BEA5-1D32E1BA1E7A.txt (text, UTF-8) Not sure what about larger data, inherently files larger than 2 GB don't work very well on some operating systems and/or file systems, but you could easily make .t2data be a directory structure instead of a zip file. (Like an application bundle in Mac OS X) - at least this will avoid the total 2 GB problem. To support individual data entries larger than 2 GB you could have 4C2D26E0-3580-4909-BA02-77FF24259DB3.part0000.bin or something, or make a block-based structure where you have 4C2D26E0-3580-4909-BA02-77FF24259DB3.bin.blocks - and the content of that file is just a list of the blocks that is contained within, for instance: 4C2D26E0-3580-4909-BA02-77FF24259DB3.bin.blocks: DB84EBD2-34E5-4E5C-8CDE-D3087A591AAA.bin.block A1C0F080-31D8-4503-8C42-9E1A760EC837.bin.block E376DB43-20A6-46DB-9D5F-F01E25990049.bin.block 4393D17D-A477-4EF0-9296-A6DA7573FA3E.bin.block DB84EBD2-34E5-4E5C-8CDE-D3087A591AAA.bin.block etc. are also just stored in the data/ structure. If you make the filenames be sha1 checksums instead of UUUIDs, then it would be easy to verify that the content is correct, in addition double-storage would no longer be a problem. You probably should not have the extension anymore then, though, but put this as part of the reference/link to the data instead. 09a/ 09a9202377d81198d409391ca54376d9c3eaadf2 569/ 569a8a9e78ea5c78dea5016fc0a2395cc8ab7038 af7/ af72e43fb1bf29c144b90283ea641e6591b14727 This is only theoretical, but this is quite key to be able to do a scaled up version of the Execute Workflow Remotely service, for instance. It should be easy to build a quite matching REST interface towards such a structure. -- Stian Soiland-Reyes, myGrid team School of Computer Science The University of Manchester ------------------------------------------------------------------------------ Register Now & Save for Velocity, the Web Performance & Operations Conference from O'Reilly Media. Velocity features a full day of expert-led, hands-on workshops and two days of sessions from industry leaders in dedicated Performance & Operations tracks. Use code vel09scf and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf _______________________________________________ taverna-hackers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/taverna-hackers Developers Guide: http://www.mygrid.org.uk/usermanual1.7/dev_guide.html FAQ: http://www.mygrid.org.uk/wiki/Mygrid/TavernaFaq
