Hi Stian!! Thank you very much for the detailed explanation! I'm very interested in the result of the provenance queries and the OPM Graphs. Are those results in RDF? You sent me a reference of a query sample, but is there any docs I could read about it? How does Taverna support the OPM?
Thanks for all! Regards, Guzmán ----- Original Message ----- From: "Stian Soiland-Reyes" <[email protected]> To: "List for general discussion and hacking of the Taverna project" <[email protected]>; "Paolo Missier" <[email protected]> Sent: Wednesday, May 19, 2010 4:27 AM Subject: Re: [Taverna-hackers] Provenance model docs > On Wed, May 19, 2010 at 03:35, Guzman Llambias - INCO > <[email protected]> wrote: > >> I've been looking foward some provenance models docs for T2, without >> luck. Could you please guide me a bit in order to find some? > > Hi! > > I'll try to write this up as a wiki page.. but here goes a quick draft: > > > The provenance model for Taverna 2 is quite different from the model > of Taverna 1, as we now focus on the lineage/origin of data. So we > want to easily check which inputs caused a given output, which > upstream outputs gave those inputs, and so on. > > > It might be easy to understand what we're capturing by looking at how > we store the provenance. This is done internally in a Derby database, > but can also be configured to store in a mySQL database. > > > Here's the current database schema for provenance as of Taverna 2.1: > > http://www.mygrid.org.uk/dev/wiki/display/developer/Provenance+schema+in+2.1.2 > > > However, this schema does not capture all aspects of workflow > executions, so I'm in the process of refactoring the database schema > to: > > http://www.mygrid.org.uk/dev/wiki/display/developer/Provenance+schema+in+2.2.0 > > .. I'll update this page to reflect reality once that's done. > > > Note that this database is not meant to be exposed directly, but it's > possible to query the database using a 'lineage query', and export the > provenance as an OPM graph. (Open Provenance Model). > > > See > http://code.google.com/p/mygrid-labs/source/browse/provenance-client/trunk/src/main/resources/testQuery1.xml > for an example of a query, this will select runs over workflow > ac41d494-f77c-4dd5-919c-47272aa6a848 (the dataflow identifier found > inside the .t2flow file), and in particular it will select the run > identified as ae1e2b6b-3bc5-4c93-a250-c4dd0210c3b3, in addition to any > runs since 2009-10-08. > > In the result graph will be the details of the origin of the <select> > element, so in this case it is the output port "value" on the > processor "String_constant" inside the nested workflow > "Nested_workflow", the workflow output port "out", and all output > ports of the service "Beanshell". > > The <focus> element selects which details leading to the <select> > outputs you want to look up details for, specified in a similar > fashion. > > Paolo Missier (copied) should be able to fill in with details on how > to run such queries. > > > > The rough way Provenance works internally in Taverna is this: > > When running a workflow, the WorkflowInstanceFacade will trundle > through the workflow's processors, and insert a new Dispatch stack > layer, IntermediateProvenance. This is placed all at the top, below > Parallelize, but above ErrorBounce, meaning that it should see the > actual inputs from the processor input ports, and the actual output > delivered to the processor output ports, at the time the execution is > finished. > > When a job is received (ie. all data is available on the > processor input ports and an available thread has been identified by > Parallellize), IntermediateProvenance records the input data and > (soon) execution start time. Similarly on the way up, it will record > the output data (which might be the error document registered or > bounced by the ErrorBounce layer), all stored in a hashmap per > iteration. > > On the way up, this bean of provenance information is sent to the > provenance database, where it is stored in the tables as explained in > the schema above. > > There is then a ProvenanceAccess layer, where one might query > different aspects of the stored provenance, like which input and > output values an intermediate processor dealt with. You can also ask > for the 'lineage' of a data value, which should give you a trace as to > which input values it depends on throughout the workflow, or export > the whole thing (or a selection to such a query) as an OPM graph. > > > In order to populate the new table ServiceInvocation there will be a > new, lightweight provenance layer that will be inserted between each > of the deeper dispatch layers, it will then be able to record > individual retries, failovers, looping. > > > > -- > Stian Soiland-Reyes, myGrid team > School of Computer Science > The University of Manchester > > ------------------------------------------------------------------------------ > > _______________________________________________ > taverna-hackers mailing list > [email protected] > Web site: http://www.taverna.org.uk > Mailing lists: http://www.taverna.org.uk/about/contact-us/ > Developers Guide: http://www.taverna.org.uk/developers/ > ------------------------------------------------------------------------------ _______________________________________________ taverna-hackers mailing list [email protected] Web site: http://www.taverna.org.uk Mailing lists: http://www.taverna.org.uk/about/contact-us/ Developers Guide: http://www.taverna.org.uk/developers/
