On Wed, May 19, 2010 at 03:35, Guzman Llambias - INCO
<[email protected]> wrote:

> I've been looking foward some provenance models docs for T2, without
> luck. Could you please guide me a bit in order to find some?

Hi!

I'll try to write this up as a wiki page.. but here goes a quick draft:


The provenance model for Taverna 2 is quite different from the model
of Taverna 1, as we now focus on the lineage/origin of data. So we
want to easily check which inputs caused a given output, which
upstream outputs gave those inputs, and so on.


It might be easy to understand what we're capturing by looking at how
we store the provenance. This is done internally in a Derby database,
but can also be configured to store in a mySQL database.


Here's the current database schema for provenance as of Taverna 2.1:

 http://www.mygrid.org.uk/dev/wiki/display/developer/Provenance+schema+in+2.1.2


However, this schema does not capture all aspects of workflow
executions, so I'm in the process of refactoring the database schema
to:

 http://www.mygrid.org.uk/dev/wiki/display/developer/Provenance+schema+in+2.2.0

.. I'll update this page to reflect reality once that's done.


Note that this database is not meant to be exposed directly, but it's
possible to query the database using a 'lineage query', and export the
provenance as an OPM graph. (Open Provenance Model).


See 
http://code.google.com/p/mygrid-labs/source/browse/provenance-client/trunk/src/main/resources/testQuery1.xml
for an example of a query, this will select runs over workflow
ac41d494-f77c-4dd5-919c-47272aa6a848 (the dataflow identifier found
inside the .t2flow file), and in particular it will select the run
identified as ae1e2b6b-3bc5-4c93-a250-c4dd0210c3b3, in addition to any
runs since 2009-10-08.

In the result graph will be the details of the origin of the <select>
element, so in this case it is the output port "value" on the
processor "String_constant" inside the nested workflow
"Nested_workflow", the workflow output port "out", and all output
ports of the service "Beanshell".

The <focus> element selects which details leading to the <select>
outputs you want to look up details for, specified in a similar
fashion.

Paolo Missier (copied) should be able to fill in with details on how
to run such queries.



The rough way Provenance works internally in Taverna is this:

When running a workflow, the WorkflowInstanceFacade will trundle
through the workflow's processors, and insert a new Dispatch stack
layer, IntermediateProvenance. This is placed all at the top, below
Parallelize, but above ErrorBounce, meaning that it should see the
actual inputs from the processor input ports, and the actual output
delivered to the processor output ports, at the time the execution is
finished.

When a job is received (ie. all data is available on the
processor input ports and an available thread has been identified by
Parallellize), IntermediateProvenance records the input data and
(soon) execution start time. Similarly on the way up, it will record
the output data (which might be the error document registered or
bounced by the ErrorBounce layer), all stored in a hashmap per
iteration.

On the way up, this bean of provenance information is sent to the
provenance database, where it is stored in the tables as explained in
the schema above.

There is then a ProvenanceAccess layer, where one might query
different aspects of the stored provenance, like which input and
output values an intermediate processor dealt with. You can also ask
for the 'lineage' of a data value, which should give you a trace as to
which input values it depends on throughout the workflow, or export
the whole thing (or a selection to such a query) as an OPM graph.


In order to populate the new table ServiceInvocation there will be a
new, lightweight provenance layer that will be inserted between each
of the deeper dispatch layers, it will then be able to record
individual retries, failovers, looping.



-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester

------------------------------------------------------------------------------

_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/about/contact-us/
Developers Guide: http://www.taverna.org.uk/developers/

Reply via email to