I think you have described almost every bioinformatics genomics workflow.

You start with a small selection, search and expand out to lots of
candidates, then filter and narrow down the search space before
finding more about those prioritised/matched items.

The workflow designer must get an understanding of where the large
data sizes and processing times are before they can determine the
order of many of these operations - and so the first workflows are
probably very inefficient compared to the later ones, when it's
becoming more clear where one needs to filter, which services can be
done later ('fill in details' services which don't contribute to
filtering).

However, we've often seen cases where involving a computer scientists
in reviewing the workflow can provide further optimisation. For
instance, in one case we realised that a service unofficially
supported multiple identifiers for search by using comma separation.
That meant we could move from 40.000 individual service calls to 1000
grouped calls (the service still fell over if you gave too many in
that list!).


On Thu, Jul 7, 2011 at 13:01, Efthymia Tsamoura <[email protected]> wrote:
> Hello
> I am a phd student and during this period i am dealing with workflow
> optimization problems in distributed environments.  I would like to
> ask, if there are exist any cases where if the order of task
> invocation in a scientific workflow changes its performance changes
> too without, however, affecting the produced results. In the
> following, a present a small use case of the problem i am interested in:
>
> Suppose that a company wants to obtain a list of email addresses of
> potential customers selecting only those who have a good payment
> history for at least one card and a credit rating above some
> threshold. The company has the right to use the following web services
>
> WS1 : SSN id (ssn, threshold) -> credit rating (cr)
> WS2 : SSN id (ssn) -> credit card numbers (ccn)
> WS3 : card number (ccn, good) -> good history (gph)
> WS4 : SSN id (ssn) -> email addresses (ea)
>
> The input data containing customer identifiers (ssn) and other
> relevant information is stored in a local data resource. Two possible
> web service linear workflows that can be formed to process the input
> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 =
> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a
> good payment history are initially selected (WS2,WS3), and then, the
> remaining customers whose credit history is below some threshold are
> filtered out (through WS1). The C2 workflow performs the same tasks in
> a reverse order. The above linear workflows may have different
> performance; if WS3 filters out more data than WS1, then it will be
> more beneficial to invoke WS3 before WS1 in order for the subsequent
> web services in the workflow to process less data.
>
> It would be very useful to know if there exist similar scientific
> workflow examples (where users have many options for ordering the
> workflow tasks but cannot decide which task ordering to use, while the
> workflow performance depends on the workflow task invocation order)
> and if you are interested in using optimizers for such types of
> workflows.
>
> I am asking because i have recently developed an optimization
> algorithm for this problem and i would like to test its performance in
> a real-world workflow management system with real-world workflows.
>
> P.S.: references to publications or any other information dealing with
> scientific workflows of the above rationale will be extremely useful.
>
> Thank you very much for your time
>
>
>
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> taverna-users mailing list
> [email protected]
> [email protected]
> Web site: http://www.taverna.org.uk
> Mailing lists: http://www.taverna.org.uk/about/contact-us/
>



-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
taverna-users mailing list
[email protected]
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/about/contact-us/

Reply via email to