Re: [Taverna-hackers] Handling Documents

Stian Soiland-Reyes Wed, 08 Jul 2009 08:03:40 -0700

We have not yet exposed pipelining to the interface used by the
Beanshell scripts.


It is possible to do what you want by implementing your own subclass
of Activity - you might want to look at the BiomartActivity which does
this kind of pipelining.

Basically you are able to return several times through the callback
object in the Activity - you would return with indexes, and in the end
return the full list.

>From an Activity you will also be able to interface with the reference
manager, so that you can register the data values and get a  reference
back - these are the ones returned and collected in the full list -
and they should have a smaller memory footprint.

Such an activity would have a granular depth that is lower (say 0)
than the actual output depth (1) - so it means the end result is depth
1, but I'll output one and one item at depth 0.


I tried making a workflow which implemented it's own java.util.List
subclass and returned a fancy Iterator (which returned new values with
a 10% chance of reaching end of list), but as the beanshell script
still has granular output depth 1 no pipelining would occur in the
workflow before the iterator was finished.

see 
http://taverna.googlecode.com/svn/taverna/engine/net.sf.taverna.t2.activities/tags/activities-0.8/biomart-activity/src/main/java/net/sf/taverna/t2/activities/biomart/BiomartActivity.java
for an activity that does this currently (because it's working with a
HTTP-based protocol with database rows sent back tab-separated - it
can return items even before the full HTTP transfer is finished)

As you see it's slightly trickier than normal because you will have to
keep track of the list, but the key lines are:



// Register value
T2Reference data = referenceService.register(resultLine[i],
outputDepth - 1, true, callback.getContext());

// Populate output map for all ports for this given index
partialOutputData.put(outputName, data);
// Keep track of values so far
outputLists.get(outputName).add((int) index, data)

// Partial results
callback.receiveResult(partialOutputData, new int[] { (int) index });


..

// Finally return the full list (of references)
outputData = new HashMap();
outputData.put(outputName,
referenceService.register(outputLists.get(outputName),
                                                                                
outputDepth, true, callback.getContext()));
callback.receiveResult(outputData, new int[0]);




On Wed, Jul 8, 2009 at 08:53, Yoshinobu Kano<[email protected]> wrote:
> Hi,
>
> Thanks to all of your kind helps, I have achieved many issues needed,
> but another issue arised regarding to the list generation.
> May I ask your help again?
> I have read the Taverna2-helpset.pdf but could not find a solution.
>
> I am trying to create a local worker, which essentially outputs a list
> (depth 1) without input.
> However, since the data size could be quite large, I would like to
> make this output in stream-mannar using the Taverna built-in
> behaviour,
> to avoid loading everything on the memory at the same time.
>
> What I thought is to make this component
> dummy-single-value-in/single-value-out,
> then feed a dummy list to its input to make use of the Taverna
> built-in iterator.
> The problem is that the size of the output list is unknown until all
> of the process is done,
> I need to change the size of the dummy-input-list dynamically,
> depending on the output signal (boolean, end of the process or not) of
> the component.
> .
> Since the list seems to be represented as java.util.List,
> it might be possible but up to the internal implmenetation of Taverna
> -- is it possible to add a new element to the input list dynamically
> (i.e. during the iteration of the very input list itself)?
>
> Are there any other solution to this problem?
>
> Thank you very much in advance,
>
> -Yoshinobu
>
> On Thu, Jun 11, 2009 at 9:36 AM, Stian
> Soiland-Reyes<[email protected]> wrote:
>> On Thu, Jun 11, 2009 at 06:52, Yoshinobu Kano<[email protected]> wrote:
>>
>>
>>> Since I also cannot imagine that a normal NLP tool does not require
>>> the actual text,
>>> and the annotations added by the tools tend to be larger than the raw
>>> text data,
>>> passing URLs would not be a good option for the connection between
>>> text mining components.
>>> However for the Taverna-UCompare/UIMA interface, URLs would make sense
>>> when the input is a URL referred document.
>>
>> Note that URIs could be any URI or another kind of reference, it
>> doesn't have to be a world wide accessible HTTP-based URL - it could
>> be as simple as urn:uuid:9321d5b1-8904-43a5-8a21-f92bae6d9fa7
>>
>> The main point is if you want to avoid sending large documents from a
>> service, to Taverna, and then just upload it again to the next
>> service, when those two services could exchange the documents in a
>> more efficient manner (and to lower Taverna's memory footprint), then
>> using references like URIs would make this possible - and if you did
>> go for HTTP-urls (it could be links to stuff within the service) those
>> would also be accessible for outside services.
>>
>>
>>
>>> Well that is my question for this Taverna/Bio* community.
>>> Probably we can assume that the normal input is document based - an
>>> abstract or a full text of an academic paper.
>>
>> I guess it would come down to what you decide to do in your workflow,
>> and what you want to do in your service code. :-)
>>
>> I would guess that it would be good to keep the things that you are
>> going to play around with, such as deciding which algorithms to use,
>> which databases to fetch from, etc, should be done or initiated by the
>> workflow. The boring number crunching and analysis should be done by
>> the services.
>>
>> Another thing is if you want to use external services, then obviously
>> it would be great if your services played on the same 'level' so you
>> could make two versions of the same workflow, where one uses your
>> service, and another a similar service provided by some Japanese
>> university.
>>
>> So it comes down to the actual research that you are planning to do,
>> really.. :-)
>>
>>
>>
>>> A good news! This strategy would resolve my concern.
>>> How many users use 1.7/2.0/2.1b - how much is the backward compatibility?
>>> Would it be fine to make everything on 2.1b?
>>
>> Not sure about the usage numbers, 2.1b1 is still quite fresh.
>>
>> 2.x workflows should be compatible which each other, and 2.x can open
>> 1.x workflows. However, you can't open a 2.x workflow in 1.x.
>>
>> Based on the feedback we have received so far, I would recommend
>> looking at 2.1b1.
>>
>> However, if you are developing your own extensions to Taverna, do note
>> that many of the APIs have changed between 1.x and 2.x - so you have
>> to decide early. Unfortunately the developer documentation for 2.x is
>> not very complete yet, but of course you are free to look at existing
>> source code. You can also use this list to ask for pointers as to what
>> APIs it would make sense to use - depending on what extension you are
>> doing.
>>
>>
>>> Since UIMA/U-Compare has their own workflow system,
>>> and they have many functionalities including batch processing,
>>> I need to send a single to the UIMA side workflow that the (list of)
>>> input has finished, when the Taverna side workflow finishes
>>> everything.
>>
>> OK, so you need to communicate with the UIMA side that you are now
>> 'finished'. Then I would use a second processor and a control link, as
>> I specified earlier.
>>
>> You don't specifically need the last item of the list - you just need
>> to know that all the items have been sent individually to UIMA?
>>
>>
>>> This is due to some of the text mining components are
>>
>> .. are..? :-)
>>
>>> Is there any way to notice the end of the list in the BeanShell, say
>>> some special variable which has such a status?
>>
>> No. As I said before, the individual services don't have access to
>> 'where' in the iterations they are.
>>
>>
>>> # I used bsh.shared name space for my implementation, is it a safe
>>> thing in Taverna?
>>
>> I doubt that would be very safe. I'm not sure if you would get
>> interferences with different workflow runs or different beanshells in
>> the same workflow - but that should be easy to test.
>>
>>
>>
>>
>> --
>> Stian Soiland-Reyes, myGrid team
>> School of Computer Science
>> The University of Manchester
>>
>
>
>
> --
> Yoshinobu Kano (Given/Family)
> [email protected]
> Project Research Associate, the University of Tokyo / U-Compare Project Lead
> http://www-tsujii.is.s.u-tokyo.ac.jp/ http://u-compare.org/kano/
>



-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester

------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
Developers Guide: http://www.mygrid.org.uk/tools/developer-information

Re: [Taverna-hackers] Handling Documents

Reply via email to