Re: [Taverna-hackers] Handling Documents

Stian Soiland-Reyes Wed, 10 Jun 2009 00:52:57 -0700

On Tue, Jun 9, 2009 at 18:23, Yoshinobu Kano<[email protected]> wrote:
> I am sorry that was a bug in my code... newline is shown in the result tab.
> I am currently using 2.0.
> No option to wrap lines in that view?


Thanks for a valuable suggestion.

I've noted this as a feature request:

 http://www.mygrid.org.uk/dev/issues/browse/T2-633



> Unfortunately our system is for text mining/NLP,
> URLs without actual document text does not make sense since we have to
> process the document text itself...

The idea about using URLs that Alan suggested relates to passing URLs
to services, whereas the services themselves downloads the URLs as
needed. This makes sense in some cases, for instance where you are
passing around large images/scans/datasets between services, where the
workflow locally doesn't do anything with the data, and the services
are located networkwise close to each other or with a higher bandwidth
between them than through up and down to your machine.

(imagine running a workflow from your ADSL line - it would be good to
avoid downloading 100x20 MB and then re-uploading each of these 20 MB
to each of the invocation of each service - in particular if the
services are on the same network as where the data came from!)

However this would require changing the services to deal with
referenced data instead of direct values. For (outside) services in
your workflow that don't deal with references, or for just inspecting
the documents - you can insert a tiny shim-beanshell script that does
something like:

  URL output = new URL(input);

This would change the input string (which is a URL) into a reference -
which Taverna would dereference (download) when needed. On the
server-side the code would need to do something similar - although it
can be clever and recognize that the URL is
http://myown.host.com:8080/myservice/something and look for the file
"something" in a local directory. (If you do this - remember to check
that the file really is in that subdirectory, otherwise evil people
could use "../../../../../../../../etc/passwd" instead of "something"
!)


> So again, is there any way to iterate the workflow passing and loading
> input data one by one?

First you would need to define what is 'one'. Is 'one' one sentence,
one page or one document? Then, if the source databases don't provide
the data in this granularity, you would need to either find or create
a service that splits the data down into this size - in the simplest
case this could be the 'Split by regular expression' local worker, say
splitting a sentence by space and punctuation - but such a local
worker would need the full input document in memory, and roughly the
same again for the output list.

The best would be if your services were able to give your data to you
in the 'correct' granularity in the first place - but they might be
out of your control.


If you're going for the 'Document' level, I belive Taverna should
handle this if documents are roughly 20 MB or so each. All the
documents (elements of the list) will not be kept in memory at once if
you are using Taverna 2.1 b1 or later, it will store the content of
the list to disk, and load it up again when it's needed for a
particular invocation.

However, if there are 4 concurrent invocations that would mean it
would require at least 4x the document size in memory usage, in
addition to the produced results.

The default maximum memory allocated to Taverna is 300 MB - you can
increase this if neccessary and if you have enough memory installed on
your machine by editing run.bat / run.sh and change the line that
contains -Xmx300m to say -Xmx600m for 600 MB.


> The depth-as-list strategy seems like loading the whole input at
> startup - or am I misunderstanding how the list is represented (not a
> Java object)?

It is no longer (as in Taverna 1) a naive Java object, it's a more
structured data structure which elements are references. The values of
the references themselves, in addition to the structures, are stored
in a local disk-based database, but with a cache layer in front to
avoid the disk slowing down workflow execution.  (A traditional
database won't let you continue before the value is written to disk,
but the cache avoids this problem by writing to the disk in the
background. You can change this behaviour from Preferences in case the
cache itself starts using too much memory - but this would be a
tradeoff with slower speed.)


> I don't have any preference in the Taverna version, I would use any
> version of Taverna if any solution exists.

If you are going for large data I would really recommend using 2.1b1
[see http://www.myexperiment.org/packs/60 ]  - but do notice that for
beta 1 the installation procedure is a bit more.. 'hackish' - there's
no installation wizard for Windows or Application bundle for OS X -
and you might have to download and configure Graphviz/dot separately.

We'll fix this for upcoming 2.1b2 which should be out in about 2 weeks time.


> I am sorry for the vague question, my question is that,
> are there any way to notice in the BeanShell script code whether the
> workflow has recieved the end of the batch set (if such a batch
> iteration is possible as above).

If you have a beanshell script, and you have set its input port to
take individual items (depth 0), but you connect the input port to a
processor that outputs a list, then implicit iteration will iterate
over each of the elements of the list, calling the beanshell script
once for each input as they're available. The outputs of the beanshell
is similary wrapped in a new list, one item for each invocation. (This
means that workflow-wise the output from this beanshell would be a
list, even if the output port returns a single value - which again
could do another implicit iteration down the line)

The beanshell script itself does not get any indications as to what
part of the iteration it's involved with. If you need this, the
simplest way is to instead change the beanshell input and output ports
to receive a list (depth +1) instead of a single value, and to deal
with the iteration(s) inside the beanshell script. This would
unfortunately add a bit of boiler plates dealing with lists and
iterations inside the script - including a decission on how you do
error handling in the middle of the list.



The downside of this is that all the values would have to be in memory
at once (as the beanshells currently can't deal with references), and
that the beanshell invocation won't start until the full input list is
ready. (normally 'pipelining' would be in effect, so that downstream
processors that are doing implicit iteration would start processing
those elements of the list that are received, even if the full
upstream list is not yet complete)


What do you need to do specially on the last item of the list? Perhaps
you could have a different processor in parallell that receives the
full list - this would be invoked when the full list (ie. including
the last item) has been received - however it would keep all the
elements of that list in memory. (which you can avoid by having a
secondary output from the first processor)

Note that if you do it this way, you are not guaranteed that the other
processor has finished dealing with the last element. If you want to
do that, you can connect to an output from the other processor
instead.

Or if you just want to be sure that this second process happens after
the parallell beanshell has dealt with *all* individual items, you
could just make a control link ("Run after"). This forces the
controlled processor to run after the beanshell is fully finished with
all its iteration, and you would no longer need the list input.
However in some cases this is not what you want, say you have lists of
lists, and you want to invoke the controlled processor once for each
element of that outer list!


.. excited to hear more about what your workflows would look like

-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
Developers Guide: http://www.mygrid.org.uk/tools/developer-information

Re: [Taverna-hackers] Handling Documents

Reply via email to