Re: [Taverna-hackers] Handling Documents

Yoshinobu Kano Wed, 10 Jun 2009 22:54:02 -0700

Hi Stian, Dmitry,

Thank you very much for your detailed explanations.
It seems that I should have explained much more about our U-Compare system
and the UIMA framework which we are based on.
I would like to explain some of the basic concepts here below to avoid
further misunderstandings - sorry for this long reply.


# However, Apache UIMA itself is a large API, the official document
has hundreds of pages. (http://incubator.apache.org/uima/)
U-Compare (http://u-compare.org/) itself is a system of quite large
scale functionalities, for both it would be almost impossible to
explain in e-mail communications...

First of all, our current work is trying to link two different large
systems from two different communities,
there would be many diffrences implicitly assumed in each community.
For example, it is rather rare to see/use web services for NLP researches,
probably because the processes we need sometimes consume quite large
computational resources.
Actually I have been asked by other NLP researchers everytime, whether
the web service UIMA components we provide could be run locally or
not.
In addition, since the U-Compare system itself runs locally, I have
implemented a code to call our system as a Taverna local component.

As for the URL issue, my "whole pubmed" example was not so good.
In most cases we cannot have any public URL for the input document -
for example, e-mail communications, written documents not available
online, manually annotated corpus from a part of some set of
documents, etc.
Since I also cannot imagine that a normal NLP tool does not require
the actual text,
and the annotations added by the tools tend to be larger than the raw
text data,
passing URLs would not be a good option for the connection between
text mining components.
However for the Taverna-UCompare/UIMA interface, URLs would make sense
when the input is a URL referred document.

# I myself is a Java programmar and U-Compare/UIMA are implemented in Java,
so implementations are not the problem in this case.
This is a design issue for the interoperability.
But thanks for your detailed explanations.


>> So again, is there any way to iterate the workflow passing and loading
>> input data one by one?
>
> First you would need to define what is 'one'. Is 'one' one sentence,
> one page or one document?
Well that is my question for this Taverna/Bio* community.
Probably we can assume that the normal input is document based - an
abstract or a full text of an academic paper.


> Then, if the source databases don't provide
> the data in this granularity, you would need to either find or create
> a service that splits the data down into this size - in the simplest
> case this could be the 'Split by regular expression' local worker, say
> splitting a sentence by space and punctuation - but such a local
> worker would need the full input document in memory, and roughly the
> same again for the output list.
I would prefer to handle this sort of NLP tasks inside U-Compare/UIMA,
not in the Taverna side.
>From the NLP point of view, the smallest text unit should be a
document which does not have dependencies to/from another documents,
where dependencies could be syntactic/semantic/discourse up to the purpose.

# I have to say that the sentence splitting is not so easy task in
contrast to what most of the people expect,
even 1% of failure is quite signficant for the later tools in the
pipeline, since sentence splitting is the first part!
Actually we collected/collecting several sentence splitting tools.

> If you're going for the 'Document' level, I belive Taverna should
> handle this if documents are roughly 20 MB or so each. All the
> documents (elements of the list) will not be kept in memory at once if
> you are using Taverna 2.1 b1 or later, it will store the content of
> the list to disk, and load it up again when it's needed for a
> particular invocation.
A good news! This strategy would resolve my concern.
How many users use 1.7/2.0/2.1b - how much is the backward compatibility?
Would it be fine to make everything on 2.1b?


> The downside of this is that all the values would have to be in memory
> at once (as the beanshells currently can't deal with references), and
> that the beanshell invocation won't start until the full input list is
> ready. (normally 'pipelining' would be in effect, so that downstream
> processors that are doing implicit iteration would start processing
> those elements of the list that are received, even if the full
> upstream list is not yet complete)
Then our solution will be to deal lists in the Taverna system side,
not in the BeanShell script.

Since UIMA/U-Compare has their own workflow system,
and they have many functionalities including batch processing,
I need to send a single to the UIMA side workflow that the (list of)
input has finished, when the Taverna side workflow finishes
everything.
This is due to some of the text mining components are
Is there any way to notice the end of the list in the BeanShell, say
some special variable which has such a status?

# I used bsh.shared name space for my implementation, is it a safe
thing in Taverna?

I hope I could have explained about our current issues.
Thanks a lot for your helps again!

-Yoshinobu
-- 
Yoshinobu Kano (Given/Family)
[email protected]
Project Research Associate, the University of Tokyo / U-Compare Project Lead
http://www-tsujii.is.s.u-tokyo.ac.jp/ http://u-compare.org/kano/

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
Developers Guide: http://www.mygrid.org.uk/tools/developer-information

Re: [Taverna-hackers] Handling Documents

Reply via email to