Re: [Taverna-hackers] Handling Documents

Ian Dunlop Wed, 15 Jul 2009 02:58:40 -0700

Hello,

I have had a chat with Kano and I think I now have a better understanding of
the problem which I shall share with you:


The text mining 'services' are currently exposed through a Java application
known as U-compare (available over web start from
http://u-compare.org/index.html ).  This application allows you to create
text mining workflows which consist of a 'Collection Reader' component eg.
File reader which passes this input to an 'Analysis Engine' or 'Gas Reader'
component.  The user selects these components from a tree and drops them
onto a workflow.  These analysis components can themselves consist of
multiple steps and may involve web service calls themselves.  The user can
then run this workflow and the results are returned in a quite graphically
rich way.

So, how can Taverna use these services.

1) Access the analysis components as web services.  Taverna has many local
workers for reading and manipulating files and for any that are specialised
another local worker can be created.  The Analysis components could be
accessed directly over a web service exposed as a WSDL endpoint.

2) Use the U-compare application wrapped inside an activity.  Create a new
activity exposing the functionality as a local worker type of service.  Will
probably require the U-compare application to be 'mavenised' and deployed as
an artifact somewhere.

3) Expose the U-compare application as pre-canned text-mining workflows.
Similar to 2 but the text-mining workflows would be the components rather
than the Collection Reader/Analysis components. Similar issues to 2.

There are probably other options and hybrids of these.

1) is probably the easiest.  2 & 3 of a similar level of complexity.

The results also pose a small problem.  A renderer would have to be created
for the text/xml mime type which could render the specific text mining
results.  There may be other problems if the results renderer is actually
dependent on the components used, there may have to be multiple renderers
for u-compare.

I think this is all achievable but there would need to be a fair bit of help
provided since the T2 developer docs are good but not quite there yet for
custom components.

Providing text-mining services would open up taverna to new communities and
fields so it seems like a good thing to do.

Cheers,

Ian

Ian Dunlop
myGrid team
School of Computer Science
University of Manchester

2009/7/10 Stian Soiland-Reyes <[email protected]>

> Likewise as with the Beanshell, the API consumer activity does not
> expose the reference manager capability.
>
> You will have to create a new activity implementation.
>
> There's unfortunately not much documentation out there for Taverna 2
> except the existing code - perhaps some of the other plugin developers
> could help..
>
> We have documentation about making an activity for the T2 platform
> API, but we are not yet using the platform APIs from the workbench, so
> if you follow that documentation you would not be able to appear in
> the workbench.
>
> I would suggest checking out a few of the current activity types to
> have a look - check out with Subversion:
>
>
> http://taverna.googlecode.com/svn/taverna/engine/net.sf.taverna.t2.activities/tags/activities-0.8/stringconstant-activity/
>  (simplest activity, everything hardcoded, configuration is the string
> to return)
>
>
> http://taverna.googlecode.com/svn/taverna/engine/net.sf.taverna.t2.activities/tags/activities-0.8/soaplab-activity/
> (more typical - configuration says web service locations, discovers
> ports dynamically through a few calls)
>
>
> http://taverna.googlecode.com/svn/taverna/engine/net.sf.taverna.t2.activities/tags/activities-0.8/biomart-activity/
> (advanced - but look for how streaming is done. Configuration is
> specialised XML)
>
> Note that some of the Maven dependencies use ${properties} - have a
> look at the parent (like
>
> http://taverna.googlecode.com/svn/taverna/engine/net.sf.taverna.t2.activities/tags/activities-0.8/pom.xml
> ) to see what versions are used.
>
>
> In addition you would need to implement a few interfaces to appear
> graphically in the workbench. You can see how this is done by looking
> at:
>
>
> http://taverna.googlecode.com/svn/taverna/ui/net.sf.taverna.t2.ui-activities/tags/ui-activities-0.12/localworker-activity-ui/
>  (template service, only appears once in Available Services)
>
>
> http://taverna.googlecode.com/svn/taverna/ui/net.sf.taverna.t2.ui-activities/tags/ui-activities-0.12/soaplab-activity-ui/
> (default location, dialogue for adding service from different
> location, populates a full folder in Available services)
>
> The parent POM file with version properties can be found at
>
> http://taverna.googlecode.com/svn/taverna/ui/net.sf.taverna.t2.ui-activities/tags/ui-activities-0.12/pom.xml
>
> To build these you would need Maven 2.0.10 and Java 5. See
> http://www.mygrid.org.uk/dev/wiki/display/developer/Taverna+source+code
> for more information. The easiest would be to build the full
> net.sf.taverna.t2.ui-activities and net.sf.taverna.t2.ui-activities
> modules - but remember to check them out by the latest tag (as used by
> 2.1 beta 2).
>
>
>
> To be discovered you will need to list your implementation in an SPI
> file in the magical folder META-INF/services on the classpath - for
> example see:
>
>
> http://taverna.googlecode.com/svn/taverna/ui/net.sf.taverna.t2.ui-activities/tags/ui-activities-0.12/soaplab-activity-ui/src/main/resources/META-INF/services/
>
> as a minimum of those I think you need to be listed in
> ServiceDescriptionProvider (so you can appear in Available Services)
> and ContextualViewFactory (so there's details under 'Details') -
> ActivityIconSPI allows you to get an icon in Workflow explorer, and
> the MenuComponent is just to get a right-click.
>
>
> You would need to use Maven also for your new activity - we normally
> recommend to keep the separation between the activity itself and the
> GUI, so two activity modules. In this case you can run the workflow
> server side without requiring the GUI bits.
>
> Remember to use your own domain names for <groupID> in the pom.xml
> files and to avoid confusion the same as a package name, for instance
> package jp.ac.utokyo.kano.cool
>
> To build with Taverna dependencies, you would need to add this section
> to the pom.xml (after </dependencies> ):
>
>        <repositories>
>                <repository>
>                        <releases />
>                        <snapshots>
>                                <enabled>false</enabled>
>                        </snapshots>
>                        <id>mygrid-repository</id>
>                        <name>myGrid Repository</name>
>                        <url>http://www.mygrid.org.uk/maven/repository
> </url>
>                </repository>
>        </repositories>
>
>
>
> To activate the plugin into Taverna you would need to make a plugin
> description. If you have a look at plugins/plugins.xml you will see
> what it looks like, each <plugin> defines the plugin itself and which
> dependencies it has - in your case you would list the myactivity-ui
> and myactivity artifacts.
>
> You would need to have the modules available in a Maven repository,
> while developing you can just do mvn install and use
> file:///home/blah/.m2/repository/ as a repository - some fun required
> in order to get this file URL to work on Windows though - I seem to
> have cheated and used file:/Users/stain/.m2/repository/ assuming you
> are running from C:
>
> Example:
>
> <plugin>
>        <provider>cagrid.taverna.sf.net</provider>
>        <identifier>net.sf.taverna.cagrid.cagrid-plugin</identifier>
>        <version>0.5-SNAPSHOT</version>
>        <name>CaGrid Activity Plugin</name>
>        <description>Plugin for invoking caGrid services (including
> secure ones)</description>
>        <enabled>true</enabled>
>        <repositories>
>            <repository>file:/Users/stain/.m2/repository/</repository>
>        </repositories>
>        <profile>
>            <dependency>
>                <groupId>net.sf.taverna.cagrid</groupId>
>                <artifactId>cagrid-activity-ui</artifactId>
>                <version>0.5-SNAPSHOT</version>
>            </dependency>
>        </profile>
>        <compatibility>
>            <application>
>                <version>2.1-beta-2</version>
>            </application>
>        </compatibility>
>    </plugin>
>
>
>
>
> On Thu, Jul 9, 2009 at 14:55, Yoshinobu Kano<[email protected]>
> wrote:
> > Hi Stian,
> >
> > Thank you very much for your helps again.
> >
> > I would like to follow your advice -- as far as I understand, make an
> > APIConsumer, modifiying BiomartActivity.java.
> >
> > May I have a pointer to any document which describes about creating an
> > APIConsumer code in general,
> > or information which *.jar files I need on the classpath,
> > when I create my own activity java code and compile?
> >
> > Thanks,
> >
> > -Yoshinobu
> >
> > On Wed, Jul 8, 2009 at 3:09 PM, Stian
> > Soiland-Reyes<[email protected]> wrote:
> >> We have not yet exposed pipelining to the interface used by the
> >> Beanshell scripts.
> >>
> >> It is possible to do what you want by implementing your own subclass
> >> of Activity - you might want to look at the BiomartActivity which does
> >> this kind of pipelining.
> >>
> >> Basically you are able to return several times through the callback
> >> object in the Activity - you would return with indexes, and in the end
> >> return the full list.
> >>
> >> From an Activity you will also be able to interface with the reference
> >> manager, so that you can register the data values and get a  reference
> >> back - these are the ones returned and collected in the full list -
> >> and they should have a smaller memory footprint.
> >>
> >> Such an activity would have a granular depth that is lower (say 0)
> >> than the actual output depth (1) - so it means the end result is depth
> >> 1, but I'll output one and one item at depth 0.
> >>
> >>
> >> I tried making a workflow which implemented it's own java.util.List
> >> subclass and returned a fancy Iterator (which returned new values with
> >> a 10% chance of reaching end of list), but as the beanshell script
> >> still has granular output depth 1 no pipelining would occur in the
> >> workflow before the iterator was finished.
> >>
> >> see
> http://taverna.googlecode.com/svn/taverna/engine/net.sf.taverna.t2.activities/tags/activities-0.8/biomart-activity/src/main/java/net/sf/taverna/t2/activities/biomart/BiomartActivity.java
> >> for an activity that does this currently (because it's working with a
> >> HTTP-based protocol with database rows sent back tab-separated - it
> >> can return items even before the full HTTP transfer is finished)
> >>
> >> As you see it's slightly trickier than normal because you will have to
> >> keep track of the list, but the key lines are:
> >>
> >>
> >>
> >> // Register value
> >> T2Reference data = referenceService.register(resultLine[i],
> >> outputDepth - 1, true, callback.getContext());
> >>
> >> // Populate output map for all ports for this given index
> >> partialOutputData.put(outputName, data);
> >> // Keep track of values so far
> >> outputLists.get(outputName).add((int) index, data)
> >>
> >> // Partial results
> >> callback.receiveResult(partialOutputData, new int[] { (int) index });
> >>
> >>
> >> ..
> >>
> >> // Finally return the full list (of references)
> >> outputData = new HashMap();
> >> outputData.put(outputName,
> >> referenceService.register(outputLists.get(outputName),
> >>
>        outputDepth, true, callback.getContext()));
> >> callback.receiveResult(outputData, new int[0]);
> >>
> >>
> >>
> >>
> >> On Wed, Jul 8, 2009 at 08:53, Yoshinobu Kano<[email protected]>
> wrote:
> >>> Hi,
> >>>
> >>> Thanks to all of your kind helps, I have achieved many issues needed,
> >>> but another issue arised regarding to the list generation.
> >>> May I ask your help again?
> >>> I have read the Taverna2-helpset.pdf but could not find a solution.
> >>>
> >>> I am trying to create a local worker, which essentially outputs a list
> >>> (depth 1) without input.
> >>> However, since the data size could be quite large, I would like to
> >>> make this output in stream-mannar using the Taverna built-in
> >>> behaviour,
> >>> to avoid loading everything on the memory at the same time.
> >>>
> >>> What I thought is to make this component
> >>> dummy-single-value-in/single-value-out,
> >>> then feed a dummy list to its input to make use of the Taverna
> >>> built-in iterator.
> >>> The problem is that the size of the output list is unknown until all
> >>> of the process is done,
> >>> I need to change the size of the dummy-input-list dynamically,
> >>> depending on the output signal (boolean, end of the process or not) of
> >>> the component.
> >>> .
> >>> Since the list seems to be represented as java.util.List,
> >>> it might be possible but up to the internal implmenetation of Taverna
> >>> -- is it possible to add a new element to the input list dynamically
> >>> (i.e. during the iteration of the very input list itself)?
> >>>
> >>> Are there any other solution to this problem?
> >>>
> >>> Thank you very much in advance,
> >>>
> >>> -Yoshinobu
> >>>
> >>> On Thu, Jun 11, 2009 at 9:36 AM, Stian
> >>> Soiland-Reyes<[email protected]> wrote:
> >>>> On Thu, Jun 11, 2009 at 06:52, Yoshinobu Kano<[email protected]>
> wrote:
> >>>>
> >>>>
> >>>>> Since I also cannot imagine that a normal NLP tool does not require
> >>>>> the actual text,
> >>>>> and the annotations added by the tools tend to be larger than the raw
> >>>>> text data,
> >>>>> passing URLs would not be a good option for the connection between
> >>>>> text mining components.
> >>>>> However for the Taverna-UCompare/UIMA interface, URLs would make
> sense
> >>>>> when the input is a URL referred document.
> >>>>
> >>>> Note that URIs could be any URI or another kind of reference, it
> >>>> doesn't have to be a world wide accessible HTTP-based URL - it could
> >>>> be as simple as urn:uuid:9321d5b1-8904-43a5-8a21-f92bae6d9fa7
> >>>>
> >>>> The main point is if you want to avoid sending large documents from a
> >>>> service, to Taverna, and then just upload it again to the next
> >>>> service, when those two services could exchange the documents in a
> >>>> more efficient manner (and to lower Taverna's memory footprint), then
> >>>> using references like URIs would make this possible - and if you did
> >>>> go for HTTP-urls (it could be links to stuff within the service) those
> >>>> would also be accessible for outside services.
> >>>>
> >>>>
> >>>>
> >>>>> Well that is my question for this Taverna/Bio* community.
> >>>>> Probably we can assume that the normal input is document based - an
> >>>>> abstract or a full text of an academic paper.
> >>>>
> >>>> I guess it would come down to what you decide to do in your workflow,
> >>>> and what you want to do in your service code. :-)
> >>>>
> >>>> I would guess that it would be good to keep the things that you are
> >>>> going to play around with, such as deciding which algorithms to use,
> >>>> which databases to fetch from, etc, should be done or initiated by the
> >>>> workflow. The boring number crunching and analysis should be done by
> >>>> the services.
> >>>>
> >>>> Another thing is if you want to use external services, then obviously
> >>>> it would be great if your services played on the same 'level' so you
> >>>> could make two versions of the same workflow, where one uses your
> >>>> service, and another a similar service provided by some Japanese
> >>>> university.
> >>>>
> >>>> So it comes down to the actual research that you are planning to do,
> >>>> really.. :-)
> >>>>
> >>>>
> >>>>
> >>>>> A good news! This strategy would resolve my concern.
> >>>>> How many users use 1.7/2.0/2.1b - how much is the backward
> compatibility?
> >>>>> Would it be fine to make everything on 2.1b?
> >>>>
> >>>> Not sure about the usage numbers, 2.1b1 is still quite fresh.
> >>>>
> >>>> 2.x workflows should be compatible which each other, and 2.x can open
> >>>> 1.x workflows. However, you can't open a 2.x workflow in 1.x.
> >>>>
> >>>> Based on the feedback we have received so far, I would recommend
> >>>> looking at 2.1b1.
> >>>>
> >>>> However, if you are developing your own extensions to Taverna, do note
> >>>> that many of the APIs have changed between 1.x and 2.x - so you have
> >>>> to decide early. Unfortunately the developer documentation for 2.x is
> >>>> not very complete yet, but of course you are free to look at existing
> >>>> source code. You can also use this list to ask for pointers as to what
> >>>> APIs it would make sense to use - depending on what extension you are
> >>>> doing.
> >>>>
> >>>>
> >>>>> Since UIMA/U-Compare has their own workflow system,
> >>>>> and they have many functionalities including batch processing,
> >>>>> I need to send a single to the UIMA side workflow that the (list of)
> >>>>> input has finished, when the Taverna side workflow finishes
> >>>>> everything.
> >>>>
> >>>> OK, so you need to communicate with the UIMA side that you are now
> >>>> 'finished'. Then I would use a second processor and a control link, as
> >>>> I specified earlier.
> >>>>
> >>>> You don't specifically need the last item of the list - you just need
> >>>> to know that all the items have been sent individually to UIMA?
> >>>>
> >>>>
> >>>>> This is due to some of the text mining components are
> >>>>
> >>>> .. are..? :-)
> >>>>
> >>>>> Is there any way to notice the end of the list in the BeanShell, say
> >>>>> some special variable which has such a status?
> >>>>
> >>>> No. As I said before, the individual services don't have access to
> >>>> 'where' in the iterations they are.
> >>>>
> >>>>
> >>>>> # I used bsh.shared name space for my implementation, is it a safe
> >>>>> thing in Taverna?
> >>>>
> >>>> I doubt that would be very safe. I'm not sure if you would get
> >>>> interferences with different workflow runs or different beanshells in
> >>>> the same workflow - but that should be easy to test.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Stian Soiland-Reyes, myGrid team
> >>>> School of Computer Science
> >>>> The University of Manchester
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Yoshinobu Kano (Given/Family)
> >>> [email protected]
> >>> Project Research Associate, the University of Tokyo / U-Compare Project
> Lead
> >>> http://www-tsujii.is.s.u-tokyo.ac.jp/ http://u-compare.org/kano/
> >>>
> >>
> >>
> >>
> >> --
> >> Stian Soiland-Reyes, myGrid team
> >> School of Computer Science
> >> The University of Manchester
> >>
> >
> >
> >
> > --
> > Yoshinobu Kano (Given/Family)
> > [email protected]
> > Project Research Associate, the University of Tokyo / U-Compare Project
> Lead
> > http://www-tsujii.is.s.u-tokyo.ac.jp/ http://u-compare.org/kano/
> >
>
>
>
> --
> Stian Soiland-Reyes, myGrid team
> School of Computer Science
> The University of Manchester
>
>
> ------------------------------------------------------------------------------
> Enter the BlackBerry Developer Challenge
> This is your chance to win up to $100,000 in prizes! For a limited time,
> vendors submitting new applications to BlackBerry App World(TM) will have
> the opportunity to enter the BlackBerry Developer Challenge. See full prize
> details at: http://p.sf.net/sfu/Challenge
> _______________________________________________
> taverna-hackers mailing list
> [email protected]
> Web site: http://www.taverna.org.uk
> Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
> Developers Guide: http://www.mygrid.org.uk/tools/developer-information
>

------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge

_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
Developers Guide: http://www.mygrid.org.uk/tools/developer-information

Re: [Taverna-hackers] Handling Documents

Reply via email to