Re: [Taverna-hackers] Handling Documents

Dmitry Wed, 10 Jun 2009 04:13:09 -0700

Hello,

Sorry if I'm missing a point, but why not use XOP for the URLs


I mean that it's a standard way to use it as a binary optimizations in 
SOAP, but not necessary in way SOAP does it...

<some_object>
   <xop:Include xmlns:xop="http://www.w3.org/2004/08/xop/include"; 
href="http://some_site/big_file.bin"/>
</some_object>

It's very easy to parse and in case of using JAXB as a parser (for 
java), you don't break anything:

*******************************************
@XmlRootElement(name="some_object")
public class SomeObject
{
    private DataHandler handler;

    public SomeObject() {}

    public SomeObject(URL url)
    {
        handler = new DataHandler(new URLDataSource(url));
    }

    @XmlValue
    @XmlMimeType("application/octet-stream")
    public DataHandler getValue()
    {
        return handler;
    }

    public void setValue(DataHandler handler)
    {
        this.handler = handler;
    }
}
*******************************************

Java already has "URLDataSource",

The only thing to do is to enable XOP in JAXB providing an attachment 
marshaller/unmarshaller:

Here is an example to do such encoding/decoding (with streaming).

*******************************************
    private static void decode(InputStream in) throws Exception
    {
        JAXBContext ctx = JAXBContext.newInstance(SomeObject.class);

        Unmarshaller u = ctx.createUnmarshaller();
        u.setAttachmentUnmarshaller(new AttachmentUnmarshaller()
        {
            @Override
            public DataHandler getAttachmentAsDataHandler(String cid)
            {
                try
                {
                    return new DataHandler(new URLDataSource(new URL(cid)));
                }
                catch (MalformedURLException ex)
                {
                    ex.printStackTrace();
                }

                return null;
            }

            @Override
            public byte[] getAttachmentAsByteArray(String cid)
            {
                throw new UnsupportedOperationException("Not supported 
yet.");
            }

            @Override
            public boolean isXOPPackage()
            {
                return true;
            }
        });

        SomeObject some_object = (SomeObject)u.unmarshal(new 
StreamSource(in));

        // that's the interesting part, because we can read from our 
handler the stream that is loaded from another place than the document 
loaded...
        DataHandler dh = simple.getValue();

        InputStream stream = dh.getInputStream();

        byte[] buf = new byte[1024];

        int read;
        while ((read = stream.read(buf)) >= 0)
        {
            System.out.println(new String(buf, 0, read));
        }
    }

    private static void encode(OutputStream out) throws Exception
    {
        SomeObject some_object = new SomeObject(new 
URL("http://some_site/big_file.bin";));

        JAXBContext ctx = JAXBContext.newInstance(SomeObject.class);

        Marshaller m = ctx.createMarshaller();

        m.setAttachmentMarshaller(new AttachmentMarshaller()
        {
                @Override
                public String addMtomAttachment(DataHandler data, String 
elementNamespace, String elementLocalName)
                {
                    DataSource ds = data.getDataSource();

                    if (ds instanceof URLDataSource)
                    { // instead of attaching the data, just embed the 
URL to it
                        URLDataSource urlDS = (URLDataSource)ds;

                        return urlDS.getURL().toExternalForm();
                    }

                    return null;
                }

                @Override
                public String addMtomAttachment(byte[] data, int offset, 
int length, String mimeType, String elementNamespace, String 
elementLocalName)
                {
                    throw new UnsupportedOperationException("Not 
supported yet.");
                }

                @Override
                public String addSwaRefAttachment(DataHandler data)
                {
                    throw new UnsupportedOperationException("Not 
supported yet.");
                }

                @Override
                public boolean isXOPPackage()
                {
                    return true;
                }
        });

        m.marshal(some_object, out);
    }

    public static void main(String[] args) throws Exception
    {
        //encode(new FileOutputStream("c:/xop.xml")); // encode 
SomeObject into a file
        //decode(new FileInputStream("c:/xop.xml")); // restore 
SomeObject from a file
    }
*******************************************

Regards,

Dmitry

Stian Soiland-Reyes wrote:
> On Tue, Jun 9, 2009 at 18:23, Yoshinobu Kano<[email protected]> wrote:
>   
>> I am sorry that was a bug in my code... newline is shown in the result tab.
>> I am currently using 2.0.
>> No option to wrap lines in that view?
>>     
>
> Thanks for a valuable suggestion.
>
> I've noted this as a feature request:
>
>  http://www.mygrid.org.uk/dev/issues/browse/T2-633
>
>
>
>   
>> Unfortunately our system is for text mining/NLP,
>> URLs without actual document text does not make sense since we have to
>> process the document text itself...
>>     
>
> The idea about using URLs that Alan suggested relates to passing URLs
> to services, whereas the services themselves downloads the URLs as
> needed. This makes sense in some cases, for instance where you are
> passing around large images/scans/datasets between services, where the
> workflow locally doesn't do anything with the data, and the services
> are located networkwise close to each other or with a higher bandwidth
> between them than through up and down to your machine.
>
> (imagine running a workflow from your ADSL line - it would be good to
> avoid downloading 100x20 MB and then re-uploading each of these 20 MB
> to each of the invocation of each service - in particular if the
> services are on the same network as where the data came from!)
>
> However this would require changing the services to deal with
> referenced data instead of direct values. For (outside) services in
> your workflow that don't deal with references, or for just inspecting
> the documents - you can insert a tiny shim-beanshell script that does
> something like:
>
>   URL output = new URL(input);
>
> This would change the input string (which is a URL) into a reference -
> which Taverna would dereference (download) when needed. On the
> server-side the code would need to do something similar - although it
> can be clever and recognize that the URL is
> http://myown.host.com:8080/myservice/something and look for the file
> "something" in a local directory. (If you do this - remember to check
> that the file really is in that subdirectory, otherwise evil people
> could use "../../../../../../../../etc/passwd" instead of "something"
> !)
>
>
>   
>> So again, is there any way to iterate the workflow passing and loading
>> input data one by one?
>>     
>
> First you would need to define what is 'one'. Is 'one' one sentence,
> one page or one document? Then, if the source databases don't provide
> the data in this granularity, you would need to either find or create
> a service that splits the data down into this size - in the simplest
> case this could be the 'Split by regular expression' local worker, say
> splitting a sentence by space and punctuation - but such a local
> worker would need the full input document in memory, and roughly the
> same again for the output list.
>
> The best would be if your services were able to give your data to you
> in the 'correct' granularity in the first place - but they might be
> out of your control.
>
>
> If you're going for the 'Document' level, I belive Taverna should
> handle this if documents are roughly 20 MB or so each. All the
> documents (elements of the list) will not be kept in memory at once if
> you are using Taverna 2.1 b1 or later, it will store the content of
> the list to disk, and load it up again when it's needed for a
> particular invocation.
>
> However, if there are 4 concurrent invocations that would mean it
> would require at least 4x the document size in memory usage, in
> addition to the produced results.
>
> The default maximum memory allocated to Taverna is 300 MB - you can
> increase this if neccessary and if you have enough memory installed on
> your machine by editing run.bat / run.sh and change the line that
> contains -Xmx300m to say -Xmx600m for 600 MB.
>
>
>   
>> The depth-as-list strategy seems like loading the whole input at
>> startup - or am I misunderstanding how the list is represented (not a
>> Java object)?
>>     
>
> It is no longer (as in Taverna 1) a naive Java object, it's a more
> structured data structure which elements are references. The values of
> the references themselves, in addition to the structures, are stored
> in a local disk-based database, but with a cache layer in front to
> avoid the disk slowing down workflow execution.  (A traditional
> database won't let you continue before the value is written to disk,
> but the cache avoids this problem by writing to the disk in the
> background. You can change this behaviour from Preferences in case the
> cache itself starts using too much memory - but this would be a
> tradeoff with slower speed.)
>
>
>   
>> I don't have any preference in the Taverna version, I would use any
>> version of Taverna if any solution exists.
>>     
>
> If you are going for large data I would really recommend using 2.1b1
> [see http://www.myexperiment.org/packs/60 ]  - but do notice that for
> beta 1 the installation procedure is a bit more.. 'hackish' - there's
> no installation wizard for Windows or Application bundle for OS X -
> and you might have to download and configure Graphviz/dot separately.
>
> We'll fix this for upcoming 2.1b2 which should be out in about 2 weeks time.
>
>
>   
>> I am sorry for the vague question, my question is that,
>> are there any way to notice in the BeanShell script code whether the
>> workflow has recieved the end of the batch set (if such a batch
>> iteration is possible as above).
>>     
>
> If you have a beanshell script, and you have set its input port to
> take individual items (depth 0), but you connect the input port to a
> processor that outputs a list, then implicit iteration will iterate
> over each of the elements of the list, calling the beanshell script
> once for each input as they're available. The outputs of the beanshell
> is similary wrapped in a new list, one item for each invocation. (This
> means that workflow-wise the output from this beanshell would be a
> list, even if the output port returns a single value - which again
> could do another implicit iteration down the line)
>
> The beanshell script itself does not get any indications as to what
> part of the iteration it's involved with. If you need this, the
> simplest way is to instead change the beanshell input and output ports
> to receive a list (depth +1) instead of a single value, and to deal
> with the iteration(s) inside the beanshell script. This would
> unfortunately add a bit of boiler plates dealing with lists and
> iterations inside the script - including a decission on how you do
> error handling in the middle of the list.
>
>
>
> The downside of this is that all the values would have to be in memory
> at once (as the beanshells currently can't deal with references), and
> that the beanshell invocation won't start until the full input list is
> ready. (normally 'pipelining' would be in effect, so that downstream
> processors that are doing implicit iteration would start processing
> those elements of the list that are received, even if the full
> upstream list is not yet complete)
>
>
> What do you need to do specially on the last item of the list? Perhaps
> you could have a different processor in parallell that receives the
> full list - this would be invoked when the full list (ie. including
> the last item) has been received - however it would keep all the
> elements of that list in memory. (which you can avoid by having a
> secondary output from the first processor)
>
> Note that if you do it this way, you are not guaranteed that the other
> processor has finished dealing with the last element. If you want to
> do that, you can connect to an output from the other processor
> instead.
>
> Or if you just want to be sure that this second process happens after
> the parallell beanshell has dealt with *all* individual items, you
> could just make a control link ("Run after"). This forces the
> controlled processor to run after the beanshell is fully finished with
> all its iteration, and you would no longer need the list input.
> However in some cases this is not what you want, say you have lists of
> lists, and you want to invoke the controlled processor once for each
> element of that outer list!
>
>
> .. excited to hear more about what your workflows would look like
>
>   


------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
Developers Guide: http://www.mygrid.org.uk/tools/developer-information

Re: [Taverna-hackers] Handling Documents

Reply via email to