Hello, Sorry if I'm missing a point, but why not use XOP for the URLs
I mean that it's a standard way to use it as a binary optimizations in SOAP, but not necessary in way SOAP does it... <some_object> <xop:Include xmlns:xop="http://www.w3.org/2004/08/xop/include" href="http://some_site/big_file.bin"/> </some_object> It's very easy to parse and in case of using JAXB as a parser (for java), you don't break anything: ******************************************* @XmlRootElement(name="some_object") public class SomeObject { private DataHandler handler; public SomeObject() {} public SomeObject(URL url) { handler = new DataHandler(new URLDataSource(url)); } @XmlValue @XmlMimeType("application/octet-stream") public DataHandler getValue() { return handler; } public void setValue(DataHandler handler) { this.handler = handler; } } ******************************************* Java already has "URLDataSource", The only thing to do is to enable XOP in JAXB providing an attachment marshaller/unmarshaller: Here is an example to do such encoding/decoding (with streaming). ******************************************* private static void decode(InputStream in) throws Exception { JAXBContext ctx = JAXBContext.newInstance(SomeObject.class); Unmarshaller u = ctx.createUnmarshaller(); u.setAttachmentUnmarshaller(new AttachmentUnmarshaller() { @Override public DataHandler getAttachmentAsDataHandler(String cid) { try { return new DataHandler(new URLDataSource(new URL(cid))); } catch (MalformedURLException ex) { ex.printStackTrace(); } return null; } @Override public byte[] getAttachmentAsByteArray(String cid) { throw new UnsupportedOperationException("Not supported yet."); } @Override public boolean isXOPPackage() { return true; } }); SomeObject some_object = (SomeObject)u.unmarshal(new StreamSource(in)); // that's the interesting part, because we can read from our handler the stream that is loaded from another place than the document loaded... DataHandler dh = simple.getValue(); InputStream stream = dh.getInputStream(); byte[] buf = new byte[1024]; int read; while ((read = stream.read(buf)) >= 0) { System.out.println(new String(buf, 0, read)); } } private static void encode(OutputStream out) throws Exception { SomeObject some_object = new SomeObject(new URL("http://some_site/big_file.bin")); JAXBContext ctx = JAXBContext.newInstance(SomeObject.class); Marshaller m = ctx.createMarshaller(); m.setAttachmentMarshaller(new AttachmentMarshaller() { @Override public String addMtomAttachment(DataHandler data, String elementNamespace, String elementLocalName) { DataSource ds = data.getDataSource(); if (ds instanceof URLDataSource) { // instead of attaching the data, just embed the URL to it URLDataSource urlDS = (URLDataSource)ds; return urlDS.getURL().toExternalForm(); } return null; } @Override public String addMtomAttachment(byte[] data, int offset, int length, String mimeType, String elementNamespace, String elementLocalName) { throw new UnsupportedOperationException("Not supported yet."); } @Override public String addSwaRefAttachment(DataHandler data) { throw new UnsupportedOperationException("Not supported yet."); } @Override public boolean isXOPPackage() { return true; } }); m.marshal(some_object, out); } public static void main(String[] args) throws Exception { //encode(new FileOutputStream("c:/xop.xml")); // encode SomeObject into a file //decode(new FileInputStream("c:/xop.xml")); // restore SomeObject from a file } ******************************************* Regards, Dmitry Stian Soiland-Reyes wrote: > On Tue, Jun 9, 2009 at 18:23, Yoshinobu Kano<[email protected]> wrote: > >> I am sorry that was a bug in my code... newline is shown in the result tab. >> I am currently using 2.0. >> No option to wrap lines in that view? >> > > Thanks for a valuable suggestion. > > I've noted this as a feature request: > > http://www.mygrid.org.uk/dev/issues/browse/T2-633 > > > > >> Unfortunately our system is for text mining/NLP, >> URLs without actual document text does not make sense since we have to >> process the document text itself... >> > > The idea about using URLs that Alan suggested relates to passing URLs > to services, whereas the services themselves downloads the URLs as > needed. This makes sense in some cases, for instance where you are > passing around large images/scans/datasets between services, where the > workflow locally doesn't do anything with the data, and the services > are located networkwise close to each other or with a higher bandwidth > between them than through up and down to your machine. > > (imagine running a workflow from your ADSL line - it would be good to > avoid downloading 100x20 MB and then re-uploading each of these 20 MB > to each of the invocation of each service - in particular if the > services are on the same network as where the data came from!) > > However this would require changing the services to deal with > referenced data instead of direct values. For (outside) services in > your workflow that don't deal with references, or for just inspecting > the documents - you can insert a tiny shim-beanshell script that does > something like: > > URL output = new URL(input); > > This would change the input string (which is a URL) into a reference - > which Taverna would dereference (download) when needed. On the > server-side the code would need to do something similar - although it > can be clever and recognize that the URL is > http://myown.host.com:8080/myservice/something and look for the file > "something" in a local directory. (If you do this - remember to check > that the file really is in that subdirectory, otherwise evil people > could use "../../../../../../../../etc/passwd" instead of "something" > !) > > > >> So again, is there any way to iterate the workflow passing and loading >> input data one by one? >> > > First you would need to define what is 'one'. Is 'one' one sentence, > one page or one document? Then, if the source databases don't provide > the data in this granularity, you would need to either find or create > a service that splits the data down into this size - in the simplest > case this could be the 'Split by regular expression' local worker, say > splitting a sentence by space and punctuation - but such a local > worker would need the full input document in memory, and roughly the > same again for the output list. > > The best would be if your services were able to give your data to you > in the 'correct' granularity in the first place - but they might be > out of your control. > > > If you're going for the 'Document' level, I belive Taverna should > handle this if documents are roughly 20 MB or so each. All the > documents (elements of the list) will not be kept in memory at once if > you are using Taverna 2.1 b1 or later, it will store the content of > the list to disk, and load it up again when it's needed for a > particular invocation. > > However, if there are 4 concurrent invocations that would mean it > would require at least 4x the document size in memory usage, in > addition to the produced results. > > The default maximum memory allocated to Taverna is 300 MB - you can > increase this if neccessary and if you have enough memory installed on > your machine by editing run.bat / run.sh and change the line that > contains -Xmx300m to say -Xmx600m for 600 MB. > > > >> The depth-as-list strategy seems like loading the whole input at >> startup - or am I misunderstanding how the list is represented (not a >> Java object)? >> > > It is no longer (as in Taverna 1) a naive Java object, it's a more > structured data structure which elements are references. The values of > the references themselves, in addition to the structures, are stored > in a local disk-based database, but with a cache layer in front to > avoid the disk slowing down workflow execution. (A traditional > database won't let you continue before the value is written to disk, > but the cache avoids this problem by writing to the disk in the > background. You can change this behaviour from Preferences in case the > cache itself starts using too much memory - but this would be a > tradeoff with slower speed.) > > > >> I don't have any preference in the Taverna version, I would use any >> version of Taverna if any solution exists. >> > > If you are going for large data I would really recommend using 2.1b1 > [see http://www.myexperiment.org/packs/60 ] - but do notice that for > beta 1 the installation procedure is a bit more.. 'hackish' - there's > no installation wizard for Windows or Application bundle for OS X - > and you might have to download and configure Graphviz/dot separately. > > We'll fix this for upcoming 2.1b2 which should be out in about 2 weeks time. > > > >> I am sorry for the vague question, my question is that, >> are there any way to notice in the BeanShell script code whether the >> workflow has recieved the end of the batch set (if such a batch >> iteration is possible as above). >> > > If you have a beanshell script, and you have set its input port to > take individual items (depth 0), but you connect the input port to a > processor that outputs a list, then implicit iteration will iterate > over each of the elements of the list, calling the beanshell script > once for each input as they're available. The outputs of the beanshell > is similary wrapped in a new list, one item for each invocation. (This > means that workflow-wise the output from this beanshell would be a > list, even if the output port returns a single value - which again > could do another implicit iteration down the line) > > The beanshell script itself does not get any indications as to what > part of the iteration it's involved with. If you need this, the > simplest way is to instead change the beanshell input and output ports > to receive a list (depth +1) instead of a single value, and to deal > with the iteration(s) inside the beanshell script. This would > unfortunately add a bit of boiler plates dealing with lists and > iterations inside the script - including a decission on how you do > error handling in the middle of the list. > > > > The downside of this is that all the values would have to be in memory > at once (as the beanshells currently can't deal with references), and > that the beanshell invocation won't start until the full input list is > ready. (normally 'pipelining' would be in effect, so that downstream > processors that are doing implicit iteration would start processing > those elements of the list that are received, even if the full > upstream list is not yet complete) > > > What do you need to do specially on the last item of the list? Perhaps > you could have a different processor in parallell that receives the > full list - this would be invoked when the full list (ie. including > the last item) has been received - however it would keep all the > elements of that list in memory. (which you can avoid by having a > secondary output from the first processor) > > Note that if you do it this way, you are not guaranteed that the other > processor has finished dealing with the last element. If you want to > do that, you can connect to an output from the other processor > instead. > > Or if you just want to be sure that this second process happens after > the parallell beanshell has dealt with *all* individual items, you > could just make a control link ("Run after"). This forces the > controlled processor to run after the beanshell is fully finished with > all its iteration, and you would no longer need the list input. > However in some cases this is not what you want, say you have lists of > lists, and you want to invoke the controlled processor once for each > element of that outer list! > > > .. excited to hear more about what your workflows would look like > > ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ taverna-hackers mailing list [email protected] Web site: http://www.taverna.org.uk Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/ Developers Guide: http://www.mygrid.org.uk/tools/developer-information
