On Tue, Jun 9, 2009 at 18:23, Yoshinobu Kano<[email protected]> wrote: > I am sorry that was a bug in my code... newline is shown in the result tab. > I am currently using 2.0. > No option to wrap lines in that view?
Thanks for a valuable suggestion. I've noted this as a feature request: http://www.mygrid.org.uk/dev/issues/browse/T2-633 > Unfortunately our system is for text mining/NLP, > URLs without actual document text does not make sense since we have to > process the document text itself... The idea about using URLs that Alan suggested relates to passing URLs to services, whereas the services themselves downloads the URLs as needed. This makes sense in some cases, for instance where you are passing around large images/scans/datasets between services, where the workflow locally doesn't do anything with the data, and the services are located networkwise close to each other or with a higher bandwidth between them than through up and down to your machine. (imagine running a workflow from your ADSL line - it would be good to avoid downloading 100x20 MB and then re-uploading each of these 20 MB to each of the invocation of each service - in particular if the services are on the same network as where the data came from!) However this would require changing the services to deal with referenced data instead of direct values. For (outside) services in your workflow that don't deal with references, or for just inspecting the documents - you can insert a tiny shim-beanshell script that does something like: URL output = new URL(input); This would change the input string (which is a URL) into a reference - which Taverna would dereference (download) when needed. On the server-side the code would need to do something similar - although it can be clever and recognize that the URL is http://myown.host.com:8080/myservice/something and look for the file "something" in a local directory. (If you do this - remember to check that the file really is in that subdirectory, otherwise evil people could use "../../../../../../../../etc/passwd" instead of "something" !) > So again, is there any way to iterate the workflow passing and loading > input data one by one? First you would need to define what is 'one'. Is 'one' one sentence, one page or one document? Then, if the source databases don't provide the data in this granularity, you would need to either find or create a service that splits the data down into this size - in the simplest case this could be the 'Split by regular expression' local worker, say splitting a sentence by space and punctuation - but such a local worker would need the full input document in memory, and roughly the same again for the output list. The best would be if your services were able to give your data to you in the 'correct' granularity in the first place - but they might be out of your control. If you're going for the 'Document' level, I belive Taverna should handle this if documents are roughly 20 MB or so each. All the documents (elements of the list) will not be kept in memory at once if you are using Taverna 2.1 b1 or later, it will store the content of the list to disk, and load it up again when it's needed for a particular invocation. However, if there are 4 concurrent invocations that would mean it would require at least 4x the document size in memory usage, in addition to the produced results. The default maximum memory allocated to Taverna is 300 MB - you can increase this if neccessary and if you have enough memory installed on your machine by editing run.bat / run.sh and change the line that contains -Xmx300m to say -Xmx600m for 600 MB. > The depth-as-list strategy seems like loading the whole input at > startup - or am I misunderstanding how the list is represented (not a > Java object)? It is no longer (as in Taverna 1) a naive Java object, it's a more structured data structure which elements are references. The values of the references themselves, in addition to the structures, are stored in a local disk-based database, but with a cache layer in front to avoid the disk slowing down workflow execution. (A traditional database won't let you continue before the value is written to disk, but the cache avoids this problem by writing to the disk in the background. You can change this behaviour from Preferences in case the cache itself starts using too much memory - but this would be a tradeoff with slower speed.) > I don't have any preference in the Taverna version, I would use any > version of Taverna if any solution exists. If you are going for large data I would really recommend using 2.1b1 [see http://www.myexperiment.org/packs/60 ] - but do notice that for beta 1 the installation procedure is a bit more.. 'hackish' - there's no installation wizard for Windows or Application bundle for OS X - and you might have to download and configure Graphviz/dot separately. We'll fix this for upcoming 2.1b2 which should be out in about 2 weeks time. > I am sorry for the vague question, my question is that, > are there any way to notice in the BeanShell script code whether the > workflow has recieved the end of the batch set (if such a batch > iteration is possible as above). If you have a beanshell script, and you have set its input port to take individual items (depth 0), but you connect the input port to a processor that outputs a list, then implicit iteration will iterate over each of the elements of the list, calling the beanshell script once for each input as they're available. The outputs of the beanshell is similary wrapped in a new list, one item for each invocation. (This means that workflow-wise the output from this beanshell would be a list, even if the output port returns a single value - which again could do another implicit iteration down the line) The beanshell script itself does not get any indications as to what part of the iteration it's involved with. If you need this, the simplest way is to instead change the beanshell input and output ports to receive a list (depth +1) instead of a single value, and to deal with the iteration(s) inside the beanshell script. This would unfortunately add a bit of boiler plates dealing with lists and iterations inside the script - including a decission on how you do error handling in the middle of the list. The downside of this is that all the values would have to be in memory at once (as the beanshells currently can't deal with references), and that the beanshell invocation won't start until the full input list is ready. (normally 'pipelining' would be in effect, so that downstream processors that are doing implicit iteration would start processing those elements of the list that are received, even if the full upstream list is not yet complete) What do you need to do specially on the last item of the list? Perhaps you could have a different processor in parallell that receives the full list - this would be invoked when the full list (ie. including the last item) has been received - however it would keep all the elements of that list in memory. (which you can avoid by having a secondary output from the first processor) Note that if you do it this way, you are not guaranteed that the other processor has finished dealing with the last element. If you want to do that, you can connect to an output from the other processor instead. Or if you just want to be sure that this second process happens after the parallell beanshell has dealt with *all* individual items, you could just make a control link ("Run after"). This forces the controlled processor to run after the beanshell is fully finished with all its iteration, and you would no longer need the list input. However in some cases this is not what you want, say you have lists of lists, and you want to invoke the controlled processor once for each element of that outer list! .. excited to hear more about what your workflows would look like -- Stian Soiland-Reyes, myGrid team School of Computer Science The University of Manchester ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ taverna-hackers mailing list [email protected] Web site: http://www.taverna.org.uk Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/ Developers Guide: http://www.mygrid.org.uk/tools/developer-information
