> I have this particular requirement for a API that we wrap over a Uima > pipeline. > > public List<String> analyse(String inputFolderPath, String modelName); > > This method is supposed to accept a collection of files (residing in the > inputFolderPath), run the files (as CAS) through a pipeline of UIMA AEs, and > return the results (one String per CAS). > > To return the strings, I will need to somehow access the CAS after the AEs > have finished their job and transform/extract whatever inside the CAS into > the string that I will return to the caller of this method. > > But if I run the AEs using a SimplePipeline.runPipeline() > How I can get hold of the CAS that are coming out of the AEs? > Do I attach a CAS Consumer at the tail of the pipeline and read the CAS > contents at that point? Then I append each result to the List<String> that I > constructed at the begining.
You should take a look at the JCasIterable (cf. [1] - Example in Groovy, but JCasIterable is a Java class and works nicely in Java too, just I have no example in Java). JCasIterable basically allows you to iterate over the CASes produced by your pipeline. In such a look, you can extract and collect the data you need from the CASes, e.g. putting into a List<String> and returning it. Mind that you should *not* try to keep hold of full CASes, FeatureStructure (including Annotations and stuff). You need to copy the data from the CAS, otherwise it will be corrupted. > If so, is this scalable? Well… up to a point, but not in general. > If I have thousands of files in the inputFolderPath, and if the strings are > very large, would I run out of memory soon? > Is there a more scalable way to do this? You could write your strings to a file and then return an implementation of List<String> which directly accesses the file. Depending on how much you want to scale, you'll have to look into different solutions. The easiest would be to buy more memory, the most complex would probably be porting your stuff to some kind of cluster. The latter will most likely require a change of API, possibly even of the whole processing paradigm. List<String> most probably won't do then ;) Cheers, -- Richard [1] http://code.google.com/p/dkpro-core-asl/wiki/GroovyRecipies#OpenNLP_Part-of-speech_tagging_pipeline_using_JCasIterable_and_c
