Hi Mateusz, > what is the best practice to process CASes resulting from multiple and > different AAEs? Assume I have an AAE1 processing document collection 1 and an > AAE2 processing document collection 2. Now I would like to compare each tuple > of the resulting CASes of AAE1 and AAE2 using a third AAE. My underlying > intention is to "compare" each document in collection 1 with each document in > collection 2 using different preprocessing pipelines.
I'd say there are approximately three general strategies: 1) Use reader which reads two sets of data and passes all combinations in views and an annotator which compares the view 2) Use reader which reads one set of data (initial view) and an annotator which reads the other set of data and compares 3) Do the comparison outside UIMA With 1) you have the ability to run your pre-processing as part of the pipeline. Mind that you repeatedly pre-process all documents in your two sets, which is terribly inefficient. With 2) at least one of the sets should already have been pre-processed, namely the one that is read by the annotator With 3) you can use the same (or different) pre-processing pipelines for both sets, have a writer at the end which writes either XMI (or something alike) or which already writes an extract, e.g. feature vectors extracted from the documents. Then you load the XMIs or feature vectors in a separate little programm and compare them to each other. > Is there some abstraction in UIMA to perform this? I saw an solution > performing an n-cross-m comparison using two views where each view > represented a different document. This works but seems a bit inflexible, > assuming you need to configure different processing pipelines for each > collection type. I've seen people use 1) a lot. I personally prefer 3), sometimes 2). With alone UIMA, this is not very straight forward to realize. It would involve writing a custom FlowController. What I did instead was creating DKPro Lab [1]. This is a lightweight Java framework for building such multi-pipeline workflows. Within the framework, you'd have several "Tasks", e.g.: - PreprocessingTask1 - pre-processes first set of data using a certain UIMA pipeline and writing the results out as XMI or something else - PreprocessingTask2 - pre-processes second set of data using a certain UIMA pipeline and writing the results out as XMI or something else - ComparisonTask - loads the data produced by the two previous tasks, compares it, and writes the results out. It's most likely not written as a UIMA component, but rather as a simple piece of code using the XmiDeSerializer to load the XMIs (or something else), then extracting the information to compare from the CAS and comparing it. I'm afraid, there's not much documentation available for DKPro Lab yet, but there is an example project in the GIT repository. The example also consists of three tasks, one extracting part-of-speech features from a document set, a second training a classifier on these, and a third using the classifier to tag a new text. We have additional projects on Google Code that use DKPro Lab and may also serve as examples [2], in particular DKPro Spelling [3]. Another new project based on DKPro Lab is upcoming ;) Note that DKPro Lab has special convenience support for UIMA, but is in general agnostic of it. Cheers, -- Richard [1] http://code.google.com/p/dkpro-lab/ [2] http://code.ohloh.net/search?s=%22dkpro.lab%22&browser=Default&mp=1&ml=1&me=1&md=1&filterChecked=true [3] http://code.google.com/p/dkpro-spelling-asl/
