Hi Mateusz,

> what is the best practice to process CASes resulting from multiple and 
> different AAEs? Assume I have an AAE1 processing document collection 1 and an 
> AAE2 processing document collection 2. Now I would like to compare each tuple 
> of the resulting CASes of AAE1 and AAE2 using a third AAE. My underlying 
> intention is to "compare" each document in collection 1 with each document in 
> collection 2 using different preprocessing pipelines.

I'd say there are approximately three general strategies:

1) Use reader which reads two sets of data and passes all combinations in views 
and an annotator which compares the view
2) Use reader which reads one set of data (initial view) and an annotator which 
reads the other set of data and compares
3) Do the comparison outside UIMA

With 1) you have the ability to run your pre-processing as part of the 
pipeline. Mind that you repeatedly pre-process all documents in your two sets, 
which is terribly inefficient.

With 2) at least one of the sets should already have been pre-processed, namely 
the one that is read by the annotator

With 3) you can use the same (or different) pre-processing pipelines for both 
sets, have a writer at the end which writes either XMI (or something alike) or 
which already writes an extract, e.g. feature vectors extracted from the 
documents. Then you load the XMIs or feature vectors in a separate little 
programm and compare them to each other.

> Is there some abstraction in UIMA to perform this? I saw an solution 
> performing an n-cross-m comparison using two views where each view 
> represented a different document. This works but seems a bit inflexible, 
> assuming you need to configure different processing pipelines for each 
> collection type.


I've seen people use 1) a lot. I personally prefer 3), sometimes 2). With alone 
UIMA, this is not very straight forward to realize. It would involve writing a 
custom FlowController. 

What I did instead was creating DKPro Lab [1]. This is a lightweight Java 
framework for building such multi-pipeline workflows. Within the framework, 
you'd have several "Tasks", e.g.:

- PreprocessingTask1 - pre-processes first set of data using a certain UIMA 
pipeline and writing the results out as XMI or something else

- PreprocessingTask2 - pre-processes second set of data using a certain UIMA 
pipeline and writing the results out as XMI or something else

- ComparisonTask - loads the data produced by the two previous tasks, compares 
it, and writes the results out. It's most likely not written as a UIMA 
component, but rather as a simple piece of code using the XmiDeSerializer to 
load the XMIs (or something else), then extracting the information to compare 
from the CAS and comparing it.

I'm afraid, there's not much documentation available for DKPro Lab yet, but 
there is an example project in the GIT repository. The example also consists of 
three tasks, one extracting part-of-speech features from a document set, a 
second training a classifier on these, and a third using the classifier to tag 
a new text.

We have additional projects on Google Code that use DKPro Lab and may also 
serve as examples [2], in particular DKPro Spelling [3]. Another new project 
based on DKPro Lab is upcoming ;)

Note that DKPro Lab has special convenience support for UIMA, but is in general 
agnostic of it.

Cheers,

-- Richard

[1] http://code.google.com/p/dkpro-lab/
[2] 
http://code.ohloh.net/search?s=%22dkpro.lab%22&browser=Default&mp=1&ml=1&me=1&md=1&filterChecked=true
[3] http://code.google.com/p/dkpro-spelling-asl/

Reply via email to