I think a good solution to this would include facilities to

1) reliably determine "ordering" of the sub-part CASes at some point down-stream, and

2) reliably determine "end of sub-parts"

The use cases:

I might be needing to examine things which occur across a boundary (between CASes);

I might be collecting statistics per pre-split CAS document, and want to output these when finished processing the pre-split CAS.

By "reliably" I mean some mechanism which works, even when an error is thrown (perhaps on the last CAS of the segment, causing it to be "dropped").

One approach may be to record in the sub-part CASes information about the collection of CASes, and to have some special flow control to handle errors to insure CASes (even error ones) get routed to the final consumers.

By the way, something like this is already implemented for the CPM - it has some built-in facility for its pipelines to support splitting. The splitting is done in the Collection Reader, and multiple CASes are generated right off the bat for a particular item. There is a special CAS consumer queue logic which can be activated (see the docs: http://incubator.apache.org/uima/downloads/releaseDocs/2.2.2-incubating/docs/html/references/references.html#ugr.ref.xml.cpe_descriptor.descriptor.operational_parameters ) by plugging in an Output Queue implementation named org.apache.uima.collection.impl.cpm.engine.SequencedQueue, to insure that the chunks arrive in the proper order at the CAS Consumer stage.

However, I think that this may not address the real issues of Olivier's use case involving memory limits, etc., because the chunks are CASes in the JVM memory.

-Marshall

Burn Lewis wrote:
Olivier,

No, you cannot use a CM directly in a CPE .... but you can wrap it in an
aggregate and use that in a CPE.  The CPE could consist of CollectionReader
+ Aggregate + CasConsumer, where the Aggregate has the splitting CM +
Annotators + merging CM.  (For a CPE the aggregate must have outputsNewCASes
= false.)  Or you could put all of these into an aggregate and run as a
single AE, but you wouldn't have the error handling provided by the CPE.
The almost-released new UIMA-AS provides error handling as well as scaleout
that could allow parallel processing of your document segments and so
improve throughput.

If the size of the merged CAS is of concern, you may be able to do some
consuming before the merge, since there is nothing special about
CasConsumers.  If there are no downstream analytics that need the full
document you could omit the 2nd CM and let the aggregate end with a
CasConsumer, discarding the segmented CASes, returning just the input CAS to
the CPE which would not need a CC.

Burn.

Reply via email to