Re: UIMA chunking

Marshall Schor Wed, 23 Jul 2008 06:32:58 -0700

I think a good solution to this would include facilities to

1) reliably determine "ordering" of the sub-part CASes at some pointdown-stream, and


2) reliably determine "end of sub-parts"

The use cases:

I might be needing to examine things which occur across a boundary (betweenCASes);

I might be collecting statistics per pre-split CAS document, and want to outputthese when finished processing the pre-split CAS.

By "reliably" I mean some mechanism which works, even when an error is thrown(perhaps on the last CAS of the segment, causing it to be "dropped").

One approach may be to record in the sub-part CASes information about thecollection of CASes, and to have some special flow control to handle errors toinsure CASes (even error ones) get routed to the final consumers.

By the way, something like this is already implemented for the CPM - it has somebuilt-in facility for its pipelines to support splitting. The splitting is donein the Collection Reader, and multiple CASes are generated right off the bat fora particular item. There is a special CAS consumer queue logic which can beactivated (see the docs:http://incubator.apache.org/uima/downloads/releaseDocs/2.2.2-incubating/docs/html/references/references.html#ugr.ref.xml.cpe_descriptor.descriptor.operational_parameters) by plugging in an Output Queue implementation namedorg.apache.uima.collection.impl.cpm.engine.SequencedQueue, to insure that thechunks arrive in the proper order at the CAS Consumer stage.

However, I think that this may not address the real issues of Olivier's use caseinvolving memory limits, etc., because the chunks are CASes in the JVM memory.


-Marshall

Burn Lewis wrote:

Olivier,

No, you cannot use a CM directly in a CPE .... but you can wrap it in an
aggregate and use that in a CPE.  The CPE could consist of CollectionReader
+ Aggregate + CasConsumer, where the Aggregate has the splitting CM +
Annotators + merging CM.  (For a CPE the aggregate must have outputsNewCASes
= false.)  Or you could put all of these into an aggregate and run as a
single AE, but you wouldn't have the error handling provided by the CPE.
The almost-released new UIMA-AS provides error handling as well as scaleout
that could allow parallel processing of your document segments and so
improve throughput.

If the size of the merged CAS is of concern, you may be able to do some
consuming before the merge, since there is nothing special about
CasConsumers.  If there are no downstream analytics that need the full
document you could omit the 2nd CM and let the aggregate end with a
CasConsumer, discarding the segmented CASes, returning just the input CAS to
the CPE which would not need a CC.

Burn.

Re: UIMA chunking

Reply via email to