Thanks so much! That does help - I'm still fiddling with making sure my
various Sofa's are getting through alright, but this gets me in the
right direction.
-Matt
Marshall Schor wrote:
Matthew Campbell wrote:
Hey folks:
I'm looking at a process that runs each document through a bunch
of annotators to tag up various information, then I need to do some
processing/manipulation of those documents based the information held
in the whole collection. I've been reading up on the CPE, but it
looks like it's primarily for running a collection of documents
through an AE. I was hoping someone could point me in the right
direction for doing the collection-wide processing portion of my
process.
I had started out by defining the process as one large aggregate
AE and running each document through it, but I don't see a way to go
through that initial tagging process for all documents and then move
on to the next phase.
I then switched gears and tried splitting up each phase into it's
own AE, but then I loose the complex Sofa mappings I had put together
for the previous attempt. So I guess this could be solved in two
ways - one would be that the CPE has some sort of built-in method for
doing collection-wide processing and manipulation (ie, "first
identify all location names in all documents, then replace each with
a new name, but make sure the new name doesn't appear in any other
document"). The other would be to somehow run through the first
phase to identify everything, do processing using the collection of
JCas's resulting, then pump each JCas into a second AE for doing
post-processing stuff. Somewhere in there would have to be some
dynamically-mapped Sofas from the phase 1 AE to the phase 2 AE.
I hope that described my goal well enough, and thanks ahead of
time for any pointers you guys can throw my way.
The way many do things like this is to have a singleton Annotator at
the end of the pipe line, which sees all of the CASes being processed
after they've been "tagged" by earlier annotators. This annotator
would have some persistent Java object(s) that accumulated information
across the entire document collection, and would have a
collection-processing-complete method which it would register with the
CPM so it could be called at the end of processing the collection.
This method would then use the accumulated information to do whatever
processing you wanted to do at that point.
Would that work?
-Marshall