Re: Multi-Document Processing

Matthew Campbell Wed, 22 Aug 2007 06:43:57 -0700

Thanks so much! That does help - I'm still fiddling with making sure myvarious Sofa's are getting through alright, but this gets me in theright direction.


-Matt

Marshall Schor wrote:

Matthew Campbell wrote:
Hey folks:
I'm looking at a process that runs each document through a bunchof annotators to tag up various information, then I need to do someprocessing/manipulation of those documents based the information heldin the whole collection. I've been reading up on the CPE, but itlooks like it's primarily for running a collection of documentsthrough an AE. I was hoping someone could point me in the rightdirection for doing the collection-wide processing portion of myprocess.I had started out by defining the process as one large aggregateAE and running each document through it, but I don't see a way to gothrough that initial tagging process for all documents and then moveon to the next phase.I then switched gears and tried splitting up each phase into it'sown AE, but then I loose the complex Sofa mappings I had put togetherfor the previous attempt. So I guess this could be solved in twoways - one would be that the CPE has some sort of built-in method fordoing collection-wide processing and manipulation (ie, "firstidentify all location names in all documents, then replace each witha new name, but make sure the new name doesn't appear in any otherdocument"). The other would be to somehow run through the firstphase to identify everything, do processing using the collection ofJCas's resulting, then pump each JCas into a second AE for doingpost-processing stuff. Somewhere in there would have to be somedynamically-mapped Sofas from the phase 1 AE to the phase 2 AE.
I hope that described my goal well enough, and thanks ahead oftime for any pointers you guys can throw my way.
The way many do things like this is to have a singleton Annotator atthe end of the pipe line, which sees all of the CASes being processedafter they've been "tagged" by earlier annotators. This annotatorwould have some persistent Java object(s) that accumulated informationacross the entire document collection, and would have acollection-processing-complete method which it would register with theCPM so it could be called at the end of processing the collection.This method would then use the accumulated information to do whateverprocessing you wanted to do at that point.
Would that work?
-Marshall

Re: Multi-Document Processing

Reply via email to