On 14.11.2013, at 18:19, Nicolas Hernandez <[email protected]> wrote:

> Dear All
> 
> Let say I want to count the occurrences of each word in a document
> collection and to use these counters (possibly in the same workflow).
> I am in the situation where I have a CAS per document and I want to
> scale out the workflow.

How do you scale it out?

> To scale out the workflow I use a resource to store the counters of
> each word. The resource is accessed in writing mode by several
> instances of an annotator which process in parallel distinct CASes.

What kind of resource do you use?

> Here are my questions :
> * I believe I cannot be sure that when a successive annotator in the
> same workflow will use the resource, the resource would not still be
> modified after that (by running counter annotators which are still
> processing remaining CAS). Right ? In other words, I do not have a way
> to run (to delay the run of) an annotator depending the state of a
> resource ?

You can customize the flow by writing your own workflow controller.
But if that is supported depends on how you do your scaling.

> * So, I may use two worflows: one to build the resource, the other one
> to use it.  But how can I export/save the resource ? I cannot access
> the resource in the collectionProcessComplete method of an AE, can I ?

I would personally use the two workflows. Why do you believe that you cannot
access the resource in collectionProcessComplete?

> The solution I imagine was inspired of the use of the CAS multiplier
> to merge CAS. It is to use two workflows with one of them dedicated to
> build the resource. In this workflow, I define an annotator  (without
> scaling out, so a cas consumer). In that annotator, I check the
> SourceDocumentInformation Feature Structure in the CAS to see if its
> lastSegment feature is set to true, in that case I can export the
> resource. I know this it not a guarantee that all CAS have been
> processed. I may also have a special counter resource in that
> annotator to count the processed cas and eventually export the desired
> resource when all CAS would have been processed. In that case, I would
> need a way to communicate to the "exporter" annotator the number of
> CAS which will be processed... This is not the main problem.
> 
> After writing that, I realize that to do it in a single workflow, I
> could have written a CAS multiplier to save each CAS until all have
> been processed, then create again as many CAS as the ones saved...
> 
> These solutions are very complex...
> 
> Any suggestions... ? A uimaFIT trick =) ?

Well, to do small-scale scaling using a CPE, I'd do this:

- build an aggregate which generates the word counts
- use a custom shared resource to do the counting
- in the collectionProcessComplete call some synchronized "save" method on the 
resource
- if "save" is called the second time, it does nothing

- build an aggregate which uses the word counts

Run both workflows, one after the other using the CpePipeline of uimaFIT.

-- Richard

Reply via email to