On 14.11.2013, at 18:19, Nicolas Hernandez <[email protected]> wrote:
> Dear All > > Let say I want to count the occurrences of each word in a document > collection and to use these counters (possibly in the same workflow). > I am in the situation where I have a CAS per document and I want to > scale out the workflow. How do you scale it out? > To scale out the workflow I use a resource to store the counters of > each word. The resource is accessed in writing mode by several > instances of an annotator which process in parallel distinct CASes. What kind of resource do you use? > Here are my questions : > * I believe I cannot be sure that when a successive annotator in the > same workflow will use the resource, the resource would not still be > modified after that (by running counter annotators which are still > processing remaining CAS). Right ? In other words, I do not have a way > to run (to delay the run of) an annotator depending the state of a > resource ? You can customize the flow by writing your own workflow controller. But if that is supported depends on how you do your scaling. > * So, I may use two worflows: one to build the resource, the other one > to use it. But how can I export/save the resource ? I cannot access > the resource in the collectionProcessComplete method of an AE, can I ? I would personally use the two workflows. Why do you believe that you cannot access the resource in collectionProcessComplete? > The solution I imagine was inspired of the use of the CAS multiplier > to merge CAS. It is to use two workflows with one of them dedicated to > build the resource. In this workflow, I define an annotator (without > scaling out, so a cas consumer). In that annotator, I check the > SourceDocumentInformation Feature Structure in the CAS to see if its > lastSegment feature is set to true, in that case I can export the > resource. I know this it not a guarantee that all CAS have been > processed. I may also have a special counter resource in that > annotator to count the processed cas and eventually export the desired > resource when all CAS would have been processed. In that case, I would > need a way to communicate to the "exporter" annotator the number of > CAS which will be processed... This is not the main problem. > > After writing that, I realize that to do it in a single workflow, I > could have written a CAS multiplier to save each CAS until all have > been processed, then create again as many CAS as the ones saved... > > These solutions are very complex... > > Any suggestions... ? A uimaFIT trick =) ? Well, to do small-scale scaling using a CPE, I'd do this: - build an aggregate which generates the word counts - use a custom shared resource to do the counting - in the collectionProcessComplete call some synchronized "save" method on the resource - if "save" is called the second time, it does nothing - build an aggregate which uses the word counts Run both workflows, one after the other using the CpePipeline of uimaFIT. -- Richard
