Re: Scale out using multiple Collection Readers and Cas Consumers

Greg Holmberg Tue, 30 Nov 2010 15:13:45 -0800

On Tue, 30 Nov 2010 13:23:27 -0800, Eddie Epstein <[email protected]>wrote:

I agree with Jerry that there is no code in UIMA packages explicitly
for this. I'd suggest looking at
examples/src/org/apache/uima/examples/casMultiplier/SimpleTextSegmenter.java
for an example CasMultiplier that can easily be adapted. Another
suggestion is to assemble and test the aggregate before deploying it
as a service. Much easier to debug.

OK, so, bottom line, if we want to process terabytes of text contained inmillions of files, and we want to do it in a cluster of hundreds ofmachines, and we want that cluster to scale linearly and infinitelywithout bottle-necks, and we want to use UIMA-AS to do it, then we've gota lot of work ahead us? There's no existing example configurations orcode that shows how to do this?

If we did do that work, are you confident that AS doesn't have anyinherent bottle-necks that would prevent scaling to that level? Was itdesigned to do that kind of thing? The multiple Collection Reader ideawouldn't really be able to do that, would it?

What if there's no obvious way to partition the file set? Say, forexample, we're crawling a web site, like amazon.com?

What if the file set is not known (and so can't be partitioned), such asif we have an on-demand service that is receiving a steady series ofrandom job submissions from different clients, each wanting to processdifferent doc sets from different repositories? How could AS beconfigured to ensure efficient use of the hardware (load balanced, all CPUcores at 100%)? And fairness to the competing clients?

The AS architecture has always been a bit fuzzy to me. Any insights onhow to achieve extreme scalability with AS would be appreciated.


Greg Holmberg

Re: Scale out using multiple Collection Readers and Cas Consumers

Reply via email to