On Tue, 30 Nov 2010 13:23:27 -0800, Eddie Epstein <[email protected]> wrote:

I agree with Jerry that there is no code in UIMA packages explicitly
for this. I'd suggest looking at
examples/src/org/apache/uima/examples/casMultiplier/SimpleTextSegmenter.java
for an example CasMultiplier that can easily be adapted. Another
suggestion is to assemble and test the aggregate before deploying it
as a service. Much easier to debug.


OK, so, bottom line, if we want to process terabytes of text contained in millions of files, and we want to do it in a cluster of hundreds of machines, and we want that cluster to scale linearly and infinitely without bottle-necks, and we want to use UIMA-AS to do it, then we've got a lot of work ahead us? There's no existing example configurations or code that shows how to do this?

If we did do that work, are you confident that AS doesn't have any inherent bottle-necks that would prevent scaling to that level? Was it designed to do that kind of thing? The multiple Collection Reader idea wouldn't really be able to do that, would it?

What if there's no obvious way to partition the file set? Say, for example, we're crawling a web site, like amazon.com?

What if the file set is not known (and so can't be partitioned), such as if we have an on-demand service that is receiving a steady series of random job submissions from different clients, each wanting to process different doc sets from different repositories? How could AS be configured to ensure efficient use of the hardware (load balanced, all CPU cores at 100%)? And fairness to the competing clients?

The AS architecture has always been a bit fuzzy to me. Any insights on how to achieve extreme scalability with AS would be appreciated.

Greg Holmberg

Reply via email to