On Tue, 30 Nov 2010 13:23:27 -0800, Eddie Epstein <[email protected]>
wrote:
I agree with Jerry that there is no code in UIMA packages explicitly
for this. I'd suggest looking at
examples/src/org/apache/uima/examples/casMultiplier/SimpleTextSegmenter.java
for an example CasMultiplier that can easily be adapted. Another
suggestion is to assemble and test the aggregate before deploying it
as a service. Much easier to debug.
OK, so, bottom line, if we want to process terabytes of text contained in
millions of files, and we want to do it in a cluster of hundreds of
machines, and we want that cluster to scale linearly and infinitely
without bottle-necks, and we want to use UIMA-AS to do it, then we've got
a lot of work ahead us? There's no existing example configurations or
code that shows how to do this?
If we did do that work, are you confident that AS doesn't have any
inherent bottle-necks that would prevent scaling to that level? Was it
designed to do that kind of thing? The multiple Collection Reader idea
wouldn't really be able to do that, would it?
What if there's no obvious way to partition the file set? Say, for
example, we're crawling a web site, like amazon.com?
What if the file set is not known (and so can't be partitioned), such as
if we have an on-demand service that is receiving a steady series of
random job submissions from different clients, each wanting to process
different doc sets from different repositories? How could AS be
configured to ensure efficient use of the hardware (load balanced, all CPU
cores at 100%)? And fairness to the competing clients?
The AS architecture has always been a bit fuzzy to me. Any insights on
how to achieve extreme scalability with AS would be appreciated.
Greg Holmberg