On 26/09/11 22:31, Greg Holmberg wrote:
> Arun--
> 
> 
> I don't know what the cause of your specific technical issue is, but in my
> opinion, there's a better way to slice the problem.
> 
> What you're doing is taking each step in your analysis engine and running it 
> on
> one or more machines.  The creates two problems.
> 
> One, it's a lot of network overhead.  You're moving each document across the
> network many times.  You can easily spend more time just moving the data 
> around
> than actually processing.  It also creates a low ceiling to scalability, since
> you chew up a lot of network bandwidth.
> 
> Two, in order to use your hardware efficiently, you have to get the right 
> ratio
> of machines/CPUs for each step.  Some steps use more cycles than others.  For
> example, you might find that for a given configuration and set of documents 
> that
> the ratio of CPU usage for steps A, B, and C are 1:5:2.  Now you need to
> instantiate A, B, and C services to use cores in that ratio.  Then, suppose 
> you
> want to add more machines--how should you allocate them to A, B, and C?  It 
> will
> always be lumpy, with some cores not being used much.  But worse, with a
> different configuration (different dictionaries, for example), or with 
> different
> documents (longer vs. shorter, for example), the ratios will change, and you
> will have to reconfigure your machines again.  It's never-ending, and it's 
> never
> completely right.
> 
> So, it would be much easier to manage and more efficient, more scalable, if 
> you
> just run your analysis engine self-contained in a single process, and then
> replicate the engine over your machines/CPUs.  You slice by document, not by
> service--send each document to a different analysis engine instance.  This 
> makes
> your life easier, always runs the CPUs at 100%, and scales indefinitely.  Just
> add more machines, it goes faster.
> 
> This is what I'm doing.  I use JavaSpaces (producer/consumer queue), but I'm
> sure you can get the same effect with UIMA AS and ActiveMQ.

Or Hadoop.

> 
> 
> Greg

Reply via email to