On 26/09/11 22:31, Greg Holmberg wrote: > Arun-- > > > I don't know what the cause of your specific technical issue is, but in my > opinion, there's a better way to slice the problem. > > What you're doing is taking each step in your analysis engine and running it > on > one or more machines. The creates two problems. > > One, it's a lot of network overhead. You're moving each document across the > network many times. You can easily spend more time just moving the data > around > than actually processing. It also creates a low ceiling to scalability, since > you chew up a lot of network bandwidth. > > Two, in order to use your hardware efficiently, you have to get the right > ratio > of machines/CPUs for each step. Some steps use more cycles than others. For > example, you might find that for a given configuration and set of documents > that > the ratio of CPU usage for steps A, B, and C are 1:5:2. Now you need to > instantiate A, B, and C services to use cores in that ratio. Then, suppose > you > want to add more machines--how should you allocate them to A, B, and C? It > will > always be lumpy, with some cores not being used much. But worse, with a > different configuration (different dictionaries, for example), or with > different > documents (longer vs. shorter, for example), the ratios will change, and you > will have to reconfigure your machines again. It's never-ending, and it's > never > completely right. > > So, it would be much easier to manage and more efficient, more scalable, if > you > just run your analysis engine self-contained in a single process, and then > replicate the engine over your machines/CPUs. You slice by document, not by > service--send each document to a different analysis engine instance. This > makes > your life easier, always runs the CPUs at 100%, and scales indefinitely. Just > add more machines, it goes faster. > > This is what I'm doing. I use JavaSpaces (producer/consumer queue), but I'm > sure you can get the same effect with UIMA AS and ActiveMQ.
Or Hadoop. > > > Greg
