Steve--
In my mind, there are a couple of potential pitfalls with text analytics in a cluster in general (UIMA or otherwise) that one should think very carefully about. I speak from some experience here. 1. Use of network bandwidth. If you're really scaling up (hundreds of cores), then at some point you will saturate the network. The theoretically best you can do is to have each document on the network only once (from it's source container to the CPU where it will be processed), and then the results on the network only once (from the CPU where processed to the destination container). The effective TCP/IP throughput of a gigabit network card is about 57 MB/sec (try measuring yours some time with something like Sandra from SiSoftware--Lite version is free). Depending on your content format (Word, PDF, HTML, plain text), you might find your network saturated sooner than you expected. If your architecture puts the data on the network more than once, it will happen much sooner. Also be very careful about what protocol you use--something like SOAP uses incredible amounts of both network bandwidth and CPU. RMI or Jini would be much better. If you saturate the network, then your choices are expensive 10 GB networking (fiber optics?), or redesign the system. You will of course want a dedicated switch and isolation from other traffic (i.e. a gateway). 2. Multi-processor boxes may be easier to manage, but they don't scale as well as, say, a set of 2- or 4-core boxes with the same total cores. Why? Because the larger boxes have shared system components that become bottle-necks. 16 cores in a box sharing one disk system, one bus, one or two NICs, a RAM bank, and one front-side bus, and for multi-core, some sharing of the L2 cache, will have some problems. Much of the NLP software I work with use large amounts of RAM, both for the documents and for data structures (name catalogs, taxonomies, etc.) So in a big box when multiple CPUs get going on multiple documents (either threaded in a single process or in multiple processes), there are precious few memory pages being requested in common because each thread is executing different code or accessing different data. Since the code and data are so large, the L2 cache hits drop to near nothing, and the CPU spends most of its time in wait states while the FSB goes out to the RAM. Now the FSB is the bottle-neck and the machine won't go above 50% CPU utilization (I've seen it). The more CPUs you put in a box, the worse it gets. So I recommend a larger number of smaller boxes, to increase the ratio of CPUs to FSBs, RAM, buses, and NICs. They're cheaper anyway. Text analytics isn't like an Oracle cluster with a shared SGA and shared disk storage, which runs great on a single large machine. Text analytics is a set of independent processes with little or no local disk I/O. So put them on separate, cheap machines. No need for those fancy blades either. Those make it easier to build a shared file system (NAS or SAN plugging into the cage), but you should avoid an architecture that uses a shared file system anyway. Just use HTTP or FTP to transfer the documents to the machine via the NIC. Send the worker a URL, have the worker pull the document over directly from the source. Blades a fairly expensive and shared file systems are hard to manage. So if your architecture is right, then those Dell rack-mountables should be the fastest, the cheapest, and the most scalable choice. But that's a big if. Hope this helps. Greg Holmberg -------------- Original message ---------------------- From: Steve Suppe <[EMAIL PROTECTED]> > Hi All, > > I had a few general questions for those of you in a cluster > environment. First of all, has anyone actually tried using UIMA on > Solaris? How about x86 Sun hardware running Linux? How about instead of > many nodes in a cluster, just having a few more powerful machines that > support much larger multithreading environments? > > We're in a general Linux cluster (mostly 1U Dell rack-mountables with a few > more powerful machines for our DB, etc). environment right now, and are > doing well with it. However, we're speccing out new hardware and though we > would put all options on the table. Just curious how everyone is building > their environments. > > Thanks! > > Steve Suppe > > Steve
