Steve--

In my mind, there are a couple of potential pitfalls with text analytics in a 
cluster in general (UIMA or otherwise) that one should think very carefully 
about.  I speak from some experience here.


1. Use of network bandwidth.  If you're really scaling up (hundreds of cores), 
then at some point you will saturate the network.  The theoretically best you 
can do is to have each document on the network only once (from it's source 
container to the CPU where it will be processed), and then the results on the 
network only once (from the CPU where processed to the destination container).

The effective TCP/IP throughput of a gigabit network card is about 57 MB/sec 
(try measuring yours some time with something like Sandra from SiSoftware--Lite 
version is free).  Depending on your content format (Word, PDF, HTML, plain 
text), you might find your network saturated sooner than you expected.  If your 
architecture puts the data on the network more than once, it will happen much 
sooner.  Also be very careful about what protocol you use--something like SOAP 
uses incredible amounts of both network bandwidth and CPU.  RMI or Jini would 
be much better. If you saturate the network, then your choices are expensive 10 
GB networking (fiber optics?), or redesign the system.

You will of course want a dedicated switch and isolation from other traffic 
(i.e. a gateway).


2. Multi-processor boxes may be easier to manage, but they don't scale as well 
as, say, a set of 2- or 4-core boxes with the same total cores.  Why?  Because 
the larger boxes have shared system components that become bottle-necks.  16 
cores in a box sharing one disk system, one bus, one or two NICs, a RAM bank, 
and one front-side bus, and for multi-core, some sharing of the L2 cache, will 
have some problems.

Much of the NLP software I work with use large amounts of RAM, both for the 
documents and for data structures (name catalogs, taxonomies, etc.)  So in a 
big box when multiple CPUs get going on multiple documents (either threaded in 
a single process or in multiple processes), there are precious few memory pages 
being requested in common because each thread is executing different code or 
accessing different data.  Since the code and data are so large, the L2 cache 
hits drop to near nothing, and the CPU spends most of its time in wait states 
while the FSB goes out to the RAM.  Now the FSB is the bottle-neck and the 
machine won't go above 50% CPU utilization (I've seen it).

The more CPUs you put in a box, the worse it gets.  So I recommend a larger 
number of smaller boxes, to increase the ratio of CPUs to FSBs, RAM, buses, and 
NICs.  They're cheaper anyway.  Text analytics isn't like an Oracle cluster 
with a shared SGA and shared disk storage, which runs great on a single large 
machine.  Text analytics is a set of independent processes with little or no 
local disk I/O.  So put them on separate, cheap machines.

No need for those fancy blades either.  Those make it easier to build a shared 
file system (NAS or SAN plugging into the cage), but you should avoid an 
architecture that uses a shared file system anyway.  Just use HTTP or FTP to 
transfer the documents to the machine via the NIC.  Send the worker a URL, have 
the worker pull the document over directly from the source.  Blades a fairly 
expensive and shared file systems are hard to manage.

So if your architecture is right, then those Dell rack-mountables should be the 
fastest, the cheapest, and the most scalable choice.  But that's a big if.

Hope this helps.


Greg Holmberg

 -------------- Original message ----------------------
From: Steve Suppe <[EMAIL PROTECTED]>
> Hi All,
> 
> I had a few general questions for those of you in a cluster 
> environment.  First of all, has anyone actually tried using UIMA on 
> Solaris?  How about x86 Sun hardware running Linux?  How about instead of 
> many nodes in a cluster, just having a few more powerful machines that 
> support much larger multithreading environments?
> 
> We're in a general Linux cluster (mostly 1U Dell rack-mountables with a few 
> more powerful machines for our DB, etc).  environment right now, and are 
> doing well with it.  However, we're speccing out new hardware and though we 
> would put all options on the table.  Just curious how everyone is building 
> their environments.
> 
> Thanks!
> 
> Steve Suppe
> 
> Steve

Reply via email to