Greg --
Thanks so much for the response. Reply is below...
At 04:26 PM 12/17/2007, [EMAIL PROTECTED] wrote:
1. Use of network bandwidth. If you're really scaling up (hundreds of
cores), then at some point you will saturate the network. The
theoretically best you can do is to have each document on the network only
once (from it's source container to the CPU where it will be processed),
and then the results on the network only once (from the CPU where
processed to the destination container).
...
You will of course want a dedicated switch and isolation from other
traffic (i.e. a gateway).
This is very sage advice, and we are definitely working towards it! We're
nowhere near the hundreds for cores level yet (10s of cores!!!) but that's
been a priority from the beginning. It just seemed obvious. We've been
using aggregate engines to ensure that the document is, as you said, sent
from the reader to the 'worker node' to our destination (in most cases, an
Oracle database) with no stops in between. We're also learning (with our
DBA) the art of Oracle optimization, and hav learned a lot as we go. One
thing we've learned is that really, Oracle has been our biggest
problem. It's still by and large the weakest link (but, as I said, we only
have 10s of cores at the moment).
I've also ensured that our reader does as little as possible (there's been
a push to have it sneak in some pre-processing). As the reader is a serial
operation (unless I'm missing something), it seems to me to have it just
focus on 'multiplexing' the document set out to the worker nodes.
2. Multi-processor boxes may be easier to manage, but they don't scale as
well as, say, a set of 2- or 4-core boxes with the same total
cores. Why? Because the larger boxes have shared system components that
become bottle-necks. 16 cores in a box sharing one disk system, one bus,
one or two NICs, a RAM bank, and one front-side bus, and for multi-core,
some sharing of the L2 cache, will have some problems.
[... cut out very useful advice in the interest of brevity ... ]
Ah, this is EXACTLY the kind of thing I was hoping to hear. We have been
going back on forth on this, but were unsure if relying on a GigE network
(an isolated one with each box having multiple NICs) like we have been
doing, or relying on inter-process communication was a better way to
go. Thanks for the confirmation - we had our suspicions, but this makes us
feel much better.
No need for those fancy blades either. Those make it easier to build a
shared file system (NAS or SAN plugging into the cage), but you should
avoid an architecture that uses a shared file system anyway. Just use
HTTP or FTP to transfer the documents to the machine via the NIC. Send
the worker a URL, have the worker pull the document over directly from the
source. Blades a fairly expensive and shared file systems are hard to manage.
OK, this makes sense. Our infrastructure guys are really pushing us to use
the SAN, as we have fiber-channel connectivity and they swear it will be
fine. I've had some questions, but also, we are limited in our budget, so
I think we just may end up relying on it for this first iteration. But
still, very high value, thank you.
So if your architecture is right, then those Dell rack-mountables should
be the fastest, the cheapest, and the most scalable choice. But that's a
big if.
I hear you! I'm by no means an expert (obviously), but a lot of what
you've said has confirmed my suspicions, and it's nice to know that my
general approach this far has been correctl I've been building this with
an engineering approach as much as possible, and it seems like every week I
get closer and closer to the hardware level. We started with just worrying
about our code, but it's amazing how all encompassing this has been. But
it's been great.
Your experience is invaluable, once again I thank you!
Steve