Greg --

Thanks so much for the response.  Reply is below...

At 04:26 PM 12/17/2007, [EMAIL PROTECTED] wrote:

1. Use of network bandwidth. If you're really scaling up (hundreds of cores), then at some point you will saturate the network. The theoretically best you can do is to have each document on the network only once (from it's source container to the CPU where it will be processed), and then the results on the network only once (from the CPU where processed to the destination container).
...
You will of course want a dedicated switch and isolation from other traffic (i.e. a gateway).

This is very sage advice, and we are definitely working towards it! We're nowhere near the hundreds for cores level yet (10s of cores!!!) but that's been a priority from the beginning. It just seemed obvious. We've been using aggregate engines to ensure that the document is, as you said, sent from the reader to the 'worker node' to our destination (in most cases, an Oracle database) with no stops in between. We're also learning (with our DBA) the art of Oracle optimization, and hav learned a lot as we go. One thing we've learned is that really, Oracle has been our biggest problem. It's still by and large the weakest link (but, as I said, we only have 10s of cores at the moment).

I've also ensured that our reader does as little as possible (there's been a push to have it sneak in some pre-processing). As the reader is a serial operation (unless I'm missing something), it seems to me to have it just focus on 'multiplexing' the document set out to the worker nodes.


2. Multi-processor boxes may be easier to manage, but they don't scale as well as, say, a set of 2- or 4-core boxes with the same total cores. Why? Because the larger boxes have shared system components that become bottle-necks. 16 cores in a box sharing one disk system, one bus, one or two NICs, a RAM bank, and one front-side bus, and for multi-core, some sharing of the L2 cache, will have some problems.
[... cut out very useful advice in the interest of brevity ... ]

Ah, this is EXACTLY the kind of thing I was hoping to hear. We have been going back on forth on this, but were unsure if relying on a GigE network (an isolated one with each box having multiple NICs) like we have been doing, or relying on inter-process communication was a better way to go. Thanks for the confirmation - we had our suspicions, but this makes us feel much better.


No need for those fancy blades either. Those make it easier to build a shared file system (NAS or SAN plugging into the cage), but you should avoid an architecture that uses a shared file system anyway. Just use HTTP or FTP to transfer the documents to the machine via the NIC. Send the worker a URL, have the worker pull the document over directly from the source. Blades a fairly expensive and shared file systems are hard to manage.

OK, this makes sense. Our infrastructure guys are really pushing us to use the SAN, as we have fiber-channel connectivity and they swear it will be fine. I've had some questions, but also, we are limited in our budget, so I think we just may end up relying on it for this first iteration. But still, very high value, thank you.

So if your architecture is right, then those Dell rack-mountables should be the fastest, the cheapest, and the most scalable choice. But that's a big if.

I hear you! I'm by no means an expert (obviously), but a lot of what you've said has confirmed my suspicions, and it's nice to know that my general approach this far has been correctl I've been building this with an engineering approach as much as possible, and it seems like every week I get closer and closer to the hardware level. We started with just worrying about our code, but it's amazing how all encompassing this has been. But it's been great.

Your experience is invaluable, once again I thank you!

Steve

Reply via email to