I was using a nightly build that Pitor had given me the nutch-nightly.jar
(actually it was nutch-dev0.7.jar or something of that nature) I tested it on
the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3
xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1 Athlon 64
3500+. all have 1gb or more of ram. now I have my big server and if you have
worked on ndfs since the begining of july I'll test it again, my big server's
HD array is very fast 200+mbytes a sec, so it will be able to fully saturate
gigabit better. anyway the p4 and the 2 amd machines are hooked into the
switch at gigabit and the 2 xeons are hooked into my other switch at 100mbit,
but it has a gigabit uplink to my gigabit switch, so both xeons would
constantly be saturated at 11mbytes a sec. while the p4 was able to reach
higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual 120gb
drives) my main pc (athlon 64 3500+) was the namenode and a datanode and also
the ndfs client, I could not get nutch to work properly with ndfs, it was
setup correctly, it kinda worked but would crash out the namenode when I
was trying to fetch segments in the ndfs filesystem or index them, or do much
of anything. so I copied all my segment directories, indexes,
content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary
machine and nutch runs off 1rpm disks raid 0 (2x36gb raptors) they can
output about 120mbytes a sec sustained so here is what I found out ( in
windows) if I dont start a datanode on the namenode with the conf pointing to
127.0.0.1 instead of its outside ip the namenode will not copy data to the
other machines, instead if I'm running datanode on the namenode data will
replicate from the datanode to the other 3 datanodes, I tried this a hundred
ways to try and make it work with an independant namenode without luck. but
the way I saw data go across my network was I would put data into ndfs the
namenode would request a datanode and find the internal datanode, copy data
to it only then after that the datanode would still be coping data from my
other hd's into chunks on the raid array, while copying it would replicate to
the p4 via gigabit at 50-60mbytes a sec, then it would replicate from the p4
to the xeons kinda like alternating them as I only had replication at default
2 and i had about 100gbytes to copy in so the copy would finish onto the
internal raid array fairly quickly then it finished replication to the p4 and
the xeons got a little bit of data, but not near as much as the p4, my guess
is it only needs 2 copies and the first copy was datanode on the internal
machine, the second was the p4 datanode. the xeons only had a smaller
connection so they didnt recieve as many chunks as fast as the p4 could, and
the p4 had enough space for all the data so it worked out, I should of put
replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and it
would crash the namenode on windows if I connected it as a datanode. so that
one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1
machine, but it would not replicate data to multiple machines at the same
time it seemed. I would of thought it would of output to the xeons at the
same time as the p4, give the xeons 20% of the data and the p4 80% or
something of that nature, but it could be that they just arent fast enough to
request data before the p4 was recieving its 32mb chunks every 1/2 second?
The good news cpu usage was only at 50% on my amd 3500+ that was while it was
copying data to the internal datanode from the ndfs client from another
internal HD running the namenode and running the datanode internally. does it
now work with a separate namenode? I'm getting ready to run nutch in linux
full time, if I can ever get the damn driver for my highpoint 2220 raid card
to work with suse, any suse, the drivers dont work with dual core cpu's or
something??? they are working on it, now I'm stuck with fedora 4 untill they
fix it. so its not ready for testing yet. I'll let you know when I can test
it in a full linux environment.
wow that was a long one!!!
-Jay