Re: near-term plan
I was using a nightly build that Pitor had given me the nutch-nightly.jar (actually it was nutch-dev0.7.jar or something of that nature) I tested it on the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3 xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1 Athlon 64 3500+. all have 1gb or more of ram. now I have my big server and if you have worked on ndfs since the begining of july I'll test it again, my big server's HD array is very fast 200+mbytes a sec, so it will be able to fully saturate gigabit better. anyway the p4 and the 2 amd machines are hooked into the switch at gigabit and the 2 xeons are hooked into my other switch at 100mbit, but it has a gigabit uplink to my gigabit switch, so both xeons would constantly be saturated at 11mbytes a sec. while the p4 was able to reach higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual 120gb drives) my main pc (athlon 64 3500+) was the namenode and a datanode and also the ndfs client, I could not get nutch to work properly with ndfs, it was setup correctly, it kinda worked but would crash out the namenode when I was trying to fetch segments in the ndfs filesystem or index them, or do much of anything. so I copied all my segment directories, indexes, content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary machine and nutch runs off 1rpm disks raid 0 (2x36gb raptors) they can output about 120mbytes a sec sustained so here is what I found out ( in windows) if I dont start a datanode on the namenode with the conf pointing to 127.0.0.1 instead of its outside ip the namenode will not copy data to the other machines, instead if I'm running datanode on the namenode data will replicate from the datanode to the other 3 datanodes, I tried this a hundred ways to try and make it work with an independant namenode without luck. but the way I saw data go across my network was I would put data into ndfs the namenode would request a datanode and find the internal datanode, copy data to it only then after that the datanode would still be coping data from my other hd's into chunks on the raid array, while copying it would replicate to the p4 via gigabit at 50-60mbytes a sec, then it would replicate from the p4 to the xeons kinda like alternating them as I only had replication at default 2 and i had about 100gbytes to copy in so the copy would finish onto the internal raid array fairly quickly then it finished replication to the p4 and the xeons got a little bit of data, but not near as much as the p4, my guess is it only needs 2 copies and the first copy was datanode on the internal machine, the second was the p4 datanode. the xeons only had a smaller connection so they didnt recieve as many chunks as fast as the p4 could, and the p4 had enough space for all the data so it worked out, I should of put replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and it would crash the namenode on windows if I connected it as a datanode. so that one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1 machine, but it would not replicate data to multiple machines at the same time it seemed. I would of thought it would of output to the xeons at the same time as the p4, give the xeons 20% of the data and the p4 80% or something of that nature, but it could be that they just arent fast enough to request data before the p4 was recieving its 32mb chunks every 1/2 second? The good news cpu usage was only at 50% on my amd 3500+ that was while it was copying data to the internal datanode from the ndfs client from another internal HD running the namenode and running the datanode internally. does it now work with a separate namenode? I'm getting ready to run nutch in linux full time, if I can ever get the damn driver for my highpoint 2220 raid card to work with suse, any suse, the drivers dont work with dual core cpu's or something??? they are working on it, now I'm stuck with fedora 4 untill they fix it. so its not ready for testing yet. I'll let you know when I can test it in a full linux environment. wow that was a long one!!! -Jay
Re: near-term plan
Hi Doug, The slides from my talk yesterday at OSCON give some hints on how to get started. We need a MapReduce tutorial. http://wiki.apache.org/nutch/Presentations Can you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU? Thanks. Stefan
Re: near-term plan
Stefan Groschupf wrote: http://wiki.apache.org/nutch/Presentations Can you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU? I mean that neither the CPUs, disks or network are at 100% of capacity. Disks are running around 50% busy, CPUs a bit higher, and the network switch has lots of bandwidth left. (Although, if we used multiple racks connected with gigabit links, these inter-rack links would already be near capacity.) So sometimes the CPU is busy generating random data and stuffing it in a buffer, and sometimes the disk is busy writing data, but we're not keeping both busy at the same time all the time. Perhaps if more threads/processes and/or bigger buffers would increase the utilization--I have not tried to tune things for this benchmark. But I am not dissapointed with this performance. Rather, I think that it is fast enough so that with real applications, with non-trival map and reduce functions, NDFS will not be a bottleneck. Doug
Re: near-term plan
Hello, I think it is good idea to release ASAP. I wanted to contribute my code for fault-tolerant searching - it takes more time than I expected because as some of you know in meantime I become a father. But I hope I will be able to send something for comments early next week. I will look at the Jira to check if some more bugs can be fixed before deadline proposed by Andrzej. Regards Piotr Andrzej Bialecki wrote: Doug Cutting wrote: Here's a near-term plan for Nutch. 1. Release Nutch 0.7, based on current trunk. We should do this ASAP. Are there bugs in trunk that we need to fix before this can be done? The trunk will be copied to a 0.7 release branch. I'll be back from vacation in 3-4 days, I hope I can do some work in the meantime; I'd like to close some bugs marked with Major (e.g. the multi-line protocol properties), and perhaps integrate the RSS parser before the release. Other than that I think we should do it ASAP. So, I would propose a deadline of Aug 8 for the last commits, and then perhaps Aug 15 for the release? 2. Merge the mapred branch to trunk. 3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a separate project for distributed computing tools. If the Lucene PMC approves this, it would be a new Lucene sub-project, a Nutch sibling. I concur. They are very useful at times in unrelated projects.
Re: near-term plan
Doug I also ran into this when I was testing ndfs the system would have to wait for the namenode to tell the datanodes what data to recieve and which data to replicate, I'm currently setting up lustre to see how it works, its at the kernel level that it operates, do you think if the namenode was not java that it would perform better? I plan on running a system where the namenode (metadata) server will have to perform thousands of i-o's a sec,concurrently updating indexes of multiple segments simultaniously, updating the db on one machine, and fetching multiple segments on multiple machines, all accessing the same logical filesystem at the same time. the way that namenode responded it took a few seconds to replicate data to other datanodes, and it took time to start the copying of data, if writing an index imagine if you have to wait 1-10 secs per file to be written(if queued), that will cause serious problems. also I was able to saturate gigabit with ndfs (well about 50-60MBytes a sec its hard to get better than that with copper) , it just took a few secs to ramp up to speed, thats including file copying and replication. -Jay PS: where can I find out about the mapreduce, I read the presentations, but I dont get the core concept of it? PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I bet you will see a huge improvement in speed, even over xeon's p4's etc... I was only able to test 5 machines but I was able to saturate 50-60mb a sec to each (mainly replication throughput running level 1) - Original Message - From: Doug Cutting [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, August 04, 2005 3:54 PM Subject: Re: near-term plan Stefan Groschupf wrote: http://wiki.apache.org/nutch/Presentations Can you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU? I mean that neither the CPUs, disks or network are at 100% of capacity. Disks are running around 50% busy, CPUs a bit higher, and the network switch has lots of bandwidth left. (Although, if we used multiple racks connected with gigabit links, these inter-rack links would already be near capacity.) So sometimes the CPU is busy generating random data and stuffing it in a buffer, and sometimes the disk is busy writing data, but we're not keeping both busy at the same time all the time. Perhaps if more threads/processes and/or bigger buffers would increase the utilization--I have not tried to tune things for this benchmark. But I am not dissapointed with this performance. Rather, I think that it is fast enough so that with real applications, with non-trival map and reduce functions, NDFS will not be a bottleneck. Doug
Re: near-term plan
Jay Pound wrote: Doug I also ran into this when I was testing ndfs the system would have to wait for the namenode to tell the datanodes what data to recieve and which data to replicate When did you test this? Which version of Nutch? How many nodes? My benchmark results from just a few days ago. There've been a lot of fixes in the past week and NDFS now works much better. I'm currently setting up lustre to see how it works, its at the kernel level that it operates, do you think if the namenode was not java that it would perform better? I plan on running a system where the namenode (metadata) server will have to perform thousands of i-o's a sec,concurrently updating indexes of multiple segments simultaniously, updating the db on one machine, and fetching multiple segments on multiple machines, all accessing the same logical filesystem at the same time. While running the benchmark the namenode was typically using only 2% of its 1Ghz CPU. PS: where can I find out about the mapreduce, I read the presentations, but I dont get the core concept of it? http://labs.google.com/papers/mapreduce.html PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I bet you will see a huge improvement in speed, even over xeon's p4's etc... I was only able to test 5 machines but I was able to saturate 50-60mb a sec to each (mainly replication throughput running level 1) Via is not my first choice of CPU, it's simply what the Internet Archive has given me to use. With hundreds of datanodes a Via-based namenode could become a bottleneck. Right now it is not. Doug
Detecting unmodified content patches (Re: near-term plan)
Doug Cutting wrote: Andrzej Bialecki wrote: So, I would propose a deadline of Aug 8 for the last commits, and then perhaps Aug 15 for the release? Sounds good to me. Thanks for helping with this! Unfortunately, the patches related to detecting the unmodified content will have to wait until after the release. Here's the problem: It's quite easy to add this checking and recording capability to all fetcher plugins, fetchlist generation and db update tools, and I've done this in my local patches. However, after a while I discovered a serious problem in the way Nutch currently manages phasing out of old segment data. If we assume that we always refresh after some fixed interval (30 days, or whatever), then we can safely delete segments older than 30 days. If the interval varies, then potentially we could be stuck with some segments with very old (but still valid) data. This is very inefficient, because in a single given segment there might be only a couple of such pages left after a while, and the rest of them would have to be removed again and again by deduplication because newer pages would exist in newer segments. Moreover (and this is the worst problem) if such segments are lost, the information in webdb must be updated in a way to force refetching, even though the If-Modified-Since or the MD5 points out that the page is still unchanged since the last fetching. Currently the only way to do this is to add days - but if we use a variable refetch interval then it doesn't make much sense. I think we need to track in a better way which pages are missing from the segments, and have to be re-fetched, or to have a better DB update mechanism if we lose some segments. Perhaps we should extend the Page to record which segment holds the latest version of the page? But segments don't have unique ID's now (a directory name is too fragile and too easily changed) ... Related question: in the FetchListEntry we have a fetch flag. I think that after minor modifications of the FetchListTool (to generate only entries, which we are supposed to fetch) we could get rid of this flag, or change its semantics to mean unconditionally fetch, even if unmodified. Any comments? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com