Re: near-term plan

2005-08-05 Thread webmaster
I was using a nightly build that Pitor had given me the nutch-nightly.jar 
(actually it was nutch-dev0.7.jar or something of that nature) I tested it on 
the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3 
xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1 Athlon 64 
3500+. all have 1gb or more of ram. now I have my big server and if you have 
worked on ndfs since the begining of july I'll test it again, my big server's 
HD array is very fast 200+mbytes a sec, so it will be able to fully saturate 
gigabit better. anyway the p4 and the 2 amd machines are hooked into the 
switch at gigabit and the 2 xeons are hooked into my other switch at 100mbit, 
but it has a gigabit uplink to my gigabit switch, so both xeons would 
constantly be saturated at 11mbytes a sec. while the p4 was able to reach 
higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual 120gb 
drives) my main pc (athlon 64 3500+) was the namenode and a datanode and also 
the ndfs client, I could not get nutch to work properly with ndfs, it was 
setup correctly, it kinda worked but would crash out the namenode when I 
was trying to fetch segments in the ndfs filesystem or index them, or do much 
of anything. so I copied all my segment directories, indexes, 
content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary 
machine and nutch runs off 1rpm disks raid 0 (2x36gb raptors) they can 
output about 120mbytes a sec sustained so here is what I found out ( in 
windows) if I dont start a datanode on the namenode with the conf pointing to 
127.0.0.1 instead of its outside ip the namenode will not copy data to the 
other machines, instead if I'm running datanode on the namenode data will 
replicate from the datanode to the other 3 datanodes, I tried this a hundred 
ways to try and make it work with an independant namenode without luck. but 
the way I saw data go across my network was I would put data into ndfs the 
namenode would request a datanode and find the internal datanode, copy data 
to it only then after that the datanode would still be coping data from my 
other hd's into chunks on the raid array, while copying it would replicate to 
the p4 via gigabit at 50-60mbytes a sec, then it would replicate from the p4 
to the xeons kinda like alternating them as I only had replication at default 
2 and i had about 100gbytes to copy in so the copy would finish onto the 
internal raid array fairly quickly then it finished replication to the p4 and 
the xeons got a little bit of data, but not near as much as the p4, my guess 
is it only needs 2 copies and the first copy was datanode on the internal 
machine, the second was the p4 datanode. the xeons only had a smaller 
connection so they didnt recieve as many chunks as fast as the p4 could, and 
the p4 had enough space for all the data so it worked out, I should of put 
replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and it 
would crash the namenode on windows if I connected it as a datanode. so that 
one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1 
machine, but it would not replicate data to multiple machines at the same 
time it seemed. I would of thought it would of output to the xeons at the 
same time as the p4, give the xeons 20% of the data and the p4 80% or 
something of that nature, but it could be that they just arent fast enough to 
request data before the p4 was recieving its 32mb chunks every 1/2 second?
The good news cpu usage was only at 50% on my amd 3500+ that was while it was 
copying data to the internal datanode from the ndfs client from another 
internal HD running the namenode and running the datanode internally. does it 
now work with a separate namenode? I'm getting ready to run nutch in linux 
full time, if I can ever get the damn driver for my highpoint 2220 raid card 
to work with suse, any suse, the drivers dont work with dual core cpu's or 
something??? they are working on it, now I'm stuck with fedora 4 untill they 
fix it. so its not ready for testing yet. I'll let you know when I can test 
it in a full linux environment.
wow that was a long one!!!
-Jay


Re: near-term plan

2005-08-04 Thread Stefan Groschupf

Hi Doug,
The slides from my talk yesterday at OSCON give some hints on how  
to get started.  We need a MapReduce tutorial.


http://wiki.apache.org/nutch/Presentations


Can you explan what this means: Page 20:
- cheduling is bottleneck, not disk, network or CPU?

Thanks.
Stefan 

Re: near-term plan

2005-08-04 Thread Doug Cutting

Stefan Groschupf wrote:

http://wiki.apache.org/nutch/Presentations


Can you explan what this means: Page 20:
- cheduling is bottleneck, not disk, network or CPU?


I mean that neither the CPUs, disks or network are at 100% of capacity. 
 Disks are running around 50% busy, CPUs a bit higher, and the network 
switch has lots of bandwidth left.  (Although, if we used multiple racks 
connected with gigabit links, these inter-rack links would already be 
near capacity.)  So sometimes the CPU is busy generating random data and 
stuffing it in a buffer, and sometimes the disk is busy writing data, 
but we're not keeping both busy at the same time all the time.  Perhaps 
if more threads/processes and/or bigger buffers would increase the 
utilization--I have not tried to tune things for this benchmark.  But I 
am not dissapointed with this performance.  Rather, I think that it is 
fast enough so that with real applications, with non-trival map and 
reduce functions, NDFS will not be a bottleneck.


Doug


Re: near-term plan

2005-08-04 Thread Piotr Kosiorowski

Hello,
I think it is good idea to release ASAP. I wanted to contribute my code
for fault-tolerant searching - it takes more time than I expected 
because as some of you know in meantime I become a father. But I hope I 
will be able to send something for comments early next week. I will look 
at the Jira to check if some more bugs can be fixed before deadline 
proposed by Andrzej.

Regards
Piotr


Andrzej Bialecki wrote:

Doug Cutting wrote:


Here's a near-term plan for Nutch.

1. Release Nutch 0.7, based on current trunk.  We should do this ASAP. 
Are there bugs in trunk that we need to fix before this can be done? 
The trunk will be copied to a 0.7 release branch.




I'll be back from vacation in 3-4 days, I hope I can do some work in the 
meantime; I'd like to close some bugs marked with Major (e.g. the 
multi-line protocol properties), and perhaps integrate the RSS parser 
before the release. Other than that I think we should do it ASAP. So, I 
would propose a deadline of Aug 8 for the last commits, and then perhaps 
Aug 15 for the release?



2. Merge the mapred branch to trunk.

3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a 
separate project for distributed computing tools.  If the Lucene PMC 
approves this, it would be a new Lucene sub-project, a Nutch sibling.



I concur. They are very useful at times in unrelated projects.






Re: near-term plan

2005-08-04 Thread Jay Pound
Doug I also ran into this when I was testing ndfs the system would have to
wait for the namenode to tell the datanodes what data to recieve and which
data to replicate, I'm currently setting up lustre to see how it works, its
at the kernel level that it operates, do you think if the namenode was not
java that it would perform better? I plan on running a system where the
namenode (metadata) server will have to perform thousands of i-o's a
sec,concurrently updating indexes of multiple segments simultaniously,
updating the db on one machine, and fetching multiple segments on multiple
machines, all accessing the same logical filesystem at the same time. the
way that namenode responded it took a few seconds to replicate data to other
datanodes, and it took time to start the copying of data, if writing an
index imagine if you have to wait 1-10 secs per file to be written(if
queued), that will cause serious problems. also I was able to saturate
gigabit with ndfs (well about 50-60MBytes a sec its hard to get better than
that with copper) , it just took a few secs to ramp up to speed, thats
including file copying and replication.
-Jay
PS: where can I find out about the mapreduce, I read the presentations, but
I dont get the core concept of it?

PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I
bet you will see a huge improvement in speed, even over xeon's p4's etc... I
was only able to test 5 machines but I was able to saturate 50-60mb a sec to
each (mainly replication throughput running level 1)

- Original Message - 
From: Doug Cutting [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, August 04, 2005 3:54 PM
Subject: Re: near-term plan


 Stefan Groschupf wrote:
  http://wiki.apache.org/nutch/Presentations
 
  Can you explan what this means: Page 20:
  - cheduling is bottleneck, not disk, network or CPU?

 I mean that neither the CPUs, disks or network are at 100% of capacity.
   Disks are running around 50% busy, CPUs a bit higher, and the network
 switch has lots of bandwidth left.  (Although, if we used multiple racks
 connected with gigabit links, these inter-rack links would already be
 near capacity.)  So sometimes the CPU is busy generating random data and
 stuffing it in a buffer, and sometimes the disk is busy writing data,
 but we're not keeping both busy at the same time all the time.  Perhaps
 if more threads/processes and/or bigger buffers would increase the
 utilization--I have not tried to tune things for this benchmark.  But I
 am not dissapointed with this performance.  Rather, I think that it is
 fast enough so that with real applications, with non-trival map and
 reduce functions, NDFS will not be a bottleneck.

 Doug






Re: near-term plan

2005-08-04 Thread Doug Cutting

Jay Pound wrote:

Doug I also ran into this when I was testing ndfs the system would have to
wait for the namenode to tell the datanodes what data to recieve and which
data to replicate


When did you test this?  Which version of Nutch?  How many nodes?  My 
benchmark results from just a few days ago.  There've been a lot of 
fixes in the past week and NDFS now works much better.



I'm currently setting up lustre to see how it works, its
at the kernel level that it operates, do you think if the namenode was not
java that it would perform better? I plan on running a system where the
namenode (metadata) server will have to perform thousands of i-o's a
sec,concurrently updating indexes of multiple segments simultaniously,
updating the db on one machine, and fetching multiple segments on multiple
machines, all accessing the same logical filesystem at the same time.


While running the benchmark the namenode was typically using only 2% of 
its 1Ghz CPU.



PS: where can I find out about the mapreduce, I read the presentations, but
I dont get the core concept of it?


http://labs.google.com/papers/mapreduce.html


PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I
bet you will see a huge improvement in speed, even over xeon's p4's etc... I
was only able to test 5 machines but I was able to saturate 50-60mb a sec to
each (mainly replication throughput running level 1)


Via is not my first choice of CPU, it's simply what the Internet Archive 
has given me to use.  With hundreds of datanodes a Via-based namenode 
could become a bottleneck.  Right now it is not.


Doug


Detecting unmodified content patches (Re: near-term plan)

2005-08-04 Thread Andrzej Bialecki

Doug Cutting wrote:

Andrzej Bialecki wrote:

So, I would propose a deadline of Aug 8 for the last commits, and then 
perhaps Aug 15 for the release?



Sounds good to me.  Thanks for helping with this!


Unfortunately, the patches related to detecting the unmodified content 
will have to wait until after the release.


Here's the problem: It's quite easy to add this checking and recording 
capability to all fetcher plugins, fetchlist generation and db update 
tools, and I've done this in my local patches. However, after a while I 
discovered a serious problem in the way Nutch currently manages phasing 
out of old segment data. If we assume that we always refresh after some 
fixed interval (30 days, or whatever), then we can safely delete 
segments older than 30 days. If the interval varies, then potentially we 
could be stuck with some segments with very old (but still valid) data. 
This is very inefficient, because in a single given segment there might 
be only a couple of such pages left after a while, and the rest of them 
would have to be removed again and again by deduplication because newer 
pages would exist in newer segments.


Moreover (and this is the worst problem) if such segments are lost, the 
information in webdb must be updated in a way to force refetching, even 
though the If-Modified-Since or the MD5 points out that the page is 
still unchanged since the last fetching. Currently the only way to do 
this is to add days - but if we use a variable refetch interval then 
it doesn't make much sense. I think we need to track in a better way 
which pages are missing from the segments, and have to be re-fetched, 
or to have a better DB update mechanism if we lose some segments.


Perhaps we should extend the Page to record which segment holds the 
latest version of the page? But segments don't have unique ID's now (a 
directory name is too fragile and too easily changed) ...


Related question: in the FetchListEntry we have a fetch flag. I think 
that after minor modifications of the FetchListTool (to generate only 
entries, which we are supposed to fetch) we could get rid of this flag, 
or change its semantics to mean unconditionally fetch, even if unmodified.


Any comments?

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com