Re: New to nutch, seem to be problems

2007-08-30 Thread misc


Hello-

   One more important piece of data about the problems that I am having. 
After waiting a really long time, I learned that fetch is not hung up, it 
was just raly slow.  It took only a few hours to go through all the 
urls (the corresponding lines for each url appears in the hadoop.log, and 
all the content was loaded).  Then it took 24 hours of waiting before the 
phrase fetcher done appeared.  Then fetch returned.  Why would fetch hang 
after the crawl was done before returning?


   Looking at the code it would seem that some of the fetcher threads must 
be stuck for a long time.  Don't these time out?


   thanks


- Original Message - 
From: misc [EMAIL PROTECTED]

To: nutch-agent@lucene.apache.org
Sent: Wednesday, August 29, 2007 5:31 PM
Subject: Re: New to nutch, seem to be problems




Hello-

   I will reply to my own post with new findings and observations.  About 
the slowness of generate, I just don't believe that it should take many 
hours to generate (any sized) list on a database that is a couple million 
large.  I could do the equivalant on plain text lists using grep, sort, 
uniq in just minutes.  I *must* be doing something wrong.


   I dug into it today.  Could someone correct me if I am wrong on any of 
this?  I couldn't find any written information about this anywhere.


   1. The generate seems to be broken into three phases, each a separate 
mapreduce command.  The first phase runs through all the urls in the 
crawldb, and throws out any that aren't eligable for crawling (by 
crawldate).


   2. The second phase partitions by hostname and ranks according to 
frequency.  It also cuts out repeat requests to a host if the number is 
too high (set by a parameter), and then sorts the urls by frequency.


   3. The third phase updates the database with the information that the 
url is being crawled and should not be handed out to anyone else.


   By observing what was going on, I could see that the first phase seems 
to take a couple of hours.  I can change the debug level of nutch to debug 
and see all the rejected urls being generated, and it does seem to be 
slow, a couple per second (my db has about 200k crawled things, and about 
200 uncrawled, so about 1 in 10 should be rejected  How can nutch 
only be going at a rate of about 20 per second, this is way too slow).


   I also looked to see if DNS lookups were slowing me down, but as far as 
I can tell not.  This is because the first phase doesn't even do DNS, yet 
is slow, and second because I used Wireshark to look for dns lookups and 
found none.


   Can someone tell me the expected time for generate to run?  6 hours is 
too long!


   thanks
   -J


- Original Message - 
From: misc [EMAIL PROTECTED]

To: nutch-agent@lucene.apache.org
Sent: Tuesday, August 28, 2007 6:27 PM
Subject: New to nutch, seem to be problems


Hello-

   My configuration and stats are at the end of this email.  I have set up 
nutch to crawl 100,000 urls.  The first pass (of 100,000) items went well, 
but problems started after this.


   1. Generate takes many hours to complete.  It doesn't matter whether I 
generate 1 million or 1000 items, it takes about 5 hours to complete.  Is 
this normal?


   2. Fetch works great, until it is done.  It then freezes up 
indefinitely.  It can fetch 100 pages in about 12 hours, and all the 
fetched content is in /tmp, but then it just sits there, not returning to 
the command line.  I have let it sit for about 12 hours and eventually 
broke down and cancelled it.  If I try to undate the database it of course 
fails.


   3. Fetch2 runs very slowly, even though I am using 80 threads, I only 
download an object per every few seconds (1 every 5 or 10 seconds).  From 
the log, I can see that almost always 79 or 80 threads are spinWaiting.


   4. I can't tell if fetch2 freezes like fetch does, as I haven't been 
able to wait the many days it will take to go through a full fetch with 
fetch2.


Configuration:

   Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.

   The ethernet connection has a dedicated 1gb connection to the web, so 
certainly that isn't a problem.


   I have tested on nutch 0.9 and the newest daily build from 2007-08-28.

   I seeded with urls from the opendirectory, 10.  I first ran a pass 
to load all 10, then took the topN=1million (10 times larger than the 
first set of urls).  The first pass had no problem, the second pass (and 
beyond) is where the problems began.







New to nutch, seem to be problems

2007-08-28 Thread misc
Hello-

My configuration and stats are at the end of this email.  I have set up 
nutch to crawl 100,000 urls.  The first pass (of 100,000) items went well, but 
problems started after this.

1. Generate takes many hours to complete.  It doesn't matter whether I 
generate 1 million or 1000 items, it takes about 5 hours to complete.  Is this 
normal?

2. Fetch works great, until it is done.  It then freezes up indefinitely.  
It can fetch 100 pages in about 12 hours, and all the fetched content is in 
/tmp, but then it just sits there, not returning to the command line.  I have 
let it sit for about 12 hours and eventually broke down and cancelled it.  If I 
try to undate the database it of course fails.

3. Fetch2 runs very slowly, even though I am using 80 threads, I only 
download an object per every few seconds (1 every 5 or 10 seconds).  From the 
log, I can see that almost always 79 or 80 threads are spinWaiting.

4. I can't tell if fetch2 freezes like fetch does, as I haven't been able 
to wait the many days it will take to go through a full fetch with fetch2.

Configuration:

Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.

The ethernet connection has a dedicated 1gb connection to the web, so 
certainly that isn't a problem.

I have tested on nutch 0.9 and the newest daily build from 2007-08-28.

I seeded with urls from the opendirectory, 10.  I first ran a pass to 
load all 10, then took the topN=1million (10 times larger than the first 
set of urls).  The first pass had no problem, the second pass (and beyond) is 
where the problems began.