Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Florent Gluck
Hi Mike,

I finally got everything working properly!
What I did was to switch to /protocol-http/ and move the following from
/nutch-site.xml/ to /mapred-default.xml/:

/property
  namemapred.map.tasks/name
  value100/value
  descriptionThe default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is local.
  /description
/property

property
  namemapred.reduce.tasks/name
  value40/value
  descriptionThe default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is local.
  /description
/property/

I then injected 100'000 urls and grepped the logs on my 4 slaves to see
if the sum of all the fetched urls adds up to 100'000.  It did :)
There was finally no need to comment out line 211 of /Generator.java.

/Hope it helps,/
--/Flo

Mike Smith wrote:

Hi Florent

Thanks for the inquery and reply. I did some more tests based on your
suggestion.
Using the old protocol-http the problem is solved for single machine. But
when I have datanodes running on two other machines the problem still exist
but the number of unfetched pages is less than before. These are my tests

Injected URL: 8
only one machine is datanode: 7 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250

Injected URL: 8
3 machines are datanode. All machines are partipated in the fetching by
looking at the task tracker logs on three machines:  2 fetched pages
 map tasks: 12
reduce tasks: 6
threads: 250

Injected URL : 5000
 3 machines are datanode. All machines are partipated in the fetching by
looking at the task tracker logs on three machines:  1200 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250


Injected URL : 1000
 3 machines are datanode. All machines are partipated in the fetching by
looking at the task tracker logs on three machines:  240 fetched pages

 Injected URL : 1000
 only one machine is datanode: 800 fecthed pages
 map tasks: 3
reduce tasks: 3
threads: 250

I also commented line 211 of Generator.java, but it didn't change the
situation.

I'll try to do some more testings.

Thanks, Mike

On 1/19/06, Doug Cutting [EMAIL PROTECTED] wrote:
  

Florent Gluck wrote:


I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.
  

There have been a number of complaints about unreliable fetching with
protocol-httpclient, so I've switched the default back to protocol-http.

Doug




  




Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Andrzej Bialecki

Florent Gluck wrote:

Hi Mike,

I finally got everything working properly!
What I did was to switch to /protocol-http/ and move the following from
/nutch-site.xml/ to /mapred-default.xml/:
  


Could you please check (on a smaller sample ;-) ) which of these two 
changes was necessary? Frist, second, or both? I suspect only the second 
change was really needed, i.e. the change in config files, and not the 
change of protocol-httpclient - protocol-http ... It would be very 
helpful if you could confirm/deny this.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Florent Gluck
Andrzej,

I ran 2 crawls of 1 pass each, injecting 100'000 urls.
Here is the output of /readdb -stats/ when crawling with /protocol-http/:

060123 162250 TOTAL urls:   119221
060123 162250 avg score:1.023
060123 162250 max score:240.666
060123 162250 min score:1.0
060123 162250 retry 0:  56648
060123 162250 retry 1:  62573
060123 162250 status 1 (DB_unfetched):  89068
060123 162250 status 2 (DB_fetched):27513
060123 162250 status 3 (DB_gone):   2640

And here is the output when crawling with /protocol-httpclient/:

060123 180243 TOTAL urls:   117451
060123 180243 avg score:1.021
060123 180243 max score:194.0
060123 180243 min score:1.0
060123 180243 retry 0:  52273
060123 180243 retry 1:  65178
060123 180243 status 1 (DB_unfetched):  89670
060123 180243 status 2 (DB_fetched):26066
060123 180243 status 3 (DB_gone):   1715

Both return more or less the same results (w/ a difference of ~1.5% in
the #fetches which is not surprising on a 100k set).
I checked the logs and in the 2 cases, I see exactly 100'000 fetch attempts.
You were right, it actually makes sense that the settings in
/mapred-default.xml/ would affect the local crawl as well since they
have nothing to do w/ ndfs.
It therefore seems that /protocol-httpclient/ is reliable enough to be
used (well, at least in my case).

--Flo

Florent Gluck wrote:

Andrzej Bialecki wrote:

  

Could you please check (on a smaller sample ;-) ) which of these two
changes was necessary? Frist, second, or both? I suspect only the
second change was really needed, i.e. the change in config files, and
not the change of protocol-httpclient - protocol-http ... It would be
very helpful if you could confirm/deny this.



Well, I'm pretty much sure protocol-httpclient is part of the problem. 
Earlier last week, I was trying to figure out what the problem was and I
ran some crawls on single machine, using the local filesystem.  Here
were my previous observations (from an older message):

I injected 5 urls and got 2315 urls fetched.  I couldn't find a
trace in the logs of most of the urls.
I noticed that if I put a counter at the beginning of the
/while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't
end up with 5!
After some poking around, I noticed that if I comment out the line doing
the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key,
datum);/, then I get 5.
There seems to be something really wrong with that.  I seems to mean
that some threads are dying without notification in the http protocol
code (if it makes any sense).
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient.
With the old protocol I got 5 as expected.


So to me it seems protocol-httpclient is buggy.  I'll still run a test
with my current config and protocol-httpclient and let you know.
-Flo

  




Re: So many Unfetched Pages using MapReduce

2006-01-22 Thread Mike Smith
Hi Florent

Thanks for the inquery and reply. I did some more tests based on your
suggestion.
Using the old protocol-http the problem is solved for single machine. But
when I have datanodes running on two other machines the problem still exist
but the number of unfetched pages is less than before. These are my tests

Injected URL: 8
only one machine is datanode: 7 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250

Injected URL: 8
3 machines are datanode. All machines are partipated in the fetching by
looking at the task tracker logs on three machines:  2 fetched pages
 map tasks: 12
reduce tasks: 6
threads: 250

Injected URL : 5000
 3 machines are datanode. All machines are partipated in the fetching by
looking at the task tracker logs on three machines:  1200 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250


Injected URL : 1000
 3 machines are datanode. All machines are partipated in the fetching by
looking at the task tracker logs on three machines:  240 fetched pages

 Injected URL : 1000
 only one machine is datanode: 800 fecthed pages
 map tasks: 3
reduce tasks: 3
threads: 250

I also commented line 211 of Generator.java, but it didn't change the
situation.

I'll try to do some more testings.

Thanks, Mike

On 1/19/06, Doug Cutting [EMAIL PROTECTED] wrote:

 Florent Gluck wrote:
  I then decided to switch to using the old http protocol plugin:
  protocol-http (in nutch-default.xml) instead of protocol-httpclient
  With the old protocol I got 5 as expected.

 There have been a number of complaints about unreliable fetching with
 protocol-httpclient, so I've switched the default back to protocol-http.

 Doug



Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Mike Smith
Hi Florent

I did some more testings. Here is the results:

I have 3 machines, P4 and 1G ram. All three are data node and one is
namenode. I started from 8 seed urls and tried to see the effect of
depth 1 crawl for different configuration.

Number of unfetch pages changes with different configurations:

--Configuration 1
Number of map tasks: 3
Number of reduce tasks: 3
Number of fetch threads: 40
Number of thread per host: 2
http.timeout: 10 sec
---
6700 pages fetched

--Configuration 2
Number of map tasks: 12
Number of reduce tasks: 6
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
18000 pages fetched

--Configuration 3
Number of map tasks: 40
Number of reduce tasks: 20
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
37000 pages fetched

--Configuration 4
Number of map tasks: 100
Number of reduce tasks: 20
Number of fetch threads: 100
Number of thread per host: 20
http.timeout: 10 sec
---
34000 pages fetched


--Configuration 5
Number of map tasks: 50
Number of reduce tasks: 50
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
52000 pages fetched

--Configuration 6
Number of map tasks: 50
Number of reduce tasks: 100
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
57000 pages fetched

--Configuration 7
Number of map tasks: 50
Number of reduce tasks: 120
Number of fetch threads: 250
Number of thread per host: 20
http.timeout: 20 sec
---
6 pages fetched



Do you have any idea why pages are missing from the fetcher without the any
log or exceptions? It seems it really depends on the number of reduce
tasks!
Thanks, Mike



On 1/17/06, Mike Smith [EMAIL PROTECTED] wrote:

 I've experienced the same effect. When I decrease number of map/reduce
 tasks, I can fetch more web pages. but increasing those increases unfetched
 pages. I also get some java.net.SocketTimeoutException : Read timed out
 exceptions in my datanode log files. But those time out problems couldn't
 cause this much missing pages!! I agree the problem should be somewhere is
 the fetcher.

 Mike


 On 1/17/06, Florent Gluck [EMAIL PROTECTED] wrote:
 
  I'm having the exact same problem.
  I noticed that changing the number of map/reduce tasks gives me
  different DB_fetched results.
  Looking at the logs, a lot of urls are actually missing.  I can't find
  their trace *anywhere* in the logs (whether on the slaves or the
  master).  I'm puzzled.  Currently I'm trying to debug the code to see
  what's going on.
  So far, I noticed the generator is fine, so the issue must lay further
  in the pipeline (fetcher?).
 
  Let me know if you find anything regarding this issue. Thanks.
 
  --Flo
 
  Mike Smith wrote:
 
  Hi,
  
  I have setup for boxes using MapReduce, everything goes smoothly, I
  have
  feeded about 8 seed nodes for begining and I have crawled by depth
  2.
  Only 1900 pages (about 300MG) data and the rest is marked and db
  unfetched.
  Does any one know what could be wrong?
  
  This is the output of (bin/nutch readdb h2/crawldb -stats):
  
  060115 171625 Statistics for CrawlDb: h2/crawldb
  060115 171625 TOTAL urls:   99403
  060115 171625 avg score:1.01
  060115 171625 max score:7.382
  060115 171625 min score:1.0
  060115 171625 retry 0:  99403
  060115 171625 status 1 (DB_unfetched):  97470
  060115 171625 status 2 (DB_fetched):1933
  060115 171625 CrawlDb statistics: done
  
  Thanks,
  Mike
  
  
  
 
 



Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Florent Gluck
Hi Mike,

Your differents tests are really interesting, thanks for sharing!
I didn't do as many tests. I changed the number of fetch threads and the
number of map and reduce tasks and noticed that it gave me quite
different results in terms of pages fetched.
Then, I wanted to see if this issue would still happen when running the
crawl (single pass) on one single machine running everything locally,
without ndfs.
So I injected 5 urls and got 2315 urls fetched.  I couldn't find a
trace in the logs of most of the urls.
I noticed that if I put a counter at the beginning of the
/while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't
end up with 5!
After some poking around, I noticed that if I comment out the line doing
the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key,
datum);/, then I get 5.
There seems to be something really wrong with that.  I seems to mean
that some threads are dying without notification in the http protocol
code (if it makes any sense).
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.

The following bug seems to be very similar to what we are encountering:
http://issues.apache.org/jira/browse/NUTCH-136
Check out the latest comment.  I'm gonna remove line 211 and run some
tests to see how it behaves (with protocol-http and protocol-httpclient).

I'll let you know what I find out,
--Florent

Mike Smith wrote:

Hi Florent

I did some more testings. Here is the results:

I have 3 machines, P4 and 1G ram. All three are data node and one is
namenode. I started from 8 seed urls and tried to see the effect of
depth 1 crawl for different configuration.

Number of unfetch pages changes with different configurations:

--Configuration 1
Number of map tasks: 3
Number of reduce tasks: 3
Number of fetch threads: 40
Number of thread per host: 2
http.timeout: 10 sec
---
6700 pages fetched

--Configuration 2
Number of map tasks: 12
Number of reduce tasks: 6
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
18000 pages fetched

--Configuration 3
Number of map tasks: 40
Number of reduce tasks: 20
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
37000 pages fetched

--Configuration 4
Number of map tasks: 100
Number of reduce tasks: 20
Number of fetch threads: 100
Number of thread per host: 20
http.timeout: 10 sec
---
34000 pages fetched


--Configuration 5
Number of map tasks: 50
Number of reduce tasks: 50
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
52000 pages fetched

--Configuration 6
Number of map tasks: 50
Number of reduce tasks: 100
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
57000 pages fetched

--Configuration 7
Number of map tasks: 50
Number of reduce tasks: 120
Number of fetch threads: 250
Number of thread per host: 20
http.timeout: 20 sec
---
6 pages fetched



Do you have any idea why pages are missing from the fetcher without the any
log or exceptions? It seems it really depends on the number of reduce
tasks!
Thanks, Mike
  




Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Doug Cutting

Florent Gluck wrote:

I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.


There have been a number of complaints about unreliable fetching with 
protocol-httpclient, so I've switched the default back to protocol-http.


Doug


Re: So many Unfetched Pages using MapReduce

2006-01-17 Thread Florent Gluck
I'm having the exact same problem.
I noticed that changing the number of map/reduce tasks gives me
different DB_fetched results.
Looking at the logs, a lot of urls are actually missing.  I can't find
their trace *anywhere* in the logs (whether on the slaves or the
master).  I'm puzzled.  Currently I'm trying to debug the code to see
what's going on.
So far, I noticed the generator is fine, so the issue must lay further
in the pipeline (fetcher?).

Let me know if you find anything regarding this issue. Thanks.

--Flo

Mike Smith wrote:

Hi,

I have setup for boxes using MapReduce, everything goes smoothly, I have
feeded about 8 seed nodes for begining and I have crawled by depth 2.
Only 1900 pages (about 300MG) data and the rest is marked and db unfetched.
Does any one know what could be wrong?

This is the output of (bin/nutch readdb h2/crawldb -stats):

060115 171625 Statistics for CrawlDb: h2/crawldb
060115 171625 TOTAL urls:   99403
060115 171625 avg score:1.01
060115 171625 max score:7.382
060115 171625 min score:1.0
060115 171625 retry 0:  99403
060115 171625 status 1 (DB_unfetched):  97470
060115 171625 status 2 (DB_fetched):1933
060115 171625 CrawlDb statistics: done

Thanks,
Mike

  




So many Unfetched Pages using MapReduce

2006-01-15 Thread Mike Smith
Hi,

I have setup for boxes using MapReduce, everything goes smoothly, I have
feeded about 8 seed nodes for begining and I have crawled by depth 2.
Only 1900 pages (about 300MG) data and the rest is marked and db unfetched.
Does any one know what could be wrong?

This is the output of (bin/nutch readdb h2/crawldb -stats):

060115 171625 Statistics for CrawlDb: h2/crawldb
060115 171625 TOTAL urls:   99403
060115 171625 avg score:1.01
060115 171625 max score:7.382
060115 171625 min score:1.0
060115 171625 retry 0:  99403
060115 171625 status 1 (DB_unfetched):  97470
060115 171625 status 2 (DB_fetched):1933
060115 171625 CrawlDb statistics: done

Thanks,
Mike