Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Andrzej Bialecki
Florent Gluck wrote: Both return more or less the same results (w/ a difference of ~1.5% in the #fetches which is not surprising on a 100k set). I checked the logs and in the 2 cases, I see exactly 100'000 fetch attempts. You were right, it actually makes sense that the settings in /mapred-defaul

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Florent Gluck
Andrzej, I ran 2 crawls of 1 pass each, injecting 100'000 urls. Here is the output of /readdb -stats/ when crawling with /protocol-http/: 060123 162250 TOTAL urls: 119221 060123 162250 avg score:1.023 060123 162250 max score:240.666 060123 162250 min score:1.0 060123

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Florent Gluck
Andrzej Bialecki wrote: > Florent Gluck wrote: > >> Hi Mike, >> >> I finally got everything working properly! >> What I did was to switch to /protocol-http/ and move the following from >> /nutch-site.xml/ to /mapred-default.xml/: >> > > > Could you please check (on a smaller sample ;-) ) which

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Andrzej Bialecki
Florent Gluck wrote: Hi Mike, I finally got everything working properly! What I did was to switch to /protocol-http/ and move the following from /nutch-site.xml/ to /mapred-default.xml/: Could you please check (on a smaller sample ;-) ) which of these two changes was necessary? Frist, seco

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Florent Gluck
Hi Mike, I finally got everything working properly! What I did was to switch to /protocol-http/ and move the following from /nutch-site.xml/ to /mapred-default.xml/: / mapred.map.tasks 100 The default number of map tasks per job. Typically set to a prime several times greater than number

Re: So many Unfetched Pages using MapReduce

2006-01-22 Thread Mike Smith
Hi Florent Thanks for the inquery and reply. I did some more tests based on your suggestion. Using the old protocol-http the problem is solved for single machine. But when I have datanodes running on two other machines the problem still exist but the number of unfetched pages is less than before.

Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Doug Cutting
Florent Gluck wrote: I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. There have been a number of complaints about unreliable fetching with protocol-httpclient, so

Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Florent Gluck
Hi Mike, Your differents tests are really interesting, thanks for sharing! I didn't do as many tests. I changed the number of fetch threads and the number of map and reduce tasks and noticed that it gave me quite different results in terms of pages fetched. Then, I wanted to see if this issue woul

Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Mike Smith
Hi Florent I did some more testings. Here is the results: I have 3 machines, P4 and 1G ram. All three are data node and one is namenode. I started from 8 seed urls and tried to see the effect of depth 1 crawl for different configuration. Number of unfetch pages changes with different configu

Re: So many Unfetched Pages using MapReduce

2006-01-17 Thread Mike Smith
I've experienced the same effect. When I decrease number of map/reduce tasks, I can fetch more web pages. but increasing those increases unfetched pages. I also get some "java.net.SocketTimeoutException: Read timed out" exceptions in my datanode log files. But those time out problems couldn't cause

Re: So many Unfetched Pages using MapReduce

2006-01-17 Thread Florent Gluck
I'm having the exact same problem. I noticed that changing the number of map/reduce tasks gives me different DB_fetched results. Looking at the logs, a lot of urls are actually missing. I can't find their trace *anywhere* in the logs (whether on the slaves or the master). I'm puzzled. Currently

So many Unfetched Pages using MapReduce

2006-01-15 Thread Mike Smith
Hi, I have setup for boxes using MapReduce, everything goes smoothly, I have feeded about 8 seed nodes for begining and I have crawled by depth 2. Only 1900 pages (about 300MG) data and the rest is marked and db unfetched. Does any one know what could be wrong? This is the output of (bin/nutc