Re: So many Unfetched Pages using MapReduce
Hi Mike, I finally got everything working properly! What I did was to switch to /protocol-http/ and move the following from /nutch-site.xml/ to /mapred-default.xml/: /property namemapred.map.tasks/name value100/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property property namemapred.reduce.tasks/name value40/value descriptionThe default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is local. /description /property/ I then injected 100'000 urls and grepped the logs on my 4 slaves to see if the sum of all the fetched urls adds up to 100'000. It did :) There was finally no need to comment out line 211 of /Generator.java. /Hope it helps,/ --/Flo Mike Smith wrote: Hi Florent Thanks for the inquery and reply. I did some more tests based on your suggestion. Using the old protocol-http the problem is solved for single machine. But when I have datanodes running on two other machines the problem still exist but the number of unfetched pages is less than before. These are my tests Injected URL: 8 only one machine is datanode: 7 fecthed pages map tasks: 3 reduce tasks: 3 threads: 250 Injected URL: 8 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 2 fetched pages map tasks: 12 reduce tasks: 6 threads: 250 Injected URL : 5000 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 1200 fetched pages map tasks: 12 reduce tasks: 6 threads: 250 Injected URL : 1000 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 240 fetched pages Injected URL : 1000 only one machine is datanode: 800 fecthed pages map tasks: 3 reduce tasks: 3 threads: 250 I also commented line 211 of Generator.java, but it didn't change the situation. I'll try to do some more testings. Thanks, Mike On 1/19/06, Doug Cutting [EMAIL PROTECTED] wrote: Florent Gluck wrote: I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. There have been a number of complaints about unreliable fetching with protocol-httpclient, so I've switched the default back to protocol-http. Doug
Re: So many Unfetched Pages using MapReduce
Florent Gluck wrote: Hi Mike, I finally got everything working properly! What I did was to switch to /protocol-http/ and move the following from /nutch-site.xml/ to /mapred-default.xml/: Could you please check (on a smaller sample ;-) ) which of these two changes was necessary? Frist, second, or both? I suspect only the second change was really needed, i.e. the change in config files, and not the change of protocol-httpclient - protocol-http ... It would be very helpful if you could confirm/deny this. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: So many Unfetched Pages using MapReduce
Andrzej, I ran 2 crawls of 1 pass each, injecting 100'000 urls. Here is the output of /readdb -stats/ when crawling with /protocol-http/: 060123 162250 TOTAL urls: 119221 060123 162250 avg score:1.023 060123 162250 max score:240.666 060123 162250 min score:1.0 060123 162250 retry 0: 56648 060123 162250 retry 1: 62573 060123 162250 status 1 (DB_unfetched): 89068 060123 162250 status 2 (DB_fetched):27513 060123 162250 status 3 (DB_gone): 2640 And here is the output when crawling with /protocol-httpclient/: 060123 180243 TOTAL urls: 117451 060123 180243 avg score:1.021 060123 180243 max score:194.0 060123 180243 min score:1.0 060123 180243 retry 0: 52273 060123 180243 retry 1: 65178 060123 180243 status 1 (DB_unfetched): 89670 060123 180243 status 2 (DB_fetched):26066 060123 180243 status 3 (DB_gone): 1715 Both return more or less the same results (w/ a difference of ~1.5% in the #fetches which is not surprising on a 100k set). I checked the logs and in the 2 cases, I see exactly 100'000 fetch attempts. You were right, it actually makes sense that the settings in /mapred-default.xml/ would affect the local crawl as well since they have nothing to do w/ ndfs. It therefore seems that /protocol-httpclient/ is reliable enough to be used (well, at least in my case). --Flo Florent Gluck wrote: Andrzej Bialecki wrote: Could you please check (on a smaller sample ;-) ) which of these two changes was necessary? Frist, second, or both? I suspect only the second change was really needed, i.e. the change in config files, and not the change of protocol-httpclient - protocol-http ... It would be very helpful if you could confirm/deny this. Well, I'm pretty much sure protocol-httpclient is part of the problem. Earlier last week, I was trying to figure out what the problem was and I ran some crawls on single machine, using the local filesystem. Here were my previous observations (from an older message): I injected 5 urls and got 2315 urls fetched. I couldn't find a trace in the logs of most of the urls. I noticed that if I put a counter at the beginning of the /while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't end up with 5! After some poking around, I noticed that if I comment out the line doing the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key, datum);/, then I get 5. There seems to be something really wrong with that. I seems to mean that some threads are dying without notification in the http protocol code (if it makes any sense). I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient. With the old protocol I got 5 as expected. So to me it seems protocol-httpclient is buggy. I'll still run a test with my current config and protocol-httpclient and let you know. -Flo
Re: So many Unfetched Pages using MapReduce
Hi Florent Thanks for the inquery and reply. I did some more tests based on your suggestion. Using the old protocol-http the problem is solved for single machine. But when I have datanodes running on two other machines the problem still exist but the number of unfetched pages is less than before. These are my tests Injected URL: 8 only one machine is datanode: 7 fecthed pages map tasks: 3 reduce tasks: 3 threads: 250 Injected URL: 8 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 2 fetched pages map tasks: 12 reduce tasks: 6 threads: 250 Injected URL : 5000 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 1200 fetched pages map tasks: 12 reduce tasks: 6 threads: 250 Injected URL : 1000 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 240 fetched pages Injected URL : 1000 only one machine is datanode: 800 fecthed pages map tasks: 3 reduce tasks: 3 threads: 250 I also commented line 211 of Generator.java, but it didn't change the situation. I'll try to do some more testings. Thanks, Mike On 1/19/06, Doug Cutting [EMAIL PROTECTED] wrote: Florent Gluck wrote: I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. There have been a number of complaints about unreliable fetching with protocol-httpclient, so I've switched the default back to protocol-http. Doug
Re: So many Unfetched Pages using MapReduce
Hi Florent I did some more testings. Here is the results: I have 3 machines, P4 and 1G ram. All three are data node and one is namenode. I started from 8 seed urls and tried to see the effect of depth 1 crawl for different configuration. Number of unfetch pages changes with different configurations: --Configuration 1 Number of map tasks: 3 Number of reduce tasks: 3 Number of fetch threads: 40 Number of thread per host: 2 http.timeout: 10 sec --- 6700 pages fetched --Configuration 2 Number of map tasks: 12 Number of reduce tasks: 6 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 18000 pages fetched --Configuration 3 Number of map tasks: 40 Number of reduce tasks: 20 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 37000 pages fetched --Configuration 4 Number of map tasks: 100 Number of reduce tasks: 20 Number of fetch threads: 100 Number of thread per host: 20 http.timeout: 10 sec --- 34000 pages fetched --Configuration 5 Number of map tasks: 50 Number of reduce tasks: 50 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 52000 pages fetched --Configuration 6 Number of map tasks: 50 Number of reduce tasks: 100 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 57000 pages fetched --Configuration 7 Number of map tasks: 50 Number of reduce tasks: 120 Number of fetch threads: 250 Number of thread per host: 20 http.timeout: 20 sec --- 6 pages fetched Do you have any idea why pages are missing from the fetcher without the any log or exceptions? It seems it really depends on the number of reduce tasks! Thanks, Mike On 1/17/06, Mike Smith [EMAIL PROTECTED] wrote: I've experienced the same effect. When I decrease number of map/reduce tasks, I can fetch more web pages. but increasing those increases unfetched pages. I also get some java.net.SocketTimeoutException : Read timed out exceptions in my datanode log files. But those time out problems couldn't cause this much missing pages!! I agree the problem should be somewhere is the fetcher. Mike On 1/17/06, Florent Gluck [EMAIL PROTECTED] wrote: I'm having the exact same problem. I noticed that changing the number of map/reduce tasks gives me different DB_fetched results. Looking at the logs, a lot of urls are actually missing. I can't find their trace *anywhere* in the logs (whether on the slaves or the master). I'm puzzled. Currently I'm trying to debug the code to see what's going on. So far, I noticed the generator is fine, so the issue must lay further in the pipeline (fetcher?). Let me know if you find anything regarding this issue. Thanks. --Flo Mike Smith wrote: Hi, I have setup for boxes using MapReduce, everything goes smoothly, I have feeded about 8 seed nodes for begining and I have crawled by depth 2. Only 1900 pages (about 300MG) data and the rest is marked and db unfetched. Does any one know what could be wrong? This is the output of (bin/nutch readdb h2/crawldb -stats): 060115 171625 Statistics for CrawlDb: h2/crawldb 060115 171625 TOTAL urls: 99403 060115 171625 avg score:1.01 060115 171625 max score:7.382 060115 171625 min score:1.0 060115 171625 retry 0: 99403 060115 171625 status 1 (DB_unfetched): 97470 060115 171625 status 2 (DB_fetched):1933 060115 171625 CrawlDb statistics: done Thanks, Mike
Re: So many Unfetched Pages using MapReduce
Hi Mike, Your differents tests are really interesting, thanks for sharing! I didn't do as many tests. I changed the number of fetch threads and the number of map and reduce tasks and noticed that it gave me quite different results in terms of pages fetched. Then, I wanted to see if this issue would still happen when running the crawl (single pass) on one single machine running everything locally, without ndfs. So I injected 5 urls and got 2315 urls fetched. I couldn't find a trace in the logs of most of the urls. I noticed that if I put a counter at the beginning of the /while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't end up with 5! After some poking around, I noticed that if I comment out the line doing the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key, datum);/, then I get 5. There seems to be something really wrong with that. I seems to mean that some threads are dying without notification in the http protocol code (if it makes any sense). I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. The following bug seems to be very similar to what we are encountering: http://issues.apache.org/jira/browse/NUTCH-136 Check out the latest comment. I'm gonna remove line 211 and run some tests to see how it behaves (with protocol-http and protocol-httpclient). I'll let you know what I find out, --Florent Mike Smith wrote: Hi Florent I did some more testings. Here is the results: I have 3 machines, P4 and 1G ram. All three are data node and one is namenode. I started from 8 seed urls and tried to see the effect of depth 1 crawl for different configuration. Number of unfetch pages changes with different configurations: --Configuration 1 Number of map tasks: 3 Number of reduce tasks: 3 Number of fetch threads: 40 Number of thread per host: 2 http.timeout: 10 sec --- 6700 pages fetched --Configuration 2 Number of map tasks: 12 Number of reduce tasks: 6 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 18000 pages fetched --Configuration 3 Number of map tasks: 40 Number of reduce tasks: 20 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 37000 pages fetched --Configuration 4 Number of map tasks: 100 Number of reduce tasks: 20 Number of fetch threads: 100 Number of thread per host: 20 http.timeout: 10 sec --- 34000 pages fetched --Configuration 5 Number of map tasks: 50 Number of reduce tasks: 50 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 52000 pages fetched --Configuration 6 Number of map tasks: 50 Number of reduce tasks: 100 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 57000 pages fetched --Configuration 7 Number of map tasks: 50 Number of reduce tasks: 120 Number of fetch threads: 250 Number of thread per host: 20 http.timeout: 20 sec --- 6 pages fetched Do you have any idea why pages are missing from the fetcher without the any log or exceptions? It seems it really depends on the number of reduce tasks! Thanks, Mike
Re: So many Unfetched Pages using MapReduce
Florent Gluck wrote: I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. There have been a number of complaints about unreliable fetching with protocol-httpclient, so I've switched the default back to protocol-http. Doug
Re: So many Unfetched Pages using MapReduce
I'm having the exact same problem. I noticed that changing the number of map/reduce tasks gives me different DB_fetched results. Looking at the logs, a lot of urls are actually missing. I can't find their trace *anywhere* in the logs (whether on the slaves or the master). I'm puzzled. Currently I'm trying to debug the code to see what's going on. So far, I noticed the generator is fine, so the issue must lay further in the pipeline (fetcher?). Let me know if you find anything regarding this issue. Thanks. --Flo Mike Smith wrote: Hi, I have setup for boxes using MapReduce, everything goes smoothly, I have feeded about 8 seed nodes for begining and I have crawled by depth 2. Only 1900 pages (about 300MG) data and the rest is marked and db unfetched. Does any one know what could be wrong? This is the output of (bin/nutch readdb h2/crawldb -stats): 060115 171625 Statistics for CrawlDb: h2/crawldb 060115 171625 TOTAL urls: 99403 060115 171625 avg score:1.01 060115 171625 max score:7.382 060115 171625 min score:1.0 060115 171625 retry 0: 99403 060115 171625 status 1 (DB_unfetched): 97470 060115 171625 status 2 (DB_fetched):1933 060115 171625 CrawlDb statistics: done Thanks, Mike
So many Unfetched Pages using MapReduce
Hi, I have setup for boxes using MapReduce, everything goes smoothly, I have feeded about 8 seed nodes for begining and I have crawled by depth 2. Only 1900 pages (about 300MG) data and the rest is marked and db unfetched. Does any one know what could be wrong? This is the output of (bin/nutch readdb h2/crawldb -stats): 060115 171625 Statistics for CrawlDb: h2/crawldb 060115 171625 TOTAL urls: 99403 060115 171625 avg score:1.01 060115 171625 max score:7.382 060115 171625 min score:1.0 060115 171625 retry 0: 99403 060115 171625 status 1 (DB_unfetched): 97470 060115 171625 status 2 (DB_fetched):1933 060115 171625 CrawlDb statistics: done Thanks, Mike