Re: readseg bug?
Thank you for the explanation. It was a bit confusing at first, but it actually makes sense. Florent Doğacan Güney wrote: Hi, On 5/17/07, Florent Gluck [EMAIL PROTECTED] wrote: Hi all, I've noticed that when doing a segment dump using readseg, several instances of the same CrawlDatum can be present in a given record. For example I have a segment with one single url (http://www.moma.org) and here is the dump below. I ran the following command: nutch readseg -dump segments/20070517113941 segdump -nocontent -noparsedata -noparsetext With this command, readseg reads from crawl_{fetch,generate,parse}. Here is the first record: Recno:: 0 URL:: http://www.moma.org/ CrawlDatum:: Version: 5 Status: 1 (db_unfetched) Fetch time: Thu May 17 11:39:34 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0 Signature: null Metadata: _ngt_:1179416381663 This one is from crawl_generate, you can see that it contains a _ngt_ field. This datum is read by fetcher. CrawlDatum:: Version: 5 Status: 65 (signature) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 0.0 days Score: 1.0 Signature: fe47b3db7c988541287fc6412ce0b923 Metadata: null This one is from crawl_parse. It contains signature of the parse text which is used to dedup after index. CrawlDatum:: Version: 5 Status: 33 (fetch_success) Fetch time: Thu May 17 11:39:49 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0 Signature: fe47b3db7c988541287fc6412ce0b923 Metadata: _ngt_:1179416381663 _pst_:success(1), lastModified=0 This is from crawl_fetch. Why are there 3 CrawlDatum fields? I assumed there would be only one CrawlDatum with status 33 (fetch_success). What is the purpose of the other two? Now, here is the 5th record: Recno:: 5 URL:: http://www.moma.org/application/x-shockwave-flash CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null In this case, a linked status indicates an outlink. Most likely your url (http://www.moma.org) contains six distinct outlinks to http://www.moma.org/application/x-shockwave-flash. Each of them is put as a seperate entity to crawl_parse. This is used in updatedb to (among other things) calculate score. There are 6 CrawlDatum fields and all of them are exactly identical. Is this a bug or am I missing something here? Any light on this matter would be greatly appreciated. Thank you. Florent
Re: Buggy fetchlist' urls
Hi Andrzej, Well, I think for now I'll just disable the parse-js plugin since I don't really need it anyway. I'll let you know if I ever work on it (I may need it in the future). Thanks, --Flo Andrzej Bialecki wrote: Florent Gluck wrote: Some urls are totally bogus. I didn't investigate what could be causing this yet, but it looks like it could be a parsing issue. Some urls contain some javascript code and others contain some html tags. This is a side-effect of our primitive parse-js, which doesn't really parse anything, just uses some heuristic to extract possible URLs. Unfortunately, often as not the strings it extracts don't have anything to do with URLs. If you have suggestions on how to improve it I'm all ears.
Buggy fetchlist' urls
Hi, I'm using nutch revision 385671 from the trunk. I'm running it on a single machine using the local fileystem. I just started with a seed of one single url: http://www.osnews.com Then I ran a crawl cycle of depth 2 (generate/fetch/updatedb) and dumpped the crawl db. Here is where I got quite surprised: [EMAIL PROTECTED]:~/tmp$ nutch readdb crawldb -dump dump [EMAIL PROTECTED]:~/tmp$ grep ^http dump/part-0 http://a.ads.t-online.de/ Version: 4 http://a.as-eu.falkag.net/ Version: 4 http://a.as-rh4.falkag.net/ Version: 4 http://a.as-rh4.falkag.net/server/asldata.jsVersion: 4 http://a.as-test.falkag.net/Version: 4 http://a.as-us.falkag.net/ Version: 4 http://a.as-us.falkag.net/dat/bfx/ Version: 4 http://a.as-us.falkag.net/dat/bgf/ Version: 4 http://a.as-us.falkag.net/dat/bgf/trpix.gif;Version: 4 http://a.as-us.falkag.net/dat/bjf/ Version: 4 http://a.as-us.falkag.net/dat/brf/ Version: 4 http://a.as-us.falkag.net/dat/cjf/ Version: 4 http://a.as-us.falkag.net/dat/cjf/00/13/60/94.jsVersion: 4 http://a.as-us.falkag.net/dat/cjf/00/13/60/96.jsVersion: 4 http://a.as-us.falkag.net/dat/dlv/);QQt.document.write( Version: 4 http://a.as-us.falkag.net/dat/dlv/);document.write( Version: 4 http://a.as-us.falkag.net/dat/dlv/+((QQPc-QQwA)/1000)+ Version: 4 http://a.as-us.falkag.net/dat/dlv/.ads.t-online.de Version: 4 http://a.as-us.falkag.net/dat/dlv/.as-eu.falkag.net Version: 4 http://a.as-us.falkag.net/dat/dlv/.as-rh4.falkag.netVersion: 4 http://a.as-us.falkag.net/dat/dlv/.as-us.falkag.net Version: 4 http://a.as-us.falkag.net/dat/dlv/:// Version: 4 http://a.as-us.falkag.net/dat/dlv//bbr Version: 4 http://a.as-us.falkag.net/dat/dlv//big/bbrVersion: 4 http://a.as-us.falkag.net/dat/dlv//center/td/tr/table/body/html Version: 4 http://a.as-us.falkag.net/dat/dlv//divVersion: 4 http://a.as-us.falkag.net/dat/dlv/Banner-Typ/PopUp Version: 4 http://a.as-us.falkag.net/dat/dlv/ShockwaveFlash.ShockwaveFlash. Version: 4 http://a.as-us.falkag.net/dat/dlv/afxplay.jsVersion: 4 http://a.as-us.falkag.net/dat/dlv/application/x-shockwave-flash Version: 4 http://a.as-us.falkag.net/dat/dlv/aslmain.jsVersion: 4 http://a.as-us.falkag.net/dat/dlv/text/javascript Version: 4 http://a.as-us.falkag.net/dat/dlv/window.blur();Version: 4 http://a.as-us.falkag.net/dat/njf/ Version: 4 http://bilbo.counted.com/0/42699/ Version: 4 http://bilbo.counted.com/7/42699/ Version: 4 http://bw.ads.t-online.de/ Version: 4 http://bw.as-eu.falkag.net/ Version: 4 http://bw.as-us.falkag.net/ Version: 4 http://data.as-us.falkag.net/server/asldata.js Version: 4 http://denux.org/ Version: 4 ... Some urls are totally bogus. I didn't investigate what could be causing this yet, but it looks like it could be a parsing issue. Some urls contain some javascript code and others contain some html tags. Is there anyone aware of this? I can open a bug if needed. Thanks, --Flo
Re: Error while indexing (mapred)
Chris, I bumpped the maximum number of open file descriptors to 32k, but still no luck: ... 060214 062901 reduce 9% 060214 062905 reduce 10% 060214 062908 reduce 11% 060214 062911 reduce 12% 060214 062914 reduce 11% 060214 062917 reduce 10% 060214 062918 reduce 9% 060214 062919 reduce 10% 060214 062923 reduce 9% 060214 062924 reduce 10% Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:310) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:329) at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:349) Exactly the same error messages as before. I guess I'll take my chances with the latest revision in trunk and try again :-/ --Florent Chris Schneider wrote: Florent, You might want to try increasing the number of open files allowed on your master machine. We've increased this twice now, and each time it solved similar problems. We now have it at 16K. See my other post today (re: Corrupt NDFS?) for more details. Good Luck, - Chris At 11:07 AM -0500 2/10/06, Florent Gluck wrote: Hi, I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data and 4.6M fetched urls in my crawldb. I'm using the mapred code from trunk (revision 374061, Wed, 01 Feb 2006). I was able to generate the indexes from the crawldb and linkdb, but I started to see this error recently while running a dedup on my indexes: 060210 061707 reduce 9% 060210 061710 reduce 10% 060210 061713 reduce 11% 060210 061717 reduce 12% 060210 061719 reduce 11% 060210 061723 reduce 10% 060210 061725 reduce 11% 060210 061726 reduce 10% 060210 061729 reduce 11% 060210 061730 reduce 9% 060210 061732 reduce 10% 060210 061736 reduce 11% 060210 061739 reduce 12% 060210 061742 reduce 10% 060210 061743 reduce 9% 060210 061745 reduce 10% 060210 061746 reduce 100% Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:310) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:329) at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:349) I can see a lot of these messages in the jobtracker log on the master: ... 060210 061743 Task 'task_r_4t50k4' has been lost. 060210 061743 Task 'task_r_79vn7i' has been lost. ... On every single slave, I get this file not found exception in the tasktracker log: 060210 061749 Server handler 0 on 50040 caught: java.io.FileNotFoundException: /var/epile/nutch/mapred/local/task_m_273opj/part-4.out java.io.FileNotFoundException: /var/epile/nutch/mapred/local/task_m_273opj/part-4.out at org.apache.nutch.fs.LocalFileSystem.openRaw(LocalFileSystem.java:121) at org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45) at org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:226) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160) at org.apache.nutch.mapred.MapOutputFile.write(MapOutputFile.java:93) at org.apache.nutch.io.ObjectWritable.writeObject(ObjectWritable.java:121) at org.apache.nutch.io.ObjectWritable.write(ObjectWritable.java:68) at org.apache.nutch.ipc.Server$Handler.run(Server.java:215) I used to be able to complete the index dedupping successfully when my segments/crawldb was smaller, but I don't see why this would be related to the FileNotFoundException. I'm by far not running out of disk space and my hard discs work properly. Has anyone encountered a similar issue or has a clue about what's happening? Thanks, Florent
Error while indexing (mapred)
Hi, I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data and 4.6M fetched urls in my crawldb. I'm using the mapred code from trunk (revision 374061, Wed, 01 Feb 2006). I was able to generate the indexes from the crawldb and linkdb, but I started to see this error recently while running a dedup on my indexes: 060210 061707 reduce 9% 060210 061710 reduce 10% 060210 061713 reduce 11% 060210 061717 reduce 12% 060210 061719 reduce 11% 060210 061723 reduce 10% 060210 061725 reduce 11% 060210 061726 reduce 10% 060210 061729 reduce 11% 060210 061730 reduce 9% 060210 061732 reduce 10% 060210 061736 reduce 11% 060210 061739 reduce 12% 060210 061742 reduce 10% 060210 061743 reduce 9% 060210 061745 reduce 10% 060210 061746 reduce 100% Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:310) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:329) at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:349) I can see a lot of these messages in the jobtracker log on the master: ... 060210 061743 Task 'task_r_4t50k4' has been lost. 060210 061743 Task 'task_r_79vn7i' has been lost. ... On every single slave, I get this file not found exception in the tasktracker log: 060210 061749 Server handler 0 on 50040 caught: java.io.FileNotFoundException: /var/epile/nutch/mapred/local/task_m_273opj/part-4.out java.io.FileNotFoundException: /var/epile/nutch/mapred/local/task_m_273opj/part-4.out at org.apache.nutch.fs.LocalFileSystem.openRaw(LocalFileSystem.java:121) at org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45) at org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:226) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160) at org.apache.nutch.mapred.MapOutputFile.write(MapOutputFile.java:93) at org.apache.nutch.io.ObjectWritable.writeObject(ObjectWritable.java:121) at org.apache.nutch.io.ObjectWritable.write(ObjectWritable.java:68) at org.apache.nutch.ipc.Server$Handler.run(Server.java:215) I used to be able to complete the index dedupping successfully when my segments/crawldb was smaller, but I don't see why this would be related to the FileNotFoundException. I'm by far not running out of disk space and my hard discs work properly. Has anyone encountered a similar issue or has a clue about what's happening? Thanks, Florent
Re: So many Unfetched Pages using MapReduce
Hi Mike, I finally got everything working properly! What I did was to switch to /protocol-http/ and move the following from /nutch-site.xml/ to /mapred-default.xml/: /property namemapred.map.tasks/name value100/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property property namemapred.reduce.tasks/name value40/value descriptionThe default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is local. /description /property/ I then injected 100'000 urls and grepped the logs on my 4 slaves to see if the sum of all the fetched urls adds up to 100'000. It did :) There was finally no need to comment out line 211 of /Generator.java. /Hope it helps,/ --/Flo Mike Smith wrote: Hi Florent Thanks for the inquery and reply. I did some more tests based on your suggestion. Using the old protocol-http the problem is solved for single machine. But when I have datanodes running on two other machines the problem still exist but the number of unfetched pages is less than before. These are my tests Injected URL: 8 only one machine is datanode: 7 fecthed pages map tasks: 3 reduce tasks: 3 threads: 250 Injected URL: 8 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 2 fetched pages map tasks: 12 reduce tasks: 6 threads: 250 Injected URL : 5000 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 1200 fetched pages map tasks: 12 reduce tasks: 6 threads: 250 Injected URL : 1000 3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines: 240 fetched pages Injected URL : 1000 only one machine is datanode: 800 fecthed pages map tasks: 3 reduce tasks: 3 threads: 250 I also commented line 211 of Generator.java, but it didn't change the situation. I'll try to do some more testings. Thanks, Mike On 1/19/06, Doug Cutting [EMAIL PROTECTED] wrote: Florent Gluck wrote: I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. There have been a number of complaints about unreliable fetching with protocol-httpclient, so I've switched the default back to protocol-http. Doug
Re: So many Unfetched Pages using MapReduce
Andrzej, I ran 2 crawls of 1 pass each, injecting 100'000 urls. Here is the output of /readdb -stats/ when crawling with /protocol-http/: 060123 162250 TOTAL urls: 119221 060123 162250 avg score:1.023 060123 162250 max score:240.666 060123 162250 min score:1.0 060123 162250 retry 0: 56648 060123 162250 retry 1: 62573 060123 162250 status 1 (DB_unfetched): 89068 060123 162250 status 2 (DB_fetched):27513 060123 162250 status 3 (DB_gone): 2640 And here is the output when crawling with /protocol-httpclient/: 060123 180243 TOTAL urls: 117451 060123 180243 avg score:1.021 060123 180243 max score:194.0 060123 180243 min score:1.0 060123 180243 retry 0: 52273 060123 180243 retry 1: 65178 060123 180243 status 1 (DB_unfetched): 89670 060123 180243 status 2 (DB_fetched):26066 060123 180243 status 3 (DB_gone): 1715 Both return more or less the same results (w/ a difference of ~1.5% in the #fetches which is not surprising on a 100k set). I checked the logs and in the 2 cases, I see exactly 100'000 fetch attempts. You were right, it actually makes sense that the settings in /mapred-default.xml/ would affect the local crawl as well since they have nothing to do w/ ndfs. It therefore seems that /protocol-httpclient/ is reliable enough to be used (well, at least in my case). --Flo Florent Gluck wrote: Andrzej Bialecki wrote: Could you please check (on a smaller sample ;-) ) which of these two changes was necessary? Frist, second, or both? I suspect only the second change was really needed, i.e. the change in config files, and not the change of protocol-httpclient - protocol-http ... It would be very helpful if you could confirm/deny this. Well, I'm pretty much sure protocol-httpclient is part of the problem. Earlier last week, I was trying to figure out what the problem was and I ran some crawls on single machine, using the local filesystem. Here were my previous observations (from an older message): I injected 5 urls and got 2315 urls fetched. I couldn't find a trace in the logs of most of the urls. I noticed that if I put a counter at the beginning of the /while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't end up with 5! After some poking around, I noticed that if I comment out the line doing the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key, datum);/, then I get 5. There seems to be something really wrong with that. I seems to mean that some threads are dying without notification in the http protocol code (if it makes any sense). I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient. With the old protocol I got 5 as expected. So to me it seems protocol-httpclient is buggy. I'll still run a test with my current config and protocol-httpclient and let you know. -Flo
Re: So many Unfetched Pages using MapReduce
Hi Mike, Your differents tests are really interesting, thanks for sharing! I didn't do as many tests. I changed the number of fetch threads and the number of map and reduce tasks and noticed that it gave me quite different results in terms of pages fetched. Then, I wanted to see if this issue would still happen when running the crawl (single pass) on one single machine running everything locally, without ndfs. So I injected 5 urls and got 2315 urls fetched. I couldn't find a trace in the logs of most of the urls. I noticed that if I put a counter at the beginning of the /while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't end up with 5! After some poking around, I noticed that if I comment out the line doing the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key, datum);/, then I get 5. There seems to be something really wrong with that. I seems to mean that some threads are dying without notification in the http protocol code (if it makes any sense). I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. The following bug seems to be very similar to what we are encountering: http://issues.apache.org/jira/browse/NUTCH-136 Check out the latest comment. I'm gonna remove line 211 and run some tests to see how it behaves (with protocol-http and protocol-httpclient). I'll let you know what I find out, --Florent Mike Smith wrote: Hi Florent I did some more testings. Here is the results: I have 3 machines, P4 and 1G ram. All three are data node and one is namenode. I started from 8 seed urls and tried to see the effect of depth 1 crawl for different configuration. Number of unfetch pages changes with different configurations: --Configuration 1 Number of map tasks: 3 Number of reduce tasks: 3 Number of fetch threads: 40 Number of thread per host: 2 http.timeout: 10 sec --- 6700 pages fetched --Configuration 2 Number of map tasks: 12 Number of reduce tasks: 6 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 18000 pages fetched --Configuration 3 Number of map tasks: 40 Number of reduce tasks: 20 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 37000 pages fetched --Configuration 4 Number of map tasks: 100 Number of reduce tasks: 20 Number of fetch threads: 100 Number of thread per host: 20 http.timeout: 10 sec --- 34000 pages fetched --Configuration 5 Number of map tasks: 50 Number of reduce tasks: 50 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 52000 pages fetched --Configuration 6 Number of map tasks: 50 Number of reduce tasks: 100 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 57000 pages fetched --Configuration 7 Number of map tasks: 50 Number of reduce tasks: 120 Number of fetch threads: 250 Number of thread per host: 20 http.timeout: 20 sec --- 6 pages fetched Do you have any idea why pages are missing from the fetcher without the any log or exceptions? It seems it really depends on the number of reduce tasks! Thanks, Mike
Re: Error at end of MapReduce run with indexing
Ken Krugler wrote: Hello fellow Nutchers, I followed the steps described here by Doug: http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL PROTECTED] ...to start a test run of the new (0.8, as of 1/12/2006) version of Nutch. It ran for quite a while on my three machines - started at 111226, and died at 150937, so almost four hours. The error occurred during the Indexer phase: 060114 150937 Indexer: starting 060114 150937 Indexer: linkdb: crawl-20060114111226/linkdb 060114 150937 Indexer: adding segment: /user/crawler/crawl-20060114111226/segments/20060114111918 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-default.xml 060114 150937 parsing file:/home/crawler/nutch/conf/crawl-tool.xml 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-site.xml 060114 150937 Indexer: adding segment: /user/crawler/crawl-20060114111226/segments/20060114122751 060114 150937 Indexer: adding segment: /user/crawler/crawl-20060114111226/segments/20060114133620 Exception in thread main java.io.IOException: timed out waiting for response at org.apache.nutch.ipc.Client.call(Client.java:296) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at $Proxy1.submitJob(Unknown Source) at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259) at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288) at org.apache.nutch.indexer.Indexer.index(Indexer.java:259) at org.apache.nutch.crawl.Crawl.main(Crawl.java:121) 1. Any ideas what might have caused it to time out just now, when it had successfully run many jobs up to that point? 2. What cruft might I need to get rid of because it died? For example, I see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml now when I try to execute some Nutch commands. I've had the same problem during the invertlinks step when dealing w/ a large number of urls. Increasing the ipc.client.timeout value from 6 to 10 (cf nutch-default.xml) did the trick. 3. What's the best way to find out how many pages were actually crawled, how many links are in the DB, etc? The 0.7-era commands (readdb, segread, etc) don't seem to be working with the new NDFS setup. The following gives you some stats about the crawl db (#url fetched, unfetched and dead ones): nutch readdb crawldb -stats 4. Any idea whether 4 hours is a reasonable amount of time for this test? It seemed long to me, given that I was starting with a single URL as the seed. How many crawl passes did you do ? --Flo
Re: So many Unfetched Pages using MapReduce
I'm having the exact same problem. I noticed that changing the number of map/reduce tasks gives me different DB_fetched results. Looking at the logs, a lot of urls are actually missing. I can't find their trace *anywhere* in the logs (whether on the slaves or the master). I'm puzzled. Currently I'm trying to debug the code to see what's going on. So far, I noticed the generator is fine, so the issue must lay further in the pipeline (fetcher?). Let me know if you find anything regarding this issue. Thanks. --Flo Mike Smith wrote: Hi, I have setup for boxes using MapReduce, everything goes smoothly, I have feeded about 8 seed nodes for begining and I have crawled by depth 2. Only 1900 pages (about 300MG) data and the rest is marked and db unfetched. Does any one know what could be wrong? This is the output of (bin/nutch readdb h2/crawldb -stats): 060115 171625 Statistics for CrawlDb: h2/crawldb 060115 171625 TOTAL urls: 99403 060115 171625 avg score:1.01 060115 171625 max score:7.382 060115 171625 min score:1.0 060115 171625 retry 0: 99403 060115 171625 status 1 (DB_unfetched): 97470 060115 171625 status 2 (DB_fetched):1933 060115 171625 CrawlDb statistics: done Thanks, Mike
mapred fetching weirdness
Hi, I'm running nutch trunk as of today. I have 3 slaves and a master. I'm using *mapred.map.tasks=20* and *mapred.reduce.tasks=4* There is something I'm really confused about. When I inject 25000 urls and fetch them (depth = 1) and do a readdb -stats, I get: 060110 171347 Statistics for CrawlDb: crawldb 060110 171347 TOTAL urls: 27939 060110 171347 avg score:1.011 060110 171347 max score:8.883 060110 171347 min score:1.0 060110 171347 retry 0: 26429 060110 171347 retry 1: 1510 060110 171347 status 1 (DB_unfetched): 24248 060110 171347 status 2 (DB_fetched):3390 060110 171347 status 3 (DB_gone): 301 060110 171347 CrawlDb statistics: done There are several things that don't make sense to me and it would be great if someone could clear this up: 1. If I compute the number of occurences of fetching in all of my slaves' tasktracker logs, I get: 6225 This number clearly doesn't match the *DB_fetched* of 3390 from the readdb output. Why is that ? What happened to the 6225-3390=2835 urls missing ? 2. Why is the *TOTAL urls: 27939* if I inject a file with 25000 entries ? Why is it not 25000 ? 3. What is the meaning of *DB_gone* and *DB_unfetched* ? I was assuming if you inject a total of 25k urls where 5000 are fetchable ones, you would get something like: (DB_unfetched): 2 (DB_fetched):5000 It's not the case, so I'd like to understand what's exactly going on here. 4. If I redo (starting from an empty crawldb of course) the exact same inject + crawl with the same 25000 urls, but I use the following mapred settings instead: *mapred.map.tasks=200* and *mapred.reduce.tasks=8*, I get the following readdb output: 060110 162140 TOTAL urls: 33173 060110 162140 avg score:1.026 060110 162140 max score:22.083 060110 162140 min score:1.0 060110 162140 retry 0: 28381 060110 162140 retry 1: 4792 060110 162140 status 1 (DB_unfetched): 23136 060110 162140 status 2 (DB_fetched):9234 060110 162140 status 3 (DB_gone): 803 060110 162140 CrawlDb statistics: done How come the *DB_fetched *is about 3x more and the *TOTAL urls *goes beyond 25000 ??? It doesn't make any sense. I'd expect to see similar results as before with the other mapred settings. Thank you, Florent
Re: is nutch recrawl possible?
Pushpesh, We extended nutch with a whitelist filter and you might find it useful. Check the comments from Matt Kangas here: http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all --Flo Pushpesh Kr. Rajwanshi wrote: hmmm... actually my requirement is a bit more complex than it seems so url filters alone probably would do. Because i am not filtering urls based only on some domain name but within domain i want to discard some urls, and since they actually dont follow a pattern hence i cant use url filters otherwise url filters would have done great job. Thanks anyway Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me writing code for stuff i need :-) Thanks and Regards, Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Try using the whole-web fetching method instead of the crawl method. http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling http://wiki.media-style.com/display/nutchDocu/quick+tutorial Pushpesh Kr. Rajwanshi wrote: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005 No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
java.io.IOException in dedup (map reduce)
Hi, I'm using the map reduce branch, 1 master and 3 slaves, and they are configured the standard way (master as a jobtracker + namenode) After having created a index, I run dedup on it, but I get a IOException. Here is an extract of the log: 051215 160733 Dedup: starting 051215 160733 Dedup: adding indexes in: index 051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/nutch-default.xml 051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/mapred-default.xml 051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/nutch-site.xml 051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/nutch-default.xml 051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/nutch-site.xml 051215 160734 Client connection to 127.0.0.1:4: starting 051215 160734 Client connection to 127.0.0.1:5: starting 051215 160735 Running job: job_a4k3en 051215 160736 map 0% 051215 160753 map 17% 051215 160803 reduce 100% Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.crawl.DeleteDuplicates.dedup(DeleteDuplicates.java:3 12) at org.apache.nutch.crawl.DeleteDuplicates.main(DeleteDuplicates.java:34 9) I also ran the exact same crawl locally on one single machine w/o using ndfs and it works fine. Has anyone else encountered this exception ? Thanks, --Flo
nutch mapred + tomcat and a couple other questions
Sorry for such a basic question, but how do we run a search on a generated index ? I read about how to setup tomcat w/ nutch 0.8 and you have to run it in the directory where the index resides (apparently it looks in the segments dir from where it's run). However this wont' work w/ nutch 0.8 since the index is in ndfs. I then tried to copy it locally using ndfs -copyToLocal, but without any luck. Also, how to use nutch search ? Is NutchBean the closest thing to segread from nutch 0.8 ? Finally, what's the purpose of nutch parse and how to use it ? Thanks, --Flo
Re: nutch mapred + tomcat and a couple other questions
My err, I meant nutch server not nutch search --Flo Florent Gluck wrote: Sorry for such a basic question, but how do we run a search on a generated index ? I read about how to setup tomcat w/ nutch 0.8 and you have to run it in the directory where the index resides (apparently it looks in the segments dir from where it's run). However this wont' work w/ nutch 0.8 since the index is in ndfs. I then tried to copy it locally using ndfs -copyToLocal, but without any luck. Also, how to use nutch search ? Is NutchBean the closest thing to segread from nutch 0.8 ? Finally, what's the purpose of nutch parse and how to use it ? Thanks, --Flo
Re: nutch mapred + tomcat and a couple other questions
Never mind, I got tomcat working. After looking at the code, it seems nutch parse does nothing yet. The last remaining thing is how to use NutchBean to output the segments' content. Thanks, --Flo Florent Gluck wrote: Sorry for such a basic question, but how do we run a search on a generated index ? I read about how to setup tomcat w/ nutch 0.8 and you have to run it in the directory where the index resides (apparently it looks in the segments dir from where it's run). However this wont' work w/ nutch 0.8 since the index is in ndfs. I then tried to copy it locally using ndfs -copyToLocal, but without any luck. Also, how to use nutch search ? Is NutchBean the closest thing to segread from nutch 0.8 ? Finally, what's the purpose of nutch parse and how to use it ? Thanks, --Flo
Incremental crawl w/ map reduce
Hi, As a test, I recently did a quick incremental crawl. First, I did a crawl with 10 seed urls using 4 nodes (1 jobTracker/nameNode + 3 tastTrackers/dataNodes). So far, so good, the fetches were distributed among the 3 nodes (3/3/4) and a segment was generated. Running a quick -stats on the crawldb showed me the 10 links were there. I also did a dump and everything was fine. Then, I injected a new url and crawled again, generating a second segment. While it was running, I looked at the logs expecting to only see the fetch of the new url I added, but instead I saw it was fetching all the previous urls again. Why is that ? These were already fetched and my understanding is that they should only be fetched again after 30 days (or whatever value is specified in nutch-site.xml). What am I missing here ? Thanks, Flo