[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836918#action_12836918 ] Euan Clark commented on NUTCH-719: -- I notice the other addFetchItem method of FetchItemQueues and FetchItemQueue in Fetcher.java should these also be synchronized? fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-719. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 911905. Thanks to S. Dennis for investigating the issue + R. Schwab for testing it fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-719. --- fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836125#action_12836125 ] Hudson commented on NUTCH-719: -- Integrated in Nutch-trunk #1074 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1074/]) fetchQueues.totalSize incorrect in Fetcher fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-719: --- Assignee: Julien Nioche fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-679) Fetcher2 implementing Tool
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-679. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Julien Nioche Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: Fetcher2.Tool.patch, NUTCH-679.patch The patch attached makes Fetcher2 implement Tool. As a result we should be able to override parameters on the command line e.g. bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 segments/20090115072836 instead of having to modify the *-site.xml files in conf/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764295#action_12764295 ] Hudson commented on NUTCH-679: -- Integrated in Nutch-trunk #959 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/959/]) Fetcher2 implementing Tool. Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Julien Nioche Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: Fetcher2.Tool.patch, NUTCH-679.patch The patch attached makes Fetcher2 implement Tool. As a result we should be able to override parameters on the command line e.g. bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 segments/20090115072836 instead of having to modify the *-site.xml files in conf/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-721. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Doğacan Güney Code committed as of rev. 807485. I am closing this issue. Of course, there may be other reasons why Fetcher2 is slow, so feel free to create new issues if so. Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Assignee: Doğacan Güney Fix For: 1.1 Attachments: crawl_generate.tar.gz, NUTCH-721.patch, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-679) Fetcher2 implementing Tool
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-679: Attachment: NUTCH-679.patch Updated version of the patch Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Julien Nioche Priority: Minor Attachments: Fetcher2.Tool.patch, NUTCH-679.patch The patch attached makes Fetcher2 implement Tool. As a result we should be able to override parameters on the command line e.g. bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 segments/20090115072836 instead of having to modify the *-site.xml files in conf/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741225#action_12741225 ] Doğacan Güney commented on NUTCH-721: - Thanks for the analysis, Julien! Can you make a patch for the conf changes so we can commit it with your name? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-721: Attachment: NUTCH-721.patch Sets the default value for fetcher.threads.per.host.by.ip to false Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, NUTCH-721.patch, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741082#action_12741082 ] Julien Nioche commented on NUTCH-721: - I had another look at this issue after applying the patch from Nutch-719. I can easily reproduce the situation from the original post by setting fetcher.threads.per.host.by.ip to true. The nutch-site file sent by Rodger does not specify it so it would rely on this value by default. Once setting it to false all threads are active and the fetching is much faster. I have used the first 5K URLs from the fetchlist sent by Rodger and compared the perfs with by.ip set to false : OldFetcher : real32m26.003s user1m11.768s sys 0m10.337s OldFetcher : real30m52.965s user1m10.696s sys 0m10.425s Fetcher : real31m21.924s user1m12.725s sys 0m10.797s Fetcher : real30m3.017s user1m15.509s sys 0m10.909s I ran each step twice and as we can see the results are comparable. This explanation is also compliant with Steven's observation that we get 5-7 times the rate as we would hit the DNS cache for subsequent calls for URLs from non unique sites. The IP resolution is done by the QueueFeeder which explains why it is slowing down the number of URLs being available for fetching. I don't think that the oldFetcher allows to group URLs by IP for politeness in which case why not making fetcher.threads.per.host.by.ip default to false in the new fetcher? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741092#action_12741092 ] Andrzej Bialecki commented on NUTCH-721: - +1. Current defaults are sub-optimal due to backward-compatibility issues with early Nutch 0.8. This should be no longer a concern. Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730239#action_12730239 ] Steven Denny commented on NUTCH-719: I'm not sure, as far as I can tell, the feeder has always finished feeding the urls, it's just that a proportion are lost. However, there are two things I've noted re performance (if you just look at url's crawled per second) 1) When this situation arrises, the fetcher will time out and Abort with N hung threads. The timeout occurs after mapred.task.timeout/2 or seconds (default 5 mins), so any timing on a crawl that aborted will be extended by 5 mins. One a small crawl this could skew the figures 2) DNS look up can take a while. I know this has been noted before, but on my test system (admittedly only a vm on our network, with nothing special in terms of DNS), some of the look ups were taking 5-6 seconds. THis is possibley the wrong place to discuss given NUTCH-721, but I put in some debug arround the feeder thread and got: 2009-07-10 04:01:35,296 INFO fetcher.Fetcher - Fed 500 urls in 186 secs = 2.7url/s 2009-07-10 04:04:18,343 INFO fetcher.Fetcher - Fed 499 urls in 163 secs = 3.1url/s 2009-07-10 04:06:57,109 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s 2009-07-10 04:10:38,282 INFO fetcher.Fetcher - Fed 499 urls in 221 secs = 2.3url/s 2009-07-10 04:12:58,371 INFO fetcher.Fetcher - Fed 498 urls in 140 secs = 3.6url/s 2009-07-10 04:16:12,275 INFO fetcher.Fetcher - Fed 499 urls in 193 secs = 2.6url/s 2009-07-10 04:19:20,162 INFO fetcher.Fetcher - Fed 499 urls in 187 secs = 2.7url/s 2009-07-10 04:21:25,846 INFO fetcher.Fetcher - Fed 499 urls in 125 secs = 4.0url/s 2009-07-10 04:24:16,049 INFO fetcher.Fetcher - Fed 495 urls in 170 secs = 2.9url/s 2009-07-10 04:27:01,944 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s 2009-07-10 04:29:26,247 INFO fetcher.Fetcher - Fed 499 urls in 144 secs = 3.5url/s 2009-07-10 04:32:02,590 INFO fetcher.Fetcher - Fed 499 urls in 156 secs = 3.2url/s 2009-07-10 04:34:49,985 INFO fetcher.Fetcher - Fed 498 urls in 167 secs = 3.0url/s 2009-07-10 04:37:28,367 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s 2009-07-10 04:40:09,865 INFO fetcher.Fetcher - Fed 499 urls in 161 secs = 3.1url/s 2009-07-10 04:42:55,203 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s obviously when I'm only feeding 3-4 urls/sec, i'll only every be able to fetch that. That test was one a crawldb just initialised with 11,000 urls (unique sites). However, on the next iteration where I'm feeding urls from non-unique sites, I see 5-7 times that rate. fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730242#action_12730242 ] Steven Denny commented on NUTCH-721: I've done some testing on this and looked at the number of pages being fed, as this obvioulsy limits the number of pages you can fetch: 2009-07-10 04:01:35,296 INFO fetcher.Fetcher - Fed 500 urls in 186 secs = 2.7url/s 2009-07-10 04:04:18,343 INFO fetcher.Fetcher - Fed 499 urls in 163 secs = 3.1url/s 2009-07-10 04:06:57,109 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s 2009-07-10 04:10:38,282 INFO fetcher.Fetcher - Fed 499 urls in 221 secs = 2.3url/s 2009-07-10 04:12:58,371 INFO fetcher.Fetcher - Fed 498 urls in 140 secs = 3.6url/s 2009-07-10 04:16:12,275 INFO fetcher.Fetcher - Fed 499 urls in 193 secs = 2.6url/s 2009-07-10 04:19:20,162 INFO fetcher.Fetcher - Fed 499 urls in 187 secs = 2.7url/s 2009-07-10 04:21:25,846 INFO fetcher.Fetcher - Fed 499 urls in 125 secs = 4.0url/s 2009-07-10 04:24:16,049 INFO fetcher.Fetcher - Fed 495 urls in 170 secs = 2.9url/s 2009-07-10 04:27:01,944 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s 2009-07-10 04:29:26,247 INFO fetcher.Fetcher - Fed 499 urls in 144 secs = 3.5url/s 2009-07-10 04:32:02,590 INFO fetcher.Fetcher - Fed 499 urls in 156 secs = 3.2url/s 2009-07-10 04:34:49,985 INFO fetcher.Fetcher - Fed 498 urls in 167 secs = 3.0url/s 2009-07-10 04:37:28,367 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s 2009-07-10 04:40:09,865 INFO fetcher.Fetcher - Fed 499 urls in 161 secs = 3.1url/s 2009-07-10 04:42:55,203 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s That test was one a crawldb just initialised with 11,000 urls (unique sites). However, on the next iteration where I'm feeding urls from non-unique sites, I see 5-7 times that rate. (My test system is a vm on our network, with nothing special in terms of DNS. Someof the look ups were taking 5-6 seconds). Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730253#action_12730253 ] Steven Denny commented on NUTCH-719: perhaps i spoke too soon 10 threads, 15520 pages, 723 errors, 3.7 pages/s, 2972 kb/s, -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=0, fetchQueues.count=0 Aborting with 10 hung threads. Unable to resolve: www.countryenergy.com.au, skipping. Exception in thread QueueFeeder java.lang.NullPointerException at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:206) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:238) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:111) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1895) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1925) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.nutch.fetcher.Fetcher$QueueFeeder.run(Fetcher.java:418) It apears that the feeder hung, but I'm not sure whether the exception raised is the cause or the effect (i suspect it's the effect of the thread aborting) I'm also not sure if any of these issues are vm related. Hopefully our real hardware will turn up soon fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730402#action_12730402 ] Doğacan Güney commented on NUTCH-721: - Steven, if you have time/hardware, can you retry your use-case with OIdFetcher in trunk? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729699#action_12729699 ] Steven Denny commented on NUTCH-719: I've changed line 324 of src/java/org/apache/nutch/fetcher/Fetcher.java to public void synchronized void addFetchItem(FetchItem it) { (added the synchronized) and initial testing looks good. fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729833#action_12729833 ] Doğacan Güney commented on NUTCH-719: - Thanks for looking into this bug. I wonder if this is the cause of the performance problem so many people are facing with Fetcher in nutch-1.0. Can it be that QueueFeeder stops feeding new URLs into FetchQueues because of this bug? fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729254#action_12729254 ] Steven Denny edited comment on NUTCH-719 at 7/9/09 6:17 AM: I've done some investigation on this. It looks to me as if queues can get reaped to early. I've put in some debug and this is what I see: 2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - FetchItemQueue::getFetchItemQueue() id=http://125.168.254.20 2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - Created queue: http://125.168.254.20 2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - reaping: http://125.168.254.20 . 2009-07-09 04:39:50,705 DEBUG fetcher.Fetcher - addFetchItem: adding item - http://www.callidan.com/ma100.htm 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - totalSize++:2 http://125.168.254.20 http://www.callidan.com/ma100.htm queuesize: 1 queuecount: 11 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://61.9.216.193, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://216.184.34.250, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://139.146.150.23, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://203.29.78.68, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://150.101.91.39, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://209.212.110.211, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://123.176.112.44, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://117.104.160.130, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://196.25.73.205, size: 0 2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.53.7.145, size: 0 2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.60.67.145, size: 1 Note that the queue is created and then immediately reaped, and after totalSize is incremented, that queue does not appear in the list, even though it supposedly has the item added to it. The upshot is that the url is never fetched (as the queue has gone) so totalSize never = 0, and eventually the abort will happen. In short I'd say this is a sync issue, but I'm not sure where the best place to lock would be. Any comments from the author? was (Author: stevedenny): I've done some investigation on this. It looks to me as if queues can get reaped to early. I've put in some debug and this is what I see: 2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - FetchItemQueue::getFetchItemQueue() id=http://125.168.254.20 2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - Created queue: http://125.168.254.20 2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - reaping: http://125.168.254.20 . 2009-07-09 04:39:50,705 DEBUG fetcher.Fetcher - addFetchItem: adding item - http://www.callidan.com/ma100.htm 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - totalSize++:2 http://125.168.254.20 http://www.callidan.com/ma100.htm queuesize: 1 queuecount: 11 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://61.9.216.193, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://216.184.34.250, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://139.146.150.23, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://203.29.78.68, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://150.101.91.39, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://209.212.110.211, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://123.176.112.44, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://117.104.160.130, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://196.25.73.205, size: 0 2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.53.7.145, size: 0 2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.60.67.145, size: 1 Not that the queue is created and then immediately reaped, and after totalSize is incremented, that queue does not appear in the list, even though it supposedly has the item added to it. It looks as if when items are fed, there's a posibility of the queue being reaped before the item is added to the queue. However, totalSize is still incrememented. The upshot is that the url is never fetched (as the queue has gone) so totalSize never = 0, and eventually the abort will happen. In short I'd say this is a sync issue, but I'm not sure where the best place to lock would be. Any comments from the author? fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components
[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729254#action_12729254 ] Steven Denny commented on NUTCH-719: I've done some investigation on this. It looks to me as if queues can get reaped to early. I've put in some debug and this is what I see: 2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - FetchItemQueue::getFetchItemQueue() id=http://125.168.254.20 2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - Created queue: http://125.168.254.20 2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - reaping: http://125.168.254.20 . 2009-07-09 04:39:50,705 DEBUG fetcher.Fetcher - addFetchItem: adding item - http://www.callidan.com/ma100.htm 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - totalSize++:2 http://125.168.254.20 http://www.callidan.com/ma100.htm queuesize: 1 queuecount: 11 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://61.9.216.193, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://216.184.34.250, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://139.146.150.23, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://203.29.78.68, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://150.101.91.39, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://209.212.110.211, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://123.176.112.44, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://117.104.160.130, size: 0 2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://196.25.73.205, size: 0 2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.53.7.145, size: 0 2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.60.67.145, size: 1 Not that the queue is created and then immediately reaped, and after totalSize is incremented, that queue does not appear in the list, even though it supposedly has the item added to it. It looks as if when items are fed, there's a posibility of the queue being reaped before the item is added to the queue. However, totalSize is still incrememented. The upshot is that the url is never fetched (as the queue has gone) so totalSize never = 0, and eventually the abort will happen. In short I'd say this is a sync issue, but I'm not sure where the best place to lock would be. Any comments from the author? fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712492#action_12712492 ] Otis Gospodnetic commented on NUTCH-721: Questions: Has anyone tried profiling this? (may be relevant: http://markmail.org/message/4ixrnvfycpgmkdno ) Or maybe simply debugged/timed various blocks of code using something as simple as print statements and simple timers? Or maybe running just a single thread and then doing kill -QUIT a number of times to simply try and spot the method where the code seems to spend a lot of its time? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712494#action_12712494 ] Otis Gospodnetic commented on NUTCH-721: Ken's thoughts: http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712506#action_12712506 ] Roger Dunk commented on NUTCH-721: -- My tests were done on a segment with only 1 URL per host (generate.max.per.host = 1), so I don't believe what Ken has to say is the reason, at least in my case, for Fetcher2 performing slowly. Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695277#action_12695277 ] Doğacan Güney commented on NUTCH-721: - Wow, 53 min vs 3 min !? Thanks a lot for testing and that is indeed very worrying. Which 5000 url set did you use? I think the crawl_generate you attached to this issue has 13K urls? PS: One small thing new Fetcher requires less threads than OldFetcher. If you have time can you try with smaller number of threads (say, 15-20)? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695298#action_12695298 ] Roger Dunk commented on NUTCH-721: -- I did a -topN 5000, so only a subset of the attached, but still only 1 URL per host. The following is with 20 threads and also no parsing. [r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655 -threads 20 -noParsing [...] Aborting with 20 hung threads. Fetcher: done real60m14.926s user0m38.671s sys 0m6.134s Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695394#action_12695394 ] Julien Nioche commented on NUTCH-721: - The message about the Aborted hung threads looks like what I described in https://issues.apache.org/jira/browse/NUTCH-719 except that in this case there are active queues but fetchQueues.totalSize=0 Roger : can you confirm that the parameter fetcher.threads.per.host.by.ip is set to false? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695600#action_12695600 ] Roger Dunk commented on NUTCH-721: -- Julien, yes, fetcher.threads.per.host.by.ip was set to false in the above tests. I have also tried it with true, which certainly didn't help the speed issue, but I can't comment on the hung threads as I didn't bother letting the fetch complete. I'd say there are two, likely unrelated problems with Fetcher2. Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986 ] Doğacan Güney commented on NUTCH-721: - I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986 ] Doğacan Güney edited comment on NUTCH-721 at 4/2/09 6:01 AM: - I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher so that we can find out if this is related to new fetcher or is the side effect of some other change? was (Author: dogacan): I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher? Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695170#action_12695170 ] Roger Dunk commented on NUTCH-721: -- For the following tests I've used the same segment containing 5000 URLs. I cleaned the named cache before the first two tests. [r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.OldFetcher newcrawl/segments/20090402130655/ real3m38.084s user2m20.887s sys 0m7.470s [r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/ [...] Fetcher: done real53m44.800s user2m20.070s sys 0m9.527s For this next test, I used the same segment but didn't clear the named cache from the previous test, so all resolvable hosts should still be cached. This appeared to help greatly, as often times out of 80 active threads, only 60 were spinwaiting (as opposed to 79 in the non-cached test), but there were still plenty of times where at least 30 consecutive log entries showed 80 threads spinwaiting. And clearly as can be seen from the times below, still nowhere in the league of OldFetcher. [r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/ [...] Aborting with 80 hung threads. Fetcher: done real22m5.420s user2m39.407s sys 0m8.192s Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695233#action_12695233 ] Hudson commented on NUTCH-721: -- Integrated in Nutch-trunk #772 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/772/]) - Commit old fetcher as OldFetcher for now so that we can test Fetcher2 performance. Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694708#action_12694708 ] Doğacan Güney commented on NUTCH-721: - OK, there is clearly a problem with the new fetcher. First, let's make sure that there is indeed a problem with the new fetcher and this is not the side effect of some other code we introduced between 0.9 and 1.0. So I suggest that we re-commit old fetcher back into trunk and do a side-by-side comparison to make sure that the problem is with the new fetcher. If it is with the new fetcher, then we may try to salvage Todd's work (I remember that he said that his fetcher was faster, right?). Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-721) Fetcher2 Slow
Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-721) Fetcher2 Slow
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roger Dunk updated NUTCH-721: - Attachment: nutch-site.xml crawl_generate.tar.gz Fetcher2 Slow - Key: NUTCH-721 URL: https://issues.apache.org/jira/browse/NUTCH-721 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12 Reporter: Roger Dunk Attachments: crawl_generate.tar.gz, nutch-site.xml Fetcher2 fetches far more slowly than Fetcher1. Config options: fetcher.threads.fetch = 80 fetcher.threads.per.host = 80 fetcher.server.delay = 0 generate.max.per.host = 1 With a queue size of ~40,000, the result is: activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0 with maybe a download of 1 page per second. Runing with -noParse makes little difference. CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0 Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678573#action_12678573 ] Hudson commented on NUTCH-669: -- Integrated in Nutch-trunk #742 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/742/]) Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Assignee: Sami Siren Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-669. -- Resolution: Fixed replaced fetcher with fetcher2 Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Assignee: Sami Siren Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
Sami Siren (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-669. -- Resolution: Fixed replaced fetcher with fetcher2 I'm puzzled .. it seemed the goal was to integrate Todd's patch, which effectively replaces both Fetchers. Does this mean that Todd's version was not ready, or is the current code based on Todd's version? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
Andrzej Bialecki wrote: Sami Siren (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-669. -- Resolution: Fixed replaced fetcher with fetcher2 I'm puzzled .. it seemed the goal was to integrate Todd's patch, which effectively replaces both Fetchers. Does this mean that Todd's version was not ready, or is the current code based on Todd's version? There was no Todd's path that I could see, he never provided one even after asked multiple times, first by you at dec 2008 then dogacan jan 2009 and finally me last week. My motivation to get this fixed was, as I understood most of the developers thought too, to get rid of the burden of supporting two classes providing roughly the same piece of functionality. I opened a jira for this but closed it soon after as you told me it was a duplicate to this one. So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher is still there to be improved by Todd and others at will. -- Sami Siren
Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
Sami Siren wrote: Andrzej Bialecki wrote: Sami Siren (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-669. -- Resolution: Fixed replaced fetcher with fetcher2 I'm puzzled .. it seemed the goal was to integrate Todd's patch, which effectively replaces both Fetchers. Does this mean that Todd's version was not ready, or is the current code based on Todd's version? There was no Todd's path that I could see, he never provided one even after asked multiple times, first by you at dec 2008 then dogacan jan 2009 and finally me last week. My motivation to get this fixed was, as I understood most of the developers thought too, to get rid of the burden of supporting two classes providing roughly the same piece of functionality. I opened a jira for this but closed it soon after as you told me it was a duplicate to this one. So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher is still there to be improved by Todd and others at will. Ok, I understand now - given the circumstances I agree this was the right thing to do. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
Hey guys, Sorry for the non-responsiveness here. I recently left my old employment and have been packing for a cross-country move. I agree that for 1.0 the best bet is what Sami has done. The code that I was working on is available here: http://github.com/toddlipcon/nutch/tree/nutch-669 But it is not production ready - notably there's a problem whereby it runs out of memory even with a reasonably large heap. I'm not sure if I'll be able to complete working on it, given the cluster (and workload) I was using to test were from my old job, but I'm happy to provide any assistance understanding the work I began if you'd like to try to integrate it for 1.1 -Todd On Mon, Mar 2, 2009 at 9:48 AM, Andrzej Bialecki a...@getopt.org wrote: Sami Siren wrote: Andrzej Bialecki wrote: Sami Siren (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Sami Siren resolved NUTCH-669. -- Resolution: Fixed replaced fetcher with fetcher2 I'm puzzled .. it seemed the goal was to integrate Todd's patch, which effectively replaces both Fetchers. Does this mean that Todd's version was not ready, or is the current code based on Todd's version? There was no Todd's path that I could see, he never provided one even after asked multiple times, first by you at dec 2008 then dogacan jan 2009 and finally me last week. My motivation to get this fixed was, as I understood most of the developers thought too, to get rid of the burden of supporting two classes providing roughly the same piece of functionality. I opened a jira for this but closed it soon after as you told me it was a duplicate to this one. So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher is still there to be improved by Todd and others at will. Ok, I understand now - given the circumstances I agree this was the right thing to do. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Assigned: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren reassigned NUTCH-669: Assignee: Sami Siren Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Assignee: Sami Siren Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-701) replace Fetcher with Fetcher2
replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-701) Replace Fetcher with Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-701: - Summary: Replace Fetcher with Fetcher2 (was: replace Fetcher with Fetcher2) Replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-701) Replace Fetcher with Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676240#action_12676240 ] Andrzej Bialecki commented on NUTCH-701: - This is a duplicate of NUTCH-669. Please follow-up with Todd to finalize that issue instead. Replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-701) Replace Fetcher with Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-701. -- Resolution: Duplicate Replace Fetcher with Fetcher2 - Key: NUTCH-701 URL: https://issues.apache.org/jira/browse/NUTCH-701 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.0.0 Currently there are two fetcher implementation within nutch, one too many. This task tracks the process of promoting Fetcher2. my plan is basically to -remove Fetcher all together and rename Fetcher2 to Fetcher -fix crawl class so it works with F2 api. If there are no objections I will proceed with this soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-669: - Fix Version/s: (was: 1.1) 1.0.0 Moving this back to 1.0 Are you close with your patch? As discussed in this thread we should just replace Fetcher With Fetcher2, change Crawl class and check that the tests pass. other issues we can deal within their own tickets. I can also help with this if you don't have the time. Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects
[ https://issues.apache.org/jira/browse/NUTCH-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676496#action_12676496 ] Hudson commented on NUTCH-626: -- Integrated in Nutch-trunk #735 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/735/]) - Fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects, contributed by Remco Verhoef, dogacan fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects -- Key: NUTCH-626 URL: https://issues.apache.org/jira/browse/NUTCH-626 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Linux Debian Reporter: Remco Verhoef Assignee: Sami Siren Fix For: 1.0.0 Attachments: fetcher2.diff, NUTCH-626_v2.patch Fetcher2 breaks out of the db.ignore.external.links directive when encounterin a cross domain redirect. The redirected url is followed without checking for db.ignore.external.links and cross domain. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects
[ https://issues.apache.org/jira/browse/NUTCH-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-626: Attachment: NUTCH-626_v2.patch I updated your patch to apply and compile in latest trunk. I am not committing this patch since I don't want to mess with Todd's Fetcher work. For now :D fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects -- Key: NUTCH-626 URL: https://issues.apache.org/jira/browse/NUTCH-626 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Linux Debian Reporter: Remco Verhoef Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: fetcher2.diff, NUTCH-626_v2.patch Fetcher2 breaks out of the db.ignore.external.links directive when encounterin a cross domain redirect. The redirected url is followed without checking for db.ignore.external.links and cross domain. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665791#action_12665791 ] julien nioche commented on NUTCH-679: - I can send a modified version of it once Todd has finished working on the Fetchers. Same for https://issues.apache.org/jira/browse/NUTCH-658 Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: julien nioche Priority: Minor Attachments: Fetcher2.Tool.patch The patch attached makes Fetcher2 implement Tool. As a result we should be able to override parameters on the command line e.g. bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 segments/20090115072836 instead of having to modify the *-site.xml files in conf/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665482#action_12665482 ] Otis Gospodnetic commented on NUTCH-679: I'm not sure, but committing this may mess up Todd's work on merging Fetcher and Fetcher2. Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: julien nioche Priority: Minor Attachments: Fetcher2.Tool.patch The patch attached makes Fetcher2 implement Tool. As a result we should be able to override parameters on the command line e.g. bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 segments/20090115072836 instead of having to modify the *-site.xml files in conf/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665562#action_12665562 ] Doğacan Güney commented on NUTCH-669: - Hi Todd, Can you upload your work to JIRA now, so that we can review and merge it for 1.0? Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665125#action_12665125 ] Doğacan Güney commented on NUTCH-679: - Looks simple enough. I am going to commit it soon if no objections. Btw, please use 2-space tabs otherwise it messes up patches :) Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: julien nioche Priority: Minor Attachments: Fetcher2.Tool.patch The patch attached makes Fetcher2 implement Tool. As a result we should be able to override parameters on the command line e.g. bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 segments/20090115072836 instead of having to modify the *-site.xml files in conf/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-679) Fetcher2 implementing Tool
Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: julien nioche Priority: Minor Attachments: Fetcher2.Tool.patch The patch attached makes Fetcher2 implement Tool. As a result we should be able to override parameters on the command line e.g. bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 segments/20090115072836 instead of having to modify the *-site.xml files in conf/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-679) Fetcher2 implementing Tool
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-679: Attachment: Fetcher2.Tool.patch Patch which makes Fetcher2 implement Tool interface Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: julien nioche Priority: Minor Attachments: Fetcher2.Tool.patch The patch attached makes Fetcher2 implement Tool. As a result we should be able to override parameters on the command line e.g. bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 segments/20090115072836 instead of having to modify the *-site.xml files in conf/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660382#action_12660382 ] Todd Lipcon commented on NUTCH-669: --- Here's a further report on my progress: - It turns out the change in NUTCH-676 caused things to break - there's some behavior in nutch's MapWritable that differs from Hadoop's, so it was spending all of its time in output.collect - I think the writables were accruing lots of key/value pairs that they weren't sposed to. So, this doesn't depend on NUTCH-676. - I implemented adaptive crawl delay (NUTCH-475) in the new fetcher. - Also implemented early termination as discussed in this mailing list thread: http://www.nabble.com/proposal:-fetcher-performance-improvements-td20939872.html Results so far are looking good. I was able to run a 1M url fetch with 5000 urls per host at a sustained rate of 25 pages/second (total around 11 hours). About 60% of the URLs ended up parsed, which isn't significantly worse than I usually see without early termination, but past attempts to run 1M fetches have taken several days because of some slow hosts. I'm running a 2M+ URL fetch right now and have been sustaining 40-60mbit inbound from 8 fetchers for the last couple hours. - I did experience one GC error - I think I need to add some cleanup of empty queues out of the FetchQueue structure when the number of unique hosts is very high. Complete history is here: http://github.com/toddlipcon/nutch/tree/nutch-669 Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660397#action_12660397 ] Otis Gospodnetic commented on NUTCH-669: Todd, and when you say sustained rate of 25 pages/second that means the final rate you see on one of the status screens? In other words, this is not a rate you see being steady while the fetch run is in the full swing (which could be a lot higher), but rather the final rate? Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659857#action_12659857 ] Todd Lipcon commented on NUTCH-669: --- Hey guys, I tried it on production, but ran into an Exception of some sort that happened very rarely. Then I went on vacation for 2 weeks and came back to find the logs gone from my hadoop tracker, so I can't figure out what the Exception was ;-) I'll run another segment today hopefully and let you know the results. -Todd Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659958#action_12659958 ] Todd Lipcon commented on NUTCH-669: --- Found the exception in a screen log: {noformat} java.lang.NullPointerException at org.apache.nutch.crawl.MapWritable$KeyValueEntry.access$102(MapWritable.java:469) at org.apache.nutch.crawl.MapWritable.readFields(MapWritable.java:362) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:250) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170) at org.apache.nutch.fetcher.Fetcher$FetchMapper.run(Fetcher.java:399) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child.main(Child.java:155) {noformat} I think NUTCH-676 may help this. Trying another run in a minute. Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659343#action_12659343 ] Andrzej Bialecki commented on NUTCH-669: - Well ... have you tried it? How did it go? I think it's time to upload the patch to JIRA, so that we can decide what to do using a concrete snapshot of your work. Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-669: --- Priority: Major (was: Minor) Fix Version/s: 1.0.0 +1 -- people, vote for it. This could go in 1.0, right? Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655491#action_12655491 ] Todd Lipcon commented on NUTCH-669: --- For those watching this issue: I pushed a couple more changes to the github repo linked above. I'm about to try it on production with a 100K url segment, 80 threads, limit by IP, 8 crawler nodes. We'll see how it goes. Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653844#action_12653844 ] Todd Lipcon commented on NUTCH-669: --- Agreed on all fronts. I spent several hours yesterday refactoring/rewriting Fetcher2 to be a little cleaner . One of the changes was to factor out the queueing policies into a new class and replace the Thread-based model with one based on ExecutorServices. I may also try to factor out the actual fetching into a new class as well. I haven't gotten to testing the new version quite yet but hopefully should have a patch available next week, and perhaps some intermediate commits available on github this afternoon so people can see where I'm headed. Is there a unit (or functional) testing infrastructure I can use somewhere to test this? -Todd Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Priority: Minor I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653940#action_12653940 ] Todd Lipcon commented on NUTCH-669: --- I've pushed the initial commit of this rewrite/refactor to github: http://github.com/toddlipcon/nutch/commit/5c9d99a856628c842b50b1d76f62b375f377bf95 Might be worth just reviewing it as if it were a new file rather than a diff: http://github.com/toddlipcon/nutch/tree/5c9d99a856628c842b50b1d76f62b375f377bf95/src/java/org/apache/nutch/fetcher/Fetcher.java Still have some more cleanup and revisions here, plus I want to test it on a real crawl or two from our cluster. It currently passes the TestFetcher unit test but I don't know what the coverage is on that. I'll attach a patch here before it's ready to be comitted it so I can check off the license grant checkbox, which I know is important for ASF. Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Priority: Minor I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Priority: Minor I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653487#action_12653487 ] Doğacan Güney commented on NUTCH-669: - * What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? Agreed. We should just rename Fetcher2 to Fetcher and be done with it :D Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Priority: Minor I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects
[ https://issues.apache.org/jira/browse/NUTCH-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney reassigned NUTCH-626: --- Assignee: Doğacan Güney fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects -- Key: NUTCH-626 URL: https://issues.apache.org/jira/browse/NUTCH-626 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Linux Debian Reporter: Remco Verhoef Assignee: Doğacan Güney Attachments: fetcher2.diff Fetcher2 breaks out of the db.ignore.external.links directive when encounterin a cross domain redirect. The redirected url is followed without checking for db.ignore.external.links and cross domain. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Fetcher2 Reduce Phase Question
Hi Folks, I was just wondering what computation really happens in the reduce phase for Fetcher2 ? I know that it is implemented as a MapRunnable -- but I see no explicit reducer being set for the job. Is the identity reducer being used ? Why can't we simply use job.setNumReduceTasks(0) ? Wouldn't this be faster? Sandeep
Re: Fetcher2 Reduce Phase Question
Sandeep Tata wrote: Hi Folks, I was just wondering what computation really happens in the reduce phase for Fetcher2 ? If Fetcher was running in the parsing mode, then in the reduce phase Outlinks are separated from Parse output and stored in crawl_parse, and other data in parse_text and parse_data. This actually happens in FetcherOutputFormat / ParseOutputFormat, so there is no need for any Reduce apart from the IdentityReduce (default) I know that it is implemented as a MapRunnable -- but I see no explicit reducer being set for the job. Is the identity reducer being used ? Why can't we simply use job.setNumReduceTasks(0) ? Wouldn't this be faster? First, when Fetcher / Fetcher2 were written there was no such option in Hadoop. Second, the meaning of this setting is that the output from maps becomes the final output - but this won't cut it, because map outputs are always simple SequenceFile's, whereas we need to split the FetcherOutput into a bunch of Sequence and MapFile-s (which have to be sorted) ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Created: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects
fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects -- Key: NUTCH-626 URL: https://issues.apache.org/jira/browse/NUTCH-626 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Linux Debian Reporter: Remco Verhoef Fetcher2 breaks out of the db.ignore.external.links directive when encounterin a cross domain redirect. The redirected url is followed without checking for db.ignore.external.links and cross domain. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects
[ https://issues.apache.org/jira/browse/NUTCH-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remco Verhoef updated NUTCH-626: Attachment: fetcher2.diff this patch also fixes an other issue with redirects. fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects -- Key: NUTCH-626 URL: https://issues.apache.org/jira/browse/NUTCH-626 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Linux Debian Reporter: Remco Verhoef Attachments: fetcher2.diff Fetcher2 breaks out of the db.ignore.external.links directive when encounterin a cross domain redirect. The redirected url is followed without checking for db.ignore.external.links and cross domain. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
[ https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-592. --- Resolution: Duplicate Assignee: Andrzej Bialecki (was: Emmanuel Joke) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED - Key: NUTCH-592 URL: https://issues.apache.org/jira/browse/NUTCH-592 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Emmanuel Joke Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: patch.txt I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect function can return NULL for few case and it has not been managed in the function as it has been done for the case ProtocolStatus.SUCCESS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
[ https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578786#action_12578786 ] Andrzej Bialecki commented on NUTCH-592: - Duplicate of NUTCH-597 and NUTCH-615. Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED - Key: NUTCH-592 URL: https://issues.apache.org/jira/browse/NUTCH-592 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Emmanuel Joke Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: patch.txt I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect function can return NULL for few case and it has not been managed in the function as it has been done for the case ProtocolStatus.SUCCESS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.
[ https://issues.apache.org/jira/browse/NUTCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12559382#action_12559382 ] Hudson commented on NUTCH-597: -- Integrated in Nutch-Nightly #330 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/330/]) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish. -- Key: NUTCH-597 URL: https://issues.apache.org/jira/browse/NUTCH-597 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Debian Reporter: Remco Verhoef Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: fetcher2.java.patch When fetcher.threads.per.host.by.ip is set to true the following exception is thrown when the host does not exist. FetchItem.create returns null when it is not able to resolve the host address when it is redirecting. 2007-12-30 15:34:42,720 WARN fetcher.Fetcher2 - Unable to resolve: {url} , skipping. 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - java.lang.NullPointerException 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher caught:java.lang..NullPointerException 2007-12-30 15:34:42,721 INFO fetcher.Fetcher2 - -finishing thread FetcherThread, activeThreads=49 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.
[ https://issues.apache.org/jira/browse/NUTCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-597. - Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki Patch applied in rev. 612264. Thank you! Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish. -- Key: NUTCH-597 URL: https://issues.apache.org/jira/browse/NUTCH-597 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Debian Reporter: Remco Verhoef Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: fetcher2.java.patch When fetcher.threads.per.host.by.ip is set to true the following exception is thrown when the host does not exist. FetchItem.create returns null when it is not able to resolve the host address when it is redirecting. 2007-12-30 15:34:42,720 WARN fetcher.Fetcher2 - Unable to resolve: {url} , skipping. 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - java.lang.NullPointerException 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher caught:java.lang..NullPointerException 2007-12-30 15:34:42,721 INFO fetcher.Fetcher2 - -finishing thread FetcherThread, activeThreads=49 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.
[ https://issues.apache.org/jira/browse/NUTCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-597. --- Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish. -- Key: NUTCH-597 URL: https://issues.apache.org/jira/browse/NUTCH-597 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Debian Reporter: Remco Verhoef Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: fetcher2.java.patch When fetcher.threads.per.host.by.ip is set to true the following exception is thrown when the host does not exist. FetchItem.create returns null when it is not able to resolve the host address when it is redirecting. 2007-12-30 15:34:42,720 WARN fetcher.Fetcher2 - Unable to resolve: {url} , skipping. 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - java.lang.NullPointerException 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher caught:java.lang..NullPointerException 2007-12-30 15:34:42,721 INFO fetcher.Fetcher2 - -finishing thread FetcherThread, activeThreads=49 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
[ https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12559272#action_12559272 ] Andrzej Bialecki commented on NUTCH-592: - This seems to be a duplicate of NUTCH-597. If you have no objections I will close this issue. Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED - Key: NUTCH-592 URL: https://issues.apache.org/jira/browse/NUTCH-592 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Emmanuel Joke Assignee: Emmanuel Joke Fix For: 1.0.0 Attachments: patch.txt I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect function can return NULL for few case and it has not been managed in the function as it has been done for the case ProtocolStatus.SUCCESS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.
Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish. -- Key: NUTCH-597 URL: https://issues.apache.org/jira/browse/NUTCH-597 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Debian Reporter: Remco Verhoef When fetcher.threads.per.host.by.ip is set to true the following exception is thrown when the host does not exist. FetchItem.create returns null when it is not able to resolve the host address when it is redirecting. 2007-12-30 15:34:42,720 WARN fetcher.Fetcher2 - Unable to resolve: {url} , skipping. 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - java.lang.NullPointerException 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher caught:java.lang..NullPointerException 2007-12-30 15:34:42,721 INFO fetcher.Fetcher2 - -finishing thread FetcherThread, activeThreads=49 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.
[ https://issues.apache.org/jira/browse/NUTCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remco Verhoef updated NUTCH-597: Attachment: fetcher2.java.patch Contains the patch code for Fetcher2.java. Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish. -- Key: NUTCH-597 URL: https://issues.apache.org/jira/browse/NUTCH-597 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Debian Reporter: Remco Verhoef Attachments: fetcher2.java.patch When fetcher.threads.per.host.by.ip is set to true the following exception is thrown when the host does not exist. FetchItem.create returns null when it is not able to resolve the host address when it is redirecting. 2007-12-30 15:34:42,720 WARN fetcher.Fetcher2 - Unable to resolve: {url} , skipping. 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - java.lang.NullPointerException 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632) 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher caught:java.lang..NullPointerException 2007-12-30 15:34:42,721 INFO fetcher.Fetcher2 - -finishing thread FetcherThread, activeThreads=49 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED - Key: NUTCH-592 URL: https://issues.apache.org/jira/browse/NUTCH-592 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Emmanuel Joke Assignee: Emmanuel Joke Fix For: 1.0.0 Attachments: patch.txt I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect function can return NULL for few case and it has not been managed in the function as it has been done for the case ProtocolStatus.SUCCESS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
[ https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-592: Attachment: patch.txt Patch provided. Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED - Key: NUTCH-592 URL: https://issues.apache.org/jira/browse/NUTCH-592 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Emmanuel Joke Assignee: Emmanuel Joke Fix For: 1.0.0 Attachments: patch.txt I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect function can return NULL for few case and it has not been managed in the function as it has been done for the case ProtocolStatus.SUCCESS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly
[ https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508747 ] Hudson commented on NUTCH-474: -- Integrated in Nutch-Nightly #131 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/]) Fetcher2 sets server-delay and blocking checks incorrectly -- Key: NUTCH-474 URL: https://issues.apache.org/jira/browse/NUTCH-474 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: fetcher2.patch 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be the opposite. 2) Fetcher2 sets wrong configuration options so host blocking is still handled by the lib-http plugin (Fetcher2 is designed to handle blocking internally). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly
On 6/28/07, Hudson (JIRA) [EMAIL PROTECTED] wrote: [ https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508747 ] Hudson commented on NUTCH-474: -- Integrated in Nutch-Nightly #131 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/]) *sigh* I wrote NUTCH-474 instead of NUTCH-434 in svn log. Sorry everyone... Fetcher2 sets server-delay and blocking checks incorrectly -- Key: NUTCH-474 URL: https://issues.apache.org/jira/browse/NUTCH-474 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: fetcher2.patch 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be the opposite. 2) Fetcher2 sets wrong configuration options so host blocking is still handled by the lib-http plugin (Fetcher2 is designed to handle blocking internally). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney
[jira] Resolved: (NUTCH-495) Unnecessary delays in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-495. - Resolution: Fixed Assignee: Doğacan Güney Committed in rev 547901. Unnecessary delays in Fetcher2 -- Key: NUTCH-495 URL: https://issues.apache.org/jira/browse/NUTCH-495 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 Attachments: fetcher2_robots.patch Even if a url is blocked by robots.txt (or has a crawl delay larger that max.crawl.delay), Fetcher2 still waits fetcher.server.delay before fetching another url from same host, which is not necessary, considering that Fetcher2 didn't make a request to server anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-495) Unnecessary delays in Fetcher2
Unnecessary delays in Fetcher2 -- Key: NUTCH-495 URL: https://issues.apache.org/jira/browse/NUTCH-495 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Priority: Minor Fix For: 1.0.0 Even if a url is blocked by robots.txt (or has a crawl delay larger that max.crawl.delay), Fetcher2 still waits fetcher.server.delay before fetching another url from same host, which is not necessary, considering that Fetcher2 didn't make a request to server anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-495) Unnecessary delays in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-495: Attachment: fetcher2_robots.patch Unnecessary delays in Fetcher2 -- Key: NUTCH-495 URL: https://issues.apache.org/jira/browse/NUTCH-495 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Priority: Minor Fix For: 1.0.0 Attachments: fetcher2_robots.patch Even if a url is blocked by robots.txt (or has a crawl delay larger that max.crawl.delay), Fetcher2 still waits fetcher.server.delay before fetching another url from same host, which is not necessary, considering that Fetcher2 didn't make a request to server anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Fetcher2's delay between successive requests
Hi all, I have been working on Fetcher2 code lately and I came across this particular code (in FetchItemQueue.getFetchItem) that I didn't quite understand: public FetchItem getFetchItem() { ... long last = endTime.get() + (maxThreads 1 ? crawlDelay : minCrawlDelay); ... } Now, the 'default' politeness behaviour should be 1 thread per host and delaying n seconds between successive requests to that host, right? But, won't this code wait only minCrawlDelay(which, by default, is 0) if maxThreads == 1. I also did not understand why there is a maxThread check at all. Each individual thread should wait crawl delay before making another request to the same host. Am I missing something here? -- Doğacan Güney
Re: Fetcher2's delay between successive requests
I have discovered another bug in Fetcher2. Plugin lib-http checks Protocol.CHECK_{BLOCKING,ROBOTS}(which resolve to strings protocol.plugin.check.{blocking,robots}) to see if it should handle blocking or not. But fetcher2 sets http.plugin.check.{blocking,robots} (notice the protocol/http difference) to false to indicate lib-http shouldn't handle blocking internally. Because of this, when you use Fetcher2, lib-http still tries to block them which makes Fetcher2 much less useful. I am not sending a patch for this yet because I first want to get some feedback on the first bug. -- Doğacan Güney
Re: Fetcher2's delay between successive requests
Doğacan Güney wrote: Hi all, I have been working on Fetcher2 code lately and I came across this particular code (in FetchItemQueue.getFetchItem) that I didn't quite understand: public FetchItem getFetchItem() { ... long last = endTime.get() + (maxThreads 1 ? crawlDelay : minCrawlDelay); ... } Now, the 'default' politeness behaviour should be 1 thread per host and delaying n seconds between successive requests to that host, right? But, won't this code wait only minCrawlDelay(which, by default, is 0) if maxThreads == 1. Yes, that was the intended behavior - normally, you should never use more than 1 thread per host, unless you have an explicit permission to do so. If multiple threads make requests to the same host, then the crawl delay parameter loses its usual meaning - see the details of this in comments to NUTCH-385. However, the sensible way to do is to still provide a way to limit the maximum rate of requests, and this is what the minCrawlDelay parameter is for. I also did not understand why there is a maxThread check at all. Each individual thread should wait crawl delay before making another request to the same host. Am I missing something here? See the ASCII-art graphs and comments in NUTCH-385 - this is likely not what is expected. Although this JIRA issue is still open, the Fetcher2 code tries to implement this middle ground solution. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Fetcher2's delay between successive requests
Doğacan Güney wrote: I have discovered another bug in Fetcher2. Plugin lib-http checks Protocol.CHECK_{BLOCKING,ROBOTS}(which resolve to strings protocol.plugin.check.{blocking,robots}) to see if it should handle blocking or not. But fetcher2 sets http.plugin.check.{blocking,robots} (notice the protocol/http difference) to false to indicate lib-http shouldn't handle blocking internally. Because of this, when you use Fetcher2, lib-http still tries to block them which makes Fetcher2 much less useful. This is definitely a bug. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Created: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly
Fetcher2 sets server-delay and blocking checks incorrectly -- Key: NUTCH-474 URL: https://issues.apache.org/jira/browse/NUTCH-474 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Fix For: 1.0.0 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be the opposite. 2) Fetcher2 sets wrong configuration options so host blocking is still handled by the lib-http plugin (Fetcher2 is designed to handle blocking internally). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly
[ https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-474: Attachment: fetcher2.patch Fetcher2 sets server-delay and blocking checks incorrectly -- Key: NUTCH-474 URL: https://issues.apache.org/jira/browse/NUTCH-474 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Fix For: 1.0.0 Attachments: fetcher2.patch 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be the opposite. 2) Fetcher2 sets wrong configuration options so host blocking is still handled by the lib-http plugin (Fetcher2 is designed to handle blocking internally). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Fetcher2's delay between successive requests
Doğacan Güney wrote: I don't get it. The code seems to do exactly the opposite of what you are saying. If maxThreads == 1 then maxThreads 1 is false thus the expression evaluates to minCrawlDelay not crawlDelay. Shouldn't the expression be (maxThreads 1 ? minCrawlDelay : crawlDelay) ? Yep, you're right - it's a bug. However, the reasoning that I presented still holds, it's just the implementation that doesn't get it ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Fetcher2's delay between successive requests
On 4/24/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: I don't get it. The code seems to do exactly the opposite of what you are saying. If maxThreads == 1 then maxThreads 1 is false thus the expression evaluates to minCrawlDelay not crawlDelay. Shouldn't the expression be (maxThreads 1 ? minCrawlDelay : crawlDelay) ? Yep, you're right - it's a bug. However, the reasoning that I presented still holds, it's just the implementation that doesn't get it ;) Heh, OK:). I opened an issue for these bugs (NUTCH-474) and attached a patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
[jira] Closed: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly
[ https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-474. --- Resolution: Fixed Assignee: Andrzej Bialecki Fixed in rev. 532088. Thanks! Fetcher2 sets server-delay and blocking checks incorrectly -- Key: NUTCH-474 URL: https://issues.apache.org/jira/browse/NUTCH-474 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Assigned To: Andrzej Bialecki Fix For: 1.0.0 Attachments: fetcher2.patch 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be the opposite. 2) Fetcher2 sets wrong configuration options so host blocking is still handled by the lib-http plugin (Fetcher2 is designed to handle blocking internally). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Fetcher2
please give us the url,thx On 1/25/07, chee wu [EMAIL PROTECTED] wrote: Just appended the portion for .81 to NUTCH-339 - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 8:06 AM Subject: RE: Fetcher2 Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 03:59 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- www.babatu.com
RE: Fetcher2
Kauu, The url for fetcher too is: https://issues.apache.org/jira/browse/NUTCH-339 Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: 25 January 2007 09:31 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 please give us the url,thx On 1/25/07, chee wu [EMAIL PROTECTED] wrote: Just appended the portion for .81 to NUTCH-339 - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 8:06 AM Subject: RE: Fetcher2 Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 03:59 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- www.babatu.com
RE: Fetcher2
Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 03:59 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Fetcher2
Just appended the portion for .81 to NUTCH-339 - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 8:06 AM Subject: RE: Fetcher2 Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 03:59 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Fetcher2
Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com