[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-22 Thread Euan Clark (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836918#action_12836918
 ] 

Euan Clark commented on NUTCH-719:
--

I notice the other addFetchItem method of FetchItemQueues  and FetchItemQueue 
in Fetcher.java should these also be synchronized?

 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-719.
-

   Resolution: Fixed
Fix Version/s: 1.1

Committed revision 911905.
Thanks to S. Dennis for investigating the issue + R. Schwab for testing it 

 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-719.
---


 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836125#action_12836125
 ] 

Hudson commented on NUTCH-719:
--

Integrated in Nutch-trunk #1074 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1074/])
 fetchQueues.totalSize incorrect in Fetcher


 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-01-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-719:
---

Assignee: Julien Nioche

 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche

 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-679) Fetcher2 implementing Tool

2009-10-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-679.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Fetcher2 implementing Tool
 --

 Key: NUTCH-679
 URL: https://issues.apache.org/jira/browse/NUTCH-679
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: Fetcher2.Tool.patch, NUTCH-679.patch


 The patch attached makes Fetcher2 implement Tool. As a result we should be 
 able to override parameters on the command line e.g. 
 bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 
 segments/20090115072836
 instead of having to modify the *-site.xml files in conf/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool

2009-10-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764295#action_12764295
 ] 

Hudson commented on NUTCH-679:
--

Integrated in Nutch-trunk #959 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/959/])
 Fetcher2 implementing Tool.


 Fetcher2 implementing Tool
 --

 Key: NUTCH-679
 URL: https://issues.apache.org/jira/browse/NUTCH-679
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: Fetcher2.Tool.patch, NUTCH-679.patch


 The patch attached makes Fetcher2 implement Tool. As a result we should be 
 able to override parameters on the command line e.g. 
 bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 
 segments/20090115072836
 instead of having to modify the *-site.xml files in conf/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-721) Fetcher2 Slow

2009-08-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-721.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Doğacan Güney

Code committed as of rev. 807485.

I am closing this issue. Of course, there may be other reasons why Fetcher2 is 
slow, so feel free to create new issues if so.

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
Assignee: Doğacan Güney
 Fix For: 1.1

 Attachments: crawl_generate.tar.gz, NUTCH-721.patch, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-679) Fetcher2 implementing Tool

2009-08-13 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-679:


Attachment: NUTCH-679.patch

Updated version of the patch

 Fetcher2 implementing Tool
 --

 Key: NUTCH-679
 URL: https://issues.apache.org/jira/browse/NUTCH-679
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: Fetcher2.Tool.patch, NUTCH-679.patch


 The patch attached makes Fetcher2 implement Tool. As a result we should be 
 able to override parameters on the command line e.g. 
 bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 
 segments/20090115072836
 instead of having to modify the *-site.xml files in conf/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-08-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741225#action_12741225
 ] 

Doğacan Güney commented on NUTCH-721:
-

Thanks for the analysis, Julien! Can you make a patch for the conf changes so 
we can commit it with your name?

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-721) Fetcher2 Slow

2009-08-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-721:


Attachment: NUTCH-721.patch

Sets the default value for fetcher.threads.per.host.by.ip to false

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, NUTCH-721.patch, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-08-09 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741082#action_12741082
 ] 

Julien Nioche commented on NUTCH-721:
-

I had another look at this issue after applying the patch from Nutch-719. I can 
easily reproduce the situation from the original post by setting 
fetcher.threads.per.host.by.ip to true. The nutch-site file sent by Rodger does 
not specify it so it would rely on this value by default. Once setting it to 
false all threads are active and the fetching is much faster. 

I have used the first 5K URLs from the fetchlist sent by Rodger and compared 
the perfs with by.ip set to false :  

OldFetcher :  
real32m26.003s
user1m11.768s
sys 0m10.337s

OldFetcher :  
real30m52.965s
user1m10.696s
sys 0m10.425s

Fetcher :  
real31m21.924s
user1m12.725s
sys 0m10.797s

Fetcher :
real30m3.017s
user1m15.509s
sys 0m10.909s

I ran each step twice and as we can see the results are comparable.

This explanation is also compliant with Steven's observation that we get 5-7 
times the rate as we would hit the DNS cache for subsequent calls for URLs from 
non unique sites. The IP resolution is done by the QueueFeeder which explains 
why it is slowing down the number of URLs being available for fetching.

I don't think that the oldFetcher allows to group URLs by IP for politeness in 
which case why not making fetcher.threads.per.host.by.ip default to false in 
the new fetcher?


 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-08-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741092#action_12741092
 ] 

Andrzej Bialecki  commented on NUTCH-721:
-

+1. Current defaults are sub-optimal due to backward-compatibility issues with 
early Nutch 0.8. This should be no longer a concern.

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2009-07-13 Thread Steven Denny (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730239#action_12730239
 ] 

Steven Denny commented on NUTCH-719:


I'm not sure, as far as I can tell, the feeder has always finished feeding the 
urls, it's just that a proportion are lost.

However, there are two things I've noted re performance (if you just look at 
url's crawled per second)

1) When this situation arrises, the fetcher will time out and Abort with N 
hung threads. The timeout occurs after mapred.task.timeout/2 or seconds 
(default 5 mins), so any timing on a crawl that aborted will be extended by 5 
mins. One a small crawl this could skew the figures

2) DNS look up can take a while. I know this has been noted before, but on my 
test system (admittedly only a vm on our network, with nothing special in terms 
of DNS), some of the look ups were taking 5-6 seconds. THis is possibley the 
wrong place to discuss given NUTCH-721, but I put in some debug arround the 
feeder thread and got:

2009-07-10 04:01:35,296 INFO  fetcher.Fetcher - Fed 500 urls in 186 secs = 
2.7url/s
2009-07-10 04:04:18,343 INFO  fetcher.Fetcher - Fed 499 urls in 163 secs = 
3.1url/s
2009-07-10 04:06:57,109 INFO  fetcher.Fetcher - Fed 498 urls in 158 secs = 
3.2url/s
2009-07-10 04:10:38,282 INFO  fetcher.Fetcher - Fed 499 urls in 221 secs = 
2.3url/s
2009-07-10 04:12:58,371 INFO  fetcher.Fetcher - Fed 498 urls in 140 secs = 
3.6url/s
2009-07-10 04:16:12,275 INFO  fetcher.Fetcher - Fed 499 urls in 193 secs = 
2.6url/s
2009-07-10 04:19:20,162 INFO  fetcher.Fetcher - Fed 499 urls in 187 secs = 
2.7url/s
2009-07-10 04:21:25,846 INFO  fetcher.Fetcher - Fed 499 urls in 125 secs = 
4.0url/s
2009-07-10 04:24:16,049 INFO  fetcher.Fetcher - Fed 495 urls in 170 secs = 
2.9url/s
2009-07-10 04:27:01,944 INFO  fetcher.Fetcher - Fed 499 urls in 165 secs = 
3.0url/s
2009-07-10 04:29:26,247 INFO  fetcher.Fetcher - Fed 499 urls in 144 secs = 
3.5url/s
2009-07-10 04:32:02,590 INFO  fetcher.Fetcher - Fed 499 urls in 156 secs = 
3.2url/s
2009-07-10 04:34:49,985 INFO  fetcher.Fetcher - Fed 498 urls in 167 secs = 
3.0url/s
2009-07-10 04:37:28,367 INFO  fetcher.Fetcher - Fed 498 urls in 158 secs = 
3.2url/s
2009-07-10 04:40:09,865 INFO  fetcher.Fetcher - Fed 499 urls in 161 secs = 
3.1url/s
2009-07-10 04:42:55,203 INFO  fetcher.Fetcher - Fed 499 urls in 165 secs = 
3.0url/s

obviously when I'm only feeding 3-4 urls/sec, i'll only every be able to fetch 
that. That test was one a crawldb just initialised with 11,000 urls (unique 
sites).

However, on the next iteration where I'm feeding urls from non-unique sites, I 
see 5-7 times that rate.


 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche

 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-07-13 Thread Steven Denny (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730242#action_12730242
 ] 

Steven Denny commented on NUTCH-721:


I've done some testing on this and looked at the number of pages being fed, as 
this obvioulsy limits the number of pages you can fetch:

2009-07-10 04:01:35,296 INFO fetcher.Fetcher - Fed 500 urls in 186 secs = 
2.7url/s
2009-07-10 04:04:18,343 INFO fetcher.Fetcher - Fed 499 urls in 163 secs = 
3.1url/s
2009-07-10 04:06:57,109 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 
3.2url/s
2009-07-10 04:10:38,282 INFO fetcher.Fetcher - Fed 499 urls in 221 secs = 
2.3url/s
2009-07-10 04:12:58,371 INFO fetcher.Fetcher - Fed 498 urls in 140 secs = 
3.6url/s
2009-07-10 04:16:12,275 INFO fetcher.Fetcher - Fed 499 urls in 193 secs = 
2.6url/s
2009-07-10 04:19:20,162 INFO fetcher.Fetcher - Fed 499 urls in 187 secs = 
2.7url/s
2009-07-10 04:21:25,846 INFO fetcher.Fetcher - Fed 499 urls in 125 secs = 
4.0url/s
2009-07-10 04:24:16,049 INFO fetcher.Fetcher - Fed 495 urls in 170 secs = 
2.9url/s
2009-07-10 04:27:01,944 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 
3.0url/s
2009-07-10 04:29:26,247 INFO fetcher.Fetcher - Fed 499 urls in 144 secs = 
3.5url/s
2009-07-10 04:32:02,590 INFO fetcher.Fetcher - Fed 499 urls in 156 secs = 
3.2url/s
2009-07-10 04:34:49,985 INFO fetcher.Fetcher - Fed 498 urls in 167 secs = 
3.0url/s
2009-07-10 04:37:28,367 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 
3.2url/s
2009-07-10 04:40:09,865 INFO fetcher.Fetcher - Fed 499 urls in 161 secs = 
3.1url/s
2009-07-10 04:42:55,203 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 
3.0url/s

That test was one a crawldb just initialised with 11,000 urls (unique sites).

However, on the next iteration where I'm feeding urls from non-unique sites, I 
see 5-7 times that rate. (My test system is a vm on our network, with nothing 
special in terms of DNS. Someof the look ups were taking 5-6 seconds).

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2009-07-13 Thread Steven Denny (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730253#action_12730253
 ] 

Steven Denny commented on NUTCH-719:


perhaps i spoke too soon

10 threads, 15520 pages, 723 errors, 3.7 pages/s, 2972 kb/s, 
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=0, fetchQueues.count=0
Aborting with 10 hung threads.
Unable to resolve: www.countryenergy.com.au, skipping.
Exception in thread QueueFeeder java.lang.NullPointerException
at 
org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48)
at 
org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:206)
at 
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:238)
at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:111)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at 
org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1895)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1925)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.nutch.fetcher.Fetcher$QueueFeeder.run(Fetcher.java:418)


It apears that the feeder hung, but I'm not sure whether the exception raised 
is the cause or the effect (i suspect it's the effect of the thread aborting)

I'm also not sure if any of these issues are vm related. Hopefully our real 
hardware will turn up soon

 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche

 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-07-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730402#action_12730402
 ] 

Doğacan Güney commented on NUTCH-721:
-

Steven, if you have time/hardware, can you retry your use-case with OIdFetcher 
in trunk?

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2009-07-10 Thread Steven Denny (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729699#action_12729699
 ] 

Steven Denny commented on NUTCH-719:


I've changed line 324 of src/java/org/apache/nutch/fetcher/Fetcher.java to 

public void synchronized void addFetchItem(FetchItem it) {

(added the synchronized) and initial testing looks good.



 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche

 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2009-07-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729833#action_12729833
 ] 

Doğacan Güney commented on NUTCH-719:
-

Thanks for looking into this bug.

I wonder if this is the cause of the performance problem so many people are 
facing with Fetcher in nutch-1.0. Can it be that QueueFeeder stops feeding new 
URLs into FetchQueues because of this bug?

 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche

 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2009-07-09 Thread Steven Denny (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729254#action_12729254
 ] 

Steven Denny edited comment on NUTCH-719 at 7/9/09 6:17 AM:


I've done some investigation on this.

It looks to me as if queues can get reaped to early. I've put in some debug and 
this is what I see:

2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher -   
FetchItemQueue::getFetchItemQueue() id=http://125.168.254.20
2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - Created queue: 
http://125.168.254.20

2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - reaping: http://125.168.254.20
.
2009-07-09 04:39:50,705 DEBUG fetcher.Fetcher - addFetchItem: adding item - 
http://www.callidan.com/ma100.htm
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - totalSize++:2 
http://125.168.254.20 http://www.callidan.com/ma100.htm queuesize: 1 
queuecount: 11
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://61.9.216.193, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://216.184.34.250, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://139.146.150.23, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://203.29.78.68, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://150.101.91.39, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: 
http://209.212.110.211, size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://123.176.112.44, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: 
http://117.104.160.130, size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://196.25.73.205, 
size: 0
2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.53.7.145, 
size: 0
2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.60.67.145, 
size: 1

Note that the queue is created and then immediately reaped, and after totalSize 
is incremented, that queue does not appear in the list, even though it 
supposedly has the item added to it.

The upshot is that the url is never fetched (as the queue has gone) so 
totalSize never = 0, and eventually the abort will happen.

In short I'd say this is a sync issue, but I'm not sure where the best place to 
lock would be.

Any comments from the author?


  was (Author: stevedenny):
I've done some investigation on this.

It looks to me as if queues can get reaped to early. I've put in some debug and 
this is what I see:

2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher -   
FetchItemQueue::getFetchItemQueue() id=http://125.168.254.20
2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - Created queue: 
http://125.168.254.20

2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - reaping: http://125.168.254.20
.
2009-07-09 04:39:50,705 DEBUG fetcher.Fetcher - addFetchItem: adding item - 
http://www.callidan.com/ma100.htm
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - totalSize++:2 
http://125.168.254.20 http://www.callidan.com/ma100.htm queuesize: 1 
queuecount: 11
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://61.9.216.193, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://216.184.34.250, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://139.146.150.23, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://203.29.78.68, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://150.101.91.39, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: 
http://209.212.110.211, size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://123.176.112.44, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: 
http://117.104.160.130, size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://196.25.73.205, 
size: 0
2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.53.7.145, 
size: 0
2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.60.67.145, 
size: 1

Not that the queue is created and then immediately reaped, and after totalSize 
is incremented, that queue does not appear in the list, even though it 
supposedly has the item added to it.
It looks as if when items are fed, there's a posibility of the queue being 
reaped before the item is added to the queue. However, totalSize is still 
incrememented.

The upshot is that the url is never fetched (as the queue has gone) so 
totalSize never = 0, and eventually the abort will happen.

In short I'd say this is a sync issue, but I'm not sure where the best place to 
lock would be.

Any comments from the author?

  
 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components

[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2009-07-09 Thread Steven Denny (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729254#action_12729254
 ] 

Steven Denny commented on NUTCH-719:


I've done some investigation on this.

It looks to me as if queues can get reaped to early. I've put in some debug and 
this is what I see:

2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher -   
FetchItemQueue::getFetchItemQueue() id=http://125.168.254.20
2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - Created queue: 
http://125.168.254.20

2009-07-09 04:39:50,704 DEBUG fetcher.Fetcher - reaping: http://125.168.254.20
.
2009-07-09 04:39:50,705 DEBUG fetcher.Fetcher - addFetchItem: adding item - 
http://www.callidan.com/ma100.htm
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - totalSize++:2 
http://125.168.254.20 http://www.callidan.com/ma100.htm queuesize: 1 
queuecount: 11
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://61.9.216.193, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://216.184.34.250, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://139.146.150.23, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://203.29.78.68, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://150.101.91.39, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: 
http://209.212.110.211, size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://123.176.112.44, 
size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: 
http://117.104.160.130, size: 0
2009-07-09 04:39:50,883 DEBUG fetcher.Fetcher - * queue: http://196.25.73.205, 
size: 0
2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.53.7.145, 
size: 0
2009-07-09 04:39:50,884 DEBUG fetcher.Fetcher - * queue: http://202.60.67.145, 
size: 1

Not that the queue is created and then immediately reaped, and after totalSize 
is incremented, that queue does not appear in the list, even though it 
supposedly has the item added to it.
It looks as if when items are fed, there's a posibility of the queue being 
reaped before the item is added to the queue. However, totalSize is still 
incrememented.

The upshot is that the url is never fetched (as the queue has gone) so 
totalSize never = 0, and eventually the abort will happen.

In short I'd say this is a sync issue, but I'm not sure where the best place to 
lock would be.

Any comments from the author?


 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche

 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-05-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712492#action_12712492
 ] 

Otis Gospodnetic commented on NUTCH-721:


Questions:
Has anyone tried profiling this? (may be relevant: 
http://markmail.org/message/4ixrnvfycpgmkdno )

Or maybe simply debugged/timed various blocks of code using something as simple 
as print statements and simple timers?

Or maybe running just a single thread and then doing kill -QUIT a number of 
times to simply try and spot the method where the code seems to spend a lot of 
its time?


 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-05-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712494#action_12712494
 ] 

Otis Gospodnetic commented on NUTCH-721:


Ken's thoughts: 
http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/


 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-05-23 Thread Roger Dunk (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712506#action_12712506
 ] 

Roger Dunk commented on NUTCH-721:
--

My tests were done on a segment with only 1 URL per host (generate.max.per.host 
= 1), so I don't believe what Ken has to say is the reason, at least in my 
case, for Fetcher2 performing slowly.

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695277#action_12695277
 ] 

Doğacan Güney commented on NUTCH-721:
-

Wow, 53 min vs 3 min !?

Thanks a lot for testing and that is indeed very worrying.

Which 5000 url set did you use? I think the crawl_generate you attached to this 
issue has 13K urls?

PS: One small thing new Fetcher requires less threads than OldFetcher. If you 
have time can you try with
smaller number of threads (say, 15-20)?

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-03 Thread Roger Dunk (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695298#action_12695298
 ] 

Roger Dunk commented on NUTCH-721:
--

I did a -topN 5000, so only a subset of the attached, but still only 1 URL 
per host. The following is with 20 threads and also no parsing.

[r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher 
newcrawl/segments/20090402130655 -threads 20 -noParsing

[...]

Aborting with 20 hung threads.
Fetcher: done

real60m14.926s
user0m38.671s
sys 0m6.134s

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-03 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695394#action_12695394
 ] 

Julien Nioche commented on NUTCH-721:
-

The message about the Aborted hung threads looks like what I described in 
https://issues.apache.org/jira/browse/NUTCH-719 except that in this case there 
are active queues but fetchQueues.totalSize=0 

Roger : can you confirm that the parameter fetcher.threads.per.host.by.ip is 
set to false?



 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-03 Thread Roger Dunk (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695600#action_12695600
 ] 

Roger Dunk commented on NUTCH-721:
--

Julien, yes, fetcher.threads.per.host.by.ip was set to false in the above 
tests. I have also tried it with true, which certainly didn't help the speed 
issue, but I can't comment on the hung threads as I didn't bother letting the 
fetch complete. I'd say there are two, likely unrelated problems with Fetcher2.

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986
 ] 

Doğacan Güney commented on NUTCH-721:
-

I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and 
OldFetcher?

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986
 ] 

Doğacan Güney edited comment on NUTCH-721 at 4/2/09 6:01 AM:
-

I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and 
OldFetcher so that we can find out if this is related to new fetcher or is the 
side effect of some other change?

  was (Author: dogacan):
I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk 
and OldFetcher?
  
 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread Roger Dunk (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695170#action_12695170
 ] 

Roger Dunk commented on NUTCH-721:
--

For the following tests I've used the same segment containing 5000 URLs. I 
cleaned the named cache before the first two tests.

[r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.OldFetcher 
newcrawl/segments/20090402130655/

real3m38.084s
user2m20.887s
sys 0m7.470s

[r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher 
newcrawl/segments/20090402130655/

[...]

Fetcher: done

real53m44.800s
user2m20.070s
sys 0m9.527s

For this next test, I used the same segment but didn't clear the named cache 
from the previous test, so all resolvable hosts should still be cached. This 
appeared to help greatly, as often times out of 80 active threads, only 60 were 
spinwaiting (as opposed to 79 in the non-cached test), but there were still 
plenty of times where at least 30 consecutive log entries showed 80 threads 
spinwaiting. And clearly as can be seen from the times below, still nowhere in 
the league of OldFetcher.

[r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher 
newcrawl/segments/20090402130655/

[...]

Aborting with 80 hung threads.
Fetcher: done

real22m5.420s
user2m39.407s
sys 0m8.192s

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695233#action_12695233
 ] 

Hudson commented on NUTCH-721:
--

Integrated in Nutch-trunk #772 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/772/])
 - Commit old fetcher as OldFetcher for now so that we can test Fetcher2 
performance.


 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694708#action_12694708
 ] 

Doğacan Güney commented on NUTCH-721:
-

OK, there is clearly a problem with the new fetcher. 

First, let's make sure that there is indeed a problem with the new fetcher and 
this is not the side effect of some other code we introduced between 0.9 and 
1.0. So I suggest that we re-commit old fetcher back into trunk and do a 
side-by-side comparison to make sure that the problem is with the new fetcher. 

If it is with the new fetcher, then we may try to salvage Todd's work (I 
remember that he said that his fetcher was faster, right?).

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-721) Fetcher2 Slow

2009-03-17 Thread Roger Dunk (JIRA)
Fetcher2 Slow
-

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk


Fetcher2 fetches far more slowly than Fetcher1.

Config options:
fetcher.threads.fetch = 80
fetcher.threads.per.host = 80
fetcher.server.delay = 0
generate.max.per.host = 1

With a queue size of ~40,000, the result is:

activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0

with maybe a download of 1 page per second.

Runing with -noParse makes little difference.

CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0

Hosts already cached by local caching NS appear to download quickly upon a 
re-fetch, so possible issue relating to NS lookups, however all things being 
equal Fetcher1 runs fast without pre-caching hosts.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-721) Fetcher2 Slow

2009-03-17 Thread Roger Dunk (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roger Dunk updated NUTCH-721:
-

Attachment: nutch-site.xml
crawl_generate.tar.gz

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678573#action_12678573
 ] 

Hudson commented on NUTCH-669:
--

Integrated in Nutch-trunk #742 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/742/])


 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Assignee: Sami Siren
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-02 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-669.
--

Resolution: Fixed

replaced fetcher with fetcher2

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Assignee: Sami Siren
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-02 Thread Andrzej Bialecki

Sami Siren (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-669.
--

Resolution: Fixed

replaced fetcher with fetcher2


I'm puzzled ..  it seemed the goal was to integrate Todd's patch, which 
effectively replaces both Fetchers. Does this mean that Todd's version 
was not ready, or is the current code based on Todd's version?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-02 Thread Sami Siren

Andrzej Bialecki wrote:

Sami Siren (JIRA) wrote:
 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
]


Sami Siren resolved NUTCH-669.
--

Resolution: Fixed

replaced fetcher with fetcher2


I'm puzzled ..  it seemed the goal was to integrate Todd's patch, 
which effectively replaces both Fetchers. Does this mean that Todd's 
version was not ready, or is the current code based on Todd's version?
There was no Todd's path that I could see,  he never provided one even 
after asked multiple times, first by you at dec 2008 then dogacan jan 
2009 and finally me last week.


My motivation to get this fixed was, as I understood most of the 
developers thought too, to get rid of the burden of supporting two 
classes providing roughly the same piece of functionality. I opened a 
jira for this but closed it soon after as you told me it was a duplicate 
to this one.


So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher 
is still there to be improved by Todd and others at will.


--
Sami Siren


Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-02 Thread Andrzej Bialecki

Sami Siren wrote:

Andrzej Bialecki wrote:

Sami Siren (JIRA) wrote:
 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
]


Sami Siren resolved NUTCH-669.
--

Resolution: Fixed

replaced fetcher with fetcher2


I'm puzzled ..  it seemed the goal was to integrate Todd's patch, 
which effectively replaces both Fetchers. Does this mean that Todd's 
version was not ready, or is the current code based on Todd's version?
There was no Todd's path that I could see,  he never provided one even 
after asked multiple times, first by you at dec 2008 then dogacan jan 
2009 and finally me last week.


My motivation to get this fixed was, as I understood most of the 
developers thought too, to get rid of the burden of supporting two 
classes providing roughly the same piece of functionality. I opened a 
jira for this but closed it soon after as you told me it was a duplicate 
to this one.


So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher 
is still there to be improved by Todd and others at will.


Ok, I understand now - given the circumstances I agree this was the 
right thing to do.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-02 Thread Todd Lipcon
Hey guys,
Sorry for the non-responsiveness here. I recently left my old employment and
have been packing for a cross-country move.

I agree that for 1.0 the best bet is what Sami has done. The code that I was
working on is available here:

http://github.com/toddlipcon/nutch/tree/nutch-669

But it is not production ready - notably there's a problem whereby it runs
out of memory even with a reasonably large heap.

I'm not sure if I'll be able to complete working on it, given the cluster
(and workload) I was using to test were from my old job, but I'm happy to
provide any assistance understanding the work I began if you'd like to try
to integrate it for 1.1

-Todd

On Mon, Mar 2, 2009 at 9:48 AM, Andrzej Bialecki a...@getopt.org wrote:

 Sami Siren wrote:

 Andrzej Bialecki wrote:

 Sami Siren (JIRA) wrote:

 [
 https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

 Sami Siren resolved NUTCH-669.
 --

Resolution: Fixed

 replaced fetcher with fetcher2


 I'm puzzled ..  it seemed the goal was to integrate Todd's patch, which
 effectively replaces both Fetchers. Does this mean that Todd's version was
 not ready, or is the current code based on Todd's version?

 There was no Todd's path that I could see,  he never provided one even
 after asked multiple times, first by you at dec 2008 then dogacan jan 2009
 and finally me last week.

 My motivation to get this fixed was, as I understood most of the
 developers thought too, to get rid of the burden of supporting two classes
 providing roughly the same piece of functionality. I opened a jira for this
 but closed it soon after as you told me it was a duplicate to this one.

 So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher
 is still there to be improved by Todd and others at will.


 Ok, I understand now - given the circumstances I agree this was the right
 thing to do.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




[jira] Assigned: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-02-26 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren reassigned NUTCH-669:


Assignee: Sami Siren

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Assignee: Sami Siren
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-701) replace Fetcher with Fetcher2

2009-02-24 Thread Sami Siren (JIRA)
replace Fetcher with Fetcher2
-

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


Currently there are two fetcher implementation within nutch, one too many. This 
task tracks the process of promoting Fetcher2.

my plan is basically to
-remove Fetcher all together and rename Fetcher2 to Fetcher
-fix crawl class so it works with F2 api.

If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-701) Replace Fetcher with Fetcher2

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-701:
-

Summary: Replace Fetcher with Fetcher2  (was: replace Fetcher with Fetcher2)

 Replace Fetcher with Fetcher2
 -

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


 Currently there are two fetcher implementation within nutch, one too many. 
 This task tracks the process of promoting Fetcher2.
 my plan is basically to
 -remove Fetcher all together and rename Fetcher2 to Fetcher
 -fix crawl class so it works with F2 api.
 If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-701) Replace Fetcher with Fetcher2

2009-02-24 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676240#action_12676240
 ] 

Andrzej Bialecki  commented on NUTCH-701:
-

This is a duplicate of NUTCH-669. Please follow-up with Todd to finalize that 
issue instead.

 Replace Fetcher with Fetcher2
 -

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


 Currently there are two fetcher implementation within nutch, one too many. 
 This task tracks the process of promoting Fetcher2.
 my plan is basically to
 -remove Fetcher all together and rename Fetcher2 to Fetcher
 -fix crawl class so it works with F2 api.
 If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-701) Replace Fetcher with Fetcher2

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-701.
--

Resolution: Duplicate

 Replace Fetcher with Fetcher2
 -

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


 Currently there are two fetcher implementation within nutch, one too many. 
 This task tracks the process of promoting Fetcher2.
 my plan is basically to
 -remove Fetcher all together and rename Fetcher2 to Fetcher
 -fix crawl class so it works with F2 api.
 If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-669:
-

Fix Version/s: (was: 1.1)
   1.0.0

Moving this back to 1.0

Are you close with your patch? As discussed in this thread we should just 
replace Fetcher With Fetcher2, change Crawl class and check that the tests 
pass. other issues we can deal within their own tickets.

I can also help with this if you don't have the time.



 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects

2009-02-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676496#action_12676496
 ] 

Hudson commented on NUTCH-626:
--

Integrated in Nutch-trunk #735 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/735/])
 - Fetcher2 breaks out the domain with db.ignore.external.links set at 
cross domain redirects, contributed by Remco Verhoef, dogacan


 fetcher2 breaks out the domain with db.ignore.external.links set at cross 
 domain redirects
 --

 Key: NUTCH-626
 URL: https://issues.apache.org/jira/browse/NUTCH-626
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Linux Debian
Reporter: Remco Verhoef
Assignee: Sami Siren
 Fix For: 1.0.0

 Attachments: fetcher2.diff, NUTCH-626_v2.patch


 Fetcher2 breaks out of the db.ignore.external.links directive when 
 encounterin a cross domain redirect. The redirected url is followed without 
 checking for db.ignore.external.links and cross domain. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects

2009-01-28 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-626:


Attachment: NUTCH-626_v2.patch

I updated your patch to apply and compile in latest trunk.

I am not committing this patch since I don't want to mess with Todd's
Fetcher work. For now :D

 fetcher2 breaks out the domain with db.ignore.external.links set at cross 
 domain redirects
 --

 Key: NUTCH-626
 URL: https://issues.apache.org/jira/browse/NUTCH-626
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Linux Debian
Reporter: Remco Verhoef
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: fetcher2.diff, NUTCH-626_v2.patch


 Fetcher2 breaks out of the db.ignore.external.links directive when 
 encounterin a cross domain redirect. The redirected url is followed without 
 checking for db.ignore.external.links and cross domain. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool

2009-01-21 Thread julien nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665791#action_12665791
 ] 

julien nioche commented on NUTCH-679:
-

I can send a modified version of it once Todd has finished working on the 
Fetchers. Same for https://issues.apache.org/jira/browse/NUTCH-658  

 Fetcher2 implementing Tool
 --

 Key: NUTCH-679
 URL: https://issues.apache.org/jira/browse/NUTCH-679
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: julien nioche
Priority: Minor
 Attachments: Fetcher2.Tool.patch


 The patch attached makes Fetcher2 implement Tool. As a result we should be 
 able to override parameters on the command line e.g. 
 bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 
 segments/20090115072836
 instead of having to modify the *-site.xml files in conf/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool

2009-01-20 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665482#action_12665482
 ] 

Otis Gospodnetic commented on NUTCH-679:


I'm not sure, but committing this may mess up Todd's work on merging Fetcher 
and Fetcher2.


 Fetcher2 implementing Tool
 --

 Key: NUTCH-679
 URL: https://issues.apache.org/jira/browse/NUTCH-679
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: julien nioche
Priority: Minor
 Attachments: Fetcher2.Tool.patch


 The patch attached makes Fetcher2 implement Tool. As a result we should be 
 able to override parameters on the command line e.g. 
 bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 
 segments/20090115072836
 instead of having to modify the *-site.xml files in conf/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665562#action_12665562
 ] 

Doğacan Güney commented on NUTCH-669:
-

Hi Todd,

Can you upload your work to JIRA now, so that we can review and merge it for 
1.0?

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool

2009-01-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665125#action_12665125
 ] 

Doğacan Güney commented on NUTCH-679:
-

Looks simple enough. I am going to commit it soon if no objections.

Btw, please use 2-space tabs otherwise it messes up patches :)

 Fetcher2 implementing Tool
 --

 Key: NUTCH-679
 URL: https://issues.apache.org/jira/browse/NUTCH-679
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: julien nioche
Priority: Minor
 Attachments: Fetcher2.Tool.patch


 The patch attached makes Fetcher2 implement Tool. As a result we should be 
 able to override parameters on the command line e.g. 
 bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 
 segments/20090115072836
 instead of having to modify the *-site.xml files in conf/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-679) Fetcher2 implementing Tool

2009-01-15 Thread julien nioche (JIRA)
Fetcher2 implementing Tool
--

 Key: NUTCH-679
 URL: https://issues.apache.org/jira/browse/NUTCH-679
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: julien nioche
Priority: Minor
 Attachments: Fetcher2.Tool.patch

The patch attached makes Fetcher2 implement Tool. As a result we should be able 
to override parameters on the command line e.g. 
bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 
segments/20090115072836
instead of having to modify the *-site.xml files in conf/


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-679) Fetcher2 implementing Tool

2009-01-15 Thread julien nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

julien nioche updated NUTCH-679:


Attachment: Fetcher2.Tool.patch

Patch which makes Fetcher2 implement Tool interface

 Fetcher2 implementing Tool
 --

 Key: NUTCH-679
 URL: https://issues.apache.org/jira/browse/NUTCH-679
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: julien nioche
Priority: Minor
 Attachments: Fetcher2.Tool.patch


 The patch attached makes Fetcher2 implement Tool. As a result we should be 
 able to override parameters on the command line e.g. 
 bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 
 segments/20090115072836
 instead of having to modify the *-site.xml files in conf/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-02 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660382#action_12660382
 ] 

Todd Lipcon commented on NUTCH-669:
---

Here's a further report on my progress:

  - It turns out the change in NUTCH-676 caused things to break - there's some 
behavior in nutch's MapWritable that differs from Hadoop's, so it was spending 
all of its time in output.collect - I think the writables were accruing lots of 
key/value pairs that they weren't sposed to. So, this doesn't depend on 
NUTCH-676.

  - I implemented adaptive crawl delay (NUTCH-475) in the new fetcher.

  - Also implemented early termination as discussed in this mailing list 
thread: 
http://www.nabble.com/proposal:-fetcher-performance-improvements-td20939872.html

Results so far are looking good. I was able to run a 1M url fetch with 5000 
urls per host at a sustained rate of 25 pages/second (total around 11 hours). 
About 60% of the URLs ended up parsed, which isn't significantly worse than I 
usually see without early termination, but past attempts to run 1M fetches have 
taken several days because of some slow hosts.

I'm running a 2M+ URL fetch right now and have been sustaining 40-60mbit 
inbound from 8 fetchers for the last couple hours.

  - I did experience one GC error - I think I need to add some cleanup of empty 
queues out of the FetchQueue structure when the number of unique hosts is very 
high.

Complete history is here: http://github.com/toddlipcon/nutch/tree/nutch-669

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-02 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660397#action_12660397
 ] 

Otis Gospodnetic commented on NUTCH-669:


Todd, and when you say sustained rate of 25 pages/second that means the final 
rate you see on one of the status screens?  In other words, this is not a rate 
you see being steady while the fetch run is in the full swing (which could be a 
lot higher), but rather the final rate?


 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659857#action_12659857
 ] 

Todd Lipcon commented on NUTCH-669:
---

Hey guys,

I tried it on production, but ran into an Exception of some sort that happened 
very rarely. Then I went on vacation for 2 weeks and came back to find the logs 
gone from my hadoop tracker, so I can't figure out what the Exception was ;-) 
I'll run another segment today hopefully and let you know the results.

-Todd

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659958#action_12659958
 ] 

Todd Lipcon commented on NUTCH-669:
---

Found the exception in a screen log:

{noformat}
java.lang.NullPointerException
at 
org.apache.nutch.crawl.MapWritable$KeyValueEntry.access$102(MapWritable.java:469)
at org.apache.nutch.crawl.MapWritable.readFields(MapWritable.java:362)
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:250)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at 
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
at org.apache.nutch.fetcher.Fetcher$FetchMapper.run(Fetcher.java:399)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
{noformat}

I think NUTCH-676 may help this. Trying another run in a minute.

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659343#action_12659343
 ] 

Andrzej Bialecki  commented on NUTCH-669:
-

Well ... have you tried it? How did it go?

I think it's time to upload the patch to JIRA, so that we can decide what to do 
using a concrete snapshot of your work.

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-10 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated NUTCH-669:
---

 Priority: Major  (was: Minor)
Fix Version/s: 1.0.0

+1 -- people, vote for it.  This could go in 1.0, right?


 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-10 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655491#action_12655491
 ] 

Todd Lipcon commented on NUTCH-669:
---

For those watching this issue: I pushed a couple more changes to the github 
repo linked above. I'm about to try it on production with a 100K url segment, 
80 threads, limit by IP, 8 crawler nodes. We'll see how it goes.

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-05 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653844#action_12653844
 ] 

Todd Lipcon commented on NUTCH-669:
---

Agreed on all fronts.

I spent several hours yesterday refactoring/rewriting Fetcher2 to be a little 
cleaner . One of the changes was to factor out the queueing policies into a new 
class and replace the Thread-based model with one based on ExecutorServices. I 
may also try to factor out the actual fetching into a new class as well.

I haven't gotten to testing the new version quite yet but hopefully should have 
a patch available next week, and perhaps some intermediate commits available on 
github this afternoon so people can see where I'm headed.

Is there a unit (or functional) testing infrastructure I can use somewhere to 
test this?

-Todd

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor

 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-05 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653940#action_12653940
 ] 

Todd Lipcon commented on NUTCH-669:
---

I've pushed the initial commit of this rewrite/refactor to github:

http://github.com/toddlipcon/nutch/commit/5c9d99a856628c842b50b1d76f62b375f377bf95

Might be worth just reviewing it as if it were a new file rather than a diff:

http://github.com/toddlipcon/nutch/tree/5c9d99a856628c842b50b1d76f62b375f377bf95/src/java/org/apache/nutch/fetcher/Fetcher.java

Still have some more cleanup and revisions here, plus I want to test it on a 
real crawl or two from our cluster. It currently passes the TestFetcher unit 
test but I don't know what the coverage is on that.

I'll attach a patch here before it's ready to be comitted it so I can check off 
the license grant checkbox, which I know is important for ASF.

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor

 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-04 Thread Todd Lipcon (JIRA)
Consolidate code for Fetcher and Fetcher2
-

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor


I'd like to consolidate a lot of the common code between Fetcher and 
Fetcher2.java.

It seems to me like there are the following differences:
  - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings 
whereas Fetcher2 implements them itself
  - Fetcher2 uses a different queueing model (queue per crawl host) to 
accomplish the per-host limiting without making the Protocol do it.

I've begun work on this but want to check with people on the following:

- What reason is there for Fetcher existing at all since Fetcher2 seems to be a 
superset of functionality?

- Is it on the road map to remove the robots/delay logic from the Http protocol 
and make Fetcher2's delegation of duties the standard?

- Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653487#action_12653487
 ] 

Doğacan Güney commented on NUTCH-669:
-

 *  What reason is there for Fetcher existing at all since Fetcher2 seems 
 to be a superset of functionality?

Agreed. We should just rename Fetcher2 to Fetcher and be done with it :D

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor

 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects

2008-10-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney reassigned NUTCH-626:
---

Assignee: Doğacan Güney

 fetcher2 breaks out the domain with db.ignore.external.links set at cross 
 domain redirects
 --

 Key: NUTCH-626
 URL: https://issues.apache.org/jira/browse/NUTCH-626
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Linux Debian
Reporter: Remco Verhoef
Assignee: Doğacan Güney
 Attachments: fetcher2.diff


 Fetcher2 breaks out of the db.ignore.external.links directive when 
 encounterin a cross domain redirect. The redirected url is followed without 
 checking for db.ignore.external.links and cross domain. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Fetcher2 Reduce Phase Question

2008-04-11 Thread Sandeep Tata
Hi Folks,

I was just wondering what computation really happens in the reduce
phase for Fetcher2 ?

I know that it is implemented as a MapRunnable -- but I see no
explicit reducer being set for the job. Is the identity reducer being
used ? Why can't we simply use job.setNumReduceTasks(0) ?
Wouldn't this be faster?

Sandeep


Re: Fetcher2 Reduce Phase Question

2008-04-11 Thread Andrzej Bialecki

Sandeep Tata wrote:

Hi Folks,

I was just wondering what computation really happens in the reduce
phase for Fetcher2 ?


If Fetcher was running in the parsing mode, then in the reduce phase 
Outlinks are separated from Parse output and stored in crawl_parse, and 
other data in parse_text and parse_data. This actually happens in 
FetcherOutputFormat / ParseOutputFormat, so there is no need for any 
Reduce apart from the IdentityReduce (default)




I know that it is implemented as a MapRunnable -- but I see no
explicit reducer being set for the job. Is the identity reducer being
used ? Why can't we simply use job.setNumReduceTasks(0) ?
Wouldn't this be faster?


First, when Fetcher / Fetcher2 were written there was no such option in 
Hadoop. Second, the meaning of this setting is that the output from maps 
becomes the final output - but this won't cut it, because map outputs 
are always simple SequenceFile's, whereas we need to split the 
FetcherOutput into a bunch of Sequence and MapFile-s (which have to be 
sorted) ...



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Created: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects

2008-04-06 Thread Remco Verhoef (JIRA)
fetcher2 breaks out the domain with db.ignore.external.links set at cross 
domain redirects
--

 Key: NUTCH-626
 URL: https://issues.apache.org/jira/browse/NUTCH-626
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Linux Debian
Reporter: Remco Verhoef


Fetcher2 breaks out of the db.ignore.external.links directive when encounterin 
a cross domain redirect. The redirected url is followed without checking for 
db.ignore.external.links and cross domain. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects

2008-04-06 Thread Remco Verhoef (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remco Verhoef updated NUTCH-626:


Attachment: fetcher2.diff

this patch also fixes an other issue with redirects.

 fetcher2 breaks out the domain with db.ignore.external.links set at cross 
 domain redirects
 --

 Key: NUTCH-626
 URL: https://issues.apache.org/jira/browse/NUTCH-626
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Linux Debian
Reporter: Remco Verhoef
 Attachments: fetcher2.diff


 Fetcher2 breaks out of the db.ignore.external.links directive when 
 encounterin a cross domain redirect. The redirected url is followed without 
 checking for db.ignore.external.links and cross domain. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED

2008-03-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-592.
---

Resolution: Duplicate
  Assignee: Andrzej Bialecki   (was: Emmanuel Joke)

 Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
 -

 Key: NUTCH-592
 URL: https://issues.apache.org/jira/browse/NUTCH-592
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: patch.txt


 I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect 
 function can return NULL for few case and it has not been managed in the 
 function as it has been done for the case ProtocolStatus.SUCCESS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578786#action_12578786
 ] 

Andrzej Bialecki  commented on NUTCH-592:
-

Duplicate of NUTCH-597 and NUTCH-615.

 Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
 -

 Key: NUTCH-592
 URL: https://issues.apache.org/jira/browse/NUTCH-592
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: patch.txt


 I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect 
 function can return NULL for few case and it has not been managed in the 
 function as it has been done for the case ProtocolStatus.SUCCESS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.

2008-01-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12559382#action_12559382
 ] 

Hudson commented on NUTCH-597:
--

Integrated in Nutch-Nightly #330 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/330/])

 Fetcher2 - java.lang.NullPointerException when host does not exist and 
 fetcher.threads.per.host.by.ip is set to true causes threads to finish.
 --

 Key: NUTCH-597
 URL: https://issues.apache.org/jira/browse/NUTCH-597
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Debian
Reporter: Remco Verhoef
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: fetcher2.java.patch


 When fetcher.threads.per.host.by.ip is set to true the following exception is 
 thrown when the host does not exist. FetchItem.create returns null when it is 
 not able to resolve the host address when it is redirecting.
 2007-12-30 15:34:42,720 WARN  fetcher.Fetcher2 - Unable to resolve: {url}  , 
 skipping.
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - 
 java.lang.NullPointerException
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher 
 caught:java.lang..NullPointerException
 2007-12-30 15:34:42,721 INFO  fetcher.Fetcher2 - -finishing thread 
 FetcherThread, activeThreads=49

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.

2008-01-15 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-597.
-

   Resolution: Fixed
Fix Version/s: 1.0.0
 Assignee: Andrzej Bialecki 

Patch applied in rev. 612264. Thank you!

 Fetcher2 - java.lang.NullPointerException when host does not exist and 
 fetcher.threads.per.host.by.ip is set to true causes threads to finish.
 --

 Key: NUTCH-597
 URL: https://issues.apache.org/jira/browse/NUTCH-597
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Debian
Reporter: Remco Verhoef
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: fetcher2.java.patch


 When fetcher.threads.per.host.by.ip is set to true the following exception is 
 thrown when the host does not exist. FetchItem.create returns null when it is 
 not able to resolve the host address when it is redirecting.
 2007-12-30 15:34:42,720 WARN  fetcher.Fetcher2 - Unable to resolve: {url}  , 
 skipping.
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - 
 java.lang.NullPointerException
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher 
 caught:java.lang..NullPointerException
 2007-12-30 15:34:42,721 INFO  fetcher.Fetcher2 - -finishing thread 
 FetcherThread, activeThreads=49

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.

2008-01-15 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-597.
---


 Fetcher2 - java.lang.NullPointerException when host does not exist and 
 fetcher.threads.per.host.by.ip is set to true causes threads to finish.
 --

 Key: NUTCH-597
 URL: https://issues.apache.org/jira/browse/NUTCH-597
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Debian
Reporter: Remco Verhoef
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: fetcher2.java.patch


 When fetcher.threads.per.host.by.ip is set to true the following exception is 
 thrown when the host does not exist. FetchItem.create returns null when it is 
 not able to resolve the host address when it is redirecting.
 2007-12-30 15:34:42,720 WARN  fetcher.Fetcher2 - Unable to resolve: {url}  , 
 skipping.
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - 
 java.lang.NullPointerException
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher 
 caught:java.lang..NullPointerException
 2007-12-30 15:34:42,721 INFO  fetcher.Fetcher2 - -finishing thread 
 FetcherThread, activeThreads=49

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED

2008-01-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12559272#action_12559272
 ] 

Andrzej Bialecki  commented on NUTCH-592:
-

This seems to be a duplicate of NUTCH-597. If you have no objections I will 
close this issue.

 Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
 -

 Key: NUTCH-592
 URL: https://issues.apache.org/jira/browse/NUTCH-592
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: patch.txt


 I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect 
 function can return NULL for few case and it has not been managed in the 
 function as it has been done for the case ProtocolStatus.SUCCESS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.

2007-12-30 Thread Remco Verhoef (JIRA)
Fetcher2 - java.lang.NullPointerException when host does not exist and 
fetcher.threads.per.host.by.ip is set to true causes threads to finish.
--

 Key: NUTCH-597
 URL: https://issues.apache.org/jira/browse/NUTCH-597
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Debian
Reporter: Remco Verhoef


When fetcher.threads.per.host.by.ip is set to true the following exception is 
thrown when the host does not exist. FetchItem.create returns null when it is 
not able to resolve the host address when it is redirecting.

2007-12-30 15:34:42,720 WARN  fetcher.Fetcher2 - Unable to resolve: {url}  , 
skipping.
2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - java.lang.NullPointerException
2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327)
2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323)
2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632)
2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher 
caught:java.lang..NullPointerException
2007-12-30 15:34:42,721 INFO  fetcher.Fetcher2 - -finishing thread 
FetcherThread, activeThreads=49

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-597) Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.

2007-12-30 Thread Remco Verhoef (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remco Verhoef updated NUTCH-597:


Attachment: fetcher2.java.patch

Contains the patch code for Fetcher2.java.

 Fetcher2 - java.lang.NullPointerException when host does not exist and 
 fetcher.threads.per.host.by.ip is set to true causes threads to finish.
 --

 Key: NUTCH-597
 URL: https://issues.apache.org/jira/browse/NUTCH-597
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Debian
Reporter: Remco Verhoef
 Attachments: fetcher2.java.patch


 When fetcher.threads.per.host.by.ip is set to true the following exception is 
 thrown when the host does not exist. FetchItem.create returns null when it is 
 not able to resolve the host address when it is redirecting.
 2007-12-30 15:34:42,720 WARN  fetcher.Fetcher2 - Unable to resolve: {url}  , 
 skipping.
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - 
 java.lang.NullPointerException
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:327)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.finishFetchItem(Fetcher2.java:323)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - at 
 org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:632)
 2007-12-30 15:34:42,721 FATAL fetcher.Fetcher2 - fetcher 
 caught:java.lang..NullPointerException
 2007-12-30 15:34:42,721 INFO  fetcher.Fetcher2 - -finishing thread 
 FetcherThread, activeThreads=49

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED

2007-12-16 Thread Emmanuel Joke (JIRA)
Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
-

 Key: NUTCH-592
 URL: https://issues.apache.org/jira/browse/NUTCH-592
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
 Fix For: 1.0.0
 Attachments: patch.txt

I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect 
function can return NULL for few case and it has not been managed in the 
function as it has been done for the case ProtocolStatus.SUCCESS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED

2007-12-16 Thread Emmanuel Joke (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emmanuel Joke updated NUTCH-592:


Attachment: patch.txt

Patch provided.

 Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
 -

 Key: NUTCH-592
 URL: https://issues.apache.org/jira/browse/NUTCH-592
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: patch.txt


 I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect 
 function can return NULL for few case and it has not been managed in the 
 function as it has been done for the case ProtocolStatus.SUCCESS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508747
 ] 

Hudson commented on NUTCH-474:
--

Integrated in Nutch-Nightly #131 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/])

 Fetcher2 sets server-delay and blocking checks incorrectly
 --

 Key: NUTCH-474
 URL: https://issues.apache.org/jira/browse/NUTCH-474
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: fetcher2.patch


 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay 
 if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be 
 the opposite.
 2) Fetcher2 sets wrong configuration options so host blocking is still 
 handled by the lib-http plugin (Fetcher2 is designed to handle blocking 
 internally).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-06-28 Thread Doğacan Güney

On 6/28/07, Hudson (JIRA) [EMAIL PROTECTED] wrote:


[ 
https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508747
 ]

Hudson commented on NUTCH-474:
--

Integrated in Nutch-Nightly #131 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/])


*sigh*

I wrote NUTCH-474 instead of NUTCH-434 in svn log. Sorry everyone...



 Fetcher2 sets server-delay and blocking checks incorrectly
 --

 Key: NUTCH-474
 URL: https://issues.apache.org/jira/browse/NUTCH-474
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Andrzej Bialecki
 Fix For: 1.0.0

 Attachments: fetcher2.patch


 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay 
if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be the 
opposite.
 2) Fetcher2 sets wrong configuration options so host blocking is still 
handled by the lib-http plugin (Fetcher2 is designed to handle blocking 
internally).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.





--
Doğacan Güney


[jira] Resolved: (NUTCH-495) Unnecessary delays in Fetcher2

2007-06-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney resolved NUTCH-495.
-

Resolution: Fixed
  Assignee: Doğacan Güney

Committed in rev 547901.

 Unnecessary delays in Fetcher2
 --

 Key: NUTCH-495
 URL: https://issues.apache.org/jira/browse/NUTCH-495
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: fetcher2_robots.patch


 Even if a url is blocked by robots.txt (or has a crawl delay larger that 
 max.crawl.delay), Fetcher2 still waits fetcher.server.delay before fetching 
 another url from same host, which is not necessary, considering that Fetcher2 
 didn't make a request to server anyway. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-495) Unnecessary delays in Fetcher2

2007-05-31 Thread JIRA
Unnecessary delays in Fetcher2
--

 Key: NUTCH-495
 URL: https://issues.apache.org/jira/browse/NUTCH-495
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0


Even if a url is blocked by robots.txt (or has a crawl delay larger that 
max.crawl.delay), Fetcher2 still waits fetcher.server.delay before fetching 
another url from same host, which is not necessary, considering that Fetcher2 
didn't make a request to server anyway. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-495) Unnecessary delays in Fetcher2

2007-05-31 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-495:


Attachment: fetcher2_robots.patch

 Unnecessary delays in Fetcher2
 --

 Key: NUTCH-495
 URL: https://issues.apache.org/jira/browse/NUTCH-495
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: fetcher2_robots.patch


 Even if a url is blocked by robots.txt (or has a crawl delay larger that 
 max.crawl.delay), Fetcher2 still waits fetcher.server.delay before fetching 
 another url from same host, which is not necessary, considering that Fetcher2 
 didn't make a request to server anyway. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Fetcher2's delay between successive requests

2007-04-24 Thread Doğacan Güney

Hi all,

I have been working on Fetcher2 code lately and I came across this
particular code (in FetchItemQueue.getFetchItem) that I didn't quite
understand:

public FetchItem getFetchItem() {
 ...
 long last = endTime.get() + (maxThreads  1 ? crawlDelay : minCrawlDelay);
 ...
}

Now, the 'default' politeness behaviour should be 1 thread per host
and delaying n seconds between successive requests to that host,
right? But, won't this code wait only minCrawlDelay(which, by default,
is 0) if maxThreads == 1.

I also did not understand why there is a maxThread check at all. Each
individual thread should wait crawl delay before making another
request to the same host. Am I missing something here?

--
Doğacan Güney


Re: Fetcher2's delay between successive requests

2007-04-24 Thread Doğacan Güney

I have discovered another bug in Fetcher2. Plugin lib-http checks
Protocol.CHECK_{BLOCKING,ROBOTS}(which resolve to strings
protocol.plugin.check.{blocking,robots})  to see if it should handle
blocking or not.

But fetcher2 sets http.plugin.check.{blocking,robots} (notice the
protocol/http difference) to false to indicate lib-http shouldn't
handle blocking internally. Because of this, when you use Fetcher2,
lib-http still tries to block them which makes Fetcher2 much less
useful.

I am not sending a patch for this yet because I first want to get some
feedback on the first bug.

--
Doğacan Güney


Re: Fetcher2's delay between successive requests

2007-04-24 Thread Andrzej Bialecki

Doğacan Güney wrote:

Hi all,

I have been working on Fetcher2 code lately and I came across this
particular code (in FetchItemQueue.getFetchItem) that I didn't quite
understand:

public FetchItem getFetchItem() {
 ...
 long last = endTime.get() + (maxThreads  1 ? crawlDelay : minCrawlDelay);
 ...
}

Now, the 'default' politeness behaviour should be 1 thread per host
and delaying n seconds between successive requests to that host,
right? But, won't this code wait only minCrawlDelay(which, by default,
is 0) if maxThreads == 1.


Yes, that was the intended behavior - normally, you should never use 
more than 1 thread per host, unless you have an explicit permission to 
do so.


If multiple threads make requests to the same host, then the crawl delay 
parameter loses its usual meaning - see the details of this in comments 
to NUTCH-385. However, the sensible way to do is to still provide a way 
to limit the maximum rate of requests, and this is what the 
minCrawlDelay parameter is for.





I also did not understand why there is a maxThread check at all. Each
individual thread should wait crawl delay before making another
request to the same host. Am I missing something here?



See the ASCII-art graphs and comments in NUTCH-385 - this is likely not 
what is expected.


Although this JIRA issue is still open, the Fetcher2 code tries to 
implement this middle ground solution.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Fetcher2's delay between successive requests

2007-04-24 Thread Andrzej Bialecki

Doğacan Güney wrote:

I have discovered another bug in Fetcher2. Plugin lib-http checks
Protocol.CHECK_{BLOCKING,ROBOTS}(which resolve to strings
protocol.plugin.check.{blocking,robots})  to see if it should handle
blocking or not.

But fetcher2 sets http.plugin.check.{blocking,robots} (notice the
protocol/http difference) to false to indicate lib-http shouldn't
handle blocking internally. Because of this, when you use Fetcher2,
lib-http still tries to block them which makes Fetcher2 much less
useful.



This is definitely a bug.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Created: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-04-24 Thread JIRA
Fetcher2 sets server-delay and blocking checks incorrectly
--

 Key: NUTCH-474
 URL: https://issues.apache.org/jira/browse/NUTCH-474
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
 Fix For: 1.0.0


1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay 
if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be the 
opposite.

2) Fetcher2 sets wrong configuration options so host blocking is still handled 
by the lib-http plugin (Fetcher2 is designed to handle blocking internally).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-04-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-474:


Attachment: fetcher2.patch

 Fetcher2 sets server-delay and blocking checks incorrectly
 --

 Key: NUTCH-474
 URL: https://issues.apache.org/jira/browse/NUTCH-474
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
 Fix For: 1.0.0

 Attachments: fetcher2.patch


 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay 
 if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be 
 the opposite.
 2) Fetcher2 sets wrong configuration options so host blocking is still 
 handled by the lib-http plugin (Fetcher2 is designed to handle blocking 
 internally).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Fetcher2's delay between successive requests

2007-04-24 Thread Andrzej Bialecki

Doğacan Güney wrote:


I don't get it. The code seems to do exactly the opposite of what you
are saying. If maxThreads == 1 then maxThreads  1 is false thus the
expression evaluates to minCrawlDelay not crawlDelay. Shouldn't the
expression be (maxThreads  1 ? minCrawlDelay : crawlDelay) ?


Yep, you're right - it's a bug. However, the reasoning that I presented 
still holds, it's just the implementation that doesn't get it ;)



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Fetcher2's delay between successive requests

2007-04-24 Thread Doğacan Güney

On 4/24/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Doğacan Güney wrote:

 I don't get it. The code seems to do exactly the opposite of what you
 are saying. If maxThreads == 1 then maxThreads  1 is false thus the
 expression evaluates to minCrawlDelay not crawlDelay. Shouldn't the
 expression be (maxThreads  1 ? minCrawlDelay : crawlDelay) ?

Yep, you're right - it's a bug. However, the reasoning that I presented
still holds, it's just the implementation that doesn't get it ;)



Heh, OK:). I opened an issue for these bugs (NUTCH-474)  and attached a patch.



--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Doğacan Güney


[jira] Closed: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-04-24 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-474.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

Fixed in rev. 532088. Thanks!

 Fetcher2 sets server-delay and blocking checks incorrectly
 --

 Key: NUTCH-474
 URL: https://issues.apache.org/jira/browse/NUTCH-474
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
 Assigned To: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: fetcher2.patch


 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay 
 if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be 
 the opposite.
 2) Fetcher2 sets wrong configuration options so host blocking is still 
 handled by the lib-http plugin (Fetcher2 is designed to handle blocking 
 internally).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Fetcher2

2007-01-25 Thread kauu

please give us the url,thx

On 1/25/07, chee wu [EMAIL PROTECTED] wrote:


Just appended the portion for .81  to NUTCH-339

- Original Message -
From: Armel T. Nene [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 25, 2007 8:06 AM
Subject: RE: Fetcher2


 Chee,

 Can you make the code available through Jira.

 Thanks,

 Armel

 -
 Armel T. Nene
 iDNA Solutions
 Tel: +44 (207) 257 6124
 Mobile: +44 (788) 695 0483
 http://blog.idna-solutions.com

 -Original Message-
 From: chee wu [mailto:[EMAIL PROTECTED]
 Sent: 24 January 2007 03:59
 To: nutch-dev@lucene.apache.org
 Subject: Re: Fetcher2

 Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy...
I
 can share the code,if any one want to use ..
 - Original Message -
 From: Andrzej Bialecki [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 23, 2007 12:09 AM
 Subject: Re: Fetcher2


 chee wu wrote:
 Fetcher2 should be a great help for me,but seems can't integrate with
 Nutch81.
 Any advice on how to use it based on .81?


 You would have to port it to Nutch 0.8.1 - e.g. change all Text
 occurences to UTF8, and most likely make other changes too ...

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com










--
www.babatu.com


RE: Fetcher2

2007-01-25 Thread Armel T. Nene
Kauu,

The url for fetcher too is: https://issues.apache.org/jira/browse/NUTCH-339

Armel

-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com
-Original Message-
From: kauu [mailto:[EMAIL PROTECTED] 
Sent: 25 January 2007 09:31
To: nutch-dev@lucene.apache.org
Subject: Re: Fetcher2

please give us the url,thx

On 1/25/07, chee wu [EMAIL PROTECTED] wrote:

 Just appended the portion for .81  to NUTCH-339

 - Original Message -
 From: Armel T. Nene [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org
 Sent: Thursday, January 25, 2007 8:06 AM
 Subject: RE: Fetcher2


  Chee,
 
  Can you make the code available through Jira.
 
  Thanks,
 
  Armel
 
  -
  Armel T. Nene
  iDNA Solutions
  Tel: +44 (207) 257 6124
  Mobile: +44 (788) 695 0483
  http://blog.idna-solutions.com
 
  -Original Message-
  From: chee wu [mailto:[EMAIL PROTECTED]
  Sent: 24 January 2007 03:59
  To: nutch-dev@lucene.apache.org
  Subject: Re: Fetcher2
 
  Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy...
 I
  can share the code,if any one want to use ..
  - Original Message -
  From: Andrzej Bialecki [EMAIL PROTECTED]
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 23, 2007 12:09 AM
  Subject: Re: Fetcher2
 
 
  chee wu wrote:
  Fetcher2 should be a great help for me,but seems can't integrate with
  Nutch81.
  Any advice on how to use it based on .81?
 
 
  You would have to port it to Nutch 0.8.1 - e.g. change all Text
  occurences to UTF8, and most likely make other changes too ...
 
  --
  Best regards,
  Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 
 
 
 




-- 
www.babatu.com



RE: Fetcher2

2007-01-24 Thread Armel T. Nene
Chee,

Can you make the code available through Jira.

Thanks,

Armel

-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com

-Original Message-
From: chee wu [mailto:[EMAIL PROTECTED] 
Sent: 24 January 2007 03:59
To: nutch-dev@lucene.apache.org
Subject: Re: Fetcher2

Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy... I
can share the code,if any one want to use ..
- Original Message - 
From: Andrzej Bialecki [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, January 23, 2007 12:09 AM
Subject: Re: Fetcher2


 chee wu wrote:
 Fetcher2 should be a great help for me,but seems can't integrate with
Nutch81.
 Any advice on how to use it based on .81? 
   
 
 You would have to port it to Nutch 0.8.1 - e.g. change all Text 
 occurences to UTF8, and most likely make other changes too ...
 
 -- 
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 




Re: Fetcher2

2007-01-24 Thread chee wu
Just appended the portion for .81  to NUTCH-339

- Original Message - 
From: Armel T. Nene [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 25, 2007 8:06 AM
Subject: RE: Fetcher2


 Chee,
 
 Can you make the code available through Jira.
 
 Thanks,
 
 Armel
 
 -
 Armel T. Nene
 iDNA Solutions
 Tel: +44 (207) 257 6124
 Mobile: +44 (788) 695 0483 
 http://blog.idna-solutions.com
 
 -Original Message-
 From: chee wu [mailto:[EMAIL PROTECTED] 
 Sent: 24 January 2007 03:59
 To: nutch-dev@lucene.apache.org
 Subject: Re: Fetcher2
 
 Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy... I
 can share the code,if any one want to use ..
 - Original Message - 
 From: Andrzej Bialecki [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 23, 2007 12:09 AM
 Subject: Re: Fetcher2
 
 
 chee wu wrote:
 Fetcher2 should be a great help for me,but seems can't integrate with
 Nutch81.
 Any advice on how to use it based on .81? 
   
 
 You would have to port it to Nutch 0.8.1 - e.g. change all Text 
 occurences to UTF8, and most likely make other changes too ...
 
 -- 
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 

 


Re: Fetcher2

2007-01-23 Thread chee wu
Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy... I can 
share the code,if any one want to use ..
- Original Message - 
From: Andrzej Bialecki [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, January 23, 2007 12:09 AM
Subject: Re: Fetcher2


 chee wu wrote:
 Fetcher2 should be a great help for me,but seems can't integrate with 
 Nutch81.
 Any advice on how to use it based on .81? 
   
 
 You would have to port it to Nutch 0.8.1 - e.g. change all Text 
 occurences to UTF8, and most likely make other changes too ...
 
 -- 
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 


  1   2   >