[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678573#action_12678573
 ] 

Hudson commented on NUTCH-669:
--

Integrated in Nutch-trunk #742 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/742/])


 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Assignee: Sami Siren
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665562#action_12665562
 ] 

Doğacan Güney commented on NUTCH-669:
-

Hi Todd,

Can you upload your work to JIRA now, so that we can review and merge it for 
1.0?

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-02 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660382#action_12660382
 ] 

Todd Lipcon commented on NUTCH-669:
---

Here's a further report on my progress:

  - It turns out the change in NUTCH-676 caused things to break - there's some 
behavior in nutch's MapWritable that differs from Hadoop's, so it was spending 
all of its time in output.collect - I think the writables were accruing lots of 
key/value pairs that they weren't sposed to. So, this doesn't depend on 
NUTCH-676.

  - I implemented adaptive crawl delay (NUTCH-475) in the new fetcher.

  - Also implemented early termination as discussed in this mailing list 
thread: 
http://www.nabble.com/proposal:-fetcher-performance-improvements-td20939872.html

Results so far are looking good. I was able to run a 1M url fetch with 5000 
urls per host at a sustained rate of 25 pages/second (total around 11 hours). 
About 60% of the URLs ended up parsed, which isn't significantly worse than I 
usually see without early termination, but past attempts to run 1M fetches have 
taken several days because of some slow hosts.

I'm running a 2M+ URL fetch right now and have been sustaining 40-60mbit 
inbound from 8 fetchers for the last couple hours.

  - I did experience one GC error - I think I need to add some cleanup of empty 
queues out of the FetchQueue structure when the number of unique hosts is very 
high.

Complete history is here: http://github.com/toddlipcon/nutch/tree/nutch-669

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-02 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660397#action_12660397
 ] 

Otis Gospodnetic commented on NUTCH-669:


Todd, and when you say sustained rate of 25 pages/second that means the final 
rate you see on one of the status screens?  In other words, this is not a rate 
you see being steady while the fetch run is in the full swing (which could be a 
lot higher), but rather the final rate?


 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659857#action_12659857
 ] 

Todd Lipcon commented on NUTCH-669:
---

Hey guys,

I tried it on production, but ran into an Exception of some sort that happened 
very rarely. Then I went on vacation for 2 weeks and came back to find the logs 
gone from my hadoop tracker, so I can't figure out what the Exception was ;-) 
I'll run another segment today hopefully and let you know the results.

-Todd

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659958#action_12659958
 ] 

Todd Lipcon commented on NUTCH-669:
---

Found the exception in a screen log:

{noformat}
java.lang.NullPointerException
at 
org.apache.nutch.crawl.MapWritable$KeyValueEntry.access$102(MapWritable.java:469)
at org.apache.nutch.crawl.MapWritable.readFields(MapWritable.java:362)
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:250)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at 
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
at org.apache.nutch.fetcher.Fetcher$FetchMapper.run(Fetcher.java:399)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
{noformat}

I think NUTCH-676 may help this. Trying another run in a minute.

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659343#action_12659343
 ] 

Andrzej Bialecki  commented on NUTCH-669:
-

Well ... have you tried it? How did it go?

I think it's time to upload the patch to JIRA, so that we can decide what to do 
using a concrete snapshot of your work.

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-10 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655491#action_12655491
 ] 

Todd Lipcon commented on NUTCH-669:
---

For those watching this issue: I pushed a couple more changes to the github 
repo linked above. I'm about to try it on production with a 100K url segment, 
80 threads, limit by IP, 8 crawler nodes. We'll see how it goes.

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-05 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653844#action_12653844
 ] 

Todd Lipcon commented on NUTCH-669:
---

Agreed on all fronts.

I spent several hours yesterday refactoring/rewriting Fetcher2 to be a little 
cleaner . One of the changes was to factor out the queueing policies into a new 
class and replace the Thread-based model with one based on ExecutorServices. I 
may also try to factor out the actual fetching into a new class as well.

I haven't gotten to testing the new version quite yet but hopefully should have 
a patch available next week, and perhaps some intermediate commits available on 
github this afternoon so people can see where I'm headed.

Is there a unit (or functional) testing infrastructure I can use somewhere to 
test this?

-Todd

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor

 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-05 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653940#action_12653940
 ] 

Todd Lipcon commented on NUTCH-669:
---

I've pushed the initial commit of this rewrite/refactor to github:

http://github.com/toddlipcon/nutch/commit/5c9d99a856628c842b50b1d76f62b375f377bf95

Might be worth just reviewing it as if it were a new file rather than a diff:

http://github.com/toddlipcon/nutch/tree/5c9d99a856628c842b50b1d76f62b375f377bf95/src/java/org/apache/nutch/fetcher/Fetcher.java

Still have some more cleanup and revisions here, plus I want to test it on a 
real crawl or two from our cluster. It currently passes the TestFetcher unit 
test but I don't know what the coverage is on that.

I'll attach a patch here before it's ready to be comitted it so I can check off 
the license grant checkbox, which I know is important for ASF.

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor

 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653487#action_12653487
 ] 

Doğacan Güney commented on NUTCH-669:
-

 *  What reason is there for Fetcher existing at all since Fetcher2 seems 
 to be a superset of functionality?

Agreed. We should just rename Fetcher2 to Fetcher and be done with it :D

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor

 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.