Re: Infinite loop bug in Nutch 0.9

2009-04-02 Thread Doğacan Güney
On Wed, Apr 1, 2009 at 13:29, George Herlin ghher...@gmail.com wrote:

 Sorry, forgot to say, there is an added precondition to causing the bug:

 The redirection has to be fetched before the page it redirects to... if
 not, there will be a pre.existing crawl datum with an reasonable
 refetch-interval.


Maybe this is something fixed between 0.9 and 1.0, but I think
CrawlDbReducer fixes these datums, around line 147 (case
CrawlDatum.STATUS_LINKED). Have you even got stuck in an infinite loop
because of it?




 2009/4/1 George Herlin ghher...@gmail.com

 Hello, there.

 I believe I may have found a infinite loop in Nutch 0.9.

 It happens when a site has a page that refers to itself through a
 redirection.

 The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a
 little modified, line numbers may vary a little - says, for that case:

 output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED);

 What that does is, inserts an extra (empty) crawl datum for the new url,
 with a re-fetch interval of 0.0.

 However, (see Generator.Selector.map(), particularly lines 144-145), the
 non-refetch condition used seems to be last-fetch+refetch-intervalnow ...
 which is always false if refetch-interval==0.0!

 Now, if there is a new link to the new url in that page, that crawl datum
 is re-used, and the whole thing loops indefinitely.

 I've fixed that for myself by changing the quoted line (twice) by:

 output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null,
 CrawlDatum.STATUS_LINKED);

 and that works (btw the 30F should really be the value of
 db.default.fetch.interval, but I haven't the time right now to work out
 the issues, but in reality the default constructor and the appropriate
 updater method should, if I am right in analysing the algorithm always
 enforce a positive refetch interval.

 Of course, another method could be used to remove this self-reference, but
 that couls be complicated, as that may happen through a loop (2 or more
 pages etc..., you know what I mean).

 Has that been fixed already, and by what method?

 Best regards

 George Herlin







-- 
Doğacan Güney


Re: Infinite loop bug in Nutch 0.9

2009-04-02 Thread George Herlin
Indeed I have... that's how I found out.

My test case: crawl

http://www.purdue.ca/research/research_clinical.asp

with crawl-urlfilter and regex-urlfilter ending with

#purdue
+^http://www.purdue.ca/research/
+^http://www.purdue.ca/pdf/

# reject anything else
-.

The site is very small (which helped in diagnosis).

Attached the beginning of a run log, just in case

brgds

George

LOG
Resource not found: commons-logging.properties
Resource not found: META-INF/services/org.apache.commons.logging.LogFactory
Resource not found: log4j.xml
Resource found: log4j.properties
Resource found: hadoop-default.xml
Resource found: hadoop-site.xml
Resource found: nutch-default.xml
Resource found: nutch-site.xml
Resource not found: crawl-tool.xml
Injector: starting
Injector: crawlDb: crawl-www.purdue.ca-20090402110952/crawldb
Injector: urlDir: conf/purdueHttp
Injector: Converting injected urls to crawl db entries.
Resource not found: META-INF/services/javax.xml.transform.TransformerFactory
Resource not found:
META-INF/services/com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager
Resource not found:
com/sun/org/apache/xml/internal/serializer/XMLEntities_en.properties
Resource not found:
com/sun/org/apache/xml/internal/serializer/XMLEntities_en_US.properties
Resource found: regex-normalize.xml
Resource found: regex-urlfilter.txt
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
crawl-www.purdue.ca-20090402110952/segments/20090402110955
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402110955
Fetcher: threads: 1
Resource found: parse-plugins.xml
fetching http://www.purdue.ca/research/research_clinical.asp
Resource found: mime-types.xml
Resource not found: META-INF/services/org.apache.xerces.impl.Version
Resource found: www.purdue.ca.html.parser-conf.properties
Resource found: www.purdue.ca.resultslist.html.parser-conf.properties
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
CrawlDb update: segments:
[crawl-www.purdue.ca-20090402110952/segments/20090402110955]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
crawl-www.purdue.ca-20090402110952/segments/20090402111003
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111003
Fetcher: threads: 1
fetching http://www.purdue.ca/research/
fetching http://www.purdue.ca/research/research_ongoing.asp
fetching http://www.purdue.ca/research/research_quality.asp
fetching http://www.purdue.ca/research/research_completed.asp
fetching http://www.purdue.ca/research/research_contin.asp
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
CrawlDb update: segments:
[crawl-www.purdue.ca-20090402110952/segments/20090402111003]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
crawl-www.purdue.ca-20090402110952/segments/20090402111024
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl-www.purdue.ca-20090402110952/segments/20090402111024
Fetcher: threads: 1
fetching http://www.purdue.ca/research/research.asp
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
CrawlDb update: segments:
[crawl-www.purdue.ca-20090402110952/segments/20090402111024]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
crawl-www.purdue.ca-20090402110952/segments/20090402111031
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-02 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694942#action_12694942
 ] 

Julien Nioche commented on NUTCH-692:
-

As I pointed out in my previous message the root of the problem in my case was 
related to some dodgy URLs coming from the Javascript parser which put the 
basic normalizer into a spin. This would repeat in subsequent attempts indeed.

However the AlreadyBeingCreatedException should not happen and we should not 
have output files left open. If you patch fixes that I am sure that this will 
be a very welcome contribution.

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Julien Nioche

 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986
 ] 

Doğacan Güney commented on NUTCH-721:
-

I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and 
OldFetcher?

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986
 ] 

Doğacan Güney edited comment on NUTCH-721 at 4/2/09 6:01 AM:
-

I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and 
OldFetcher so that we can find out if this is related to new fetcher or is the 
side effect of some other change?

  was (Author: dogacan):
I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk 
and OldFetcher?
  
 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Nutch Topical / Focused Crawl

2009-04-02 Thread MyD

Hi @ all,

I'd like to turn Nutch into an focused / topical crawler. It's a part  
of my final year thesis. Further, I'd like that others can contribute  
from my work. I started to analyze the code and think that I found the  
right peace of code. I just wanted to know if I am on the right track.  
I think the right peace of code to implement a decision to fetch  
further is in the method output of the Fetcher class every time we  
call the collect method of the OutputCollector object.


private ParseStatus output(Text key, CrawlDatum datum, Content content,
ProtocolStatus pstatus, int status) {
...
output.collect(...);
...
}

Would you mind to let me know the the best way to turn this decision  
into an plugin? I was thinking to go a similar way like the scoring  
filters. Thanks in advance.


Cheers,
MyD

Re: Nutch Topical / Focused Crawl

2009-04-02 Thread Ken Krugler

Hi @ all,

I'd like to turn Nutch into an focused / topical crawler. It's a 
part of my final year thesis. Further, I'd like that others can 
contribute from my work. I started to analyze the code and think 
that I found the right peace of code. I just wanted to know if I am 
on the right track. I think the right peace of code to implement a 
decision to fetch further is in the method output of the Fetcher 
class every time we call the collect method of the OutputCollector 
object.


private ParseStatus output(Text key, CrawlDatum datum, Content content,
ProtocolStatus pstatus, int status) {
...
output.collect(...);
...
}

Would you mind to let me know the the best way to turn this decision 
into an plugin? I was thinking to go a similar way like the scoring 
filters. Thanks in advance.


Don't have the code in front of me right now, but we did something 
like this for a focused tech pages crawl with Krugle a few years 
back. Our goal was to influence the OPIC scores to ensure that pages 
we thought were likely to be good technical pages got fetched 
sooner.


Assuming you're using the scoring-opic plugin, then you'd create a 
custom ScoringFilter that gets executed after the scoring-opic plugin.


But the actual process of hooking every up was pretty complicated and 
error prone, unfortunately. We had to define our own keys for storing 
our custom scores inside of the parse_data Metadataa, the content 
Metadata, and the CrawlDB Metadata.


And we had to implement following methods for our scoring plugin:

setConf()
injectScore()
initialScore();
generateSortValue();
passScoreBeforeParsing();
passScoreAfterParsing();
shouldHarvestOutlinks();
distributeScoreToOutlink();
updateDbScore();
indexerScore();

-- Ken
--
Ken Krugler
+1 530-210-6378


Re: Infinite loop bug in Nutch 0.9

2009-04-02 Thread Julien Nioche
George,

Try using Nutch-1.0 instead. I have tested your example with the SVN version
and it did not get into the problem you described.

J.

2009/4/2 George Herlin ghher...@gmail.com

 Indeed I have... that's how I found out.

 My test case: crawl

 http://www.purdue.ca/research/research_clinical.asp

 with crawl-urlfilter and regex-urlfilter ending with

 #purdue
 +^http://www.purdue.ca/research/
 +^http://www.purdue.ca/pdf/

 # reject anything else
 -.

 The site is very small (which helped in diagnosis).

 Attached the beginning of a run log, just in case

 brgds

 George

 LOG
 Resource not found: commons-logging.properties
 Resource not found: META-INF/services/org.apache.commons.logging.LogFactory
 Resource not found: log4j.xml
 Resource found: log4j.properties
 Resource found: hadoop-default.xml
 Resource found: hadoop-site.xml
 Resource found: nutch-default.xml
 Resource found: nutch-site.xml
 Resource not found: crawl-tool.xml
 Injector: starting
 Injector: crawlDb: crawl-www.purdue.ca-20090402110952/crawldb
 Injector: urlDir: conf/purdueHttp
 Injector: Converting injected urls to crawl db entries.
 Resource not found:
 META-INF/services/javax.xml.transform.TransformerFactory
 Resource not found:

 META-INF/services/com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager
 Resource not found:
 com/sun/org/apache/xml/internal/serializer/XMLEntities_en.properties
 Resource not found:
 com/sun/org/apache/xml/internal/serializer/XMLEntities_en_US.properties
 Resource found: regex-normalize.xml
 Resource found: regex-urlfilter.txt
 Injector: Merging injected urls into crawl db.
 Injector: done
 Generator: Selecting best-scoring urls due for fetch.
 Generator: starting
 Generator: segment:
 crawl-www.purdue.ca-20090402110952/segments/20090402110955
 Generator: filtering: false
 Generator: topN: 2147483647
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls by host, for politeness.
 Generator: done.
 Fetcher: starting
 Fetcher: segment:
 crawl-www.purdue.ca-20090402110952/segments/20090402110955
 Fetcher: threads: 1
 Resource found: parse-plugins.xml
 fetching http://www.purdue.ca/research/research_clinical.asp
 Resource found: mime-types.xml
 Resource not found: META-INF/services/org.apache.xerces.impl.Version
 Resource found: www.purdue.ca.html.parser-conf.properties
 Resource found: www.purdue.ca.resultslist.html.parser-conf.properties
 Fetcher: done
 CrawlDb update: starting
 CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
 CrawlDb update: segments:
 [crawl-www.purdue.ca-20090402110952/segments/20090402110955]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: done
 Generator: Selecting best-scoring urls due for fetch.
 Generator: starting
 Generator: segment:
 crawl-www.purdue.ca-20090402110952/segments/20090402111003
 Generator: filtering: false
 Generator: topN: 2147483647
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls by host, for politeness.
 Generator: done.
 Fetcher: starting
 Fetcher: segment:
 crawl-www.purdue.ca-20090402110952/segments/20090402111003
 Fetcher: threads: 1
 fetching http://www.purdue.ca/research/
 fetching http://www.purdue.ca/research/research_ongoing.asp
 fetching http://www.purdue.ca/research/research_quality.asp
 fetching http://www.purdue.ca/research/research_completed.asp
 fetching http://www.purdue.ca/research/research_contin.asp
 Fetcher: done
 CrawlDb update: starting
 CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
 CrawlDb update: segments:
 [crawl-www.purdue.ca-20090402110952/segments/20090402111003]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: done
 Generator: Selecting best-scoring urls due for fetch.
 Generator: starting
 Generator: segment:
 crawl-www.purdue.ca-20090402110952/segments/20090402111024
 Generator: filtering: false
 Generator: topN: 2147483647
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls by host, for politeness.
 Generator: done.
 Fetcher: starting
 Fetcher: segment:
 crawl-www.purdue.ca-20090402110952/segments/20090402111024
 Fetcher: threads: 1
 fetching http://www.purdue.ca/research/research.asp
 Fetcher: done
 CrawlDb update: starting
 CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
 CrawlDb update: segments:
 [crawl-www.purdue.ca-20090402110952/segments/20090402111024]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: done
 Generator: Selecting best-scoring urls due for fetch.
 Generator: 

Using keywords metatags

2009-04-02 Thread Rodrigo Reyes C.
Hi all. I would like to add keywords to the information that gets inserted
into the Lucene Indexes. I am thinking I need to insert them into the WebDB
and later on insert them into the Lucene indexes. Am I right? Which
extension points do I need to use?

Thanks in advance

-- 
Rodrigo Reyes


[jira] Updated: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-02 Thread Cosmin Lehene (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Lehene updated NUTCH-692:


Attachment: NUTCH-692.patch

This just checks the destination file existence before attempting to create a 
new output MapFile for the reduce task in the FetcherOutputFormat and 
ParseOutputFormat. If the destination files exist it deletes them. 
The AlreadyBeingCreatedException is thrown when a MapFile creation attempt 
fails to create the same file as the previous failed task. 


 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-692.patch


 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695122#action_12695122
 ] 

Doğacan Güney commented on NUTCH-692:
-

Thanks for the patch.

Patch looks good to me. Can you confirm if this fixes the problem (or tell me 
how to trigger the problem without patch)?

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-692.patch


 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread Roger Dunk (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695170#action_12695170
 ] 

Roger Dunk commented on NUTCH-721:
--

For the following tests I've used the same segment containing 5000 URLs. I 
cleaned the named cache before the first two tests.

[r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.OldFetcher 
newcrawl/segments/20090402130655/

real3m38.084s
user2m20.887s
sys 0m7.470s

[r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher 
newcrawl/segments/20090402130655/

[...]

Fetcher: done

real53m44.800s
user2m20.070s
sys 0m9.527s

For this next test, I used the same segment but didn't clear the named cache 
from the previous test, so all resolvable hosts should still be cached. This 
appeared to help greatly, as often times out of 80 active threads, only 60 were 
spinwaiting (as opposed to 79 in the non-cached test), but there were still 
plenty of times where at least 30 consecutive log entries showed 80 threads 
spinwaiting. And clearly as can be seen from the times below, still nowhere in 
the league of OldFetcher.

[r...@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher 
newcrawl/segments/20090402130655/

[...]

Aborting with 80 hung threads.
Fetcher: done

real22m5.420s
user2m39.407s
sys 0m8.192s

 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695233#action_12695233
 ] 

Hudson commented on NUTCH-721:
--

Integrated in Nutch-trunk #772 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/772/])
 - Commit old fetcher as OldFetcher for now so that we can test Fetcher2 
performance.


 Fetcher2 Slow
 -

 Key: NUTCH-721
 URL: https://issues.apache.org/jira/browse/NUTCH-721
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
Reporter: Roger Dunk
 Attachments: crawl_generate.tar.gz, nutch-site.xml


 Fetcher2 fetches far more slowly than Fetcher1.
 Config options:
 fetcher.threads.fetch = 80
 fetcher.threads.per.host = 80
 fetcher.server.delay = 0
 generate.max.per.host = 1
 With a queue size of ~40,000, the result is:
 activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
 with maybe a download of 1 page per second.
 Runing with -noParse makes little difference.
 CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
 Hosts already cached by local caching NS appear to download quickly upon a 
 re-fetch, so possible issue relating to NS lookups, however all things being 
 equal Fetcher1 runs fast without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.