Re: Infinite loop bug in Nutch 0.9

2009-04-02 Thread Julien Nioche

Try using Nutch-1.0 instead. I have tested your example with the SVN version
and it did not get into the problem you described.


2009/4/2 George Herlin 

> Indeed I have... that's how I found out.
> My test case: crawl
> with crawl-urlfilter and regex-urlfilter ending with
> #purdue
> +^
> +^
> # reject anything else
> -.
> The site is very small (which helped in diagnosis).
> Attached the beginning of a run log, just in case
> brgds
> George
> Resource not found:
> Resource not found: META-INF/services/org.apache.commons.logging.LogFactory
> Resource not found: log4j.xml
> Resource found:
> Resource found: hadoop-default.xml
> Resource found: hadoop-site.xml
> Resource found: nutch-default.xml
> Resource found: nutch-site.xml
> Resource not found: crawl-tool.xml
> Injector: starting
> Injector: crawlDb:
> Injector: urlDir: conf/purdueHttp
> Injector: Converting injected urls to crawl db entries.
> Resource not found:
> META-INF/services/javax.xml.transform.TransformerFactory
> Resource not found:
> META-INF/services/
> Resource not found:
> com/sun/org/apache/xml/internal/serializer/
> Resource not found:
> com/sun/org/apache/xml/internal/serializer/
> Resource found: regex-normalize.xml
> Resource found: regex-urlfilter.txt
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment:
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment:
> Fetcher: threads: 1
> Resource found: parse-plugins.xml
> fetching
> Resource found: mime-types.xml
> Resource not found: META-INF/services/org.apache.xerces.impl.Version
> Resource found:
> Resource found:
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db:
> CrawlDb update: segments:
> []
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment:
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment:
> Fetcher: threads: 1
> fetching
> fetching
> fetching
> fetching
> fetching
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db:
> CrawlDb update: segments:
> []
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment:
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment:
> Fetcher: threads: 1
> fetching
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db:
> CrawlDb update: segments:
> []
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging seg

Re: Infinite loop bug in Nutch 0.9

2009-04-02 Thread George Herlin
Indeed I have... that's how I found out.

My test case: crawl

with crawl-urlfilter and regex-urlfilter ending with


# reject anything else

The site is very small (which helped in diagnosis).

Attached the beginning of a run log, just in case



Resource not found:
Resource not found: META-INF/services/org.apache.commons.logging.LogFactory
Resource not found: log4j.xml
Resource found:
Resource found: hadoop-default.xml
Resource found: hadoop-site.xml
Resource found: nutch-default.xml
Resource found: nutch-site.xml
Resource not found: crawl-tool.xml
Injector: starting
Injector: crawlDb:
Injector: urlDir: conf/purdueHttp
Injector: Converting injected urls to crawl db entries.
Resource not found: META-INF/services/javax.xml.transform.TransformerFactory
Resource not found:
Resource not found:
Resource not found:
Resource found: regex-normalize.xml
Resource found: regex-urlfilter.txt
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment:
Fetcher: threads: 1
Resource found: parse-plugins.xml
Resource found: mime-types.xml
Resource not found: META-INF/services/org.apache.xerces.impl.Version
Resource found:
Resource found:
Fetcher: done
CrawlDb update: starting
CrawlDb update: db:
CrawlDb update: segments:
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment:
Fetcher: threads: 1
Fetcher: done
CrawlDb update: starting
CrawlDb update: db:
CrawlDb update: segments:
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment:
Fetcher: threads: 1
Fetcher: done
CrawlDb update: starting
CrawlDb update: db:
CrawlDb update: segments:
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting

Re: Infinite loop bug in Nutch 0.9

2009-04-01 Thread Doğacan Güney
On Wed, Apr 1, 2009 at 13:29, George Herlin  wrote:

> Sorry, forgot to say, there is an added precondition to causing the bug:
> The redirection has to be fetched before the page it redirects to... if
> not, there will be a pre.existing crawl datum with an reasonable
> refetch-interval.

Maybe this is something fixed between 0.9 and 1.0, but I think
CrawlDbReducer fixes these datums, around line 147 (case
CrawlDatum.STATUS_LINKED). Have you even got stuck in an infinite loop
because of it?

> 2009/4/1 George Herlin 
> Hello, there.
>> I believe I may have found a infinite loop in Nutch 0.9.
>> It happens when a site has a page that refers to itself through a
>> redirection.
>> The code in, around line 200 - sorry, my Fetcher has been a
>> little modified, line numbers may vary a little - says, for that case:
>> output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED);
>> What that does is, inserts an extra (empty) crawl datum for the new url,
>> with a re-fetch interval of 0.0.
>> However, (see, particularly lines 144-145), the
>> non-refetch condition used seems to be last-fetch+refetch-interval>now ...
>> which is always false if refetch-interval==0.0!
>> Now, if there is a new link to the new url in that page, that crawl datum
>> is re-used, and the whole thing loops indefinitely.
>> I've fixed that for myself by changing the quoted line (twice) by:
>> output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null,
>> CrawlDatum.STATUS_LINKED);
>> and that works (btw the 30F should really be the value of
>> "db.default.fetch.interval", but I haven't the time right now to work out
>> the issues, but in reality the default constructor and the appropriate
>> updater method should, if I am right in analysing the algorithm always
>> enforce a positive refetch interval.
>> Of course, another method could be used to remove this self-reference, but
>> that couls be complicated, as that may happen through a loop (2 or more
>> pages etc..., you know what I mean).
>> Has that been fixed already, and by what method?
>> Best regards
>> George Herlin

Doğacan Güney

Re: Infinite loop bug in Nutch 0.9

2009-04-01 Thread George Herlin
Sorry, forgot to say, there is an added precondition to causing the bug:

The redirection has to be fetched before the page it redirects to... if not,
there will be a pre.existing crawl datum with an reasonable

2009/4/1 George Herlin 

> Hello, there.
> I believe I may have found a infinite loop in Nutch 0.9.
> It happens when a site has a page that refers to itself through a
> redirection.
> The code in, around line 200 - sorry, my Fetcher has been a
> little modified, line numbers may vary a little - says, for that case:
> output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED);
> What that does is, inserts an extra (empty) crawl datum for the new url,
> with a re-fetch interval of 0.0.
> However, (see, particularly lines 144-145), the
> non-refetch condition used seems to be last-fetch+refetch-interval>now ...
> which is always false if refetch-interval==0.0!
> Now, if there is a new link to the new url in that page, that crawl datum
> is re-used, and the whole thing loops indefinitely.
> I've fixed that for myself by changing the quoted line (twice) by:
> output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null,
> CrawlDatum.STATUS_LINKED);
> and that works (btw the 30F should really be the value of
> "db.default.fetch.interval", but I haven't the time right now to work out
> the issues, but in reality the default constructor and the appropriate
> updater method should, if I am right in analysing the algorithm always
> enforce a positive refetch interval.
> Of course, another method could be used to remove this self-reference, but
> that couls be complicated, as that may happen through a loop (2 or more
> pages etc..., you know what I mean).
> Has that been fixed already, and by what method?
> Best regards
> George Herlin

Infinite loop bug in Nutch 0.9

2009-04-01 Thread George Herlin
Hello, there.

I believe I may have found a infinite loop in Nutch 0.9.

It happens when a site has a page that refers to itself through a

The code in, around line 200 - sorry, my Fetcher has been a
little modified, line numbers may vary a little - says, for that case:

output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED);

What that does is, inserts an extra (empty) crawl datum for the new url,
with a re-fetch interval of 0.0.

However, (see, particularly lines 144-145), the
non-refetch condition used seems to be last-fetch+refetch-interval>now ...
which is always false if refetch-interval==0.0!

Now, if there is a new link to the new url in that page, that crawl datum is
re-used, and the whole thing loops indefinitely.

I've fixed that for myself by changing the quoted line (twice) by:

output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null,

and that works (btw the 30F should really be the value of
"db.default.fetch.interval", but I haven't the time right now to work out
the issues, but in reality the default constructor and the appropriate
updater method should, if I am right in analysing the algorithm always
enforce a positive refetch interval.

Of course, another method could be used to remove this self-reference, but
that couls be complicated, as that may happen through a loop (2 or more
pages etc..., you know what I mean).

Has that been fixed already, and by what method?

Best regards

George Herlin