Re: How does generate work ?

2009-12-03 Thread Andrzej Bialecki
MilleBii wrote: Oops continuing previous mail. So I wonder if there would be a better algorithm 'generate' which would maintain a constant rate of host per 100 url ... Below a certain threshold it stops or better starts including URLs of lower scores. That's exactly how the max.urls.per.host

FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file crawl/linkdb/.locked already exists

2009-12-03 Thread BELLINI ADAM
hi, i'm performing a RECRAWL using the recrawl.sh script, and i had this error when inverting the links: FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file crawl/linkdb/.locked already exists echo - Invert Links (Step 4 of $steps) - $NUTCH_HOME/bin/nutch invertlinks

Re: How does generate work ?

2009-12-03 Thread MilleBii
Hum... I use the max urls and sets it to 600... Because in the worst case you have 6s (measured on logs) in between urls of same host: so 6 x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last longer than 1hour... Unfortunately it is not what I see I also tried the by.ip

Re: How does generate work ?

2009-12-03 Thread Julien Nioche
Hum... I use the max urls and sets it to 600... Because in the worst case you have 6s (measured on logs) in between urls of same host: so 6 x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last longer than 1hour... Unfortunately it is not what I see that's assuming that all

db.fetch.interval.default

2009-12-03 Thread BELLINI ADAM
hi, i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly property namedb.fetch.interval.default/name value18000/value descriptionThe number of seconds between re-fetches of a page (5 hours ).

RE: db.fetch.interval.default

2009-12-03 Thread BELLINI ADAM
hi i dumped the database, and this is what i found: Status: 1 (db_unfetched) Fetch time: Thu Dec 03 15:53:24 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 2.0549393 Signature: null Metadata: so if meeting this url

Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'?

2009-12-03 Thread J.G.Konrad
Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'? Shouldn't the crawldb entry have a status of 'db_gone'? This is happening in nutch-1.0 Here is one example of what I'm talking about = [jkon...@rampage search]$ ./bin/nutch