MilleBii wrote:
Oops continuing previous mail.
So I wonder if there would be a better algorithm 'generate' which
would maintain a constant rate of host per 100 url ... Below a certain
threshold it stops or better starts including URLs of lower scores.
That's exactly how the max.urls.per.host
hi,
i'm performing a RECRAWL using the recrawl.sh script, and i had this error when
inverting the links:
FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file
crawl/linkdb/.locked already exists
echo - Invert Links (Step 4 of $steps) -
$NUTCH_HOME/bin/nutch invertlinks
Hum... I use the max urls and sets it to 600... Because in the worst
case you have 6s (measured on logs) in between urls of same host: so 6
x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last
longer than 1hour... Unfortunately it is not what I see
I also tried the by.ip
Hum... I use the max urls and sets it to 600... Because in the worst
case you have 6s (measured on logs) in between urls of same host: so 6
x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last
longer than 1hour... Unfortunately it is not what I see
that's assuming that all
hi,
i'm crawling my intranet , and i have setted the db.fetch.interval.default to
be 5 hours, but it seens that it doesnt work correctly
property
namedb.fetch.interval.default/name
value18000/value
descriptionThe number of seconds between re-fetches of a page (5 hours ).
hi
i dumped the database, and this is what i found:
Status: 1 (db_unfetched)
Fetch time: Thu Dec 03 15:53:24 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 2.0549393
Signature: null
Metadata:
so if meeting this url
Why does a url with a fetch status of 'fetch_gone' show up as
'db_unfetched'? Shouldn't the crawldb entry have a status of
'db_gone'? This is happening in nutch-1.0
Here is one example of what I'm talking about
=
[jkon...@rampage search]$ ./bin/nutch