Re: How does generate work ?
MilleBii wrote: Oops continuing previous mail. So I wonder if there would be a better algorithm 'generate' which would maintain a constant rate of host per 100 url ... Below a certain threshold it stops or better starts including URLs of lower scores. That's exactly how the max.urls.per.host limit works. Using scores is de-optimzing the fetching process... Having said that I should first read the code and try to understand it. That wouldn't hurt in any case ;) There is also a method in ScoringFilter-s (e.g. the default scoring-opic), where it determines the priority of URL during generation. See ScoringFilter.generatorSortValue(..), you can modify this method in scoring-opic (or in your own scoring filter) to prioritize certain urls over others. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file crawl/linkdb/.locked already exists
hi, i'm performing a RECRAWL using the recrawl.sh script, and i had this error when inverting the links: FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file crawl/linkdb/.locked already exists echo - Invert Links (Step 4 of $steps) - $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/* i understood that the linkdb already exists (because of the last crawl). my question is: should i delete or backup the old linkdb (at every recrawl) before iverting links ? _ Eligible CDN College University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819
Re: How does generate work ?
Hum... I use the max urls and sets it to 600... Because in the worst case you have 6s (measured on logs) in between urls of same host: so 6 x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last longer than 1hour... Unfortunately it is not what I see I also tried the by.ip option, because some blogs site allocate a different domain name for each user... I saw no improvements I look at the time limit feature as a workaround this nbre host issue and was thinking that there could be a more structural way to solve it 2009/12/3, Andrzej Bialecki a...@getopt.org: MilleBii wrote: Oops continuing previous mail. So I wonder if there would be a better algorithm 'generate' which would maintain a constant rate of host per 100 url ... Below a certain threshold it stops or better starts including URLs of lower scores. That's exactly how the max.urls.per.host limit works. Using scores is de-optimzing the fetching process... Having said that I should first read the code and try to understand it. That wouldn't hurt in any case ;) There is also a method in ScoringFilter-s (e.g. the default scoring-opic), where it determines the priority of URL during generation. See ScoringFilter.generatorSortValue(..), you can modify this method in scoring-opic (or in your own scoring filter) to prioritize certain urls over others. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii-
Re: How does generate work ?
Hum... I use the max urls and sets it to 600... Because in the worst case you have 6s (measured on logs) in between urls of same host: so 6 x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last longer than 1hour... Unfortunately it is not what I see that's assuming that all input URLs are read at once and put into their corresponding queue and are ready to be fetched. in reality there is a cap on the amount of URLs stored in the queues (see fetchQueues.totalSize in the logs) which is equal to 50 * number of threads. the value of 50 is fixed but we could add a parameter to modify it. a workaround is simply to use more threads to increase the number of URLs stored in the queues. if you look at the logs you'll see that there are often situations where the fetchQueues.totalSize is at the max value allowed but not all fetcher threads are active which means that one or more queues prevent new URLs to be put in the queues by being large and filling up the fetchQueues.totalSize. we can't read ahead the URL entries given to the mapper without having to store them somewhat so the easiest option is probably to allow a custom multiplication factor for the fetchQueues.totalSize cap and make so that it could be more than 50. That would increase the memory consumption a bit but definitely make the fetching rate a bit more constant. You can also simply use more threads but there would be a risk of getting time outs if you specify too large a value. makes sense? I also tried the by.ip option, because some blogs site allocate a different domain name for each user... I saw no improvements ip resolution is quite slow because is it not multithreaded so that would not help anyway Julien I look at the time limit feature as a workaround this nbre host issue and was thinking that there could be a more structural way to solve it 2009/12/3, Andrzej Bialecki a...@getopt.org: MilleBii wrote: Oops continuing previous mail. So I wonder if there would be a better algorithm 'generate' which would maintain a constant rate of host per 100 url ... Below a certain threshold it stops or better starts including URLs of lower scores. That's exactly how the max.urls.per.host limit works. Using scores is de-optimzing the fetching process... Having said that I should first read the code and try to understand it. That wouldn't hurt in any case ;) There is also a method in ScoringFilter-s (e.g. the default scoring-opic), where it determines the priority of URL during generation. See ScoringFilter.generatorSortValue(..), you can modify this method in scoring-opic (or in your own scoring filter) to prioritize certain urls over others. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii- -- DigitalPebble Ltd http://www.digitalpebble.com
db.fetch.interval.default
hi, i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly property namedb.fetch.interval.default/name value18000/value descriptionThe number of seconds between re-fetches of a page (5 hours ). /description /property the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10). but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !! and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ? why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet ! plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth since i have all the urls in the crawldb ? i'm using this recrawl script : depth=10 echo - Inject (Step 1 of $steps) - $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls echo - Generate, Fetch, Parse, Update (Step 2 of $steps) - for((i=0; i $depth; i++)) do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN if [ $? -ne 0 ] then echo runbot: Stopping at depth $depth. No more URLs to fetch. break fi segment=`ls -d $crawl/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo runbot: fetch $segment at depth `expr $i + 1` failed. echo runbot: Deleting segment $segment. rm $RMARGS $segment continue fi echo - Updating Dadatabase ( $steps) - $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment done echo - Merge Segments (Step 3 of $steps) - $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/* rm $crawl/segments mv $crawl/MERGEDsegments $crawl/segments echo - Invert Links (Step 4 of $steps) - $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/* _ Eligible CDN College University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819
RE: db.fetch.interval.default
hi i dumped the database, and this is what i found: Status: 1 (db_unfetched) Fetch time: Thu Dec 03 15:53:24 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 2.0549393 Signature: null Metadata: so if meeting this url several time in 2 hours, that means becoz of the 0 days it gonna be fetched several times ?? it will not look at the 18000 secondes ??? thx Date: Thu, 3 Dec 2009 22:39:29 +0100 From: reinhard.sch...@aon.at To: nutch-user@lucene.apache.org Subject: Re: db.fetch.interval.default hi, i have identified one source of such a problem and opened an issue at jira. you can apply this patch and check whether is solves your problem. https://issues.apache.org/jira/browse/NUTCH-774 btw you can check also your crawldb for such items - the retry interval is set to 0. just dump the crawldb and search for it. regards reinhard BELLINI ADAM schrieb: hi, i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly property namedb.fetch.interval.default/name value18000/value descriptionThe number of seconds between re-fetches of a page (5 hours ). /description /property the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10). but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !! and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ? why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet ! plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth since i have all the urls in the crawldb ? i'm using this recrawl script : depth=10 echo - Inject (Step 1 of $steps) - $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls echo - Generate, Fetch, Parse, Update (Step 2 of $steps) - for((i=0; i $depth; i++)) do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN if [ $? -ne 0 ] then echo runbot: Stopping at depth $depth. No more URLs to fetch. break fi segment=`ls -d $crawl/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo runbot: fetch $segment at depth `expr $i + 1` failed. echo runbot: Deleting segment $segment. rm $RMARGS $segment continue fi echo - Updating Dadatabase ( $steps) - $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment done echo - Merge Segments (Step 3 of $steps) - $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/* rm $crawl/segments mv $crawl/MERGEDsegments $crawl/segments echo - Invert Links (Step 4 of $steps) - $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/* _ Eligible CDN College University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819 _ Eligible CDN College University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819
Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'?
Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'? Shouldn't the crawldb entry have a status of 'db_gone'? This is happening in nutch-1.0 Here is one example of what I'm talking about = [jkon...@rampage search]$ ./bin/nutch readseg -get testParseSegment/20091202111849 http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s; Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Fri Nov 27 16:28:09 PST 2009 Modified time: Wed Dec 31 16:00:00 PST 1969 Retries since fetch: 0 Retry interval: 7776000 seconds (90 days) Score: 7.535359E-10 Signature: null Metadata: _ngt_: 1259781530311 Crawl Fetch:: Version: 7 Status: 37 (fetch_gone) Fetch time: Wed Dec 02 12:25:21 PST 2009 Modified time: Wed Dec 31 16:00:00 PST 1969 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 2.47059988E10 Signature: null Metadata: _ngt_: 1259781530311_pst_: notfound(14), lastModified=0: http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s [jkon...@rampage search]$ ./bin/nutch readdb testParseSegment/c -url http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s; URL: http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s Version: 7 Status: 1 (db_unfetched) Fetch time: Sat Apr 03 01:25:21 PDT 2010 Modified time: Wed Dec 31 16:00:00 PST 1969 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 2.47059988E10 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s = Thanks, Jason