[Nutch Wiki] Update of NewScoringIndexingExample by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NewScoringIndexingExample The comment on the change is: comment pointing out multiple segment flags -- = Example Running new Scoring and Indexing Systems = Below is an example of running the new scoring and indexing systems from start to finish. This was done with a sample of 1000 urls and I ran two different fetch cycles. The first being 1000 urls and the second being the top 2000 urls. The loops job is optional but included for completeness. In production we have actually removed that job. This was done with a clean pull from Nutch trunk as of 2009-03-06 (right before 1.0 is set to be released). If anybody has any problems running these commands or has questions send me an email or send one to the nutch users or dev list and I will reply. Please send it to kubes at the apache address dot org. + {{{ bin/nutch inject crawl/crawldb crawl/urls/ @@ -10, +11 @@ bin/nutch fetch crawl/segments/20090306093949/ bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb + bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/ @@ -55, +57 @@ bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306100055/ rm -fr crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -segment crawl/segments/20090306100055/ -webgraphdb crawl/webgraphdb + }}} + + One thing that has been brought up is the -segment flag on webgraph. If you have more than one segment then you would have more than one segment flag as shown above. + + {{{ bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/
[Nutch Wiki] Update of NewScoringIndexingExample by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NewScoringIndexingExample -- bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb + }}} + + One thing to point out here is that WebGraph is meant to be used on larger web crawls to create web graphs. By default it ignores outlinks to pages in the same domain, including subdomains, and pages with the same hostname. It also limits to one outlink per page to links in the same page or the same domain. All of these options are changeable through the following configuration options: + + {{{ + + !-- linkrank scoring properties -- + property + namelink.ignore.internal.host/name + valuetrue/value + descriptionIgnore outlinks to the same hostname./description + /property + + property + namelink.ignore.internal.domain/name + valuetrue/value + descriptionIgnore outlinks to the same domain./description + /property + + property + namelink.ignore.limit.page/name + valuetrue/value + descriptionLimit to only a single outlink to the same page./description + /property + + property + namelink.ignore.limit.domain/name + valuetrue/value + descriptionLimit to only a single outlink to the same domain./description + /property + + }}} + + But by default if you are only crawling pages within a domain or within a set of subdomains, all outlinks will be ignored and you will come up with an empty webgraph. This in turn will throw an error while processing through the LinkRank job. The flip side is by NOT ignoring links to the same domain/host and by not limiting those links, the webgraph becomes much, much more dense and hence there is a lot more links to process which probably won't affect relevancy as much. + + {{{ bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/
[Nutch Wiki] Update of NewScoringIndexingExample by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NewScoringIndexingExample New page: = Example Running new Scoring and Indexing Systems = Below is an example of running the new scoring and indexing systems from start to finish. This was done with a sample of 1000 urls and I ran two different fetch cycles. The first being 1000 urls and the second being the top 2000 urls. The loops job is optional but included for completeness. In production we have actually removed that job. This was done with a clean pull from Nutch trunk as of 2009-03-06 (right before 1.0 is set to be released). If anybody has any problems running these commands or has questions send me an email or send one to the nutch users or dev list and I will reply. Please send it to kubes at the apache address dot org. {{{ bin/nutch inject crawl/crawldb crawl/urls/ bin/nutch generate crawl/crawldb/ crawl/segments bin/nutch fetch crawl/segments/20090306093949/ bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper -scores -topn 1000 -webgraphdb crawl/webgraphdb/ -output crawl/webgraphdb/dump/scores more crawl/webgraphdb/dump/scores/part-0 http://validator.w3.org/check?uri=referer 0.4955311 http://www.adobe.com/go/getflashplayer 0.4060498 http://www.statcounter.com/ 0.4060498 http://www.liveinternet.ru/click0.33680826 http://www.adobe.com/products/acrobat/readstep2.html0.31656843 http://www.adobe.com/go/getflashplayer/ 0.30378538 http://www.bloomingbows.com/2003/scripts/sitemap.asp0.27821928 http://www.misterping.com/ 0.27821928 ... bin/nutch readdb crawl/crawldb/ -stats CrawlDb statistics start: crawl/crawldb/ Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Statistics for CrawlDb: crawl/crawldb/ TOTAL urls: 16711 retry 0:16686 retry 1:25 min score: 0.0 avg score: 0.022716654 max score: 0.495 status 1 (db_unfetched):15739 status 2 (db_fetched): 677 status 3 (db_gone): 75 status 4 (db_redir_temp): 143 status 5 (db_redir_perm): 77 CrawlDb statistics: done bin/nutch generate crawl/crawldb/ crawl/segments/ -topN 2000 bin/nutch fetch crawl/segments/20090306100055/ bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306100055/ rm -fr crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -segment crawl/segments/20090306100055/ -webgraphdb crawl/webgraphdb bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/ more crawl/webgraphdb/dump/scores/part-0 http://www.statcounter.com/ 1.7133079 http://www.morristownwebdesign.com/ 1.0093393 http://www.jdoqocy.com/click-3331968-10419685 0.87828785 http://www.anrdoezrs.net/click-3331968-10384568 0.87828785 http://www.sedo.com/main.php3?language=e0.6565905 http://wetter.spiegel.de/spiegel/html/frankreich0.html 0.641775 http://www.kenwood.com/ 0.6084726 http://validator.w3.org/check?uri=referer 0.5605916 http://wetter.spiegel.de/spiegel/html/Italien0.html 0.5164927 http://www.youtube.com/?hl=entab=w10.50952965 http://www.addthis.com/bookmark.php 0.5013165 http://www.ptguide.com/ 0.49564213 http://www.adobe.com/go/getflashplayer 0.47368217 http://de.weather.yahoo.com/ITXX/ITXX0073/index_c.html 0.4657473 http://www.adobe.com/shockwave/download/download.cgi?P1_Prod_Version=ShockwaveFlashpromoid=BIOW 0.44376293 http://www.google.com/ 0.42282072 http://www.zajezdy.cz/ 0.41620353 http://www.intermarche.com/ 0.41489196 http://www.shipskill.com/7/ 0.4147887 http://www.statcounter.com/free_hit_counter.html0.40928197