[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: UAC issue on Vista -- == Tested with == * Nutch release 1.0 - * Eclipse 3.3 - aka Europa, ganymede + * Eclipse 3.3 (Europa) and 3.4 (Ganymede) * Java 1.6 * Ubuntu (should work on most platforms though) - * Windows XP + * Windows XP and Vista == Before you start == @@ -21, +21 @@ === For Windows Users === - If you are running Windows (tested on Windows XP) you must first install cygwin + If you are running Windows (tested on Windows XP) you must first install cygwin. Download it from http://www.cygwin.com/setup.exe + Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH. - Download cygwin from http://www.cygwin.com/setup.exe - - Install cygwin and set PATH variable for it. - - It's in control panel, system, advanced tab, environment variables and edit/add PATH I have in PATH like: - + {{{ C:\Sun\SDK\bin;C:\cygwin\bin + }}} + If you run bash from the Windows command line (Start Run... cmd.exe) it should successfully run cygwin. - If you run bash in Start Run... cmd.exe it should work. + If you are running Eclipse on Vista, you will likely need to [http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/ turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler: + {{{ + org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission denied + }}} + See [http://markmail.org/message/ymgygimtvuksn2ic#query:Exception%20in%20thread%20main%20org.apache.hadoop.util.Shell%24ExitCodeException%3A%20chmod%3A%20changing%20permissions+page:1+mid:pj3spjhvdtjx736q+state:results this] for more information about the UAC issue. === Install Nutch === * Grab a [http://lucene.apache.org/nutch/version_control.html fresh release] of Nutch 1.0 or download and untar the [http://lucene.apache.org/nutch/release/ official 1.0 release].
[Nutch Wiki] Update of RunNutchInEclipse0.9 by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 The comment on the change is: Added java heap size solution -- - = RunNutchInEclipse = + = Run Nutch In Eclipse on Linux and Windows nutch version 0.9= This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page :-) @@ -104, +104 @@ * click on Run * if all works, you should see Nutch getting busy at crawling :-) - == Debug Nutch in Eclipse (not yet tested for 0.9) == + == Java Heap Size problem == + + If you find in hadoop.log line similar to this: + + {{{ + 2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001 + java.lang.OutOfMemoryError: Java heap space + }}} + + You should increase amount of RAM for running applications from eclipse. + + Just set it in: + + Eclipse - Window - Preferences - Java - Installed JREs - edit - Default VM arguments + + I've set mine to + {{{ + -Xms5m -Xmx150m + }}} + because I have like 200MB RAM left after runnig all apps + + -Xms (minimum ammount of RAM memory for running applications) + -Xmx (maximum) + + + == Debug Nutch in Eclipse == * Set breakpoints and debug a crawl * It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints: {{{
[Nutch Wiki] Update of RunNutchInEclipse1.0 by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Copied page for 1.0 release New page: = Run Nutch In Eclipse on Linux and Windows nutch version 1.0= This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page :-) == Tested with == * Nutch release 1.0 * Eclipse 3.3 - aka Europa, ganymede * Java 1.6 * Ubuntu (should work on most platforms though) * Windows XP == Before you start == Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. But again you might be quicker by looking at the logs (logs/hadoop.log)... == Steps == === For Windows Users === If you are running Windows (tested on Windows XP) you must first install cygwin Download cygwin from http://www.cygwin.com/setup.exe Install cygwin and set PATH variable for it. It's in control panel, system, advanced tab, environment variables and edit/add PATH I have in PATH like: C:\Sun\SDK\bin;C:\cygwin\bin If you run bash in Start-RUN-cmd.exe it should work. Then you should install tools from Microsoft website (adding 'whoami' command). Example for Windows XP and sp2 http://www.microsoft.com/downloads/details.aspx?FamilyId=49AE8576-9BB9-4126-9761-BA8011FABF38displaylang=en Then you can follow rest of these steps === Install Nutch === * Grab a fresh release of Nutch 0.9 - http://lucene.apache.org/nutch/version_control.html * Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory === Create a new java project in Eclipse === * File New Project Java project click Next * Name the project (Nutch_Trunk for instance) * Select Create project from existing source and use the location where you downloaded Nutch * Click on Next, and wait while Eclipse is scanning the folders * Add the folder conf to the classpath (third tab and then add class folder) * Go to Order and Export tab, find the entry for added conf folder and move it to the top. It's required to make eclipse take config (nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder not anywhere else. * Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add src/java, src/test and all plugin src/java and src/test folders to your source folders. Also add all jars in lib and in the plugin lib folders to your libraries * Set output dir to tmp_build, create it if necessary * DO NOT add build to classpath === Configure Nutch === * See the [http://wiki.apache.org/nutch/NutchTutorial Tutorial] * Change the property plugin.folders to ./src/plugin on $NUTCH_HOME/conf/nutch-defaul.xml * Make sure Nutch is configured correctly before testing it into Eclipse ;-) === Missing org.farng and com.etranslate === Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code. Download them here: http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually). === Build Nutch === If you setup the project correctly, Eclipse will build Nutch for you into tmp_build. See below for problems you could run into. === Create Eclipse launcher === * Menu Run Run... * create New for Java Application * set in Main class {{{ org.apache.nutch.crawl.Crawl }}} * on tab Arguments, Program Arguments {{{ urls -dir crawl -depth 3 -topN 50 }}} * in VM arguments {{{ -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log }}} * click on Run * if all works, you should see Nutch getting busy at crawling :-) == Java Heap Size problem == If you find in hadoop.log line similar to this: {{{ 2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001 java.lang.OutOfMemoryError: Java heap space }}} You should increase amount of RAM for running applications from eclipse. Just set it in: Eclipse - Window - Preferences - Java - Installed JREs - edit - Default VM arguments I've set mine to {{{ -Xms5m -Xmx150m }}} because I have like 200MB RAM left after runnig all apps -Xms
[Nutch Wiki] Trivial Update of RunNutchInEclipse0.9 by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 -- - = Run Nutch In Eclipse on Linux and Windows nutch version 0.9= + = Run Nutch In Eclipse on Linux and Windows nutch version 0.9 = This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page :-)
[Nutch Wiki] Trivial Update of RunNutchInEclipse1.0 by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 -- - = Run Nutch In Eclipse on Linux and Windows nutch version 1.0= + = Run Nutch In Eclipse on Linux and Windows nutch version 1.0 = This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page :-)
[Nutch Wiki] Update of FrontPage by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/FrontPage -- * UpgradeFrom07To08 * [Upgrading_from_0.8.x_to_0.9] * RunNutchInEclipse for v0.8 - * [RunNutchInEclipse0.9] for v0.9 + * [RunNutchInEclipse0.9] for v0.9 (Linux and Windows) + * [RunNutchInEclipse1.0] for v1.0 (Linux and Windows) * [Crawl] - script to crawl (and possible recrawl too) * IntranetRecrawl - script to recrawl a crawl * MergeCrawl - script to merge 2 (or more) crawls
[Nutch Wiki] Update of RunNutchInEclipse0.9 by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 -- This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page :-) == Tested with == - * Nutch release 0.9 + * Nutch release 0.9 and 1.0 * Eclipse 3.3 - aka Europa * Java 1.6 * Ubuntu (should work on most platforms though) @@ -35, +35 @@ C:\Sun\SDK\bin;C:\cygwin\bin If you run bash in Start-RUN-cmd.exe it should work. + + + Then you should install tools from Microsoft website (adding 'whoami' command). + + Example for Windows XP and sp2 + + http://www.microsoft.com/downloads/details.aspx?FamilyId=49AE8576-9BB9-4126-9761-BA8011FABF38displaylang=en Then you can follow rest of these steps
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes -- === Important Points === 1. For authscope tag, 'host' and 'port' attribute should always be specified. 'realm' and 'scheme' attributes may or may not be specified depending on your needs. If you are tempted to omit the 'host' and 'port' attribute, because you want the credentials to be used for any host and any port for that realm/scheme, please use the 'default' tag instead. That's what 'default' tag is meant for. 1. One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, the credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for authentication-scopes. If the same authentication scope is encountered once again, it will be overwritten with the new credentials. However, one should not rely on this behavior as this might change with further developments. - 1. Do not define multiple authscope tags with the same host, port but different realms if the server requires NTLM authentication. This can means there should not be multiple tags with same host, port, scheme=NTLM but different realms. If you are omitting the scheme attribute and the server requires NTLM authentication, then there should not be multiple tags with same host, port but different realms. This is discussed more in the next section. + 1. Do not define multiple authscope tags with the same host, port but different realms if the server requires NTLM authentication. This means there should not be multiple tags with same host, port, scheme=NTLM but different realms. If you are omitting the scheme attribute and the server requires NTLM authentication, then there should not be multiple tags with same host, port but different realms. This is discussed more in the next section. 1. If you are using NTLM scheme, you should also set the 'http.agent.host' property in conf/nutch-site.xml === A note on NTLM domains === @@ -104, +104 @@ 1. Do you see Nutch trying to fetch the pages you were expecting in 'logs/hadoop.log'. You should see some logs like fetching http://www.example.com/expectedpage.html; where the URL is the page you were expecting to be fetched. If you don't see such lines for the pages you were expecting, the error is outside the scope of this feature. This feature comes into action only when the crawler is fetching a page but the page requires authentication. 1. With debug logs enabled, check whether there are logs beginning with Credentials in 'logs/hadoop.log'. The lines would look like Credentials - username someuser; set For every entry in 'conf/httpclient-auth.xml' you should find a corresponding log. If they are absent, probably you haven't included 'plugin.includes'. In case you have manually patched Nutch 0.9 source code with the patch, this issue may be caused if you have not built the project. 1. Do you see logs like this: auth.!AuthChallengeProcessor - basic authentication scheme selected? Instead of the word 'basic', you might see 'digest' or 'NTLM' depending on the scheme supported by the page being fetched? If you do not see it at all, probably the web server or the page being fetched does not require authentication. In that case, the crawler would not try to authenticate. If you were expecting an authentication for the page, probably something needs to be fixed at the server side. - 1. You should also see some logs that begin with: Pre-configured credentials with scope. It is very unlikely that this should happen after you have ensured all the above points. If it happens, please let us know in the mailing list. Once you have checked the items listed above and you are still unable to fix the problem or confused about any point listed above, please mail the issue with the following information: 1. Version of Nutch you are running. - 1. Did you get this feature directly from subversion or did you download the patch separately and apply? + 1. Complete code in ''conf/httpclient-auth.xml' file. 1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send the complete file.
[Nutch Wiki] Update of PublicServers by KevinReader
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KevinReader: http://wiki.apache.org/nutch/PublicServers -- * [http://campusgw.library.cornell.edu/ Cornell University Library] is collaborating with the research group of Thorsten Joachims to develop a learning search engine for library web pages based on Nutch. The nutch-based search engine is near the bottom of the page. * [http://search.creativecommons.org/ Creative Commons] is a search engine for creative commons licensed material. + + * [http://www.dadi360.com/ Dadi360] Usee nutch search engine for providing search of Chinese language websites in North America. * [http://www.ecolicommunity.org/Websearch Ecolhub Web Search] an E. coli specific search engine based on Nutch. EcoliHub WebSearch includes only those sites relevant to E. coli, thereby reducing the number of spurious hits. Searches can be optionally limited to your choice of resources. More than 110,000 pages to search. More resources getting added.
[Nutch Wiki] Update of NewScoringIndexingExample by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NewScoringIndexingExample The comment on the change is: comment pointing out multiple segment flags -- = Example Running new Scoring and Indexing Systems = Below is an example of running the new scoring and indexing systems from start to finish. This was done with a sample of 1000 urls and I ran two different fetch cycles. The first being 1000 urls and the second being the top 2000 urls. The loops job is optional but included for completeness. In production we have actually removed that job. This was done with a clean pull from Nutch trunk as of 2009-03-06 (right before 1.0 is set to be released). If anybody has any problems running these commands or has questions send me an email or send one to the nutch users or dev list and I will reply. Please send it to kubes at the apache address dot org. + {{{ bin/nutch inject crawl/crawldb crawl/urls/ @@ -10, +11 @@ bin/nutch fetch crawl/segments/20090306093949/ bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb + bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/ @@ -55, +57 @@ bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306100055/ rm -fr crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -segment crawl/segments/20090306100055/ -webgraphdb crawl/webgraphdb + }}} + + One thing that has been brought up is the -segment flag on webgraph. If you have more than one segment then you would have more than one segment flag as shown above. + + {{{ bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/
[Nutch Wiki] Update of NewScoringIndexingExample by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NewScoringIndexingExample -- bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb + }}} + + One thing to point out here is that WebGraph is meant to be used on larger web crawls to create web graphs. By default it ignores outlinks to pages in the same domain, including subdomains, and pages with the same hostname. It also limits to one outlink per page to links in the same page or the same domain. All of these options are changeable through the following configuration options: + + {{{ + + !-- linkrank scoring properties -- + property + namelink.ignore.internal.host/name + valuetrue/value + descriptionIgnore outlinks to the same hostname./description + /property + + property + namelink.ignore.internal.domain/name + valuetrue/value + descriptionIgnore outlinks to the same domain./description + /property + + property + namelink.ignore.limit.page/name + valuetrue/value + descriptionLimit to only a single outlink to the same page./description + /property + + property + namelink.ignore.limit.domain/name + valuetrue/value + descriptionLimit to only a single outlink to the same domain./description + /property + + }}} + + But by default if you are only crawling pages within a domain or within a set of subdomains, all outlinks will be ignored and you will come up with an empty webgraph. This in turn will throw an error while processing through the LinkRank job. The flip side is by NOT ignoring links to the same domain/host and by not limiting those links, the webgraph becomes much, much more dense and hence there is a lot more links to process which probably won't affect relevancy as much. + + {{{ bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/
[Nutch Wiki] Update of HardwareRequirements by NycoNyco
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by NycoNyco: http://wiki.apache.org/nutch/HardwareRequirements The comment on the change is: title -- - = Hardware Requirements = In general, fetching and database updates require lots of disk, and searching is faster with more RAM. But the particulars depend on how big of an index you're trying to build and how much query traffic you expect. + + == Requirements for indexing == As a general rule, each page fetched requires around 10k of disk overall (for the page cache, its text, the index, db entries, etc.). So a terabyte of storage is required for every 100M pages.
[Nutch Wiki] Update of Features by NycoNyco
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by NycoNyco: http://wiki.apache.org/nutch/Features The comment on the change is: (non-exhaustive) tentative features list (please review) -- (Please reformat this text and divide into feature lists, questions and questions answers). == Features == + + * Fetching, parsing and indexation in parallel and/ou distributed + * Plugins + * Many formats: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags) + * Ontology + * Clustering + * MapReduce ; + * Distributed filesystem (via Hadoop) + * Link-graph database + * NTLM authentication == Questions and Answers ==
[Nutch Wiki] Update of RunNutchInEclipse0.9 by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 The comment on the change is: added description for Windows users -- * Eclipse 3.3 - aka Europa * Java 1.6 * Ubuntu (should work on most platforms though) + * Windows XP == Before you start == Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. But again you might be quicker by looking at the logs (logs/hadoop.log)... + == Steps == + + + === For Windows Users === + + If you are running Windows (tested on Windows XP) you must first install cygwin + + Download cygwin from http://www.cygwin.com/setup.exe + + Install cygwin and set PATH variable for it. + + It's in control panel, system, advanced tab, environment variables and edit/add PATH + + I have in PATH like: + + C:\Sun\SDK\bin;C:\cygwin\bin + + If you run bash in Start-RUN-cmd.exe it should work. + + Then you can follow rest of these steps === Install Nutch === * Grab a fresh release of Nutch 0.9 - http://lucene.apache.org/nutch/version_control.html * Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory + === Create a new java project in Eclipse === * File New Project Java project click Next
[Nutch Wiki] Update of NewScoringIndexingExample by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NewScoringIndexingExample New page: = Example Running new Scoring and Indexing Systems = Below is an example of running the new scoring and indexing systems from start to finish. This was done with a sample of 1000 urls and I ran two different fetch cycles. The first being 1000 urls and the second being the top 2000 urls. The loops job is optional but included for completeness. In production we have actually removed that job. This was done with a clean pull from Nutch trunk as of 2009-03-06 (right before 1.0 is set to be released). If anybody has any problems running these commands or has questions send me an email or send one to the nutch users or dev list and I will reply. Please send it to kubes at the apache address dot org. {{{ bin/nutch inject crawl/crawldb crawl/urls/ bin/nutch generate crawl/crawldb/ crawl/segments bin/nutch fetch crawl/segments/20090306093949/ bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper -scores -topn 1000 -webgraphdb crawl/webgraphdb/ -output crawl/webgraphdb/dump/scores more crawl/webgraphdb/dump/scores/part-0 http://validator.w3.org/check?uri=referer 0.4955311 http://www.adobe.com/go/getflashplayer 0.4060498 http://www.statcounter.com/ 0.4060498 http://www.liveinternet.ru/click0.33680826 http://www.adobe.com/products/acrobat/readstep2.html0.31656843 http://www.adobe.com/go/getflashplayer/ 0.30378538 http://www.bloomingbows.com/2003/scripts/sitemap.asp0.27821928 http://www.misterping.com/ 0.27821928 ... bin/nutch readdb crawl/crawldb/ -stats CrawlDb statistics start: crawl/crawldb/ Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Statistics for CrawlDb: crawl/crawldb/ TOTAL urls: 16711 retry 0:16686 retry 1:25 min score: 0.0 avg score: 0.022716654 max score: 0.495 status 1 (db_unfetched):15739 status 2 (db_fetched): 677 status 3 (db_gone): 75 status 4 (db_redir_temp): 143 status 5 (db_redir_perm): 77 CrawlDb statistics: done bin/nutch generate crawl/crawldb/ crawl/segments/ -topN 2000 bin/nutch fetch crawl/segments/20090306100055/ bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306100055/ rm -fr crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -segment crawl/segments/20090306100055/ -webgraphdb crawl/webgraphdb bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/ bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/ more crawl/webgraphdb/dump/scores/part-0 http://www.statcounter.com/ 1.7133079 http://www.morristownwebdesign.com/ 1.0093393 http://www.jdoqocy.com/click-3331968-10419685 0.87828785 http://www.anrdoezrs.net/click-3331968-10384568 0.87828785 http://www.sedo.com/main.php3?language=e0.6565905 http://wetter.spiegel.de/spiegel/html/frankreich0.html 0.641775 http://www.kenwood.com/ 0.6084726 http://validator.w3.org/check?uri=referer 0.5605916 http://wetter.spiegel.de/spiegel/html/Italien0.html 0.5164927 http://www.youtube.com/?hl=entab=w10.50952965 http://www.addthis.com/bookmark.php 0.5013165 http://www.ptguide.com/ 0.49564213 http://www.adobe.com/go/getflashplayer 0.47368217 http://de.weather.yahoo.com/ITXX/ITXX0073/index_c.html 0.4657473 http://www.adobe.com/shockwave/download/download.cgi?P1_Prod_Version=ShockwaveFlashpromoid=BIOW 0.44376293 http://www.google.com/ 0.42282072 http://www.zajezdy.cz/ 0.41620353 http://www.intermarche.com/ 0.41489196 http://www.shipskill.com/7/ 0.4147887 http://www.statcounter.com/free_hit_counter.html0.40928197
[Nutch Wiki] Update of FrontPage by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/FrontPage The comment on the change is: Added example for new scoring and indexing systems -- == Nutch 2.0 == * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture. * [NewScoring] -- New stable pagerank like webgraph and link-analysis jobs. + * [NewScoringIndexingExample] -- Two full fetch cycles of commands using new scoring and indexing systems. == Other Resources == * [http://nutch.sourceforge.net/blog/cutting.html Doug's Weblog] -- He's the one who originally wrote Lucene and Nutch.
[Nutch Wiki] Update of DownloadingNutch by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/DownloadingNutch -- You have two choices in how to get Nutch: - 1. You can download a release from http://lucene.apache.org/nutch/release/. This will give you a relatively stable release. At the moment the latest release is 0.8. + 1. You can download a release from http://lucene.apache.org/nutch/release/. This will give you a relatively stable release. At the moment the latest release is 0.9. 2. Or, you can check out the latest source code from subversion and build it with Ant. This gets you closer to the bleeding edge of development. The 0.9 should be relatively stable but the trunk (from which the [http://lucene.apache.org/nutch/nightly.html nightly builds] are build) is under heavy development with bugs showing up and getting squashed fairly frequently. Note: As of 5/29/08 the Subversion trunk seems to be much better than the 0.9 release. If you have trouble with 0.9 your best bet is to try moving to trunk and see if the problems resolve themselves.
[Nutch Wiki] Update of SimpleMapReduceTutorial by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/SimpleMapReduceTutorial The comment on the change is: It is not map reduce tutorial, it's only confusing people -- - This is the simplest map reduce example I could come up with. Local filesystem, just getting one segment indexed. I am running Ubuntu, on an Athlon 3200+ using a cable modem connection. + deleted - == Designate Url == - - Need to get to the right place - - {{{ - cd nutch/branches/mapred - }}} - - We need to make a directory that contains files, where each line of each file is a url. I choose http://lucene.apache.org/nutch/ - - {{{ - mkdir urls - echo http://lucene.apache.org/nutch/; urls/urls - }}} - - Also need to change the crawl filter to include this site - - {{{ - perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt - }}} - - We walk through the following steps: crawl, generate, fetch, updatedb, invertlinks, index. - - == Crawl == - - We want to run crawl on the urls directory from above. - - {{{ - ./bin/nutch crawl urls - }}} - - Took me about ten minutes. Output included - - 051004 003916 178 pages, 17 errors, 0.4 pages/s, 48 kb/s - - The errors generally seemed to be timeouts. - - The rest of the commands are a bit more dynamic, relying on timestamp and the like. Environment variables help out. - - == Generate == - - Here we walk a segment dir from the crawl above. - - {{{ - CRAWLDB=`find crawl-2* -name crawldb` - SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments` - ./bin/nutch generate $CRAWLDB $SEGMENTS_DIR - }}} - - Took less than five seconds. - - == Fetch == - - {{{ - SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1` - ./bin/nutch fetch $SEGMENT - }}} - - Took about seven minutes, and output looked like - - 051004 004931 65 pages, 404 errors, 0.2 pages/s, 19 kb/s, - - Again, many timeouts. - - == UbdateDB == - - {{{ - ./bin/nutch updatedb $CRAWLDB $SEGMENT - }}} - - Took less than five seconds. - - == InvertLinks == - - {{{ - LINKDB=`find crawl-2* -name linkdb -maxdepth 1` - SEGMENTS=`find crawl-2* -name segments -maxdepth 1` - ./bin/nutch invertlinks $LINKDB $SEGMENTS - }}} - - Took less than five seconds. - - == Index == - - We need a place for our index, say myindex - - {{{ - mkdir myindex - }}} - - Now, let's index. - - {{{ - ./bin/nutch index myindex $LINKDB $SEGMENT - }}} - - Took less than ten seconds. - - == Test == - - The best test I have for the moment is - - {{{ - ls -alR myindex - }}} - - If you see several files, it at least did something. Happy nutching! - - Tutorial written by Earl Cahill, 2005. -
[Nutch Wiki] Trivial Update of FrontPage by BartoszGadzimski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/FrontPage -- * GettingNutchRunningWithDebian * GettingNutchRunningWithSocksProxy * ErrorMessages -- What they mean and suggestions for getting rid of them. - * SimpleMapReduceTutorial * SetupProxyForNutch - using Tinyproxy on Ubuntu * CreateNewFilter - for example to add a category metadata to your index and be able to search for it * UpgradeFrom07To08
[Nutch Wiki] Update of InstallingWeb2 by SamiSiren
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by SamiSiren: http://wiki.apache.org/nutch/InstallingWeb2 -- + == NOTE: Web2 module is no longer part of Nutch == + So these instructions do no longer apply. + + + chris sleeman wrote: Hi,
[Nutch Wiki] Update of RunNutchInEclipse0.9 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 The comment on the change is: Corrected instruction -- http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. + Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually). - Then add the jar files to the build path (First refresh the workspace. Then right-click on the source - folder Java Build Path Libraries Add Jars. In Eclipse version 3.4, right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually). === Build Nutch ===
[Nutch Wiki] Update of IntranetRecrawl by SAnand
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by SAnand: http://wiki.apache.org/nutch/IntranetRecrawl The comment on the change is: Suggested fix for index/merge-output already exists error when merging indices -- No changes necessary for this to run with Nutch 0.9.0. + However, if you get an error message indicating that the folder index/merge-output already exists, move the index/merge-output folder back into the index/ folder. For example: + {{{ + mv $index_dir/merge-output /tmp + rm -rf $index_dir + mv /tmp/merge-output $index_dir + }}} === Code === {{{
[Nutch Wiki] Trivial Update of RunNutchInEclipse0.9 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 The comment on the change is: clarified -- http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. - Then add them to the libraries to the build path (First refresh the workspace. Then right-click on the source + Then add the jar files to the build path (First refresh the workspace. Then right-click on the source folder Java Build Path Libraries Add Jars. In Eclipse version 3.4, right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually).
[Nutch Wiki] Update of GettingNutchRunningWithWindows by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/GettingNutchRunningWithWindows The comment on the change is: Added some clarifications -- === Download === - [http://lucene.apache.org/nutch/release/ Download] the release and extract anywhere on your hard disk e.g. `c:\nutch-0.9` + [http://lucene.apache.org/nutch/release/ Download] the release and extract on your hard disk in a directory that ''does not'' contain a space in it (e.g., `c:\nutch-0.9`). If the directory does contain a space (e.g., `c:\my programs\nutch-0.9`), the Nutch scripts will not work properly. - Create an empty text file in your nutch directory e.g. `urls` and add the URLs of the sites you want to crawl. + Create an empty text file (use any name you wish) in your nutch directory (e.g., `urls`) and add the URLs of the sites you want to crawl. - Add your URLs to the `crawl-urlfilter.txt` (e.g. `C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this: + Add your URLs to the `crawl-urlfilter.txt` (e.g., `C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this: {{{ +^http://([a-z0-9]*\.)*apache.org/ }}} - Load up cygwin and naviagte to your nutch directory. When cygwin launches you'll usually find yourself in your user folder (e.g. `C:\Documents and Settings\username`). + Load up cygwin and navigate to your `nutch` directory. When cygwin launches, you'll usually find yourself in your user folder (e.g. `C:\Documents and Settings\username`). - If your workstation needs to go through a windows authentication proxy to get to the internet then you can use an application such as the [http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] to get through it. You'll then need to edit the `nutch-site.xml` file to point to the port opened by the app. + If your workstation needs to go through a Windows Authentication Proxy to get to the Internet (this is not common), then you can use an application such as the [http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] to get through it. You'll then need to edit the `nutch-site.xml` file to point to the port opened by the app. == Intranet Crawling == @@ -48, +48 @@ {{{ bin/nutch crawl urls -dir crawl -depth 3 crawl.log }}} - then a folder called crawl/ is created in your nutch directory, along with the crawl.log file. Use this log file to debug any errors you might have. + then a folder called `crawl` is created in your `nutch` directory, along with the crawl.log file. Use this log file to debug any errors you might have. You'll need to delete or move the crawl directory before starting the crawl off again unless you specify another path on the command above.
[Nutch Wiki] Update of RunNutchInEclipse0.9 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 The comment on the change is: Clarified some instructions and improved grammar -- * Nutch release 0.9 * Eclipse 3.3 - aka Europa * Java 1.6 - * Ubuntu (should work on most platform, though) + * Ubuntu (should work on most platforms though) == Before you start == @@ -34, +34 @@ === Configure Nutch === - * see the [http://wiki.apache.org/nutch/NutchTutorial Tutorial] + * See the [http://wiki.apache.org/nutch/NutchTutorial Tutorial] - * change the property plugin.folders to ./src/plugin on $NUTCH_HOME/conf/nutch-defaul.xml + * Change the property plugin.folders to ./src/plugin on $NUTCH_HOME/conf/nutch-defaul.xml - * make sure Nutch is configured correctly before testing it into Eclipse ;-) + * Make sure Nutch is configured correctly before testing it into Eclipse ;-) - === missing org.farng and com.etranslate === + === Missing org.farng and com.etranslate === - You will encounter problems with some imports in parse-mp3 and parse-rtf plugins (30 errors in my case). + Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). - Because of incompatibility with Apache license they were left from sources. + Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code. + - You can download them here: + Download them here: http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. - Then add them to the libraries to the build path (First refresh the workspace. Then Right click on the source + Then add them to the libraries to the build path (First refresh the workspace. Then right-click on the source - folder = Java Build Path = Libraries = Add Jars). + folder Java Build Path Libraries Add Jars. In Eclipse version 3.4, right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually). === Build Nutch === - * In case you setup the project correctly, Eclipse will build Nutch for you into tmp_build. + If you setup the project correctly, Eclipse will build Nutch for you into tmp_build. See below for problems you could run into. - -
[Nutch Wiki] Update of Mailing by GrantIngersoll
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by GrantIngersoll: http://wiki.apache.org/nutch/Mailing -- == List Archives == + [http://www.lucidimagination.com/search] - Search the Lucene ecosystem, including Nutch. Powered by Lucene/Solr. [http://www.mail-archive.com/index.php?hunt=nutch Searchble Nutch] list archives. [http://www.nabble.com/Nutch-f362.html nutch archives] nabble.com archives.
[Nutch Wiki] Update of Mailing by GrantIngersoll
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by GrantIngersoll: http://wiki.apache.org/nutch/Mailing -- == List Archives == - [http://www.lucidimagination.com/search] - Search the Lucene ecosystem, including Nutch. Powered by Lucene/Solr. + * [http://www.lucidimagination.com/search] - Search the Lucene ecosystem, including Nutch. Powered by Lucene/Solr. - [http://www.mail-archive.com/index.php?hunt=nutch Searchble Nutch] list archives. + * [http://www.mail-archive.com/index.php?hunt=nutch Searchble Nutch] list archives. - [http://www.nabble.com/Nutch-f362.html nutch archives] nabble.com archives. + * [http://www.nabble.com/Nutch-f362.html nutch archives] nabble.com archives.
[Nutch Wiki] Update of NewScoring by OtisGospodnetic
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by OtisGospodnetic: http://wiki.apache.org/nutch/NewScoring -- - This page describes the new scoring (i.e. WebGraph and Link Analysis) functionality in Nutch as of revision 723441. + This page describes the new scoring (i.e. !WebGraph and Link Analysis) functionality in Nutch as of revision 723441. == General Information == The new scoring functionality can be found in org.apache.nutch.scoring.webgraph. This package contains multiple programs that build web graphs, perform a stable convergent link-analysis, and update the crawldb with those scores. These programs assume that fetching cycles have already been completed and now the users want to build a global webgraph from those segments and from that webgraph perform link-analysis to get a single global relevancy score for each url. Building a webgraph assumes that all links are stored in the current segments to be processed. Links are not held over from one processing cycle to another. Global link-analysis scores are based on the current links available and scores will change as the link structure of the webgraph changes. @@ -8, +8 @@ Currently the scoring jobs are not integrated into the Nutch script as commands and must be run in the form bin/nutch org.apache.nutch.scoring.webgraph.. === WebGraph === - The WebGraph program is the first job that must be run once all segments are fetched and ready to be processed. WebGraph is found at org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs usage. + The !WebGraph program is the first job that must be run once all segments are fetched and ready to be processed. !WebGraph is found at org.apache.nutch.scoring.webgraph.!WebGraph. Below is a printout of the programs usage. {{{ usage: WebGraph @@ -17, +17 @@ -webgraphdb webgraphdb the web graph database to use }}} - The WebGraph program can take multiple segments to process and requires an output directory in which to place the completed web graph components. The WebGraph creates three different components, and inlink database, an outlink database, and a node database. The inlink database is a listing of url and all of its inlinks. The outlink database is a listing of url and all of its outlinks. The node database is a listing of url with node meta information including the number of inlinks and outlinks, and eventually the score for that node. + The !WebGraph program can take multiple segments to process and requires an output directory in which to place the completed web graph components. The !WebGraph creates three different components: an inlink database, an outlink database, and a node database. The inlink database is a listing of url and all of its inlinks. The outlink database is a listing of url and all of its outlinks. The node database is a listing of url with node meta information including the number of inlinks and outlinks, and eventually the score for that node. === Loops === - Once the web graph is built we can begin the process of link analysis. Loops is an optional program that attempts to help weed out spam sites by determining link cycles in a web graph. An example of a link cycle would be sites A, B, C, and D where A links to B which links to C which links to D which links back to A. This program is computationally expensive and usually, due to time and space requirement, can't be run on more than a three or four level depth. While it does identify sites which appear to be spam and those links are then discounted in the later LinkRank program, its benefit to cost ratio is very low. It is included in this package for completeness and because their may be a better way to perform this function with a different algorithm. But on current production webgraphs, its use is discouraged. Loops is found at org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs usage. + Once the web graph is built we can begin the process of link analysis. Loops is an optional program that attempts to help weed out spam sites by determining link cycles in a web graph. An example of a link cycle would be sites A, B, C, and D, where A links to B which links to C which links to D which links back to A. This program is computationally expensive and usually, due to time and space requirement, can't be run on more than a three or four level depth. While it does identify sites which appear to be spam and those links are then discounted in the later !LinkRank program, its benefit to cost ratio is very low. It is included in this package for completeness and because there may be a better way to perform this function with a different algorithm. But on current large production webgraphs, its
[Nutch Wiki] Update of NewPage by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NewPage The comment on the change is: Beginning descriptions of how to use the new webgraph scoring system. -- - emptyemptyempty! + This page describes the new scoring (i.e. WebGraph and Link Analysis) functionality in Nutch as of revision 723441. + == General Information == + The new scoring functionality can be found in org.apache.nutch.scoring.webgraph. This package contains multiple programs that build web graphs, perform a stable convergent link-analysis, and update the crawldb with those scores. These programs assume that fetching cycles have already been completed and now the users want to build a global webgraph from those segments and from that webgraph perform link-analysis to get a single global relevancy score for each url. Building a webgraph assumes that all links are stored in the current segments to be processed. Links are not held over from one processing cycle to another. Global link-analysis scores are based on the current links available and scores will change as the link structure of the webgraph changes. + + Currently the scoring jobs are not integrated into the Nutch script as commands and must be run in the form bin/nutch org.apache.nutch.scoring.webgraph.. + + === WebGraph === + The WebGraph program is the first job that must be run once all segments are fetched and ready to be processed. WebGraph is found at org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs usage. + + {{{ + usage: WebGraph + -help show this help message + -segment segment the segment(s) to use + -webgraphdb webgraphdb the web graph database to use + }}} + + The WebGraph program can take multiple segments to process and requires an output directory in which to place the completed web graph components. The WebGraph creates three different components, and inlink database, an outlink database, and a node database. The inlink database is a listing of url and all of its inlinks. The outlink database is a listing of url and all of its outlinks. The node database is a listing of url with node meta information including the number of inlinks and outlinks, and eventually the score for that node. + + === Loops === + Once the web graph is built we can begin the process of link analysis. Loops is an optional program that attempts to help weed out spam sites by determining link cycles in a web graph. An example of a link cycle would be sites A, B, C, and D where A links to B which links to C which links to D which links back to A. This program is computationally expensive and usually, due to time and space requirement, can't be run on more than a three or four level depth. While it does identify sites which appear to be spam and those links are then discounted in the later LinkRank program, its benefit to cost ratio is very low. It is included in this package for completeness and because their may be a better way to perform this function with a different algorithm. But on current production webgraphs, its use is discouraged. Loops is found at org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs usage. + + {{{ + usage: Loops + -help show this help message + -webgraphdb webgraphdb the web graph database to use + }}} + + === LinkRank === + With the web graph built we can now run LinkRank to perform an iterative link analysis. LinkRank is a PageRank like link analysis program that converges to stable global scores for each url. Similar to PageRank, the LinkRank program starts with a common score for all urls. It then creates a global score for each url based on the number of incoming links and the scores for those link and the number of outgoing links from the page. The process is iterative and scores tend to converge after a given number of iterations. It is different from PageRank in that nepotistic links such as links internal to a website and reciprocal links between websites can be ignored. The number of iterations can also be configured, by default 10 iterations are performed. Unlike the previous OPIC scoring, the LinkRank program does not keep scores from one processing time to another. The web graph and the link scores are recreated at each processing run and so we don't have the problems of ev er increasing scores. LinkRank requires the WebGraph program to have completed successfully and it stores its output scores for each url in the node database of the webgraph. LinkRank is found at org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs usage. + + {{{ + usage: LinkRank + -help show this help message +
[Nutch Wiki] Update of NewScoring by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NewScoring New page: This page describes the new scoring (i.e. WebGraph and Link Analysis) functionality in Nutch as of revision 723441. == General Information == The new scoring functionality can be found in org.apache.nutch.scoring.webgraph. This package contains multiple programs that build web graphs, perform a stable convergent link-analysis, and update the crawldb with those scores. These programs assume that fetching cycles have already been completed and now the users want to build a global webgraph from those segments and from that webgraph perform link-analysis to get a single global relevancy score for each url. Building a webgraph assumes that all links are stored in the current segments to be processed. Links are not held over from one processing cycle to another. Global link-analysis scores are based on the current links available and scores will change as the link structure of the webgraph changes. Currently the scoring jobs are not integrated into the Nutch script as commands and must be run in the form bin/nutch org.apache.nutch.scoring.webgraph.. === WebGraph === The WebGraph program is the first job that must be run once all segments are fetched and ready to be processed. WebGraph is found at org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs usage. {{{ usage: WebGraph -help show this help message -segment segment the segment(s) to use -webgraphdb webgraphdb the web graph database to use }}} The WebGraph program can take multiple segments to process and requires an output directory in which to place the completed web graph components. The WebGraph creates three different components, and inlink database, an outlink database, and a node database. The inlink database is a listing of url and all of its inlinks. The outlink database is a listing of url and all of its outlinks. The node database is a listing of url with node meta information including the number of inlinks and outlinks, and eventually the score for that node. === Loops === Once the web graph is built we can begin the process of link analysis. Loops is an optional program that attempts to help weed out spam sites by determining link cycles in a web graph. An example of a link cycle would be sites A, B, C, and D where A links to B which links to C which links to D which links back to A. This program is computationally expensive and usually, due to time and space requirement, can't be run on more than a three or four level depth. While it does identify sites which appear to be spam and those links are then discounted in the later LinkRank program, its benefit to cost ratio is very low. It is included in this package for completeness and because their may be a better way to perform this function with a different algorithm. But on current production webgraphs, its use is discouraged. Loops is found at org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs usage. {{{ usage: Loops -help show this help message -webgraphdb webgraphdb the web graph database to use }}} === LinkRank === With the web graph built we can now run LinkRank to perform an iterative link analysis. LinkRank is a PageRank like link analysis program that converges to stable global scores for each url. Similar to PageRank, the LinkRank program starts with a common score for all urls. It then creates a global score for each url based on the number of incoming links and the scores for those link and the number of outgoing links from the page. The process is iterative and scores tend to converge after a given number of iterations. It is different from PageRank in that nepotistic links such as links internal to a website and reciprocal links between websites can be ignored. The number of iterations can also be configured, by default 10 iterations are performed. Unlike the previous OPIC scoring, the LinkRank program does not keep scores from one processing time to another. The web graph and the link scores are recreated at each processing run and so we don't have the problems of ever increasing scores. LinkRank requires the WebGraph program to have completed successfully and it stores its output scores for each url in the node database of the webgraph. LinkRank is found at org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs usage. {{{ usage: LinkRank -help show this help message -webgraphdb webgraphdb the web graph db to use }}} === ScoreUpdater === Once the LinkRank program has been run and link analysis is completed, the scores must be updated into the crawl database to work with the current Nutch functionality. The
[Nutch Wiki] Update of NewPage by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NewPage -- - This page describes the new scoring (i.e. WebGraph and Link Analysis) functionality in Nutch as of revision 723441. + empty - == General Information == - The new scoring functionality can be found in org.apache.nutch.scoring.webgraph. This package contains multiple programs that build web graphs, perform a stable convergent link-analysis, and update the crawldb with those scores. These programs assume that fetching cycles have already been completed and now the users want to build a global webgraph from those segments and from that webgraph perform link-analysis to get a single global relevancy score for each url. Building a webgraph assumes that all links are stored in the current segments to be processed. Links are not held over from one processing cycle to another. Global link-analysis scores are based on the current links available and scores will change as the link structure of the webgraph changes. - - Currently the scoring jobs are not integrated into the Nutch script as commands and must be run in the form bin/nutch org.apache.nutch.scoring.webgraph.. - - === WebGraph === - The WebGraph program is the first job that must be run once all segments are fetched and ready to be processed. WebGraph is found at org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs usage. - - {{{ - usage: WebGraph - -help show this help message - -segment segment the segment(s) to use - -webgraphdb webgraphdb the web graph database to use - }}} - - The WebGraph program can take multiple segments to process and requires an output directory in which to place the completed web graph components. The WebGraph creates three different components, and inlink database, an outlink database, and a node database. The inlink database is a listing of url and all of its inlinks. The outlink database is a listing of url and all of its outlinks. The node database is a listing of url with node meta information including the number of inlinks and outlinks, and eventually the score for that node. - - === Loops === - Once the web graph is built we can begin the process of link analysis. Loops is an optional program that attempts to help weed out spam sites by determining link cycles in a web graph. An example of a link cycle would be sites A, B, C, and D where A links to B which links to C which links to D which links back to A. This program is computationally expensive and usually, due to time and space requirement, can't be run on more than a three or four level depth. While it does identify sites which appear to be spam and those links are then discounted in the later LinkRank program, its benefit to cost ratio is very low. It is included in this package for completeness and because their may be a better way to perform this function with a different algorithm. But on current production webgraphs, its use is discouraged. Loops is found at org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs usage. - - {{{ - usage: Loops - -help show this help message - -webgraphdb webgraphdb the web graph database to use - }}} - - === LinkRank === - With the web graph built we can now run LinkRank to perform an iterative link analysis. LinkRank is a PageRank like link analysis program that converges to stable global scores for each url. Similar to PageRank, the LinkRank program starts with a common score for all urls. It then creates a global score for each url based on the number of incoming links and the scores for those link and the number of outgoing links from the page. The process is iterative and scores tend to converge after a given number of iterations. It is different from PageRank in that nepotistic links such as links internal to a website and reciprocal links between websites can be ignored. The number of iterations can also be configured, by default 10 iterations are performed. Unlike the previous OPIC scoring, the LinkRank program does not keep scores from one processing time to another. The web graph and the link scores are recreated at each processing run and so we don't have the problems of ev er increasing scores. LinkRank requires the WebGraph program to have completed successfully and it stores its output scores for each url in the node database of the webgraph. LinkRank is found at org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs usage. - - {{{ - usage: LinkRank - -help show this help message - -webgraphdb webgraphdb the web graph db to use - }}} - - === ScoreUpdater === - Once the LinkRank program has
[Nutch Wiki] Update of FrontPage by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/FrontPage -- == Nutch 2.0 == * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture. + * [NewScoring] -- New stable pagerank like webgraph and link-analysis jobs. == Other Resources == * [http://nutch.sourceforge.net/blog/cutting.html Doug's Weblog] -- He's the one who originally wrote Lucene and Nutch.
[Nutch Wiki] Trivial Update of Release HOWTO by SamiSiren
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by SamiSiren: http://wiki.apache.org/nutch/Release_HOWTO The comment on the change is: remember to update doap.rdf -- 1. Copy release tar file to {{{people.apache.org:/www/www.apache.org/dist/lucene/nutch}}}. 1. Wait 24 hours for release to propagate to mirrors. + + 1. Add the new release info to the [https://svn.apache.org/repos/asf/lucene/nutch/trunk/site/doap.rdf doap.rdf] file, and double check for any other updates that should be made to the doap file as well if it hasn't been updated in a while. + 1. Deploy new Nutch site (according to [Website Update HOWTO]). 1. Deploy new main Lucene site (according to [Website Update HOWTO] but modified for Lucene site - update is to be performed in {{{/www/lucene.apache.org}}} directory). 1. Update Javadoc in {{{people.apache.org:/www/lucene.apache.org/nutch/apidocs}}}.
[Nutch Wiki] Trivial Update of HttpPostAuthentication by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpPostAuthentication -- == Introduction == - Often, Nutch has to crawl websites with pages protected by authentication. Therefore, to crawl such web-pages, Nutch must authenticate itself to the website and then proceed with fetching the pages from it. Currently, the development version of Nutch can do Basic, Digest and NTLM] based authentication. This is documented in HttpAuthenticationSchemes. In this project, we would be adding HTTP POST based authentication, which is the most popular form of authentication on most websites. It should be possible to configure different credentials for different websites. + Often, Nutch has to crawl websites with pages protected by authentication. Therefore, to crawl such web-pages, Nutch must authenticate itself to the website and then proceed with fetching the pages from it. Currently, the development version of Nutch can do Basic, Digest and NTLM based authentication. This is documented in HttpAuthenticationSchemes. In this project, we would be adding HTTP POST based authentication, which is the most popular form of authentication on most websites. It should be possible to configure different credentials for different websites. == Configuration == A configuration file with a list of domains for which authentication should be done along with the login URL and POST data. If possible, the configuration should also allow the user to mention a session timeout value for websites as an optional parameter. This would be helpful if some website is known to timeout very quickly, or when the duration of the fetch cycle would be too long as compared to the session's life. @@ -23, +23 @@ 1. We use pattern matching to find out whether the contents of the page indicates it as an authentication failure page or not, for the website. But it is an unnecessary waste of time because for most cases the page wouldn't be an error page. 1. We perform an authentication by sending POST data to login URL every time we fetch a page from that domain. By this, we are almost doubling the bandwidth requirement to crawl that website. - 1. For those sites, where authentication failure page comes from a known URL, we can add which URLs mean authentication failure along with the login URL and POST data in the configuration file. There wouldn't be too many such URLs for a particular domain and so a regex match or a complete string match for the URLs after every response + 1. For those sites, where authentication failure page comes from a known URL, we can add which URLs mean authentication failure along with the login URL and POST data in the configuration file. There wouldn't be too many such URLs for a particular domain and so a regex match or a complete string match for the URLs after every response from that domain shouldn't consume much time. - from that domain shouldn't consume much time. However, even without taking care of these points, and simply getting the fetcher behavior right as discussed in the previous section, we'll have a solution that may be useful to many.
[Nutch Wiki] Update of RunNutchInEclipse0.9 by PiotrBazan
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by PiotrBazan: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 -- * Name the project (Nutch_Trunk for instance) * Select Create project from existing source and use the location where you downloaded Nutch * Click on Next, and wait while Eclipse is scanning the folders - * Add the folder conf to the classpath (third tab and then add class folder) + * Add the folder conf to the classpath (third tab and then add class folder) + * Go to Order and Export tab, find the entry for added conf folder and move it to the top. It's required to make eclipse take config (nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder not anywhere else. * Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add src/java, src/test and all plugin src/java and src/test folders to your source folders. Also add all jars in lib and in the plugin lib folders to your libraries * Set output dir to tmp_build, create it if necessary * DO NOT add build to classpath @@ -34, +35 @@ === Configure Nutch === * see the [http://wiki.apache.org/nutch/NutchTutorial Tutorial] - * change the property plugin.folders to ./src/plugin on $NUTCH_HOME/conf/nutch-defaul.xml + * change the property plugin.folders to ./src/plugin on $NUTCH_HOME/conf/nutch-defaul.xml * make sure Nutch is configured correctly before testing it into Eclipse ;-) === missing org.farng and com.etranslate ===
[Nutch Wiki] Update of johnroman by johnroman
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by johnroman: http://wiki.apache.org/nutch/johnroman -- - John Roman is a sysadmin for the RD arm of lexmark international. + [http://nimbius.36bit.com/mered.jpg John Roman] is a sysadmin for the RD arm of Lexmark International. some of his contributions include bugfix documentation and troubleshooting...as well as an attempt to clean up alot of the tutorials.
[Nutch Wiki] Update of PluginCentral by johnroman
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by johnroman: http://wiki.apache.org/nutch/PluginCentral -- * WritingPluginExample - A step-by-step example of how to write a plugin for the 0.7 branch. - updated by LucasBoullosa * [http://wiki.media-style.com/display/nutchDocu/Write+a+plugin Writing Plugins] - by Stefan - == Plugins that Come with Nutch (0.7) == + == Plugins that Come with Nutch (0.9) == In order to get Nutch to use any of these plugins, you just need to edit your conf/nutch-site.xml file and add the name of the plugin to the list of plugin.includes. @@ -24, +24 @@ * '''parse-html''' - Parses HTML documents * '''parse-js''' - Parses Java``Script * '''parse-mp3''' - Parses MP3s + * '''parse-zip''' - Parses ZIP archives + * '''parse-mspowerpoint''' - Parses Microsoft Powerpoint files * '''parse-msword''' - Parses MS Word documents + * '''parse-msexcel''' - Parses MS Excel documents * '''parse-pdf''' - Parses PDFs * '''parse-rss''' - Parses RSS feeds + * '''parse-oo''' - Parses OpenOffice files + * '''parse-swf''' - Parses Shockwave Flash * '''parse-rtf''' - Parses RTF files * '''parse-text''' - Parses text documents * '''protocol-file''' - Retreives documents from the filesystem @@ -47, +52 @@ * '''lib-commons-httpclient''' * '''lib-http''' * '''lib-jakarta-poi''' - * '''lib-log4j''' + * '''lib-log4j''' - * '''lib-lucene-analyzers''' + * '''lib-lucene-analyzers''' - Lucene analyzers - * '''lib-nekohtml''' - * '''lib-parsems''' + * '''lib-nekohtml''' - automatic tag balancer + * '''lib-parsems''' - parse ms documents framework * '''parse-msexcel''' - Parses MS Excel documents * '''parse-mspowerpoint''' - Parses MS Powerpoint documents * '''parse-oo''' - Parses Open Office and Star Office documents (Extentsions: ODT, OTT, ODH, ODM, ODS, OTS, ODP, OTP, SXW, STW, SXC, STC, SXI, STI)
[Nutch Wiki] Update of johnroman by johnroman
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by johnroman: http://wiki.apache.org/nutch/johnroman New page: John Roman is a sysadmin for the RD arm of lexmark international. some of his contributions include bugfix documentation and troubleshooting...as well as an attempt to clean up alot of the tutorials.
[Nutch Wiki] Update of Support by ThomasDelnoij
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ThomasDelnoij: http://wiki.apache.org/nutch/Support -- * [http://www.sigram.com Andrzej Bialecki] ab at sigram.com * CNLP http://www.cnlp.org/tech/lucene.asp * [http://www.doculibre.com/ Doculibre Inc.] Open source and information management consulting. (Lucene, Nutch, Hadoop, Solr, Lius etc.) info at doculibre.com - * [http://www.dsen.nl DSEN - Java | J2EE | Agile Development Consultancy] + * [http://www.dsen.nl Thomas Delnoij (DSEN) - Java | J2EE | Agile Development Consultancy] * eventax GmbH info at eventax.com * [http://www.foofactory.fi/ FooFactory] / Sami Siren info at foofactory dot fi * [http://www.lucene-consulting.com/ Lucene Consulting] - Nutch, Solr, Lucene, Hadoop consulting and development. Founded by Otis Gospodnetic, [http://www.amazon.com/Lucene-Action-Otis-Gospodnetic/dp/1932394281 Lucene in Action] co-author.
[Nutch Wiki] Update of HelpContents by FuminZHAO
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FuminZHAO: http://wiki.apache.org/nutch/HelpContents -- + deleted - ##language:en - == Help Contents == - Here is a tour of the most important help pages: - * HelpForBeginners - if you are new to wikis - * HelpOnNavigation - explains the navigational elements on a page - * HelpOnPageCreation - how to create a new page, and how to use page templates - * HelpOnUserPreferences - how to make yourself known to the wiki, and adapt default behaviour to your taste - * HelpOnEditing - how to edit a page - * HelpOnActions - tools that work on pages or the whole site - * HelpMiscellaneous - more details, and a FAQ section - - These pages contain information only important to wiki administrators and developers: - * HelpOnAdministration - how to maintain a MoinMoin wiki - * HelpOnInstalling - how to install a MoinMoin wiki - * HelpForDevelopers - how to add your own features by changing the MoinMoin code - - An automatically generated index of all help pages is on HelpIndex. See also HelpMiscellaneous/FrequentlyAskedQuestions for answers to frequently asked questions. - - If you find any errors on the help pages, describe them on MoinMoin:HelpErrata. - - ''[Please do not add redundant information on these pages (which has to be maintained at two places then), and follow the established structure of help pages. Also note that the master set of help pages is not public, that this very page you read and all other help pages may be overwritten when the wiki software is updated. So if you have major contributions that should not get lost, send an extra notification notice to the MoinMoin user mailing list.]'' -
[Nutch Wiki] Update of FindPage by FuminZHAO
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FuminZHAO: http://wiki.apache.org/nutch/FindPage -- + deleted - ##language:en - You can use this page to search all entries in this WikiWikiWeb. Searches are not case sensitive. - Good starting points to explore a wiki are: - * RecentChanges: see where people are currently working - * FindPage: search or browse the database in various ways - * TitleIndex: a list of all pages in the wiki - * WordIndex: a list of all words that are part of page title (thus, a list of the concepts in a wiki) - * WikiSandBox: feel free to change this page and experiment with editing - - Search '''wiki.apache.org''' using google: - - [[GoogleSearch]] - - Here's a title search. Try something like ''manager'': - - [[TitleSearch]] - - Here's a full-text search. - - [[FullSearch]] - - You can also use regular expressions, such as - - {{{seriali[sz]e}}} - - Or go direct to a page, or create a new page by entering its name here: - [[GoTo]] -
[Nutch Wiki] Update of PublicServers by Piratheep Mahenthiran
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Piratheep Mahenthiran: http://wiki.apache.org/nutch/PublicServers -- * [http://www.tokenizer.org Tokenizer] is an online shopping search engine partially powered by Nutch * [http://www.utilitysearch.info/ UtilitySearch] is a search engine for the regulated utility industries (Electricity, Water, Gas, and Telecommunications) in the United States and Canada. + * [http://search.tamilsweb.com/ TamilSWeb Search] is a search engine geared toward south asian web content.
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: troubleshooting tips and information to be provided while asking for help -- == Introduction == - 'protocol-httpclient' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. + 'protocol-httpclient' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. This feature can not do POST based authentication that depends on cookies. More information on this can be found at: HttpPostAuthentication == Necessity == - There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used to configure authentication. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. + There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used to configure authentication. == JIRA NUTCH-559 == These features were submitted as [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] in the JIRA. If you have checked out the latest Nutch trunk, you don't need to apply the patches. These features were included in the Nutch subversion repository in [http://svn.apache.org/viewvc?view=revrevision=608972 revision #608972] @@ -91, +91 @@ 'protocol-httpclient' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, it must choose which scheme to use. To accompish this, it uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html HttpClient Authentication Guide]. == Need Help? == - If you need help, please feel free to post your question to the [http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing list]. + If you need help, please feel free to post your question to the [http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing list]. The author of this work, Susam Pal, usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug log for 'protocol-httpclient' before running the crawler. To enable debug log for 'protocol-httpclient', open 'conf/log4j.properties' and add the following line: + {{{ + log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout + }}} + It would be good to check the following things before asking for help. + + 1. Have you overridden the 'plugin.includes' property of 'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced 'protocol-http' with 'protocol-httpclient'? + 1. If you patched Nutch 0.9 source code manually with this patch, did you build the project before running the crawler? + 1. Have you configured 'conf/httpclient-auth.xml'? + 1. Do you see Nutch trying to fetch the pages you were expecting in 'logs/hadoop.log'. You should see some logs like fetching http://www.example.com/expectedpage.html; where the URL is the page you were expecting to be fetched. If you don't see such lines for the pages you were expecting, the error is outside the scope of this feature. This feature comes into action only when the
[Nutch Wiki] Update of Nutch0.9-Hadoop0.10-Tutorial by MarcinOkraszewski
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by MarcinOkraszewski: http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial The comment on the change is: Troubleshooting entry. -- See [http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces] for more info about the number of map reduce tasks. + == Error when putting a file to DFS == + If you get a similar error: + {{{ + put: java.io.IOException: failed to create file /user/nutch/.test.crc on client 127.0.0.1 because target-length is 0, below MIN_REPLICATION (1) + }}} + it may mean you do not have enough disc space. It happened to me with 90MB disk space available, Nutch 0.9/Hadoop 0.12.2. See also [http://www.mail-archive.com/[EMAIL PROTECTED]/msg09701.html mailing list message]. +
[Nutch Wiki] Update of RunningNutchAndSolr by PieterCoucke
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by PieterCoucke: http://wiki.apache.org/nutch/RunningNutchAndSolr The comment on the change is: svn path -- I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk. - 1. Check out solr-trunk ( svn co http://svn.apache.org/repos/solr/ solr-trunk ) + 1. Check out solr-trunk ( svn co http://svn.apache.org/repos/asf/lucene/solr/ solr-trunk ) - 1. Check out nutch-trunk ( svn co http://svn.apache.org/repos/nutch/ nutch-trunk ) + 1. Check out nutch-trunk ( svn co http://svn.apache.org/repos/asf/lucene/nutch/ nutch-trunk ) 1. Go into the solr-trunk and run 'ant dist dist-solrj' 1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar from solr-trunk/dist to nutch-trunk/lib 1. Apply patch from [http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch FooFactory patch] to nutch-trunk (cd nutch-trunk; patch -p0 nutch_solr.patch)
[Nutch Wiki] Trivial Update of RunningNutchAndSolr by PieterCoucke
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by PieterCoucke: http://wiki.apache.org/nutch/RunningNutchAndSolr -- * Edit the imports to pick up org.apache.hadoop.util.ToolRunner 1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing scope on LuceneDocumentWrapper from private to protected 1. Get the zip file from [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html FooFactory] for SOLR-20 - 1. Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant' + 1. Unzip solr-client.zip somewhere, go into java/solrj and run 'ant' 1. Copy solr-client.jar from dist to nutch-trunk/lib 1. Copy xpp3-1.1.3.4.0.jar from lib to nutch-trunk/lib 1. Configure nutch-trunk/conf/nutch-site.xml with *at least* settings for your site including a value for property indexer.solr.url (something like http://localhost:8983/solr/), but you should also have http.agent.name, http.agnet.description, http.agent.url, and http.agent.email as well.
[Nutch Wiki] Update of Nutch 0.9 Crawl Script Tutorial by AlessioTomasino
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by AlessioTomasino: http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial -- Please add comments / corrections to this document. 'cause I don't know what the heck I'm doing yet. :) One thing I want to figure out, is if I can inject just a subset of urls of pages that I know have changed since the last crawl and refetch/index only those pages. I think there is a way to do this using the adddays parameter maybe? anyone have any insight? + == How to refetch/index a subset of urls == + + My solution to this common question is to use a filter on the URL we want to refetch and have those expire using the -adddays option of 'nutch generate' command. + In nutch-site.xml you should enable a filter plugin such as urlfilter-regex and specify the file which contains the regex filter rules: + + property + + nameplugin.includes/name + + valueprotocol-http|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url|more)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|feed |'urlfilter-regex'/value + + /property + + property + nameurlfilter.regex.file/name + + valueregex-urlfilter.txt/value + /property + + The file regex-urlfilter.txt can contain any regular expression, including one or more specific URLs we want to refetch/index, e.g.: + + +http://myhostname/myurl.html + + At this stage we can use the command $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments -adddays 31 to generate a segment and the output should look like: + + Fetcher: starting + + Fetcher: segment: crawl/segments/20080518090826 + + Fetcher: threads: 50 + + fetching http://myhostname/myurl.html + + redirectCount=0 + + + Any comments/feedback welcome! + + +
[Nutch Wiki] Update of PublicServers by Finbar Dineen
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Finbar Dineen: http://wiki.apache.org/nutch/PublicServers -- * [http://www.bigsearch.ca/ Bigsearch.ca] uses nutch open source software to deliver its search results. * [http://busytonight.com/ BusyTonight]: Search for any event in the United States, by keyword, location, and date. Event listings are automatically crawled and updated from original source Web sites. + + * [http://www.centralbudapest.com/search Central Budapest Search] is a search engine for English language sites focussing on Budapest news, restaurants, accommodation, life and events. * [http://circuitscout.com Circuit Scout] is a search engine for electrical circuits.
[Nutch Wiki] Update of FetchCycleOverlap by OtisGospodnetic
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by OtisGospodnetic: http://wiki.apache.org/nutch/FetchCycleOverlap The comment on the change is: This won't work 100% correctly - removing it so I don't mislead people -- - Without overlapping jobs people running Nutch are likely not utilizing their clusters fully. Thus, here is a recipe for overlapping jobs: + deleted - 0. imagine a cluster with M max maps and R max reduces (say M=R=8) - - 1. run generate job with -numFetchers equal to M-2 - - 2. run a fetcher job (uses M-2 maps and later all R reduces) - - 3. at this point, while the fetch job is still running, there are 2 open map slots for something else to run, say the updatedb job for the previously fetched/parsed segment - - 4. when updatedb job is done the cluster can take on more jobs. Any completed tasks (C) from the running fetcher job represent open work slots - - 5. start another fetch job. This will be able to use only C tasks, but C will grow as the first job opens up more slots, eventually hitting M-2 open slots. - - 6. at some point, the fetch job from 2) above will complete, opening up 2 map slots, so updatedb can be run, even in the background, allowing the execution to go back to 1) - - Because a URL is locked out for 7 days after the generate step included it into a fetchlist, the above cycle needs to complete within 7 days. In more detail: - - Generate updates the CrawlDb so that urls selected - for the latest fetchlist become locked out for the next 7 days. This - means that you can happily generate multiple fetchlists, and fetch them - out of order, and then do the DB updates out of order, as you see fit, - so long as you make it within the 7 days of the lock out period. - - This means that it's practical to limit the numFetchers to a number - below your cluster capacity, because then you can run other maintenance - jobs in parallel with the currently running fetch job (such as updatedb - and generate of next fetchlists). -
[Nutch Wiki] Update of Nutch2Architecture by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/Nutch2Architecture The comment on the change is: Changed DI for configuration and reflection utils -- == Overview == * Reuse of existing Nutch codebase * While some things will change this architecture is more of a refactor than a complete re-write. Much of the existing codebase including plugin functionality should be reused. - * Dependency Injection - * Remove the plugin framework and use a DI framework, Spring for example, to create mapper and reducer classes that are auto injected with dependencies. This will take modifications to the Hadoop codebase. + * Remove the plugin framework + * After some experimenting, DI using spring or another similar framework presents problems. Good news is that we can achieve the same thing using the configuration objects from hadoop along with creating new instances using reflectionutils. This is more service locator than dependency injection but it still gives us the same benefits. + * Have the ability to change the jobconfiguration settings for tools. This can be accomplished through some type of properties file on the classpath and would be useful for testing, for example the ability to switch out an outputformat to see the output in text format. * Have mock objects that make it easy to test jobs. == Data Structures ==
[Nutch Wiki] Update of FetchCycleOverlap by OtisGospodnetic
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by OtisGospodnetic: http://wiki.apache.org/nutch/FetchCycleOverlap -- 2. run a fetcher job (uses M-2 maps and later all R reduces) - 3. at this point there are 2 open map slots for something else to run, say the updatedb job for the previously fetched/parsed segment + 3. at this point, while the fetch job is still running, there are 2 open map slots for something else to run, say the updatedb job for the previously fetched/parsed segment 4. when updatedb job is done the cluster can take on more jobs. Any completed tasks (C) from the running fetcher job represent open work slots
[Nutch Wiki] Update of GettingNutchRunningWithDebian by StevenHayles
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by StevenHayles: http://wiki.apache.org/nutch/GettingNutchRunningWithDebian The comment on the change is: Added installation of tomcat5.5-webapps without it home page it blank -- ''export JAVA_HOME''[[BR]] == Install Tomcat5.5 and Verify that it is functioning == - ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin ''[[BR]] + ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin tomcat5.5-webapps''[[BR]] Verify Tomcat is running:[[BR]] ''# /etc/init.d/tomcat5.5 status''[[BR]] ''#Tomcat servlet engine is running with Java pid /var/lib/tomcat5.5/temp/tomcat5.5.pid''[[BR]]
[Nutch Wiki] Update of FetchCycleOverlap by OtisGospodnetic
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by OtisGospodnetic: http://wiki.apache.org/nutch/FetchCycleOverlap New page: Without overlapping jobs people running Nutch are likely not utilizing their clusters fully. Thus, here is a recipe for overlapping jobs: 0. imagine a cluster with M max maps and R max reduces (say M=R=8) 1. run generate job with -numFetchers equal to M-2 2. run a fetcher job (uses M-2 maps and later all R reduces) 3. at this point there are 2 open map slots for something else to run, say the updatedb job for the previously fetched/parsed segment 4. when updatedb job is done the cluster can take on more jobs. Any completed tasks (C) from the running fetcher job represent open work slots 5. start another fetch job. This will be able to use only C tasks, but C will grow as the first job opens up more slots, eventually hitting M-2 open slots. 6. at some point, the fetch job from 2) above will complete, opening up 2 map slots, so updatedb can be run, even in the background, allowing the execution to go back to 1) Because a URL is locked out for 7 days after the generate step included it into a fetchlist, the above cycle needs to complete within 7 days. In more detail: Generate updates the CrawlDb so that urls selected for the latest fetchlist become locked out for the next 7 days. This means that you can happily generate multiple fetchlists, and fetch them out of order, and then do the DB updates out of order, as you see fit, so long as you make it within the 7 days of the lock out period. This means that it's practical to limit the numFetchers to a number below your cluster capacity, because then you can run other maintenance jobs in parallel with the currently running fetch job (such as updatedb and generate of next fetchlists).