[Nutch Wiki] Trivial Update of RunningNutchAndSolr by NickTkach
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by NickTkach: http://wiki.apache.org/nutch/RunningNutchAndSolr The comment on the change is: Changed fields for copyField line to correct values -- * Add the fields that Nutch needs (url, content, segment, digest, host, site, anchor, title, tstamp, text--see [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html FooFactory Article on Nutch + Solr]) * Change defaultSearchField to 'text' * Change defaultOperator to 'AND' -* Add lines to copyField section to copy cat name into the text field +* Add lines to copyField section to copy anchor, title, and content into the text field 1. Start the Solr you just made (cd /tmp/mysolr; java -jar start.jar) 1. Run a Nutch crawl using the bin/crawl.sh script.
[Nutch Wiki] Update of Nutch2Architecture by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/Nutch2Architecture New page: = Nutch 2.0 Architecture = The purpose of this page is to discuss ideas for the architecture of the next generation of Nutch.
[Nutch Wiki] Update of RunningNutchAndSolr by uygar bayar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by uygar bayar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:93) --- + Sorry but nothing change!! Same as below.. + ERROR I changed lines and it worked.But this time gave this error. I tried both private and protected scopes but nothing changed. I also changed this line Document doc = (Document) ((ObjectWritable) value).get(); with this Document doc = (Document) ((NutchWritable) value).get(); this time gave build error..
[Nutch Wiki] Update of RunningNutchAndSolr by uygar bayar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by uygar bayar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- '''Troubleshooting:''' * If you get errors about Type mismatch in value from map: (expected ObjectWritable, but received NutchWritable), then you likely are missing the two steps I just added in step 9 above. Sorry about that, I forgot about making the change there in SolrIndexer. --- + ERROR I did everything but i got this error any idea?? 2008-04-03 15:42:28,009 WARN mapred.LocalJobRunner - job_local_1 @@ -46, +47 @@ at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:111) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:93) + --- + ERROR + I changed lines and it worked.But this time gave this error. I tried both private and protected scopes but nothing changed. + I also changed this line Document doc = (Document) ((ObjectWritable) value).get(); with this Document doc = (Document) ((NutchWritable) value).get(); this time gave build error.. + 2008-04-04 10:41:48,490 WARN mapred.LocalJobRunner - job_local_1 + + java.lang.ClassCastException: org.apache.nutch.indexer.Indexer$LuceneDocumentWrapper + at org.apache.nutch.indexer.SolrIndexer$OutputFormat$1.write(SolrIndexer.java:135) + at org.apache.hadoop.mapred.ReduceTask$2.collect(ReduceTask.java:315) + at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:275) + at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) + at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) + at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164) + 2008-04-04 10:41:49,085 FATAL indexer.Indexer - SolrIndexer: java.io.IOException: Job failed! + at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894) + at org.apache.nutch.indexer.SolrIndexer.index(SolrIndexer.java:87) + at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:112) + at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) + at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:94) +
[Nutch Wiki] Trivial Update of RunningNutchAndSolr by NickTkach
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by NickTkach: http://wiki.apache.org/nutch/RunningNutchAndSolr The comment on the change is: Corrected line 3 of instructions (should have been nutch-trunk) -- 1. Check out solr-trunk and nutch-trunk 1. Go into the solr-trunk and run 'ant dist dist-solrj' - 1. Get zip from [http://variogram.com/latest/SolrIndexer.zip Variogr.am] and unzip it to solr-trunk + 1. Get zip from [http://variogram.com/latest/SolrIndexer.zip Variogr.am] and unzip it to nutch-trunk. 1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar to nutch-trunk/lib 1. Get the zip file from [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html FooFactory] for SOLR-20 1. Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant'
[Nutch Wiki] Update of RunningNutchAndSolr by NickTkach
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by NickTkach: http://wiki.apache.org/nutch/RunningNutchAndSolr -- '''Troubleshooting:''' * If you get errors about Type mismatch in value from map: (expected ObjectWritable, but received NutchWritable), then you likely are missing the two steps I just added in step 9 above. Sorry about that, I forgot about making the change there in SolrIndexer. + * Note that I've changed the steps above. I was mistaken-you don't need the zip from Variogr.am. You only need the files from FooFactory. If you take that part out then you shouldn't see errors about ClassCastExceptions any more. --- ERROR I did everything but i got this error any idea??
[Nutch Wiki] Update of RunningNutchAndSolr by uygar bayar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by uygar bayar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- If you watch the output from your Solr instance (logs) you should see a bunch of messages scroll by when Nutch finishes crawling and posts new documents. If not, then you've got something not configured right. I'll try to add more notes here as people have questions/issues. + --- + I did everything but i got this error any idea?? + + 2008-04-03 15:42:28,009 WARN mapred.LocalJobRunner - job_local_1 + java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.ObjectWritable, recieved org.apache.nutch.crawl.NutchWritable + at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:369) + at org.apache.nutch.indexer.Indexer.map(Indexer.java:344) + at org.apache.nutch.indexer.Indexer.map(Indexer.java:52) + at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) + at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) + at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) + 2008-04-03 15:42:28,609 FATAL indexer.Indexer - SolrIndexer: java.io.IOException: Job failed! + at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894) + at org.apache.nutch.indexer.SolrIndexer.index(SolrIndexer.java:86) + at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:111) + at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) + at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:93) +
[Nutch Wiki] Trivial Update of RunningNutchAndSolr by NickTkach
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by NickTkach: http://wiki.apache.org/nutch/RunningNutchAndSolr -- If you watch the output from your Solr instance (logs) you should see a bunch of messages scroll by when Nutch finishes crawling and posts new documents. If not, then you've got something not configured right. I'll try to add more notes here as people have questions/issues. + '''Troubleshooting:''' + * If you get errors about Type mismatch in value from map: (expected ObjectWritable, but received NutchWritable), then you likely are missing the two steps I just added in step 9 above. Sorry about that, I forgot about making the change there in SolrIndexer. --- I did everything but i got this error any idea??
[Nutch Wiki] Update of FrontPage by IOrtega
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by IOrtega: http://wiki.apache.org/nutch/FrontPage -- * [IndexStructure] * [Getting Started] * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin a Java application - + * InstallingWeb2 == Other Resources == * [http://nutch.sourceforge.net/blog/cutting.html Doug's Weblog] -- He's the one who originally wrote Lucene and Nutch. * [http://wiki.media-style.com/display/nutchDocu/Home Stefan's Nutch Documentation]
[Nutch Wiki] Update of InstallingWeb2 by IOrtega
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by IOrtega: http://wiki.apache.org/nutch/InstallingWeb2 New page: chris sleeman wrote: Hi, Can anyone tell me how to use the spell-check query plugin available in the contrib \ web2 dir (and even the rest of the plugins too)? Is it similar to enabling the nutch-plugins? Following these steps should get you there: 1. compile nutch (in top level dir do ant) 2. crawl your data (see tutorial) 3. edit your conf/nutch-site.xml so it contains plugin web-query-propose-spellcheck and webui-extensionpoints 4. edit conf/nutch-site.xml so it contains proper dir for plugins as the plugins are not packaged inside .war (something like property nameplugin.folders/name value path to plugins dir /value /property ) 5. compile web2 plugins (in contrib/web2 do ant compile-plugins) 6. edit search.jsp contains line tiles:insert definition=propose ignore=true/ just before the second c:choose. 7. create web2 app (in contrib/web2 do ant war) 8. build your spell check index ( bin/nutch plugin web-query-propose-spellcheck org.apache.nutch.spell.NGramSpeller -i indexdir -f content -o spelling 9. deploy webapp to tomcat 10. start tomcat (from the dir you have your crawl data and ngram index generated in #7) 11. search for something that is spelled incorrectly Also how do we build the spelling index ? Are these plugins still WIP ? I see #8 above, the whole web is MWSN (More Work Still Needed:) haven't been able to find any docs on these. That's because there currently is not any other documentation but the readme in http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/README.txt?view=markup I should probably put some documentation to wiki to gain more attraction fyi - I just committed a small fix to bug that might prevent spell checking proposer from working. So if you have problems check out the trunk or a nightly build tomorrow. -- Sami Siren
[Nutch Wiki] Update of FAQ by MarkDeSpain
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by MarkDeSpain: http://wiki.apache.org/nutch/FAQ The comment on the change is: changed answer for authentication question from Unkown to existing Wiki link -- How can I fetch pages that require Authentication? - Unknown. + See HttpAuthenticationSchemes. === Updating ===
[Nutch Wiki] Update of FrontPage by DanFrost
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DanFrost: http://wiki.apache.org/nutch/FrontPage The comment on the change is: Sorry - our server was moved last night so the site was down. Pls view our site. -- * [http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06] * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts * [http://spinn3r.com Spinn3r] [http://spinn3r.com/opensource.php Open Source components] (our contribution to the crawling OSS community with more to come). + * [http://www.interadvertising.co.uk/blog/nutch_logos Larger / better quality Nutch logos] Re-created Nutch logos available in GIF, PNG EPS in resolutions up to 1200 x 449
[Nutch Wiki] Trivial Update of FrontPage by DanFrost
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DanFrost: http://wiki.apache.org/nutch/FrontPage The comment on the change is: How complicated can it be? I'm so dumb... put 2 x http:// in there -- * [http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06] * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts * [http://spinn3r.com Spinn3r] [http://spinn3r.com/opensource.php Open Source components] (our contribution to the crawling OSS community with more to come). - * [http://http://www.interadvertising.co.uk/blog/nutch_logos Larger / better quality Nutch logos] Re-created Nutch logos available in GIF, PNG EPS in resolutions up to 1200 x 449 + * [http://www.interadvertising.co.uk/blog/nutch_logos Larger / better quality Nutch logos] Re-created Nutch logos available in GIF, PNG EPS in resolutions up to 1200 x 449
[Nutch Wiki] Update of PublicServers by PeterRaines
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by PeterRaines: http://wiki.apache.org/nutch/PublicServers -- * [http://www.jcintersonic.com/ JC Intersonic] uses nutch as its search engine. + * [http://www.jumblefox.com.au/ Jumble Fox] - The Australian Search Engine + * [http://krugle.com Krugle] uses Nutch to crawl web pages for code, archives and technically-interesting content. We also use a modified version of Nutch to crawl CVS/Subversion repositories, and the NutchBean/distributed searcher support to search and generate hits for code and tech info queries. * [http://www.labforculture.org LabforCulture] - The essential tool for everyone in arts and culture who creates, collaborates, shares and produces across borders in Europe.
[Nutch Wiki] Update of WritingPluginExample by JasperKamperman
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JasperKamperman: http://wiki.apache.org/nutch/WritingPluginExample -- - ''This was written for the 0.7 branch. For an example using the 0.8 code, see [wiki:WritingPluginExample-0.8 this page]'' + ''This was written for the 0.7 branch. For an example using the 0.8 code, see [wiki:WritingPluginExample-0.8 this page]''. For an example using the 0.9 code, see [wiki:WritingPluginExample-0.9 this page] == The Example ==
[Nutch Wiki] Trivial Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: fixed typo -- However, an important thing to note here is that if some page of example:8080 requires authentication in another realm, say, 'mail', authentication would not be done even though the second set of credentials is defined as default. Of course this doesn't affect authentication for other web servers and the default authscope would be used for other web-servers. This problem occurs only for those web-servers which have authentication scopes defined for a few selected realms/schemes. This is discussed in next section. === Catch-all Authentication Scope for a Web Server === - When one or more authentication scopes are defined for a particular web server (host:port), then the default credentials is ignored for that host:port combination. Therefore, an catch-all authentication scope to handle all other realms and scopes must be specified explicitly as shown below. + When one or more authentication scopes are defined for a particular web server (host:port), then the default credentials is ignored for that host:port combination. Therefore, a catch-all authentication scope to handle all other realms and scopes must be specified explicitly as shown below. {{{ credentials username=susam password=masus
[Nutch Wiki] Update of NutchTutorial by MarioMendez
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by MarioMendez: http://wiki.apache.org/nutch/NutchTutorial -- {{{ http://lucene.apache.org/nutch/ }}} - * Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read: + * Edit the file conf/crawl-urlfilter.txt (it works for me when I used the file conf/regex-urlfilter.txt) and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read: {{{ +^http://([a-z0-9]*\.)*apache.org/ }}}
[Nutch Wiki] Trivial Update of GettingNutchRunningOnCygwin by MatMcGowan
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by MatMcGowan: http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin -- The firther problem is the path setting for the sshd service. After ssh'ing in, the Windows hostname is still used (although on my installation $PATH appeared correct, e.g. {{{ssh localhost echo $PATH; type hostname}}}, showed a correct path with {{{/usr/bin:/bin}}} in the path before the windows directories, yet type was finding the windows version of the file.) - To fix the path setting, add the PATH environment variable to the sshd service. Under the key {{{HKLM\System\CurrentControlSet\Services\sshd\Parameters\Environment}}} create a new string value PATH, set to {{{/usr/bin:/usr/lib:/bin:$PATH}}}. After restarting the sshd service, return the tests above, and the {{{^M}}} should no longer be present. + To fix the path setting, add the PATH environment variable to the sshd service. Under the key {{{HKLM\System\CurrentControlSet\Services\sshd\Parameters\Environment}}} create a new string value PATH, set to {{{/usr/bin:/usr/lib:/bin:%PATH%}}}. This prepends /user/bin etc. to the PATH defined in Windows. After restarting the sshd service, return the tests above, and the {{{^M}}} should no longer be present. See also http://www.cygwin.com/ml/cygwin/2007-07/msg00045.html
[Nutch Wiki] Update of GettingNutchRunningOnCygwin by MatMcGowan
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by MatMcGowan: http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin New page: = Problems and workarounds for running nutch on cygwin = I followed the NutchHadoopTutorial and encountered a few problems, which after looking for solutions, seems to have affected others trying to do nutch on cygwin. == Line Endings == The tutorial mentions using dos2unix for all commands, i.e. {{{ dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop /nutch/search/bin/nutch dos2unix /nutch/search/conf/*.sh }}} But this also applies to the slave (and master) files. i.e. {{{ dos2unix /nutch/search/conf/slaves }}} Without this, the hostname passed to ssh will include a trailing '\r', producing no address assocaited with name error. == Logging files == The hostname command is used to construct logfile names. This command is included in Windows and in cygwin. When the windows version is used, an additional '\r' is included in the command output, causing the logfile name to be an invalid filename. Errors such as Head: cannot open filename for reading: no such file or directory occur, even though the name looks ok. You can see the problem first hand by running {{{ ssh localhost hostname | cat -v }}} The output will include {{{^M}}} after the hostname. The first cause of the problem is that hostname is not installed under cygwin. To get this, install coreutils from the base category. The firther problem is the path setting for the sshd service. After ssh'ing in, the Windows hostname is still used (although on my installation $PATH appeared correct, e.g. {{{ssh localhost echo $PATH; type hostname}}}, showed a correct path with {{{/usr/bin:/bin}}} in the path before the windows directories, yet type was finding the windows version of the file.) To fix the path setting, add the PATH environment variable to the sshd service. Under the key {{{HKLM\System\CurrentControlSet\Services\sshd\Parameters\Environment}}} create a new string value PATH, set to {{{/usr/bin:/usr/lib:/bin:$PATH}}}. After restarting the sshd service, return the tests above, and the {{{^M}}} should no longer be present. See also http://www.cygwin.com/ml/cygwin/2007-07/msg00045.html == Impersonated SSH Account == When using the cygwin sshd, it is necessary to first ssh in before running NDFS commands (e.g. bin/hadoop -put urls urls.) This is to ensure the current user account is constistent with later ssh sessions. (Even if you ssh in as the same user you are running locally, the sshd service may use a different user account.) With my setup, I had a nutch shortcut to cygwin.bat that was started using runas.exe, to launch the nutch user. NDFS commands would then write files to /user/nutch. But, after running ssh, supposedly logging in as the same user, the NDFS stores files under /user/sshd, because the current user account was in fact the sshd account. On Windows 2003 Server, cygwin sshd is not able to log in users under their actual account. If you ssh in and enter the command {{{ %SystemRoot%\System32\whoami.exe }}} It will display the account name running the sshd service, and not the user you expected (e.g. nutch.) If you simply type {{{ whoami }}} then it will print nutch or whichever user you ssh'ed in as. When HDFS runs, it's the native username that it sees, i.e. the account running the sshd service. Logging in first with ssh before doing anything with NDFS ensures that all files are created using the same account name in the HDFS hierarchy (the sshd service account.) Not doing this, the first files created by local commands (e.g. hadoop dfs -put urls) will go to the nutch user (or the user running the cygwin shell) while subsequent commands run on remote machines will go under the sshd account folder in HDFS. As a result of this, the sshd account also needs write access to the nutch folder.
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: role of http.agent.host in NTLM and patch committed -- == Necessity == There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used to configure authentication. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. - == Download == - Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. The latest patch is named as [https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch NUTCH-559v0.5.patch]. + == JIRA NUTCH-559 == + These features were submitted as [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] in the JIRA. If you have checked out the latest Nutch trunk, you don't need to apply the patches. These features were included in the Nutch subversion repository in [http://svn.apache.org/viewvc?view=revrevision=608972 revision #608972] == Introduction to Authentication Scope == Different credentials for different authentication scopes can be configured in 'conf/httpclient-auth.xml'. If a set of credentials is configured for a particular authentication scope (i.e. particular host, port number, realm and/or scheme), then that set of credentials would be sent only to pages falling under the specified authentication scope. @@ -82, +82 @@ 1. For authscope tag, 'host' and 'port' attribute should always be specified. 'realm' and 'scheme' attributes may or may not be specified depending on your needs. If you are tempted to omit the 'host' and 'port' attribute, because you want the credentials to be used for any host and any port for that realm/scheme, please use the 'default' tag instead. That's what 'default' tag is meant for. 1. One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, the credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for authentication-scopes. If the same authentication scope is encountered once again, it will be overwritten with the new credentials. However, one should not rely on this behavior as this might change with further developments. 1. Do not define multiple authscope tags with the same host, port but different realms if the server requires NTLM authentication. This can means there should not be multiple tags with same host, port, scheme=NTLM but different realms. If you are omitting the scheme attribute and the server requires NTLM authentication, then there should not be multiple tags with same host, port but different realms. This is discussed more in the next section. + 1. If you are using NTLM scheme, you should also set the 'http.agent.host' property in conf/nutch-site.xml === A note on NTLM domains === - NTLM does not use the concept of realms. Therefore, multiple realms for a web-server can not be defined as different authentication scopes for the same web-server requiring NTLM authentication. There should be exactly one authscope tag for NTLM scheme authentication scope for a particular web-server. The authentication domain should be specified as the value of the 'realm' attribute. + NTLM does not use the concept of realms. Therefore, multiple realms for a web-server can not be defined as different authentication scopes for the same web-server requiring NTLM authentication. There should be exactly one authscope tag for NTLM scheme authentication scope for a particular web-server. The authentication domain should be specified as the value of the 'realm' attribute. NTLM authentication also requires the name of IP address of the host on which the crawler is running. Thus, 'http.agent.host' should be set properly. == Underlying HttpClient Library == 'protocol-httpclient' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for
[Nutch Wiki] Update of FrontPage by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/FrontPage The comment on the change is: Added Upgrading Hadoop in Nutch -- * [Nutch - The Java Search Engine] (Builds on the basic tutorials. Includes index maintenance scripts) * [:NutchHadoopTutorial:Nutch Hadoop Tutorial] - How to setup Nutch and Hadoop over a cluster of machines * [:Automating_Fetches_with_Python:Automating Fetches with Python] - How to automatic the Nutch fetching process using Python + * [:Upgrading_Hadoop:Upgrading Hadoop Version in Nutch] - Basic steps for upgrading Hadoop in Nutch. * [FAQ] * [:CommandLineOptions:Commandline] options for 0.7.x * [:08CommandLineOptions:Commandline] options for version 0.8
[Nutch Wiki] Update of Upgrading Hadoop by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/Upgrading_Hadoop The comment on the change is: Added initial document New page: The purpose of this document is to show how to upgrade the version of hadoop within Nutch. * Download the latest release version of Hadoop. It is preferred to download a release version instead of building from source because the release contains the Hadoop binaries and are build by Hadoop QA. If you download from source you will need to build the C libs. Also remember that the name used to build hadoop will appear in the hadoop admin screens. If you are upgrading Hadoop for the Nutch release, it is preferred to download the latest binary release for Hadoop. * Unzip the release and copy the lib/native/* directories into your clean Nutch trunk workspace under trunk/lib/native where trunk is the root of the Nutch Trunk. You will also want to copy the hadoop-core jar from the root of the hadoop release into the trunk/lib directory. * Remove the *.la files from the trunk/lib/native/OS directories (ex. trunk/lib/native/Linux-i386-32/libhadoop.la). These are just script files and are not needed for the release. You will also want to remove any older versions of the hadoop-core jar from the trunk/lib directory. * If there are any errors or code that needs to be changed because of Hadoop API upgrades, that would need to happen here. * Do a full clean and build of Nutch through the ant clean and package targets. * Run the full test suite for Nutch using the ant test target. * It is best to run a few full fetches and indexes using the new Hadoop versions. If this is not possible, see if you can build a drop and allow others to run some fetches. It is best to do this using Nutch in a distributed mode. * Once all tests have passed and a few fetch cycles have been run, post a patch with the relevant changes. Then following the standard commit rules for wait time before commit, you can commit into the nutch repository. Make sure to change the trunk/CHANGES.txt file to relect the hadoop upgrade and any significant hadoop API changes that may have occurred.
[Nutch Wiki] Update of PublicServers by FuadEfendi
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FuadEfendi: http://wiki.apache.org/nutch/PublicServers The comment on the change is: I used source code: HTTP protocol, URL filtering normalization, parser, robots -- * [http://www.synoo.com:8080 Synoo.com] is a small web search engine + * [http://www.tokenizer.org Tokenizer] is an online shopping search engine partially powered by Nutch + * [http://www.utilitysearch.info/ UtilitySearch] is a search engine for the regulated utility industries (Electricity, Water, Gas, and Telecommunications) in the United States and Canada.
[Nutch Wiki] Update of JavaApplication by ChazHickman
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ChazHickman: http://wiki.apache.org/nutch/JavaApplication -- = Integrating Nutch search functionality into a Java application = + + This example is the fruit of much searching of the nutch users mailing list in order to get a working applpication that used the Nutch APIs. Icouldn't find all that was needed to provide a quick-start in one place, so this document was born... Using Nutch within an application is actually very simple; the requirements are merely the existence of a previously created crawl index, a couple of settings in a configuration file, and a handful of jars in your classpath. Nothing else is needed from the Nutch release that you can download. @@ -75, +77 @@ } }}} + Chaz Hickman (Jan 2008) +
[Nutch Wiki] Update of AboutPlugins by ViksitGaur
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ViksitGaur: http://wiki.apache.org/nutch/AboutPlugins The comment on the change is: Added a short note on pluginRepository -- In order to get Nutch to use a given plugin, you need to edit your conf/nutch-site.xml file and add the name of the plugin to the list of plugin.includes. + == Using a Plugin From The Command Line == + + Nutch ships with a number of plugins that include a main() method, and sample code to illustrate their use. These plugins can be used from the command line - a good way to start exploring the internal workings of each plugin. + + To do so, you need to use the bin/nutch script from the $NUTCH_HOME directory, + + $ bin/nutch plugin + Usage: PluginRepository pluginId className [arg1 arg2 ...] + + As an example, if you wanted to execute the parse-html plugin, + + $ bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser filename.html + + The PluginRepository is the name of the plugin itself, and the pluginId is the fully qualified name of the plugin class. + + See also: WritingPluginExample See also: HowToContribute
[Nutch Wiki] Update of Support by fl
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by fl: http://wiki.apache.org/nutch/Support -- * Stefan Groschupf sg at media-style.com * Michael Nebel mn at nebel.de (germany preferred) * Objects Search + * [http://www.ingate.de INGATE GmbH] * [http://www.intrafind.de IntraFind Software AG] * Michael Rosset mrosset at btmeta.com * Supreet Sethi supreet at linux-delhi.org (india preferred)
[Nutch Wiki] Update of luca facoetti by luca facoetti
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by luca facoetti: http://wiki.apache.org/nutch/luca_facoetti New page: #format wiki #language fr == Modèle des pages d'aide == Texte. === Exemple === {{{ xxx }}} === Affichage === xxx
[Nutch Wiki] Update of NonDefaultIntranetCrawlingOptions by JasonKull
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JasonKull: http://wiki.apache.org/nutch/NonDefaultIntranetCrawlingOptions New page: ##language:en == Options for intranet crawling that are not enabled by default == Here are some options you might want to add to your conf/nutch-site.xml configuration file if you plan on crawling your local network intranet that are not enabled by default. === Enable additional parser plugins === {{{ property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword|pdf|rss|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value /property }}} This will enable the parser plugins for text, html, javascript, pdf, excel, powerpoint, word, pdf, rss and zip. There are additional parsers you can enable which are listed in conf/parse-plugins.xml. If you have additional document type you wish to parse and they are listed in the parse-plugins file, just add them to the list. === Increase the file size fetch limit === {{{ property namehttp.content.limit/name value2097152/value /property }}} This will increase the default file size fetching limit to 2 megabytes. If your documents are larger (such as PDFs) then increase the number appropriately.
[Nutch Wiki] Trivial Update of RunNutchInEclipse0.9 by JakeVanderdray
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JakeVanderdray: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 -- === Create a new java project in Eclipse === * File New Project Java project click Next + * Name the project (Nutch_Trunk for instance) - * select Create project from existing source and use the location where you downloaded Nutch + * Select Create project from existing source and use the location where you downloaded Nutch - * click on Next, and wait while Eclipse is scanning the folders + * Click on Next, and wait while Eclipse is scanning the folders - * add the folder conf to the classpath (third tab and then add class folder) + * Add the folder conf to the classpath (third tab and then add class folder) * Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add src/java, src/test and all plugin src/java and src/test folders to your source folders. Also add all jars in lib and in the plugin lib folders to your libraries - * set output dir to tmp_build, create it if necessary + * Set output dir to tmp_build, create it if necessary * DO NOT add build to classpath
[Nutch Wiki] Update of Becoming A Nutch Developer by JakeVanderdray
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JakeVanderdray: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer The comment on the change is: Small fixes (mostly spelling) -- Me }}} - With this type of question other users would have no idea what the problem is or how to help and therefore most simply ignore the question and move on. Other the other hand here is better example of asking questions. + With this type of question other users would have no idea what the problem is or how to help and therefore most simply ignore the question and move on. On the other hand here is better example of asking questions. __A Good Email__ {{{ @@ -97, +97 @@ * [http://www.mail-archive.com/index.php?hunt=nutch Nutch Mail Archive] * [http://www.nabble.com/forum/Search.jtp?query=nutch Nabble Nutch] - When searching the list for errors you have recieved it is good to search both by component, for example fetcher, and by the actual error recieved. If you are not finding the answers you are looking for on the list, you may want to move to the JIRA and search there for answers. + When searching the list for errors you have received it is good to search both by component, for example fetcher, and by the actual error received. If you are not finding the answers you are looking for on the list, you may want to move to the JIRA and search there for answers. - Here are some other important things to remember about the mailing lists. First, do not cross post questions. Find the best list for you question and post your it to that list only. Posting the same question to multiple lists (i.e. user and dev) tends to annoy the very people you are wanting to help you. Second, remember that developers and committers have day jobs and deadlines also and that being rude, offensive, or aggressive is a sure way to get your posting ignored if not flamed. + Here are some other important things to remember about the mailing lists. First, do not cross post questions. Find the best list for your question and post it to that list only. Posting the same question to multiple lists (i.e. user and dev) tends to annoy the very people you want help from. Second, remember that developers and committers have day jobs and deadlines also and that being rude, offensive, or aggressive is a sure way to get your posting ignored if not flamed. Most questions on the lists are answered within a day. If you ask a question and it is not answered for a couple of days, do not repost the same question. Instead you may need to reword your question, provide more information, or give a better description in the subject. Step Two: Learning the Nutch Source Code I have found that when teaching new developers the basics of the Nutch source code it is easiest to first start with learning the operations of a full crawl from start to finish. - A word about Hadoop. As soon as you start looking into Nutch code (versions .8 or higher) you will be looking at code that uses and extends Hadoop APIs. Learning the Hadoop source code base is as big an endeavour as learning the Nutch codebase, but because of how much Nutch relies on Hadoop, anyone serious about Nutch develop will also need to learn the Hadoop codebase. + A word about Hadoop. As soon as you start looking into Nutch code (versions .8 or higher) you will be looking at code that uses and extends Hadoop APIs. Learning the Hadoop source code base is as big an endeavor as learning the Nutch codebase, but because of how much Nutch relies on Hadoop, anyone serious about Nutch develop will also need to learn the Hadoop codebase. First start by getting Nutch up and running and having completed a full process of fetching through indexing. There are tutorials on this wiki that show how to do this. Second get Nutch setup to run in an integrated development environment such as Eclipse. There are also tutorials that show how to accomplish this. Once this is done you should be able to run individual Nutch components inside of a debugger. This is essential because probably the fastest way to learn the Nutch codebase is to step through different components in a debugger. @@ -177, +177 @@ Then please be patient and in the mean time start working on another issue. Committers are busy people too. If no one responds to your patch after a few days, please make friendly reminders to the dev mailing list. Please incorporate other's suggestions into into your patch if you think they're reasonable. - Now here is the hard part. Even if you have completed your patch it may not make it into the final Nutch codebase. This could be for any number of reason but most often it is because the piece of functionality is not in lines with
[Nutch Wiki] Update of NutchHadoopTutorial by WillPugh
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by WillPugh: http://wiki.apache.org/nutch/NutchHadoopTutorial -- * When you first start up hadoop, there's a warning in the namenode log, dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove e:/dev/nutch-0.8/filesystem/mapreduce/.system.crc because it does not exist - You can ignore that. * If you get errors like, failed to create file [...] on client [foo] because target-length is 0, below MIN_REPLICATION (1) this means a block could not be distributed. Most likely there is no datanode running, or the datanode has some severe problem (like the lock problem mentioned above). + + + * This tutorial worked well for me, however, I ran into a problem where my crawl wasn't working. Turned out, it was because I needed to set the user agent and other properties for the crawl. If anyone is reading this, and running into the same problem, look at the updated tutorial http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29 +
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: root element omitted - restructured the sentence -- When authentication is required to fetch a resource from a web-server, the authentication-scope is determined from the host and port obtained from the URL of the page. If it matches any 'authscope' in this configuration file, then the 'credentials' for that 'authscope' is used for authentication. == Configuration == - Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very brief, therefore this section would explain it in a little more detail. The root element is auth-configuration for all the examples below which has been omitted for the sake of clarity. + Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very brief, therefore this section would explain it in a little more detail. In all the examples below, the root element auth-configuration has been omitted for the sake of clarity. === Crawling an Intranet with Default Authentication Scope === Let's say all pages of an intranet are protected by basic, digest or ntlm authentication and there is only one set of credentials to be used for all web pages in the intranet, then a configuration as described below is enough. This is also the simplest possible configuration possible for authentication schemes.
[Nutch Wiki] Update of FAQ by DanielNaber
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DanielNaber: http://wiki.apache.org/nutch/FAQ The comment on the change is: mention OPICScoringFilter -- You can tweak your conf/common-terms.utf8 file after creating an index through the following command: bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index - What ranking algorithm is used in searches ? Does Nutch use the [http://en.wikipedia.org/wiki/HITS_algorithm HITS algorithm] ? - - N/A yet - How is scoring done in Nutch? (Or, explain the explain page?) - Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does it. The formula Lucene uses scoring can be found at the head of the Lucene Similarity class in the [http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html Lucene Similarity Javadoc]. Roughly, the score for a particular document in a set of query results, score(q,d), is the sum of the score for each term of a query (t in q). A terms score in a document is itself the sum of the term run against each field that comprises a document (title is one field, url another. A document is a set of fields). Per field, the score is the product of the following factors: Its td (term freqency in the document), a score factor idf (usually a factor made up of frequency of term relative to amount of docs in index), an index-time boost, a normalization of count of terms found relative to size of document (lengthNorm), a similar normalization is done for the term in the query i tself (queryNorm), and finally, a factor with a weight for how many instances of the total amount of terms a particular document contains. Study the lucene javadoc to get more detail on each of the equation components and how they effect overall score. + Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does it. The formula Lucene uses scoring can be found at the head of the Lucene Similarity class in the [http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html Lucene Similarity Javadoc]. Roughly, the score for a particular document in a set of query results, score(q,d), is the sum of the score for each term of a query (t in q). A terms score in a document is itself the sum of the term run against each field that comprises a document (title is one field, url another. A document is a set of fields). Per field, the score is the product of the following factors: Its tf (term freqency in the document), a score factor idf (usually a factor made up of frequency of term relative to amount of docs in index), an index-time boost, a normalization of count of terms found relative to size of document (lengthNorm), a similar normalization is done for the term in the query i tself (queryNorm), and finally, a factor with a weight for how many instances of the total amount of terms a particular document contains. Study the lucene javadoc to get more detail on each of the equation components and how they effect overall score. Interpreting the Nutch explain.jsp, you need to have the above cited Lucene scoring equation in mind. First, notice how we move right as we move from score total, to score per query term, to score per query document field (A document field is not shown if a term was not found in a particular field). Next, studying a particular field scoring, it comprises a query component and then a field component. The query component includes query time -- as opposed to index time -- boost, an idf that is same for the query and field components, and then a queryNorm. Similar for the field component (fieldNorm is an aggregation of certain of the Lucene equation components). How can I influence Nutch scoring? + Scoring is implemented as a filter plugin, i.e. an implementation of the !ScoringFilter class. By default, [http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/scoring/opic/OPICScoringFilter.html OPICScoringFilter] is used. + - The easiest way to influence scoring is to change query time boosts (Will require edit of nutch-site.xml and redeploy of the WAR file). Query-time boost by default looks like this:{{{ + However, the easiest way to influence scoring is to change query time boosts (Will require edit of nutch-site.xml and redeploy of the WAR file). Query-time boost by default looks like this:{{{ query.url.boost, 4.0f query.anchor.boost, 2.0f query.title.boost, 1.5f
[Nutch Wiki] Update of FAQ by DanielNaber
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DanielNaber: http://wiki.apache.org/nutch/FAQ The comment on the change is: small cleanup -- Are there any mailing lists available? - There's a user, developer, commits and agents lists, all available at http://lucene.apache.org/nutch/mailing_lists.html#Agents . + There's a user, developer, commits and agents lists, all available at http://lucene.apache.org/nutch/mailing_lists.html. - - Is there a mail archive? - - Yes: http://www.mail-archive.com/nutch-user%40lucene.apache.org/maillist.html or http://www.nabble.com/Nutch-f362.html . How can I stop Nutch from crawling my site? - Please visit öur [http://lucene.apache.org/nutch/bot.html webmaster info page] + Please visit our [http://lucene.apache.org/nutch/bot.html webmaster info page] Will Nutch be a distributed, P2P-based search engine? @@ -29, +25 @@ Will Nutch use a distributed crawler, like Grub? - Distributed crawling can save download bandwidth, but, in the long run, the savings is not significant. A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages, so making the crawler use less bandwidth does not reduce overall bandwidth requirements. The dominant expense of operating a search engine is not crawling, but searching. + Distributed crawling can save download bandwidth, but, in the long run, the savings is not significant. A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages, so making the crawler use less bandwidth does not reduce overall bandwidth requirements. The dominant expense of operating a large search engine is not crawling, but searching. Won't open source just make it easier for sites to manipulate rankings? @@ -58, +54 @@ nutch-site.xml is where you make the changes that override the default settings. The same goes to the servlet container application. - My system does not find the segments folder. Why? OR How do I tell the ''Nutch Servlet'' where the index file are located? + My system does not find the segments folder. Why? Or: How do I tell the ''Nutch Servlet'' where the index file are located? There are at least two choices to do that: @@ -95, +91 @@ What happens if I inject urls several times? - Urls, which are already in the database, won't be injected. + Urls which are already in the database, won't be injected. === Fetching ===
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: re-writing document as per latest v0.5 patch -- 'protocol-httpclient' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. == Necessity == - There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-httpclient' and use its authentication features. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. + There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'httpclient-auth.xml' to enable 'protocol-httpclient' and use its authentication features. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. == Download == - Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. + Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. The latest patch is named as [https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch NUTCH-559v0.5.patch]. == Configuration == - This is an advanced feature that lets the user specify different credentials for different authentication scopes. This section does not describe the default configuration. Some parts of this section might be outdated. It is better to read the guidelines in 'conf/httpclient-auth.xml' because they are correct. This section will be improved later when time permits. + Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very crisp, therefore this section would explain it in more details. The section starts with a few very simple examples which would suffice for most real life situations. Complex cases are described later in this article. The root element is auth-configuration for all the examples below which has been omitted for the sake of clarity. + + === Crawling an intranet with default authentication scope === + Let's say all pages of an intranet are protected by basic, digest or ntlm authentication and there is only one set of credentials to be used for all web pages in the intranet, then a configuration as described below is enough. This is also the simplest possible configuration possible for authentication schemes. + + {{{credentials username=susam password=masus + default/ + /credentials}}} + + The credentials specified above would be sent to any page requesting authentication. Though it is extremely simple, default authentication scope should be used with caution. This set of credentials would be sent to any web-page requesting for authentication and therefore, a malicious user can steal the credentials used in the configuration by setting up a web-page requiring Basic authentication. Therefore, we usually use credentials set apart for crawling only so that even if a user steals the credentials, he wouldn't be able to do anything harmful. If you are sure, that all pages in the intranet use a particular authentication scheme, say, NTLM, then this situation can be improved a little in this manner. + + {{{credentials username=susam password=masus + default scheme=ntlm/ + /credentials}}} + + Thus, these set of credentials would be sent to pages
[Nutch Wiki] Update of Crawl by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/Crawl -- {{{ #!/bin/sh + # runbot script to run the Nutch bot for crawling and re-crawling. + # Usage: bin/runbot [safe] + #If executed in 'safe' mode, it doesn't delete the temporary + #directories generated during crawl. This might be helpful for + #analysis and recovery in case a crawl fails. + # # Author: Susam Pal - # - # 'runbot' script to crawl and re-crawl using Nutch 0.9 and Nutch 1.0 - # - # Modify the values of the variables in the beginning to alter the - # behaviour of the script. The script accepts only one argument 'safe' - # to run the script in safe mode. e.g. bin/runbot safe - # Safe mode prevents deletion of temporary directories so that recovery - # action can be taken if anything goes wrong during the crawl. depth=2 threads=5
[Nutch Wiki] Update of FrontPage by peterpuwang
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by peterpuwang: http://wiki.apache.org/nutch/FrontPage -- == Nutch Administration == * DownloadingNutch * HardwareRequirements - * [http://peterpuwang.googlepages.com/NutchGuideForDummies.htm Tutorial] -- Latest step by Step Installation guide for dummies: Nutch 0.9. + * '''[http://peterpuwang.googlepages.com/NutchGuideForDummies.htm Tutorial] -- Latest step by Step Installation guide for dummies: Nutch 0.9.''' * [http://lucene.apache.org/nutch/tutorial.html Tutorial] -- A Step-by-Step guide to getting Nutch up and running. * NutchTutorial ''on the wiki'' * [Nutch - The Java Search Engine] (Builds on the basic tutorials. Includes index maintenance scripts)
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: removed conf/nutch-site.xml conf -- == Download == Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. + == Configuration == + This is an advanced feature that lets the user specify different credentials for different authentication scopes. This section does not describe the default configuration. Some parts of this section might be outdated. It is better to read the guidelines in 'conf/httpclient-auth.xml' because they are correct. This section will be improved later when time permits. - == Common Credentials Configuration == - This is the simplest possible configuration which involves setting just one set of credentials. It is useful in trusted Intranets where all sites are trusted and require the same username/password for authentication. - - === Quick Guide === - 1. Include 'protocol-httpclient' in 'plugin.includes'. - 1. For basic or digest authentication in proxy server, set 'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' if you want to specify a realm as the authentication scope. - 1. For NTLM authentication in proxy server, set 'http.proxy.username', 'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. - 1. For basic or digest authentication in web servers, set 'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a realm as the authentication scope. - 1. For NTLM authentication in proxy server, set 'http.auth.username', 'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. - - This is explained in details in the following section. - - === Details === - To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to include some properties which is explained in this section. First and foremost, to enable the plugin, this plugin must be added in the 'plugin.includes' of 'nutch-site.xml'. So, this property would typically look like:- - - {{{property - nameplugin.includes/name - valueprotocol-httpclient|urlfilter-regex|.../value - description.../description - /property}}} - - (... indicates a long line truncated) - - Next, if authentication is required for proxy server, the following properties need to be set in 'conf/nutch-site.xml'. - - * http.proxy.username - * http.proxy.password - * http.proxy.realm (If a realm needs to be provided. In case of NTLM authentication, the domain name should be provided as its value.) - * http.auth.host (This is required in case of NTLM authentication only. This is the host where the crawler would be running.) - - If the web servers of the intranet are in a particular domain or realm and requires authentication, these properties should be set in 'conf/nutch-site.xml'. - - * http.auth.username - * http.auth.password - * http.auth.realm - * http.auth.host - - The explanation for these properties are similar to that of the proxy authentication properties. As you might have noticed, 'http.auth.host' is used for proxy NTLM authentication as well as web server NTLM authentication. Since, the host at which the HTTP requests are originating are same for both, so the same property is used for both and two different properties were not created. - - Even though, the 'http.auth.host' property is required only for NTLM authentication, it is advisable to set this for all cases, because, in case the crawler comes across a server which requires NTLM authentication (which you might not have anticipated), the crawler can still fetch the page. - - == Authentication Scope Specific Credentials == - This is an advanced feature that lets the user specify different credentials for different authentication scopes. === Quick Guide === An example of 'conf/httpclient-auth.xml' configuration is provided below: @@ -98, +57 @@ The 'realm' attribute is optional in authscope tag and it can be omitted if you want the credentials to be used for all realms on a particular web-server (or all remaining realms as shown in the Quick Guide section above). One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, the credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for
[Nutch Wiki] Update of FrontPage by peterpuwang
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by peterpuwang: http://wiki.apache.org/nutch/FrontPage -- == Nutch Administration == * DownloadingNutch * HardwareRequirements - * [http://www.thechristianlife.com/z/NutchGuideForDummies.htm Tutorial] -- Latest step by Step Installation guide for dummies: Nutch 0.9. + * [http://thechristianlife.re-invent.net/z/NutchGuideForDummies.htm Tutorial] -- Latest step by Step Installation guide for dummies: Nutch 0.9. * [http://lucene.apache.org/nutch/tutorial.html Tutorial] -- A Step-by-Step guide to getting Nutch up and running. * NutchTutorial ''on the wiki'' * [Nutch - The Java Search Engine] (Builds on the basic tutorials. Includes index maintenance scripts)
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: typo fixes -- Even though, the 'http.auth.host' property is required only for NTLM authentication, it is advisable to set this for all cases, because, in case the crawler comes across a server which requires NTLM authentication (which you might not have anticipated), the crawler can still fetch the page. == Authentication Scope Specific Credentials == - This is an advanced feature that lets the user specify different credentials for different authentication scopes. After that you might want to try this out and appreciate the advantages. + This is an advanced feature that lets the user specify different credentials for different authentication scopes. === Quick Guide === An example of 'conf/httpclient-auth.xml' configuration is provided below: @@ -96, +96 @@ If a page, say, 'http://192.168.101.34/index.jsp' requires authentication, then the common credentials would be used since there is no credential defined for this scope. - The 'realm' attribute is optional in authscope tag and it can be omitted if you want the credentials to be used for all realms on a particular web-server (or all remaining realms as shown in the Quick Guide section above). One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, The credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for authentication-scopes. If the same authentication scope is encountered once again, it will be overwritten with the new credentials. However, one should not rely on this behavior as this might change with further developments. + The 'realm' attribute is optional in authscope tag and it can be omitted if you want the credentials to be used for all realms on a particular web-server (or all remaining realms as shown in the Quick Guide section above). One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, the credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for authentication-scopes. If the same authentication scope is encountered once again, it will be overwritten with the new credentials. However, one should not rely on this behavior as this might change with further developments. == Underlying HttpClient Library ==
[Nutch Wiki] Update of FAQ by robotgenius
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by robotgenius: http://wiki.apache.org/nutch/FAQ -- Please have a look on PrefixURLFilter. Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively. + + Alternatively, you can set db.ignore.external.links to true, and inject seeds from the domains you wish to crawl (these seeds must link to all pages you wish to crawl, directly or indirectly). Doing this will let the crawl go through only these domains without leaving to start crawling external links. Unfortunately there is no way to record external links encountered for future processing, although a very small patch to the generator code can allow you to log these links to hadoop.log. How can I recover an aborted fetch process?
[Nutch Wiki] Trivial Update of GettingNutchRunningWithDebian by Thomas R Bailey
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Thomas R Bailey: http://wiki.apache.org/nutch/GettingNutchRunningWithDebian The comment on the change is: File name in WEB-INF/classes directory requires modifying the nutch-default.xml -- #cd site2; jar xvf nutch-0.8.1.war; rm nutch-0.8.1.war; cd .. }}} === Configure the site1,site2 webapps === - Edit the site1/WEB-INF/classes/nutch-site.xml file for the searcher.dir parameter, so that it points back to your crawl directory under /usr/local/nutch:[[BR]] + Edit the site1/WEB-INF/classes/nutch-default.xml file for the searcher.dir parameter, so that it points back to your crawl directory under /usr/local/nutch and save it as nutch-site.xml after making the following changes:[[BR]] {{{namesearcher.dir/name value/usr/local/nutch/crawls/site1/value }}}
[Nutch Wiki] Trivial Update of GettingNutchRunningWithDebian by Thomas R Bailey
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Thomas R Bailey: http://wiki.apache.org/nutch/GettingNutchRunningWithDebian The comment on the change is: removed extra '' in Catalina configuration -- Under Debian Etch, the Catalina configuration files are located under '''/etc/tomcat5.5/policy.d''' At runtime they are combined into a single file, ''/usr/share/tomcat5.5/conf/catalina.policy'' Do not edit the latter, as it will be overwrittten.[[BR]] At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following code:[[BR]] - {{{grant codeBase file:/usr/share/tomcat5.5-webapps/-\'' { + {{{grant codeBase file:/usr/share/tomcat5.5-webapps/-\ { permission java.util.PropertyPermission user.dir, read; permission java.util.PropertyPermission java.io.tmpdir, read,write; permission java.util.PropertyPermission org.apache.*, read,execute;
[Nutch Wiki] Update of FrontPage by peterpuwang
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by peterpuwang: http://wiki.apache.org/nutch/FrontPage -- == Nutch Administration == * DownloadingNutch * HardwareRequirements + * [http://www.thechristianlife.com/z/NutchGuideForDummies.htm Tutorial] -- Latest step by Step Installation guide for dummies: Nutch 0.9. * [http://lucene.apache.org/nutch/tutorial.html Tutorial] -- A Step-by-Step guide to getting Nutch up and running. * NutchTutorial ''on the wiki'' * [Nutch - The Java Search Engine] (Builds on the basic tutorials. Includes index maintenance scripts)
[Nutch Wiki] Update of WritingPluginExample-0.9 by JasperKamperman
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JasperKamperman: http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 The comment on the change is: Added explanation why the field is added as UN_TOKENIZED -- == The Example == - Consider this as a plugin example: We want to be able to recommend specific web pages for given search terms. For this example we'll assume we're indexing this site. As you may have noticed, there are a number of pages that talk about plugins. What we want to do is have it so that if someone searches for the term plugin we recommend that they start at the PluginCentral page, but we also want to return all the normal hits in the expected ranking. We'll seperate the search results page into a section of recommendations and then a section with the normal search results. + Consider this as a plugin example: We want to be able to recommend specific web pages for given search terms. For this example we'll assume we're indexing this site. As you may have noticed, there are a number of pages that talk about plugins. What we want to do is have it so that if someone searches for the term plugins we recommend that they start at the PluginCentral page, but we also want to return all the normal hits in the expected ranking. We'll seperate the search results page into a section of recommendations and then a section with the normal search results. You go through your site and add meta-tags to pages that list what terms they should be recommended for. The tags look something like this: @@ -177, +177 @@ == The Indexer Extension == - The following is the code for the Indexing Filter extension. If the document being indexed had a recommended meta tag this extension adds a lucene text field to the index called recommended with the content of that meta tag. Create a file called RecommendedIndexer.java in the source code directory: + The following is the code for the Indexing Filter extension. If the document being indexed had a recommended meta tag this extension adds a lucene text field to the index called recommended with the content of that meta tag. Create a file called RecommendedIndexer.java in the source code directory: {{{ package org.apache.nutch.parse.recommended; @@ -242, +242 @@ } } }}} + + Note that the field is UN_TOKENIZED because we don't want the recommended tag to be cut up by a tokenizer. Change to TOKENIZED if you want to be able to search on parts of the tag, for example to put multiple recommended terms in one tag. == The QueryFilter ==
[Nutch Wiki] Trivial Update of NutchTutorial by JoeyMazzarelli
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JoeyMazzarelli: http://wiki.apache.org/nutch/NutchTutorial The comment on the change is: current path to DmozParser -- Next we select a random subset of these pages. (We use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around three million URLs. We select one out of every 5000, so that we end up with around 1000 URLs: {{{ mkdir dmoz - bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 dmoz/urls }}} + bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 dmoz/urls }}} The parser also takes a few minutes, as it must parse the full file. Finally, we initialize the crawl db with the selected urls.
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: Initial draft copied from protocol-http11 New page: == Introduction == 'protocol-httpclient' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. == Author of Authentication Features == Susam Pal, Infosys Technologies Limited == Necessity == There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use its authentication features. This is an improvement on the previous two plugins. The author of the authentication features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. == Download == Currently, this plugin is in the form of patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply it to trunk. == Quick Guide == This section is a quick guide to configure authentication related properties for 'protocol-httpclient'. 1. Include 'protocol-httpclient' in 'plugin.includes'. 1. For basic or digest authentication in proxy server, set 'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' if you want to specify a realm as the authentication scope. 1. For NTLM authentication in proxy server, set 'http.proxy.username', 'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. 1. For basic or digest authentication in web servers, set 'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a realm as the authentication scope. 1. For NTLM authentication in proxy server, set 'http.auth.username', 'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. 1. It is recommended that 'http.useHttp11' be set to true. This is explained in a little more detail in the next section. == Nutch Configuration == To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to include some properties which is explained in this section. First and foremost, to enable the plugin, this plugin must be added in the 'plugin.includes' of 'nutch-site.xml'. So, this property would typically look like:- {{{property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|.../value description.../description /property}}} (... indicates a long line truncated) Next, if authentication is required for proxy server, the following properties need to be set in 'conf/nutch-site.xml'. * http.proxy.username * http.proxy.password * http.proxy.realm (If a realm needs to be provided. In case of NTLM authentication, the domain name should be provided as its value.) * http.auth.host (This is required in case of NTLM authentication only. This is the host where the crawler would be running.) If the web servers of the intranet are in a particular domain or realm and requires authentication, these properties should be set in 'conf/nutch-site.xml'. * http.auth.username * http.auth.password * http.auth.realm * http.auth.host The explanation for these properties are similar to that of the proxy authentication properties. As you might have noticed, 'http.auth.host' is used both for proxy NTLM authentication and web server NTLM authentication. Since, the host at which the HTTP requests are originating are same for both, so the same property is used for both and two different properties were not created. Even though, the 'http.auth.host' property is required only for NTLM authentication, it is advisable to set this for all cases, because, in case the crawler comes across a server which requires NTLM authentication (which you might not have anticipated), the crawler can still fetch the page. == Underlying HttpClient Library == 'protocol-httpclient' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for authenticating users. Given that only one
[Nutch Wiki] Update of protocol-http11 by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/protocol-http11 The comment on the change is: content moved to HttpAuthenticationSchemes -- + protocol-http11 has been converted to a patch for protocol-httpclient as per the discussion held at [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557]. - == Introduction == - 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. + Therefore, the content of this page has been moved to HttpAuthenticationSchemes. - == Author == - Susam Pal, Infosys Technologies Limited - == Necessity == - There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. 'protocol-http11' was written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use its authentication features. This is an improvement on the previous two plugins. The author of this plugin has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. The name, 'protocol-http11' was chosen because, 'HTTP 1.1' is a valid protocol name. - - == Download == - Currently, this plugin is in the form of patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply it to trunk. - - == Quick Guide == - This section is a quick guide to configure authentication related properties for 'protocol-http11'. - - 1. Include 'protocol-http11' in 'plugin.includes'. - 1. For basic or digest authentication in proxy server, set 'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' if you want to specify a realm as the authentication scope. - 1. For NTLM authentication in proxy server, set 'http.proxy.username', 'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. - 1. For basic or digest authentication in web servers, set 'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a realm as the authentication scope. - 1. For NTLM authentication in proxy server, set 'http.auth.username', 'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. - 1. It is recommended that 'http.useHttp11' be set to true. - - This is explained in a little more detail in the next section. - - == Nutch Configuration == - To use 'protocol-http11', 'conf/nutch-site.xml has to be edited to include some properties which is explained in this section. First and foremost, to enable the plugin, this plugin must be added in the 'plugin.includes' of 'nutch-site.xml'. So, this property would typically look like:- - - {{{property - nameplugin.includes/name - valueprotocol-http11|urlfilter-regex|.../value - description.../description - /property}}} - - (... indicates truncation) - - It is recommended that HTTP 1.1 should be enabled. - - {{{property - namehttp.useHttp11/name - valuetrue/value - description.../description - /property}}} - - Next, if authentication is required for proxy server, the following properties need to be set in 'conf/nutch-site.xml'. - - * http.proxy.username - * http.proxy.password - * http.proxy.realm (If a realm needs to be provided. In case of NTLM authentication, the domain name should be provided as its value.) - * http.auth.host (This is required in case of NTLM authentication only. This is the host where the crawler would be running.) - - If the web servers of the intranet are in a particular domain or realm and requires authentication, these properties should be set in 'conf/nutch-site.xml'. - - * http.auth.username - * http.auth.password - * http.auth.realm - * http.auth.host - - The explanation for these properties are similar to that of the proxy authentication properties. As you might have noticed, 'http.auth.host' is used both for proxy NTLM authentication and web server NTLM authentication. Since, the host at which the HTTP requests are originating are same for both, so the same property is used for both and two different
[Nutch Wiki] Update of FrontPage by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/FrontPage The comment on the change is: Http Authentication Schemes -- * CrossPlatformNutchScripts * MonitoringNutchCrawls - techniques for keeping an eye on a nutch crawl's progress. * [Nutch 0.9 Crawl Script Tutorial] + * HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes. == Nutch Development == * [:Becoming_A_Nutch_Developer:Becoming a Nutch Developer] - Start developing and contributing to Nutch.
[Nutch Wiki] Update of Help Wanted by GordonMohr
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by GordonMohr: http://wiki.apache.org/nutch/Help_Wanted -- This listing is provided as a reference only. No endorsements are given or implied. - * The [http://www.archive.org Internet Archive] is seeking a [http://www.archive.org/about/webjobs.php#JavaSoftwareEngineer Java Software Engineer] (and/or a possible Indexing Operations Engineer) to contribute to [http://archive-access.sourceforge.net/projects/nutch NutchWAX] (the adaptation of Nutch for web archives), help make our ever-growing collections available for full-text search, and related projects. + * The [http://www.archive.org Internet Archive] is seeking a [http://www.archive.org/about/webjobs.php#SeniorSearchEngineer Senior Search Engineer] (and/or a possible Indexing Operations Engineer) to lead the development of our open source search tools and platforms.
[Nutch Wiki] Update of protocol-http11 by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/protocol-http11 The comment on the change is: corrected XML code for enabling HTTP 1.1 -- {{{property namehttp.useHttp11/name - valuefalse/value + valuetrue/value description.../description /property}}}
[Nutch Wiki] Update of Crawl by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/Crawl The comment on the change is: fixed topN bug -- == Script == {{{ #!/bin/sh - - # Runs the Nutch bot to crawl or re-crawl - # Usage: bin/runbot [safe] - #If executed in 'safe' mode, it doesn't delete the temporary - #directories generated during crawl. This might be helpful for - #analysis and recovery in case a crawl fails. - # - # Author: Susam Pal - - depth=2 + depth=8 threads=50 adddays=5 - topN=2 # Comment this statement if you don't want to set topN value + topN=1000 #Comment this statement if you don't want to set topN value # Parse arguments if [ $1 == safe ] @@ -101, +92 @@ if [ -n $topN ] then - topN=--topN $rank + topN=-topN $topN else topN= fi @@ -125, +116 @@ $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then - echo runbot: fetch $segment at depth $depth failed. Deleting segment $segment. + echo runbot: fetch $segment at depth `expr $i + 1` failed. Deleting segment $segment. rm -rf $segment continue fi @@ -138, +129 @@ $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* if [ $safe != yes ] then - rm -rf crawl/segments/* + rm -rf crawl/segments else - mkdir crawl/FETCHEDsegments - mv --verbose crawl/segments/* crawl/FETCHEDsegments + mv $MVARGS crawl/segments crawl/FETCHEDsegments fi - mv --verbose crawl/MERGEDsegments/* crawl/segments + mv $MVARGS crawl/MERGEDsegments crawl/segments - rmdir crawl/MERGEDsegments echo - Invert Links (Step 4 of $steps) - $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*
[Nutch Wiki] Update of protocol-http11 by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/protocol-http11 The comment on the change is: why the name is protocol-http11 -- Susam Pal, Infosys Technologies Limited == Necessity == - There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. 'protocol-http11' was written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use its authentication features. This is an improvement on the previous two plugins. The author of this plugin has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. + There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. 'protocol-http11' was written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use its authentication features. This is an improvement on the previous two plugins. The author of this plugin has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. The name, 'protocol-http11' was chosen because, 'HTTP 1.1' is a valid protocol name. == Download == Currently, this plugin is in the form of patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply it to trunk.
[Nutch Wiki] Trivial Update of protocol-http11 by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/protocol-http11 The comment on the change is: use pronoun 'it' for 'HttpClient' -- Even though, the 'http.auth.host' property is required only for NTLM authentication, it is advisable to set this for all cases, because, in case the crawler comes across a server which requires NTLM authentication (which you might not have anticipated), the crawler can still fetch the page. == Underlying HttpClient Library == - 'protocol-http11' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, HttpClient must choose which scheme to use. To accompish this, HttpClient uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html HttpClient Authentication Guide]. + 'protocol-http11' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, it must choose which scheme to use. To accompish this, it uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html HttpClient Authentication Guide]. == Need Help? == If you need help, please feel free to post your question to the [http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing list].
[Nutch Wiki] Update of protocol-http11 by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/protocol-http11 The comment on the change is: download -- == Necessity == There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. 'protocol-http11' was written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use its authentication features. This is an improvement on the previous two plugins. The author of this plugin has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. + + == Download == + Currently, this plugin is in the form of patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply it to trunk. == Quick Guide == This section is a quick guide to configure authentication related properties for 'protocol-http11'.
[Nutch Wiki] Update of protocol-http11 by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/protocol-http11 The comment on the change is: rough draft New page: == Introduction == 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. == Author == Susam Pal, Infosys Technologies Limited == Necessity == There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. 'protocol-http11' was written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use its authentication features. This is an improvement on the previous two plugins. The author of this plugin has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. == Quick Guide == This section is a quick guide to configure authentication related properties for 'protocol-http11'. 1. Include 'protocol-http11' in 'plugin.includes'. 1. For basic or digest authentication in proxy server, set 'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' if you want to specify a realm as the authentication scope. 1. For NTLM authentication in proxy server, set 'http.proxy.username', 'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. 1. For basic or digest authentication in web servers, set 'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a realm as the authentication scope. 1. For NTLM authentication in proxy server, set 'http.auth.username', 'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. 1. It is recommended that 'http.useHttp11' be set to true. This is explained in a little more detail in the next section. == Nutch Configuration == To use 'protocol-http11', 'conf/nutch-site.xml has to be edited to include some properties which is explained in this section. First and foremost, to enable the plugin, this plugin must be added in the 'plugin.includes' of 'nutch-site.xml'. So, this property would typically look like:- {{{property nameplugin.includes/name valueprotocol-http11|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value description.../description /property}}} It is recommended that HTTP 1.1 should be enabled. {{{property namehttp.useHttp11/name valuefalse/value description.../description /property}}} Next, if authentication is required for proxy server, the following properties need to be set in 'conf/nutch-site.xml'. * http.proxy.username * http.proxy.password * http.proxy.realm (If a realm needs to be provided. In case of NTLM authentication, the domain name should be provided as its value.) * http.auth.host (This is required in case of NTLM authentication only. This is the host where the crawler would be running.) If the web servers of the intranet are in a particular domain or realm and requires authentication, these properties should be set in 'conf/nutch-site.xml'. * http.auth.username * http.auth.password * http.auth.realm * http.auth.host The explanation for these properties are similar to that of the proxy authentication properties. As you might have noticed, 'http.auth.host' is used both for proxy NTLM authentication and web server NTLM authentication. Since, the host at which the HTTP requests are originating are same for both, so the same property is used for both and two different properties were not created. Even though, the 'http.auth.host' property is required only for NTLM authentication, it is advisable to set this for all cases, because, in case the crawler comes across a server which requires NTLM authentication (which you might not have anticipated), the crawler can still fetch the page. == Underlying HttpClient Library == 'protocol-http11' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for
[Nutch Wiki] Trivial Update of protocol-http11 by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/protocol-http11 The comment on the change is: truncated long line -- {{{property nameplugin.includes/name - valueprotocol-http11|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value + valueprotocol-http11|urlfilter-regex|.../value description.../description /property}}} + + (... indicates truncation) It is recommended that HTTP 1.1 should be enabled.
[Nutch Wiki] Update of PluginCentral by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/PluginCentral The comment on the change is: protocol-http11 -- * GeoPosition * [German] * [http://issues.apache.org/jira/browse/NUTCH-422 index-extra] - Adds user-configurable fields to the index. - * [http://issues.apache.org/jira/browse/NUTCH-427 protocol-smb] - Allows Nutch to crawl MS Windows Shares folder + * [http://issues.apache.org/jira/browse/NUTCH-427 protocol-smb] - Allows Nutch to crawl MS Windows Shares folder. - + * [protocol-http11] - Adds support for HTTP 1.1, HTTPS, Basic, Digest and NTLM authentication. ([https://issues.apache.org/jira/browse/NUTCH-557 NUTCH-557])
[Nutch Wiki] Update of Nutch 0.9 Crawl Script Tutorial by Lyndon
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Lyndon: http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial -- echo - Merge Segments (Step 4 of $steps) - $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* - if [ $? -ne 0 ] + if [ $? -eq 0 ] then if [ $safe != yes ] then
[Nutch Wiki] Trivial Update of Nutch 0.9 Crawl Script Tutorial by Lyndon
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Lyndon: http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial The comment on the change is: the mergesegs failure test was inverted, needed to test $? -eq instead of -ne -- echo - Merge Segments (Step 4 of $steps) - $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* + + /!\ '''Edit conflict - other version:''' if [ $? -eq 0 ] + + /!\ '''Edit conflict - your version:''' + if [ $? -eq 0 ] + + /!\ '''End of edit conflict''' then if [ $safe != yes ] then
[Nutch Wiki] Trivial Update of Nutch 0.9 Crawl Script Tutorial by Lyndon
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Lyndon: http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial The comment on the change is: removing wiki inserted conflict text sorry -- echo - Merge Segments (Step 4 of $steps) - $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* - /!\ '''Edit conflict - other version:''' if [ $? -eq 0 ] - - /!\ '''Edit conflict - your version:''' - if [ $? -eq 0 ] - - /!\ '''End of edit conflict''' then if [ $safe != yes ] then
[Nutch Wiki] Update of Globe+Correspondent by wikicninfo
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by wikicninfo: http://wiki.apache.org/nutch/Globe+Correspondent New page: But community health centers draw patients for a number of reasons. They offer one-stop shopping, which can include dental care, substance abuse treatment, pediatric and prenatal care, and social services. Most have child care and translators on site for non-English speakers. With the new Massachusetts health insurance law boosting the number of patients seeking care, community health centers south of Boston are scrambling to meet the demand. Sign up for: Globe Headlines e-mail | Breaking News Alerts Manet Community Health Center, which has four locations in Quincy and one in Hull, is hiring two new family care physicians and a nurse practitioner. Brockton Neighborhood Health Center now stays open two hours later on weeknights. In February, it hired a nurse practitioner, two medical assistants, and two social workers, and is planning to hire 20 more staff members in the next six months. We've seen a really significant increase in visits by new patients, said Sue Joss, executive director of the Brockton health center. Our phones are ringing off the hook for new patients. The two centers are the only ones directly south of Boston. But community health centers in Fall River and New Bedford, which also serve people from this region, are experiencing the same increase in demand, and expanding hours to meet it. The state's universal health insurance law, which is being rolled out this year, is bringing formerly uninsured people into the healthcare system. Many of these individuals and families are turning to community health centers, the locally based nonprofit organizations that arose from the antipoverty movement of the 1960s. We are front and center in the new healthcare legislation, said Kerin O'Toole, spokeswoman for the Massachusetts League of Community Health Centers. We've seen quite a surge in demand. Although in many cases patients could go elsewhere, the health centers offer a whole range of services you can't get from a private provider. The nation's first community health center opened at Columbia Point in Dorchester in 1965 as part of President Johnson's war on poverty. Similar centers, supported by federal aid and private grants, opened across the country in poor and medically underserved areas. Today, the United States has more than a thousand centers, 52 of them in Massachusetts. Business is thriving. In April, the Brockton center on Main Street saw a 12 percent spike in patient visits over last year, and in May, a 9 percent increase. A new $16 million center is under construction next to the cramped downtown facility and is scheduled to open in November. Statewide, patient loads at community health centers have been on the rise. In 2006, [http://www.teamflyelectronic.com/ Burglar alarm] centers in Massachusetts saw 760,301 patients, an increase of nearly 94,000, or 14 percent, over the previous year. The surge in demand at community health centers with the new law was not fully expected. The centers have long been a safety net in the healthcare system - places where people could go whether they had insurance or not. The insured usually have many choices when seeking care Sign up for: Globe Headlines e-mail | Breaking News Alerts People are more aware of the community health centers and the services we provide, said Sheryl Turgeon, chief executive officer of Healthfirst, which draws patients from Fall River and nearby towns. Community health centers also do outreach for Commonwealth Care, the new state health insurance program, and visitors to most centers can sign up for health insurance on the spot.buy [http://www.isefc.com.cn/ wow gold] The heavy promotions the state has been doing to get the uninsured to sign up and take advantage of healthcare also seems to be a factor in the increasing number of visits, according to Toni McGuire, chief executive officer of the Manet center. I think one of the biggest reasons for the increase is the advertising around Commonwealth Care, McGuire said. Said Joss of the Brockton center, There was never this kind of publicity around the free-care pool. In the past, institutions that treated the uninsured were compensated by a pool of money administered by the state and paid into by hospitals and other large providers. Another reason that community health centers are seeing more patients is that three of the four insurers working with Commonwealth Care tend to direct subscribers to the centers, according to Alan Sager, director of the health reform program at Boston University School of Public Health. Sager said he is concerned that some community health centers may not be able to hire physicians quickly enough to meet the demand. If health centers were deluged by
[Nutch Wiki] Update of wow+power+leveling by loki002
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by loki002: http://wiki.apache.org/nutch/wow+power+leveling New page: Baseball gave him his earliest challenge. He was an outstanding pitcher in Little League, and eventually, as a senior in high school, made the [http://www.toppowerlevel.net wow powerleveling] varsity, winning half the teamâs games with a record of five wins and two losses. At graduation, the coach named Daniel the [http://www.toppowerlevel.net wow power level] teamâs most valuable player. ããHis finest hour, though, came at a school science fair. He entered an exhibit showing how the [http://www.toppowerlevel.net wow power leveling] circulatory system works. It was primitive and crude, especially compared to the fancy, computerized, blinking-light models entered by other [http://www.toppowerlevel.net wow power level] students. My wife, Sara, felt embarrassed for him. ããIt turned out that the other kids [http://www.toppowerlevel.net wow power leveling] had not done their own work-their parents had made their exhibits. As the judges went on their rounds, they found that these other kids couldnât answer their questions. Daniel answered every one. When the judges awarded the [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] Albert Einstein Plaque for the best exhibit, they gave it to him. ããBy the time Daniel left for [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] he stood six feet tall and weighed 170 pounds. He was muscular and in superb [http://www.toppowerlevel.net wow powerleveling] condition, but he never pitched another inning, having given up baseball for English literature. I was sorry that he would not develop his athletic talent, but proud that he had made such a mature decision. ããOne day I told Daniel that the great failing in my [http://www.toppowerlevel.net wow power leveling] life had been that I didnât take a year or two off to travel when I finished college. This is the best way, to my way of thinking, to broaden oneself and develop a larger perspective on life. Once I had married and begun working, I found that the dream of [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] in another culture had vanished. ããDaniel thought about this. His friends said that he would be insane to put his career on [http://www.toppowerlevel.net wow powerleveling]. But he decided it wasnât so crazy. After graduation, he worked as a waiter at college, a bike messenger and a house painter. With the money he earned, he had enough to go to [http://www.toppowerlevel.net wow power level] Paris. ããThe [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] before he was to leave, I tossed in bed. I was trying to figure out something to say. Nothing came to mind. Maybe, I thought [http://www.toppowerlevel.net wow power leveling], it wasnât necessary to say anything. ããWhat does it matter in the course of a [http://www.toppowerlevel.net wow power level] if a father never tells a son what he really thinks of him? But as I stood before Daniel, I knew that it does matter. My father and I loved each other. Yet, I always regretted never hearing him put his feelings into words and never having the memory of that moment. Now, I could feel my palms sweat and my throat tighten. Why is it so hard to tell a son something from the heart? My mouth turned dry, and I knew I would be able to get out only a few [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] words clearly. ããâDaniel, I said, if I could have picked [http://www.toppowerlevel.net wow powerleveling], I would have picked you. ããThatâs all I could say. I wasnât sure he understood what I meant. Then he came toward me and threw his arms around me. For a moment, the [http://www.toppowerlevel.net wow power leveling] world and all its people vanished, and there was just Daniel and me in our home by the sea. ããHe was saying [http://www.toppowerlevel.net wow powerleveling], but my eyes misted over, and I couldnât understand what he was saying. All I was aware of was the stubble on his chin as his face pressed against mine. And then, the [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro powerleveling]. I went to [http://www.toppowerlevel.net wow power level] work, and Daniel left a few hours later with his girlfriend. ããThat was seven weeks ago, and I think about [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro powerleveling] when I walk along the beach on weekends. Thousands of miles away, somewhere out past the ocean waves breaking on the [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power level] deserted shore, he might be scurrying across Boulevard Saint Germain, strolling
[Nutch Wiki] Update of wow+power+leveling by matthieuriou
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by matthieuriou: http://wiki.apache.org/nutch/wow+power+leveling The comment on the change is: Spam attack -- - Baseball gave him his earliest challenge. He was an outstanding pitcher in Little League, and eventually, as a senior in high school, made the [http://www.toppowerlevel.net wow powerleveling] varsity, winning half the teamâs games with a record of five wins and two losses. At graduation, the coach named Daniel the [http://www.toppowerlevel.net wow power level] teamâs most valuable player. + deleted - ããHis finest hour, though, came at a school science fair. He entered an exhibit showing how the [http://www.toppowerlevel.net wow power leveling] circulatory system works. It was primitive and crude, especially compared to the fancy, computerized, blinking-light models entered by other [http://www.toppowerlevel.net wow power level] students. My wife, Sara, felt embarrassed for him. - - ããIt turned out that the other kids [http://www.toppowerlevel.net wow power leveling] had not done their own work-their parents had made their exhibits. As the judges went on their rounds, they found that these other kids couldnât answer their questions. Daniel answered every one. When the judges awarded the [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] Albert Einstein Plaque for the best exhibit, they gave it to him. - - ããBy the time Daniel left for [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] he stood six feet tall and weighed 170 pounds. He was muscular and in superb [http://www.toppowerlevel.net wow powerleveling] condition, but he never pitched another inning, having given up baseball for English literature. I was sorry that he would not develop his athletic talent, but proud that he had made such a mature decision. - - ããOne day I told Daniel that the great failing in my [http://www.toppowerlevel.net wow power leveling] life had been that I didnât take a year or two off to travel when I finished college. This is the best way, to my way of thinking, to broaden oneself and develop a larger perspective on life. Once I had married and begun working, I found that the dream of [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] in another culture had vanished. - - ããDaniel thought about this. His friends said that he would be insane to put his career on [http://www.toppowerlevel.net wow powerleveling]. But he decided it wasnât so crazy. After graduation, he worked as a waiter at college, a bike messenger and a house painter. With the money he earned, he had enough to go to [http://www.toppowerlevel.net wow power level] Paris. - - ããThe [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] before he was to leave, I tossed in bed. I was trying to figure out something to say. Nothing came to mind. Maybe, I thought [http://www.toppowerlevel.net wow power leveling], it wasnât necessary to say anything. - - ããWhat does it matter in the course of a [http://www.toppowerlevel.net wow power level] if a father never tells a son what he really thinks of him? But as I stood before Daniel, I knew that it does matter. My father and I loved each other. Yet, I always regretted never hearing him put his feelings into words and never having the memory of that moment. Now, I could feel my palms sweat and my throat tighten. Why is it so hard to tell a son something from the heart? My mouth turned dry, and I knew I would be able to get out only a few [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] words clearly. - - ããâDaniel, I said, if I could have picked [http://www.toppowerlevel.net wow powerleveling], I would have picked you. - - ããThatâs all I could say. I wasnât sure he understood what I meant. Then he came toward me and threw his arms around me. For a moment, the [http://www.toppowerlevel.net wow power leveling] world and all its people vanished, and there was just Daniel and me in our home by the sea. - - ããHe was saying [http://www.toppowerlevel.net wow powerleveling], but my eyes misted over, and I couldnât understand what he was saying. All I was aware of was the stubble on his chin as his face pressed against mine. And then, the [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro powerleveling]. I went to [http://www.toppowerlevel.net wow power level] work, and Daniel left a few hours later with his girlfriend. - - ããThat was seven weeks ago, and I think about [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro powerleveling] when I walk along the beach on weekends. Thousands of miles away, somewhere out past the ocean waves
[Nutch Wiki] Update of FrontPage by KevinBurton
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KevinBurton: http://wiki.apache.org/nutch/FrontPage -- * [Search_Theory] Search Theory White Papers * [http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06] * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts + * [http://spinn3r Spinn3r] [http://spinn3r.com/opensource.php Open Source components] (our contribution to the crawling OSS community with more to come).
[Nutch Wiki] Update of FrontPage by KevinBurton
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KevinBurton: http://wiki.apache.org/nutch/FrontPage -- * [Search_Theory] Search Theory White Papers * [http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06] * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts - * [http://spinn3r Spinn3r] [http://spinn3r.com/opensource.php Open Source components] (our contribution to the crawling OSS community with more to come). + * [http://spinn3r.com Spinn3r] [http://spinn3r.com/opensource.php Open Source components] (our contribution to the crawling OSS community with more to come).
[Nutch Wiki] Trivial Update of GettingNutchRunningWithDebian by Ted Guild
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Ted Guild: http://wiki.apache.org/nutch/GettingNutchRunningWithDebian The comment on the change is: webapps not necessary and perhaps not desired to have running on server -- ''export JAVA_HOME''[[BR]] == Install Tomcat5.5 and Verify that it is functioning == - ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin tomcat5.5-webapps''[[BR]] + ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin ''[[BR]] Verify Tomcat is running:[[BR]] ''# /etc/init.d/tomcat5.5 status''[[BR]] ''#Tomcat servlet engine is running with Java pid /var/lib/tomcat5.5/temp/tomcat5.5.pid''[[BR]]
[Nutch Wiki] Update of GettingNutchRunningWithDebian by Ted Guild
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Ted Guild: http://wiki.apache.org/nutch/GettingNutchRunningWithDebian The comment on the change is: removing BR noise on conf example -- Under Debian Etch, the Catalina configuration files are located under '''/etc/tomcat5.5/policy.d''' At runtime they are combined into a single file, ''/usr/share/tomcat5.5/conf/catalina.policy'' Do not edit the latter, as it will be overwrittten.[[BR]] At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following code:[[BR]] - {{{grant codeBase file:/usr/share/tomcat5.5-webapps/-\'' {[[BR]] + {{{grant codeBase file:/usr/share/tomcat5.5-webapps/-\'' { - permission java.util.PropertyPermission user.dir, read;[[BR]] + permission java.util.PropertyPermission user.dir, read; - permission java.util.PropertyPermission java.io.tmpdir, read,write;[[BR]] + permission java.util.PropertyPermission java.io.tmpdir, read,write; - permission java.util.PropertyPermission org.apache.*, read,execute;[[BR]] + permission java.util.PropertyPermission org.apache.*, read,execute; - permission java.io.FilePermission /usr/local/nutch/crawls/- , read;[[BR]] + permission java.io.FilePermission /usr/local/nutch/crawls/- , read; - permission java.io.FilePermission /var/lib/tomcat5.5/temp, read;[[BR]] + permission java.io.FilePermission /var/lib/tomcat5.5/temp, read; - permission java.io.FilePermission /var/lib/tomcat5.5/temp/-, read,write,execute,delete;[[BR]] + permission java.io.FilePermission /var/lib/tomcat5.5/temp/-, read,write,execute,delete; - permission java.lang.RuntimePermission createClassLoader, ;[[BR]] + permission java.lang.RuntimePermission createClassLoader, ; - permission java.security.AllPermission;[[BR]] + permission java.security.AllPermission; - };[[BR]]}}} + };}}} '''Warning: The last line here was necessary in order to make things work for me. If anybody can supply a more restrictive permission set, please do so!!! The effects of this are unknown'''[[BR]] == Install Multiple Copies of Nutch under Tomcat5.5 and Prepare for Searching ==
[Nutch Wiki] Update of FrontPage by jbv
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by jbv: http://wiki.apache.org/nutch/FrontPage -- * SearchOverMultipleIndexes - configuring nutch to enable searching over multiple indexes * CrossPlatformNutchScripts * MonitoringNutchCrawls - techniques for keeping an eye on a nutch crawl's progress. + * [Nutch 0.9 Crawl Script Tutorial] == Nutch Development == * [:Becoming_A_Nutch_Developer:Becoming a Nutch Developer] - Start developing and contributing to Nutch.
[Nutch Wiki] Update of Nutch 0.9 Crawl Script Tutorial by jbv
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by jbv: http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial New page: == Nutch 0.9 Crawl Script Tutorial == This is a walk through of the nutch 0.9 crawl.sh script provided by Susam Pal. (Thanks for getting me started Susam) I am only a novice at the whole nutch thing so this article may not be 100% accurate, but I think this will be helpful to other people just getting started, and I am hopeful other people who know more about Nutch will correct my mistakes and add more useful information to this document. Thanks to everyone in advance! (by the way I made changes to Susam's script, so if I broke stuff or made stupid mistakes, please correct me. ;) {{{ #!/bin/sh # Runs the Nutch bot to crawl or re-crawl # Usage: bin/runbot [safe] #If executed in 'safe' mode, it doesn't delete the temporary #directories generated during crawl. This might be helpful for #analysis and recovery in case a crawl fails. # # Author: Susam Pal }}} First we specify some variables. Depth tells how many times to crawl the web page. It seems like about 6 will get us all the files, but to be really thorough 9 should be enough. Threads sets how many threads to crawl with, though ultimately this is limited by the conf file's max threads per server setting because for intranet crawling (like we are doing) there is really only one server. adddays is something I don't know... need to figure out how to use this to our advantage for only crawling updated pages. topN is not used right now because we want to crawl the whole intranet. You can use this during testing to limit the maximum number of pages to crawl per depth. But it will make it so you don't get all the possible results. {{{ depth=0 threads=50 adddays=5 #topN=100 # Comment this statement if you don't want to set topN value NUTCH_HOME=/data/nutch CATALINA_HOME=/var/lib/tomcat5.5 }}} Nutch home and catalina home have to be configured to point to where you installed nutch and tomcat respectively. {{{ # Parse arguments if [ $1 == safe ] then safe=yes fi if [ -z $NUTCH_HOME ] then NUTCH_HOME=. echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -z $CATALINA_HOME ] then CATALINA_HOME=/opt/apache-tomcat-6.0.10 echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script else echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME fi if [ -n $topN ] then topN=--topN $rank else topN= fi }}} This last part just looks at the incoming variables and sets defaults, etc... Now on to the real work!! == Step 1 : Inject == First thing is to inject the crawldb with an initial set of urls to crawl. In our case we are injecting only a single url contained in the nutch/seed/urls file. But in the future this file will probably get filled with out of date pages in order to hasten their recrawl. {{{ steps=10 echo - Inject (Step 1 of $steps) - $NUTCH_HOME/bin/nutch inject crawl/crawldb seed }}} == Step 2 : Crawl == Next we do a for loop for $depth number of times. This for loop performs a couple steps which make up a basic 'crawl' procedure. First it generates a segment which (I think?) is filled with empty data for each url in the crawldb that has reached it's expiration (i.e. has not been fetched in a month). I am not really sure what this does yet... Then it fetches pages for those urls and stores that data in the segment. During this fetch phase, it also fills the crawldb with any new urls it finds (as long as they are not excluded by the filters we configured). This is really the key to making this for loop work, because the next time it gets to the segment generation there will be more urls in the crawldb for it to crawl. Notice however that the crawldb never gets cleared in this script... so if I am not mistaken there is no need to re-inject the root url. Then we parse the data in the segments. Although depending on your configuration in the xml files this can be done automatically, in our case we are parsing it manually cause I read it would be faster this way... Haven't really given it a good test yet. After these steps are done we have a nice set of segments that are full of data to be indexed. {{{ echo - Generate, Fetch, Parse, Update (Step 2 of $steps) - for((i=0; i $depth; i++)) do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN -adddays $adddays if [ $? -ne 0 ] then echo runbot: Stopping at depth $depth. No more URLs to fetch. break fi segment=`ls -d crawl/segments/* | tail -1`
[Nutch Wiki] Update of jbv by jbv
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by jbv: http://wiki.apache.org/nutch/jbv New page: Oh look its my page. I'm Jeff Van Boxtel. I am a nutch newb, but I hope to learn more and contribute to this wiki. Yoroshiku.
[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DawidWeiss: http://wiki.apache.org/nutch/ClusteringPlugin The comment on the change is: Adding initial help for configuring clustering with different clustering algs. -- * configureable parameters: Take a look at the defaults defined in nutch-default.xml (search for 'clustering'). * meta data added to index: None. Clustering is performed dynamically for each result set. * required jars: The entire `lib` folder in the plugin must be present in classpath. More JARs might be needed from the Carrot2 project if additional algorithms or languages are to be used. - * plugin extension point interface: net.nutch.clustering.OnlineClusterer + * plugin extension point interface: `net.nutch.clustering.OnlineClusterer` * '''Carrot2 JARs come from codebase in version: 2.1''' @@ -61, +61 @@ /property }}} + == Using other Carrot2 clustering algorithms == + + To limit the size of the clustering plugin, the default implementation is shipped with the Lingo + algorithm -- just one of several alternatives available in the Carrot2 project. This section describes + how to substitute the default algorithm with a different one. + + First, prepare the following: + + * Install Nutch, enable the clustering plugin and make sure it works in the default configuration. + * Get a precompiled distribution of Carrot2 ([http://project.carrot2.org/download.html]), for example the DCS demo, or compile it from scratch. + + Now you are ready to install another clustering algorithm. The instructions below show how to run STC (Suffix Tree Clustering) instead + of Lingo. We will use a binary release of the DCS as a source of the required Carrot2 JARs. It is assumed that Nutch's WAR is installed in a Web application container such as Jetty or Tomcat. + + (will finish later) +
[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DawidWeiss: http://wiki.apache.org/nutch/ClusteringPlugin -- * Install Nutch, enable the clustering plugin and make sure it works in the default configuration. * Get a precompiled distribution of Carrot2 ([http://project.carrot2.org/download.html]), for example the DCS demo, or compile it from scratch. - Now you are ready to install another clustering algorithm. The instructions below show how to run STC (Suffix Tree Clustering) instead + Now you are ready to install a different clustering algorithm. The instructions below show how to run STC (Suffix Tree Clustering) instead - of Lingo. We will use a binary release of the DCS as a source of the required Carrot2 JARs. It is assumed that Nutch's WAR is installed in a Web application container such as Jetty or Tomcat. + of Lingo on the Jetty server (6.1.5). We will use a binary release of the DCS as a source of the required Carrot2 JARs. - (will finish later) + * Make a symbolic link from `webapps/ROOT.war` to a compile Nutch WAR file (or just place it there). + * Make sure clustering and search work as expected. + * Download the binary release of DCS (we will use the one from the latest stable release: version 2.1). + * Carrot2 framework has the notion of a process -- a pipeline of components that process search results and emit clusters. We will need to provide the name of an XML file which defines such a process to Nutch's clustering extension and give it access to all the required classes it may need. Let's start from defining a process. Unpack the DCS distribution and locate the `descriptors` folder. You'll see a bunch of files inside, the one that interests us is called `alg-stc-en.xml`. Its contents should look like this: + {{{ + local-process id=stc-en + nameSTC (+English)/name + descriptionSuffix Tree Clustering Algorithm/description + input component-key=input-demo-webapp / + + filter component-key=filter-language-detection-en / + filter component-key=filter-tokenizer / + filter component-key=filter-case-normalizer / + filter component-key=filter-stc / + + output component-key=output-demo-webapp / + /local-process + }}} + * Now edit the above file and change `input` component key to `nutch-input` and `output` component key to `output-array`, leaving everything else exactly as it were. + {{{ + local-process id=stc-en + nameSTC (+English)/name + descriptionSuffix Tree Clustering Algorithm/description + + input component-key=input-nutch / + + filter component-key=filter-language-detection-en / + filter component-key=filter-tokenizer / + filter component-key=filter-case-normalizer / + filter component-key=filter-stc / + + output component-key=output-array / + /local-process + }}} + * The filters you see in the process descriptor should also be available. Some of them are built in in the Carrot2 core, other should be copied from DCS's distribution to the same temporary folder we copied the process definition to. In our case the following filter definition files should be copied: `filter-language-detection-en.bsh`, `filter-tokenizer.bsh`, `filter-case-normalizer.bsh` and `filter-stc.bsh`. + * Process and component descriptors are read as a resoure (relative to classpath). Jetty can be configured to have an additional classpath entry as a folder, but it slightly complicates things (hierarchy of classloaders may result in some hard-to-track errors). It will be easier to just place all the required stuff in Nutch's Web application context under `WEB-INF`. If you work with WAR file directly, you'll need to add the resources mentioned below to the WAR file (it's a ZIP file, so it's not a problem). + * Copy process and component descriptor files to `{NUTCH-CONTEXT}/WEB-INF/classes/`. +
[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DawidWeiss: http://wiki.apache.org/nutch/ClusteringPlugin -- * The filters you see in the process descriptor should also be available. Some of them are built in in the Carrot2 core, other should be copied from DCS's distribution to the same temporary folder we copied the process definition to. In our case the following filter definition files should be copied: `filter-language-detection-en.bsh`, `filter-tokenizer.bsh`, `filter-case-normalizer.bsh` and `filter-stc.bsh`. * Process and component descriptors are read as a resoure (relative to classpath). Jetty can be configured to have an additional classpath entry as a folder, but it slightly complicates things (hierarchy of classloaders may result in some hard-to-track errors). It will be easier to just place all the required stuff in Nutch's Web application context under `WEB-INF`. If you work with WAR file directly, you'll need to add the resources mentioned below to the WAR file (it's a ZIP file, so it's not a problem). * Copy process and component descriptor files to `{NUTCH-CONTEXT}/WEB-INF/classes/`. + * Copy all JAR files from the DCS (`WEB-INF/lib/*.jar`) to `{NUTCH-CONTEXT}/WEB-INF/lib`. Overwrite older libraries, whenever prompted. + * Finally, the path to the clustering process should be added to `{NUTCH-CONTEXT}/WEB-INF/classes/nutch-site.xml`: + {{{ + property + nameextension.clustering.carrot2.process-resource/name + value/alg-stc-en.xml/value + /property + }}} + * Restart your Web application container. The clustering plugin should use STC clustering algorithm if everything was ok.
[Nutch Wiki] Update of Support by RidaBenjelloun
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by RidaBenjelloun: http://wiki.apache.org/nutch/Support -- * [http://www.sigram.com Andrzej Bialecki] ab at sigram.com * CNLP http://www.cnlp.org/tech/lucene.asp - * [http://www.doculibre.com/ Doculibre Inc.] Open source and information management consulting. (Lucene, Nutch, Hadoop etc.) info at doculibre.com + * [http://www.doculibre.com/ Doculibre Inc.] Open source and information management consulting. (Lucene, Nutch, Hadoop, Solr, Lius etc.) info at doculibre.com * [http://www.dsen.nl DSEN - Java | J2EE | Agile Development Consultancy] * eventax GmbH info at eventax.com * [http://www.foofactory.fi/ FooFactory] / Sami Siren info at foofactory dot fi
[Nutch Wiki] Update of WritingPluginExample-0.9 by BUlicny
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by BUlicny: http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 The comment on the change is: Updated format of QueryFilter extension in plugin.xml -- name=Recommended Search Query Filter point=org.apache.nutch.searcher.QueryFilter implementation id=RecommendedQueryFilter - class=org.apache.nutch.parse.recommended.RecommendedQueryFilter + class=org.apache.nutch.parse.recommended.RecommendedQueryFilter - fields=DEFAULT/ + parameter name=fields value=recommended/ + /implementation /extension /plugin
[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DawidWeiss: http://wiki.apache.org/nutch/ClusteringPlugin The comment on the change is: Updated the info about clustering plugin and instructions. -- - -- Main.DawidWeiss - 01 Dec 2004 + = Clustering Plugin = - * plugin name: Online Search Results Clustering using Carrot2's Lingo component + plugin name:: Online Search Results Clustering using Carrot2 components - * plugin version: 0.9.0 + plugin version:: 1.0.3 + == Plugin Info == - * provider: Dawid Weiss, The Carrot2 project - * plugin home url: Included in Nutch CVS. Home WWW of the project: http://carrot2.sourceforge.net - * plugin download url: A binary is included in Nutch CVS. The plugin builds together with Nutch. - * license: BSD-style - * short description: Search results clustering plugin. + * provider: The Carrot2 project, [http://www.carrot2.org] + * plugin home url: Plugin is included in Nutch codebase. + * plugin download url: Binaries included with Nutch. + * license: BSD-style + * short description: Plugin for clustering search results at query-time. - * long description: A plugin that clusters search results into groups of (related, hopefully) documents. + * long description: This plugin organizes search results into groups of (related, hopefully) documents. - * configureable parameters: Take a look at the defaults defined in nutch-default.xml (search for 'clustering'). + * configureable parameters: Take a look at the defaults defined in nutch-default.xml (search for 'clustering'). - * meta data added to index: None. Clustering is performed dynamically for each result set. + * meta data added to index: None. Clustering is performed dynamically for each result set. + * required jars: The entire `lib` folder in the plugin must be present in classpath. More JARs might be needed from the Carrot2 project if additional algorithms or languages are to be used. - * required jars: Many - the entire lib folder in the plugin must be present in classpath. - * plugin extension points: - - * plugin extension point interface: net.nutch.clustering.OnlineClusterer + * plugin extension point interface: net.nutch.clustering.OnlineClusterer - * plugin extension point xml snippet: ? - = Installation guide + == Installation guide == - * Create some index using the instructions provided in Nutch documentation, + * Create a search index using the instructions provided in Nutch documentation. - * Deploy Nutch Web application and make sure the index is found and works (type a query and see if you + * Deploy Nutch Web application and make sure the index is found and searching works (type a query and see if you get any results). - get any results). + * Stop the web server (Tomcat, Jetty or anything you like). + * Modify `WEB-INF/classes/nutch-default.xml` file and include the clustering plugin (it is by default ignored) by adding `clustering-carrot2` to `plugin.includes` property. + * Restart your web server and reload the search page. You should see the `clustering` checkbox next to `search` button. Enable it and rerun your query. Cluster labels and documents should appear to the right of search results. - * Stop Web container (Tomcat) - * You must modify =WEB-INF/classes/nutch-default.xml= file and include the clustering plugin (it is by default - ignored). - - plugin.includes - - protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)|clustering-carrot2 - Regular expression naming plugin directory names to - - include. Any plugin not matching this expression is excluded. By - default Nutch includes crawling just HTML and plain text via HTTP, - and basic indexing and search plugins. - - * Restart Tomcat. - - * Reload the search page of Nutch. You should see the =clustering= checkbox next to =search= button. - Enable it and rerun your query. Clustered results should appear to the right. -
[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DawidWeiss: http://wiki.apache.org/nutch/ClusteringPlugin The comment on the change is: Configuration help added. -- * required jars: The entire `lib` folder in the plugin must be present in classpath. More JARs might be needed from the Carrot2 project if additional algorithms or languages are to be used. * plugin extension point interface: net.nutch.clustering.OnlineClusterer + * '''Carrot2 JARs come from codebase in version: 2.1''' == Installation guide == @@ -27, +28 @@ * Modify `WEB-INF/classes/nutch-default.xml` file and include the clustering plugin (it is by default ignored) by adding `clustering-carrot2` to `plugin.includes` property. * Restart your web server and reload the search page. You should see the `clustering` checkbox next to `search` button. Enable it and rerun your query. Cluster labels and documents should appear to the right of search results. + Note that the user interface in default Nutch's Web application is very limited and you'll most likely need something more application-specific. Look at [http://www.carrot2.org] or [http://www.carrot-search.com] for inspiration. + + + == Configuration guide == + + Libraries in this release are precompiled with stemming and stop words for various languages present in the Carrot2 codebase. You should define the default language and supported languages in Nutch configuration file (nutch-site.xml). If nothing is given in Nutch configuration English is used by default. The following properties can be added to `nutch-site.xml`: + + {{{ + !-- Carrot2 Clustering plugin configuration -- + + property + nameextension.clustering.carrot2.defaultLanguage/name + valueen/value + descriptionTwo-letter ISO code of the language. + http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt/description + /property + + property + nameextension.clustering.carrot2.languages/name + valueen,nl,da,fi,fr,de,it,no,pl,pt,ru,es,sv/value + descriptionAll languages to be used by the clustering plugin. + This list includes all currently supported languages (although not all of them + will successfully instantiate -- support for Polish requires additional + libraries for instance). Adjust to your needs, fewer languages take less + memory. + + If you use the language recognizer plugin, then each hit will come with its + own ISO language code. All hits with no explicit language take the default + language specified in extension.clustering.carrot2.defaultLanguage property. + /description + /property + }}} +
[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DawidWeiss: http://wiki.apache.org/nutch/ClusteringPlugin The comment on the change is: Additional languages for the supported language list. -- property nameextension.clustering.carrot2.languages/name - valueen,nl,da,fi,fr,de,it,no,pl,pt,ru,es,sv/value + valueen,nl,da,fi,fr,de,it,no,pl,pt,ru,es,sv,tr,ro,hu/value descriptionAll languages to be used by the clustering plugin. This list includes all currently supported languages (although not all of them will successfully instantiate -- support for Polish requires additional
[Nutch Wiki] Update of PublicServers by rlhoad
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by rlhoad: http://wiki.apache.org/nutch/PublicServers -- = Public search engines using Nutch = Please sort by name alphabetically - - * [http://www.arancia.com Arancia Outlaw] Italian search engine for legal material, laws and high court sentences. * [http://askaboutoil.com AskAboutOil] is a vertical search portal for the petroleum industry.
[Nutch Wiki] Update of FrontPage by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/FrontPage The comment on the change is: crawl script -- * [Upgrading_from_0.8.x_to_0.9] * RunNutchInEclipse for v0.8 * [RunNutchInEclipse0.9] for v0.9 + * Crawl - script to crawl (and possible recrawl too) * IntranetRecrawl - script to recrawl a crawl * MergeCrawl - script to merge 2 (or more) crawls * SearchOverMultipleIndexes - configuring nutch to enable searching over multiple indexes
[Nutch Wiki] Update of Crawl by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/Crawl The comment on the change is: crawl script New page: == Introduction == This is a script to crawl an Internet or the web. It does not crawl using the 'bin/crawl' tool or 'Crawl' class present in Nutch, therefore the filters present in 'conf/crawl-urlfilter.txt ' has not effect on this script. The filters for this script must be set in 'regex-urlfilter.txt'. == Steps == The complete job of this script has been divided broadly into 8 steps. # Inject URLs # Generate, Fetch, Parse, Update Loop # Merge Segments # Invert Links # Index # Dedup # Merge Indexes # Reload index == Modes of Execution == The script can be executed in two modes:- * Normal Mode * Safe Mode === Normal Mode === If the script is executed with the command 'bin/runbot', it will delete all the directories such as fetched segments, generated indexes, etc, so as to save space. It will also reload the index after it finishes crawling and the new crawl DB would go live. '''Caution:''' This also means that if something has gone wrong during the crawl and the resultant crawl DB is corrupt or incomplete, it might not return results for any query. And since this crawl DB would go live in 'normal mode', your visitors may not see any results. === Safe Mode === Alternatively, the script can be executed in safe mode as 'bin/runbot safe' which will prevent deletion of these directories. If errors occur, you can take recovery action because the directories haven't been deleted. You can then manually merge the segments, generate indexes, etc. from the directories and make the resultant crawl DB go live. Safe Mode also suppresses the automatic reloading of the new index. Therefore, the resultant crawl DB does not go live immediately after crawling. This gives you a chance to first test the new crawl DB for valid results. If it is found to work, you can make this new DB go live. === Normal Mode vs. Safe Mode === Ideally, you should run the script in safe mode a couple of times, to make sure the crawl is running fine. If you are sure, that everything will go fine, you need not run it in safe mode. == Tinkering == Adjust the variables, 'depth', 'threads', 'adddays' and 'topN' as per your needs. Delete or comment out the statement for 'topN' assignment if you do not wish to set a 'topN' value. === NUTCH_HOME === If you are not executing the script as 'bin/runbot' from Nutch directory, you should either set the environment variable 'NUTCH_HOME' or edit the following in the script:- {{{if [ -z $NUTCH_HOME ] then NUTCH_HOME=.}}} Set 'NUTCH_HOME' to the path of the Nutch directory (if you are not setting it as an environment variable, since if environment variable is set, the above assignment is ignored). === CATALINA_HOME === 'CATALINA_HOME' points to the Tomcat installation directory. You must either set this as an environment variable or set it by editing the following lines in the script:- {{{if [ -z $CATALINA_HOME ] then CATALINA_HOME=/opt/apache-tomcat-6.0.10}}} Similar to the previous section, if this variable is set in the environment, then the above assignment is ignored. == Can it re-crawl? == The author has used this script to re-crawl a couple of times. However, no real world testing has been done for re-crawling. Therefore, you may try to use the script of re-crawl. If it works out fine or it doesn't work properly for re-crawl, please let us know. == Script == {{{ #!/bin/sh # Runs the Nutch bot to crawl or re-crawl # Usage: bin/runbot [safe] #If executed in 'safe' mode, it doesn't delete the temporary #directories generated during crawl. This might be helpful for #analysis and recovery in case a crawl fails. # # Author: Susam Pal depth=2 threads=50 adddays=5 topN=2 # Comment this statement if you don't want to set topN value # Parse arguments if [ $1 == safe ] then safe=yes fi if [ -z $NUTCH_HOME ] then NUTCH_HOME=. echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -z $CATALINA_HOME ] then CATALINA_HOME=/opt/apache-tomcat-6.0.10 echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script else echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME fi if [ -n $topN ] then topN=--topN $rank else topN= fi steps=8 echo - Inject (Step 1 of $steps) - $NUTCH_HOME/bin/nutch inject crawl/crawldb urls echo - Generate, Fetch, Parse, Update (Step 2 of $steps) - for((i=0; i $depth; i++)) do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate crawl/crawldb
[Nutch Wiki] Update of FrontPage by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/FrontPage -- * [Upgrading_from_0.8.x_to_0.9] * RunNutchInEclipse for v0.8 * [RunNutchInEclipse0.9] for v0.9 - * Crawl - script to crawl (and possible recrawl too) + * [Crawl] - script to crawl (and possible recrawl too) * IntranetRecrawl - script to recrawl a crawl * MergeCrawl - script to merge 2 (or more) crawls * SearchOverMultipleIndexes - configuring nutch to enable searching over multiple indexes
[Nutch Wiki] Trivial Update of Crawl by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/Crawl -- == Steps == The complete job of this script has been divided broadly into 8 steps. - # Inject URLs + 1. Inject URLs - # Generate, Fetch, Parse, Update Loop + 2. Generate, Fetch, Parse, Update Loop - # Merge Segments + 3. Merge Segments - # Invert Links + 4. Invert Links - # Index + 5. Index - # Dedup + 6. Dedup - # Merge Indexes + 7. Merge Indexes - # Reload index + 8. Reload index == Modes of Execution == The script can be executed in two modes:-
[Nutch Wiki] Update of IntranetRecrawl by JamesVictor
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JamesVictor: http://wiki.apache.org/nutch/IntranetRecrawl The comment on the change is: clarified script location, removed erroneous info on 0.9.0 changes -- }}} == Version 0.8.0 and 0.9.0 == + - Place in the bin sub-directory within Nutch and run. + Place in the `bin` sub-directory within your Nutch install and run. - ** MUST CALL SCRIPT USING THE FULL PATH TO THE SCRIPT OR IT WON'T WORK*** + '''CALL THE SCRIPT USING THE FULL PATH TO THE SCRIPT OR IT WON'T WORK''' === Example Usage === `./usr/local/nutch/bin/recrawl /usr/local/tomcat/webapps/ROOT /usr/local/nutch/crawl 10 31` - (with adddays being '31', all pages will be recrawled) + Setting adddays at `31` causes all pages will to be recrawled. === Changes for 0.9.0 === + No changes necessary for this to run with Nutch 0.9.0. - Change line 76 to read - {{{ - #Sets the path to bin - nutch_dir=`dirname $0`/bin - }}} - - in order for the proper path to be built. Everything else may remain the same. === Code ===
[Nutch Wiki] Update of IntranetRecrawl by JamesVictor
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JamesVictor: http://wiki.apache.org/nutch/IntranetRecrawl The comment on the change is: added instructions for Nutch 0.9.0 script -- }}} - == Version 0.8.0 == + == Version 0.8.0 and 0.9.0 == Place in the bin sub-directory within Nutch and run. ** MUST CALL SCRIPT USING THE FULL PATH TO THE SCRIPT OR IT WON'T WORK*** + === Example Usage === - ./usr/local/nutch/bin/recrawl /usr/local/tomcat/webapps/ROOT /usr/local/nutch/crawl 10 31 + `./usr/local/nutch/bin/recrawl /usr/local/tomcat/webapps/ROOT /usr/local/nutch/crawl 10 31` (with adddays being '31', all pages will be recrawled) + + === Changes for 0.9.0 === + + Change line 76 to read + {{{ + #Sets the path to bin + nutch_dir=`dirname $0`/bin + }}} + + in order for the proper path to be built. Everything else may remain the same. === Code ===
[Nutch Wiki] Update of FrontPage by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/FrontPage -- * [:08CommandLineOptions:Commandline] options for version 0.8 * OverviewDeploymentConfigs * NutchConfigurationFiles - * GettingNutchRunningWithUtf8 - For support of non-ASCII characters (Chinese, Japanese and Korean). + * GettingNutchRunningWithUtf8 - For support of non-ASCII characters (Chinese, German, Japanese, Korean). * GettingNutchRunningWithResin - Resin is a JSP/Servlet/EJB application server (alternative to tomcat). * GettingNutchRunningWithJetty * GettingNutchRunningWithUbuntu
[Nutch Wiki] Trivial Update of FrontPage by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/FrontPage -- * CreateNewFilter - for example to add a category metadata to your index and be able to search for it * UpgradeFrom07To08 * [Upgrading_from_0.8.x_to_0.9] - * RunNutchInEclipse + * RunNutchInEclipse for v0.8 - * [RunNutchInEclipse0.9] (update - work in progress for 0.9) + * [RunNutchInEclipse0.9] for v0.9 * IntranetRecrawl - script to recrawl a crawl * MergeCrawl - script to merge 2 (or more) crawls * SearchOverMultipleIndexes - configuring nutch to enable searching over multiple indexes
[Nutch Wiki] Update of RunNutchInEclipse0.9 by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 -- * change the property plugin.folders to ./src/plugin on $NUTCH_HOME/conf/nutch-defaul.xml * make sure Nutch is configured correctly before testing it into Eclipse ;-) + === missing org.farng and com.etranslate === + You will encounter problems with some imports in parse-mp3 and parse-rtf plugins (30 errors in my case). + Because of incompatibility with Apache license they were left from sources. + You can download them here: + + http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/ + + http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ + + Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. + Then add them to the libraries to the build path (First refresh the workspace. Then Right click on the source + folder = Java Build Path = Libraries = Add Jars). + + === Build Nutch === * In case you setup the project correctly, Eclipse will build Nutch for you into tmp_build. - --- okay up to here... going to do rest tomorrow... @@ -62, +75 @@ * click on Run * if all works, you should see Nutch getting busy at crawling :-) - == Debug Nutch in Eclipse == + == Debug Nutch in Eclipse (not yet tested for 0.9) == * Set breakpoints and debug a crawl * It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints: {{{ @@ -78, +91 @@ Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-) === eclipse: Cannot create project content in workspace === - The nutch source code must be out of the workspace folder. My first attemp was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine. + The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine. === plugin dir not found === - Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absoluth one as well in nutch-defaults.xml or may be better in nutch-site.xml + Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-defaults.xml or may be better in nutch-site.xml {{{ property nameplugin.folders/name - value/home/../nutch-0.8/src/plugin/value + value/home/../nutch-0.9/src/plugin/value }}} @@ -107, +120 @@ * open the class itself, rightclick * refresh the build dir - === missing org.farng and com.etranslate === - You may have problems with some imports in parse-mp3 and parse-rtf plugins. Because of incompatibility with apache licence they were left from sources. You can find it here: - - http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/ - - http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ - - You need to copy jar files into plugin lib path and refresh the project. - === debugging hadoop classes === Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the Hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can:
[Nutch Wiki] Update of WritingPluginExample-0.9 by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 -- Edit this block to add a line for your plugin before the /target tag. {{{ - ant dir=reccomended target=deploy / + ant dir=recommended target=deploy / }}} Running 'ant' in the root of your checkout directory should get everything compiled and jared up. The next time you run a crawl your parser and index filter should get used.
[Nutch Wiki] Trivial Update of FrontPage by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/FrontPage -- * UpgradeFrom07To08 * [Upgrading_from_0.8.x_to_0.9] * RunNutchInEclipse - * RunNutchInEclipse0.9 (update - work in progress for 0.9) + * RunNutchInEclipse0_9 (update - work in progress for 0.9) * IntranetRecrawl - script to recrawl a crawl * MergeCrawl - script to merge 2 (or more) crawls * SearchOverMultipleIndexes - configuring nutch to enable searching over multiple indexes
[Nutch Wiki] Update of FrontPage by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/FrontPage -- * UpgradeFrom07To08 * [Upgrading_from_0.8.x_to_0.9] * RunNutchInEclipse - * RunNutchInEclipse0_9 (update - work in progress for 0.9) + * [RunNutchInEclipse0.9] (update - work in progress for 0.9) * IntranetRecrawl - script to recrawl a crawl * MergeCrawl - script to merge 2 (or more) crawls * SearchOverMultipleIndexes - configuring nutch to enable searching over multiple indexes
[Nutch Wiki] Update of FrontPage by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/FrontPage -- * UpgradeFrom07To08 * [Upgrading_from_0.8.x_to_0.9] * RunNutchInEclipse + * RunNutchInEclipse0.9 (update - work in progress for 0.9) * IntranetRecrawl - script to recrawl a crawl * MergeCrawl - script to merge 2 (or more) crawls * SearchOverMultipleIndexes - configuring nutch to enable searching over multiple indexes
[Nutch Wiki] Update of RunNutchInEclipse0.9 by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9 New page: = RunNutchInEclipse = This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page :-) == Tested with == * Nutch release 0.9 * Eclipse 3.3 - aka Europa * Java 1.6 * Ubuntu (should work on most platform, though) == Before you start == Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. But again you might be quicker by looking at the logs (logs/hadoop.log)... == Steps == === Install Nutch === * Grab a fresh release of Nutch 0.9 * Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory === Create a new java project in Eclipse === * File New Project Java project click Next * select Create project from existing source and use the location where you downloaded Nutch * click on Next, and wait while Eclipse is scanning the folders * add the folder conf to the classpath (third tab and then add class folder) * Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add src/java, src/test and all plugin src/java and src/test folders to your source folders. Also add all jars in lib and in the plugin lib folders to your libraries * set output dir to tmp_build, create it if necessary * DO NOT add build to classpath === Configure Nutch === * see the [http://wiki.apache.org/nutch/NutchTutorial Tutorial] * change the property plugin.folders to ./src/plugin on $NUTCH_HOME/conf/nutch-defaul.xml * make sure Nutch is configured correctly before testing it into Eclipse ;-) === Build Nutch === * In case you setup the project correctly, Eclipse will build Nutch for you into tmp_build. --- okay up to here... going to do rest tomorrow... === Create Eclipse launcher === * Menu Run Run... * create New for Java Application * set in Main class {{{ org.apache.nutch.crawl.Crawl }}} * on tab Arguments, Program Arguments {{{ urls -dir crawl -depth 3 -topN 50 }}} * in VM arguments {{{ -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log }}} * click on Run * if all works, you should see Nutch getting busy at crawling :-) == Debug Nutch in Eclipse == * Set breakpoints and debug a crawl * It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints: {{{ Fetcher [line: 371] - run Fetcher [line: 438] - fetch Fetcher$FetcherThread [line: 149] - run() Generator [line: 281] - generate Generator$Selector [line: 119] - map OutlinkExtractor [line: 111] - getOutlinks }}} == If things do not work... == Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-) === eclipse: Cannot create project content in workspace === The nutch source code must be out of the workspace folder. My first attemp was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine. === plugin dir not found === Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absoluth one as well in nutch-defaults.xml or may be better in nutch-site.xml {{{ property nameplugin.folders/name value/home/../nutch-0.8/src/plugin/value }}} === No plugins loaded during unit tests in Eclipse === During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well. === Unit tests work in eclipse but fail when running ant in the command line === Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running '''ant test''' in the command line - including the ones you haven't modified. Check if you defined the '''plugin.folders''' property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml Run '''ant test''' again. That should have solved the problem. If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin\build.xml, on the test target? === classNotFound === * open the class itself, rightclick * refresh the build dir === missing org.farng and com.etranslate === You may have problems with some imports in parse-mp3 and parse-rtf plugins. Because of incompatibility with apache licence they were left from sources. You can find it here:
[Nutch Wiki] Update of FAQ by KaiMiddleton
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KaiMiddleton: http://wiki.apache.org/nutch/FAQ -- Assuming your index is located at /index : {{{% cd /index/ % $CATATALINA_HOME/bin/startup.sh}}} - '''Now you can search.'' + '''Now you can search.''' 2) After building your first index, start and stop Tomcat which will make Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and put in it the location of the index folder. {{{% $CATATALINA_HOME/bin/startup.sh @@ -391, +391 @@ /property }}} After that, __don't forget to crawl again__ and you should be able to retrieve the mime-type and content-length through the class HitDetails (via the fields primaryType, subType and contentLength) as you normally do for the title and URL of the hits. - (Note by DanielLopez) Thanks to DoÄacan Güney for the tip. + (Note by DanielLopez) Thanks to Dogacan Güney for the tip. === Crawling === @@ -399, +399 @@ The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser... - Some pages are not indexed but my regex file and everything else is okay - what is going on? + Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on? The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. - To overcome this limitation change the property to a higher value or simply -1 (unlimited). + To overcome this limitation change the '''db.max.outlinks.per.page''' property to a higher value or simply -1 (unlimited). file: conf/nutch-default.xml
[Nutch Wiki] Update of FAQ by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/FAQ -- The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser... + Some pages are not indexed but my regex file and everyhing else is okay - what is going on? + The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. + To overcome this limitation change the property to a higher value or simply -1. + + file: conf/nutch-default.xml + + property +namedb.max.outlinks.per.page/name +value-1/value +descriptionThe maximum number of outlinks that we'll process for a page. +If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks +will be processed for a page; otherwise, all outlinks will be processed. +/description + /property + + see also: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08665.html + + + + + === Discussion === [http://grub.org/ Grub] has some interesting ideas about building a search engine using distributed computing. ''And how is that relevant to nutch?''
[Nutch Wiki] Update of FAQ by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/FAQ -- The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser... - Some pages are not indexed but my regex file and everyhing else is okay - what is going on? + Some pages are not indexed but my regex file and everything else is okay - what is going on? The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. - To overcome this limitation change the property to a higher value or simply -1. + To overcome this limitation change the property to a higher value or simply -1 (unlimited). file: conf/nutch-default.xml + {{{ property namedb.max.outlinks.per.page/name @@ -415, +416 @@ /property }}} see also: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08665.html - + (tested under nutch 0.9)
[Nutch Wiki] Update of FAQ by ra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ra: http://wiki.apache.org/nutch/FAQ -- To overcome this limitation change the property to a higher value or simply -1. file: conf/nutch-default.xml - + {{{ property namedb.max.outlinks.per.page/name value-1/value @@ -413, +413 @@ will be processed for a page; otherwise, all outlinks will be processed. /description /property - + }}} see also: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08665.html
[Nutch Wiki] Update of Support by ThomasDelnoij
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ThomasDelnoij: http://wiki.apache.org/nutch/Support -- * [http://www.sigram.com Andrzej Bialecki] ab at sigram.com * CNLP http://www.cnlp.org/tech/lucene.asp * [http://www.doculibre.com/ Doculibre Inc.] Open source and information management consulting. (Lucene, Nutch, Hadoop etc.) info at doculibre.com + * [http://www.dsen.nl DSEN - Java | J2EE | Agile Development Consultancy] * eventax GmbH info at eventax.com * [http://www.foofactory.fi/ FooFactory] / Sami Siren info at foofactory dot fi * [http://www.lucene-consulting.com/ Lucene Consulting] / Otis Gospodnetic otis at apache.org
[Nutch Wiki] Update of GettingNutchRunningWithWindows by JamesVictor
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JamesVictor: http://wiki.apache.org/nutch/GettingNutchRunningWithWindows The comment on the change is: removed index-more from example; it threw an exception on my indexing -- Edit `conf/nutch-site.xml` and change the value of `plugin.includes` to include the plugins for the document types that you want Nutch to handle. - For example, to add parsing for PDF, MS Office, and OpenOffice documents, and use the `index-more` instead of `index-basic`, you'll have something like: + Example: to add parsing for PDF, MS Office, and OpenOffice documents, you'll have something like: {{{ property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)| - index-more|query-(basic|site|url)|summary-basic|scoring-opic| + index-basic|query-(basic|site|url)|summary-basic|scoring-opic| urlnormalizer-(pass|regex|basic)/value /property }}}