[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by Dmitrius. The comment on this change is: Fixed commang (single quotes missed). http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=28rev2=29 -- = New in Nutch 1.0-dev = - Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from [[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/]]. + Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/. = Pre Solr Nutch integration = - This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to [[http://variogram.com||Brian Whitman at Variogr.am]] and [[http://blog.foofactory.fi||Sami Siren at FooFactory]] for all the help! You guys saved me a lot of time! :) + This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to http://variogram.com and http://blog.foofactory.fi for all the help! You guys saved me a lot of time! :) I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk. @@ -12, +12 @@ * apt-get install sun-java6-jdk subversion ant patch unzip == Steps == - The first step to get started is to download the required software components, namely Apache Solr and Nutch. '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page @@ -23, +22 @@ '''4.''' Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz + '''5.''' Configure Solr For the sake of simplicity we are going to use the example configuration of Solr as a base. - '''5.''' Configure Solr - For the sake of simplicity we are going to use the example - configuration of Solr as a base. - '''a.''' Copy the provided Nutch schema from directory - apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) + '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: @@ -52, +48 @@ str name=qf - content^0.5 anchor^1.0 title^1.2 + content^0.5 anchor^1.0 title^1.2 /str - /str - str name=pf - content^0.5 anchor^1.5 title^1.2 site^1.5 + str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str - /str + str name=fl url /str - str name=fl - url - /str + str name=mm 2-1 5-2 690% /str - str name=mm - 2lt;-1 5lt;-2 6lt;90% - /str int name=ps100/int @@ -91, +80 @@ '''6.''' Start Solr + cd apache-solr-1.3.0/example java -jar start.jar - cd apache-solr-1.3.0/example - java -jar start.jar '''7. Configure Nutch''' a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) : + ?xml version=1.0? configuration - ?xml version=1.0? - configuration property @@ -109, +96 @@ /property - property - namegenerate.max.per.host/name + property namegenerate.max.per.host/name value100/value @@ -126, +112 @@ /configuration - '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace it’s content with following: -^(https|telnet|file|ftp|mailto): + - - # skip some suffixes - -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ + # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ - + - # skip URLs
[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by Dmitrius. The comment on this change is: It's a problem to make wiki to display grave assent. Managed to do that using html codes. http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=29rev2=30 -- The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable: - export SEGMENT=crawl/segments/``ls -tr crawl/segments|tail -1`` + export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` Now I launch the fetcher that actually goes to get the content:
[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by Dmitrius. http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=30rev2=31 -- The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable: - export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` + export SEGMENT=crawl/segments/#96;ls -tr crawl/segments|tail -1#96; Now I launch the fetcher that actually goes to get the content:
[Nutch Wiki] Update of FrontPage by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by JulienNioche. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=128rev2=129 -- * [[Mailing]] Lists * AcademicArticles that deal with Nutch * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch search engine author:Doug Cutting,Video Lecture - == Nutch Administration == * DownloadingNutch @@ -89, +88 @@ * TikaPlugin - Comments on the Tika integration and differences with existing parse plugins == Nutch 2.0 == + * Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0 - * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture. + * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture (old) * NewScoring -- New stable pagerank like webgraph and link-analysis jobs. * NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems.
[Nutch Wiki] Update of Nutch2Roadmap by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Nutch2Roadmap page has been changed by JulienNioche. http://wiki.apache.org/nutch/Nutch2Roadmap -- New page: = Nutch2Roadmap = Here is a list of the features and architectural changes that will be implemented in Nutch 2.0. * Storage Abstraction * initially with back end implementations for HBase and HDFS * extend it to other storages later e.g. MySQL etc... * Plugin cleanup : Tika only for parsing document formats * keep only stuff HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. * Externalize functionalities to crawler-commons project [http://code.google.com/p/crawler-commons/] * robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix,droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * Remove index / search and delegate to SOLR * we may still keep a thin abstract layer to allow other indexing/search backends (ElasticSearch?), but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. * Various new functionalities * e.g. sitemap support, canonical tag, better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. This document is meant to serve as a basis for discussion, feel free to contribute to it
[Nutch Wiki] Update of Nutch2Roadmap by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Nutch2Roadmap page has been changed by JulienNioche. http://wiki.apache.org/nutch/Nutch2Roadmap?action=diffrev1=1rev2=2 -- * Storage Abstraction * initially with back end implementations for HBase and HDFS * extend it to other storages later e.g. MySQL etc... - * Plugin cleanup : Tika only for parsing document formats + * Plugin cleanup : Tika only for parsing document formats (see http://wiki.apache.org/nutch/TikaPlugin) * keep only stuff HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. * Externalize functionalities to crawler-commons project [http://code.google.com/p/crawler-commons/] * robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix,droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats.
[Nutch Wiki] Update of FAQ by Ankit Dangi
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FAQ page has been changed by Ankit Dangi. http://wiki.apache.org/nutch/FAQ?action=diffrev1=115rev2=116 -- TableOfContents == Nutch FAQ == - === General === - Are there any mailing lists available? - There's a user, developer, commits and agents lists, all available at http://lucene.apache.org/nutch/mailing_lists.html. How can I stop Nutch from crawling my site? - Please visit our [[http://lucene.apache.org/nutch/bot.html|webmaster info page]] Will Nutch be a distributed, P2P-based search engine? - We don't think it is presently possible to build a peer-to-peer search engine that is competitive with existing search engines. It would just be too slow. Returning results in less than a second is important: it lets people rapidly reformulate their queries so that they can more often find what they're looking for. In short, a fast search engine is a better search engine. I don't think many people would want to use a search engine that takes ten or more seconds to return results. That said, if someone wishes to start a sub-project of Nutch exploring distributed searching, we'd love to host it. We don't think these techniques are likely to solve the hard problems Nutch needs to solve, but we'd be happy to be proven wrong. - Will Nutch use a distributed crawler, like Grub? - Distributed crawling can save download bandwidth, but, in the long run, the savings is not significant. A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages, so making the crawler use less bandwidth does not reduce overall bandwidth requirements. The dominant expense of operating a large search engine is not crawling, but searching. Won't open source just make it easier for sites to manipulate rankings? - Search engines work hard to construct ranking algorithms that are immune to manipulation. Search engine optimizers still manage to reverse-engineer the ranking algorithms used by search engines, and improve the ranking of their pages. For example, many sites use link farms to manipulate search engines' link-based ranking algorithms, and search engines retaliate by improving their link-based algorithms to neutralize the effect of link farms. With an open-source search engine, this will still happen, just out in the open. This is analagous to encryption and virus protection software. In the long term, making such algorithms open source makes them stronger, as more people can examine the source code to find flaws and suggest improvements. Thus we believe that an open source search engine has the potential to better resist manipulation of its rankings. What Java version is required to run Nutch? - Nutch 0.7 will run with Java 1.4 and up. Exception: java.net.SocketException: Invalid argument or cannot assign requested address on Fedora Core 3 or 4 - It seems you have installed IPV6 on your machine. To solve this problem, add the following java param to the java instantiation in bin/nutch: JAVA_IPV4=-Djava.net.preferIPv4Stack=true - # run it - exec $JAVA $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath $CLASSPATH $CLASS $@ + # run it exec $JAVA $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath $CLASSPATH $CLASS $@ I have two XML files, nutch-default.xml and nutch-site.xml, why? + nutch-default.xml is the out of the box configuration for nutch. Most configuration can (and should unless you know what your doing) stay as it is. nutch-site.xml is where you make the changes that override the default settings. The same goes to the servlet container application. - - nutch-default.xml is the out of the box configuration for nutch. Most configuration can (and should unless you know what your doing) stay as it is. - nutch-site.xml is where you make the changes that override the default settings. - The same goes to the servlet container application. My system does not find the segments folder. Why? Or: How do I tell the ''Nutch Servlet'' where the index file are located? - There are at least two choices to do that: - First you need to copy the .WAR file to the servlet container webapps folder. + . First you need to copy the .WAR file to the servlet container webapps folder. + {{{ % cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war }}} - - 1) After building your first index, start Tomcat from the index folder. + . 1) After building your first index, start Tomcat from the index folder. - Assuming your index is located at /index : + . Assuming your index is located at /index : + {{{ % cd /index/ % $CATATALINA_HOME/bin/startup.sh }}} - '''Now you can search.''' + . '''Now
[Nutch Wiki] Update of Support by Christopher Bader
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Support page has been changed by Christopher Bader. http://wiki.apache.org/nutch/Support?action=diffrev1=48rev2=49 -- * [[http://www.dsen.nl|Thomas Delnoij (DSEN) - Java | J2EE | Agile Development Consultancy]] * eventax GmbH info at eventax.com * [[http://www.foofactory.fi/|FooFactory]] / Sami Siren info at foofactory dot fi + * [[http://www.kratylos.com/|Kratylos Technologies]] - Consulting development, and tech support for open source search and speech. * [[http://www.lucene-consulting.com/|Lucene Consulting]] - Nutch, Solr, Lucene, Hadoop consulting and development. Founded by Otis Gospodnetic, [[http://www.amazon.com/Lucene-Action-Otis-Gospodnetic/dp/1932394281|Lucene in Action]] co-author. * Stefan Groschupf sg at media-style.com * Michael Nebel mn at nebel.de (germany preferred)
[Nutch Wiki] Update of HttpAuthenticationSchemes by s usam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The HttpAuthenticationSchemes page has been changed by susam. http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diffrev1=18rev2=19 -- === Important Points === 1. For authscope tag, 'host' and 'port' attribute should always be specified. 'realm' and 'scheme' attributes may or may not be specified depending on your needs. If you are tempted to omit the 'host' and 'port' attribute, because you want the credentials to be used for any host and any port for that realm/scheme, please use the 'default' tag instead. That's what 'default' tag is meant for. 1. One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, the credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for authentication-scopes. If the same authentication scope is encountered once again, it will be overwritten with the new credentials. However, one should not rely on this behavior as this might change with further developments. - 1. Do not define multiple authscope tags with the same host, port but different realms if the server requires NTLM authentication. This means there should not be multiple tags with same host, port, scheme=NTLM but different realms. If you are omitting the scheme attribute and the server requires NTLM authentication, then there should not be multiple tags with same host, port but different realms. This is discussed more in the next section. + 1. Do not define multiple authscope tags with the same host, port but different realms if the server requires NTLM authentication. This means there should not be multiple authscope tags with same host, port, scheme=NTLM but different realms. If you are omitting the scheme attribute and the server requires NTLM authentication, then there should not be multiple tags with same host, port but different realms. This is discussed more in the next section. 1. If you are using NTLM scheme, you should also set the 'http.agent.host' property in conf/nutch-site.xml === A note on NTLM domains === NTLM does not use the concept of realms. Therefore, multiple realms for a web-server can not be defined as different authentication scopes for the same web-server requiring NTLM authentication. There should be exactly one authscope tag for NTLM scheme authentication scope for a particular web-server. The authentication domain should be specified as the value of the 'realm' attribute. NTLM authentication also requires the name of IP address of the host on which the crawler is running. Thus, 'http.agent.host' should be set properly. == Underlying HttpClient Library == - 'protocol-httpclient' is based on [[http://jakarta.apache.org/httpcomponents/httpclient-3.x/|Jakarta Commons HttpClient]]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, it must choose which scheme to use. To accompish this, it uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html|HttpClient Authentication Guide]]. + 'protocol-httpclient' is based on [[http://hc.apache.org/httpclient-3.x/|Jakarta Commons HttpClient]]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, it must choose which scheme to use. To accomplish this, it uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [[http://hc.apache.org/httpclient-3.x/authentication.html|HttpClient Authentication Guide]]. == Need Help? == If you need help, please feel free to post your question to the [[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing list]]. The author of this work, Susam Pal, usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug log for 'protocol-httpclient' before running the crawler. To enable debug log for 'protocol-httpclient', open 'conf/log4j.properties' and add the following line:
[Nutch Wiki] Update of HttpAuthenticationSchemes by s usam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The HttpAuthenticationSchemes page has been changed by susam. The comment on this change is: Added suggestion to enable debug for for Jakarta Commons HttpClient. http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diffrev1=19rev2=20 -- 'protocol-httpclient' is based on [[http://hc.apache.org/httpclient-3.x/|Jakarta Commons HttpClient]]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, it must choose which scheme to use. To accomplish this, it uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [[http://hc.apache.org/httpclient-3.x/authentication.html|HttpClient Authentication Guide]]. == Need Help? == - If you need help, please feel free to post your question to the [[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing list]]. The author of this work, Susam Pal, usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug log for 'protocol-httpclient' before running the crawler. To enable debug log for 'protocol-httpclient', open 'conf/log4j.properties' and add the following line: + If you need help, please feel free to post your question to the [[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing list]]. The author of this work, Susam Pal, usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug log for 'protocol-httpclient' and Jakarta Commons !HttpClient before running the crawler. To enable debug log for 'protocol-httpclient' and !HttpClient, open 'conf/log4j.properties' and add the following line: {{{ log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout + log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout }}} It would be good to check the following things before asking for help.
[Nutch Wiki] Trivial Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The HttpAuthenticationSchemes page has been changed by susam. The comment on this change is: Added a link to my website. http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diffrev1=20rev2=21 -- 'protocol-httpclient' is based on [[http://hc.apache.org/httpclient-3.x/|Jakarta Commons HttpClient]]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, it must choose which scheme to use. To accomplish this, it uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [[http://hc.apache.org/httpclient-3.x/authentication.html|HttpClient Authentication Guide]]. == Need Help? == - If you need help, please feel free to post your question to the [[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing list]]. The author of this work, Susam Pal, usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug log for 'protocol-httpclient' and Jakarta Commons !HttpClient before running the crawler. To enable debug log for 'protocol-httpclient' and !HttpClient, open 'conf/log4j.properties' and add the following line: + If you need help, please feel free to post your question to the [[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing list]]. The author of this work, [[http://susam.in/|Susam Pal]], usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug log for 'protocol-httpclient' and Jakarta Commons !HttpClient before running the crawler. To enable debug log for 'protocol-httpclient' and !HttpClient, open 'conf/log4j.properties' and add the following line: {{{ log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout
[Nutch Wiki] Trivial Update of Crawl by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Crawl page has been changed by susam. The comment on this change is: Fixed wiki markup for codes. http://wiki.apache.org/nutch/Crawl?action=diffrev1=8rev2=9 -- === NUTCH_HOME === If you are not executing the script as 'bin/runbot' from Nutch directory, you should either set the environment variable 'NUTCH_HOME' or edit the following in the script:- + {{{ - {{{if [ -z $NUTCH_HOME ] + if [ -z $NUTCH_HOME ] then - NUTCH_HOME=.}}} + NUTCH_HOME=. + }}} Set 'NUTCH_HOME' to the path of the Nutch directory (if you are not setting it as an environment variable, since if environment variable is set, the above assignment is ignored). === CATALINA_HOME === 'CATALINA_HOME' points to the Tomcat installation directory. You must either set this as an environment variable or set it by editing the following lines in the script:- + {{{ - {{{if [ -z $CATALINA_HOME ] + if [ -z $CATALINA_HOME ] then - CATALINA_HOME=/opt/apache-tomcat-6.0.10}}} + CATALINA_HOME=/opt/apache-tomcat-6.0.10 + }}} Similar to the previous section, if this variable is set in the environment, then the above assignment is ignored.
[Nutch Wiki] Trivial Update of Crawl by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Crawl page has been changed by susam. The comment on this change is: Fixed typo. http://wiki.apache.org/nutch/Crawl?action=diffrev1=9rev2=10 -- Similar to the previous section, if this variable is set in the environment, then the above assignment is ignored. == Can it re-crawl? == - The author has used this script to re-crawl a couple of times. However, no real world testing has been done for re-crawling. Therefore, you may try to use the script of re-crawl. If it works out fine or it doesn't work properly for re-crawl, please let us know. + The author has used this script to re-crawl a couple of times. However, no real world testing has been done for re-crawling. Therefore, you may try to use the script for re-crawl. If it works fine or it doesn't work properly for re-crawl, please let us know. == Script == {{{
[Nutch Wiki] Update of Becoming_A_Nutch_Developer by maqboolzee
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Becoming_A_Nutch_Developer page has been changed by maqboolzee. http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer?action=diffrev1=7rev2=8 -- * [[http://www.mail-archive.com/index.php?hunt=nutch|Nutch Mail Archive]] * [[http://www.nabble.com/forum/Search.jtp?query=nutch|Nabble Nutch]] + * [[http://search.lucidimagination.com/search/#/p:nutch][Lucid Imagination Email]] When searching the list for errors you have received it is good to search both by component, for example fetcher, and by the actual error received. If you are not finding the answers you are looking for on the list, you may want to move to the JIRA and search there for answers.
[Nutch Wiki] Update of Becoming_A_Nutch_Developer by maqboolzee
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Becoming_A_Nutch_Developer page has been changed by maqboolzee. http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer?action=diffrev1=8rev2=9 -- * [[http://www.mail-archive.com/index.php?hunt=nutch|Nutch Mail Archive]] * [[http://www.nabble.com/forum/Search.jtp?query=nutch|Nabble Nutch]] - * [[http://search.lucidimagination.com/search/#/p:nutch][Lucid Imagination Email]] + * [[http://search.lucidimagination.com/search/#/p:nutch|Lucid Imagination Email]] When searching the list for errors you have received it is good to search both by component, for example fetcher, and by the actual error received. If you are not finding the answers you are looking for on the list, you may want to move to the JIRA and search there for answers.
New attachment added to page Evaluations on Nutch Wiki
Dear Wiki user, You have subscribed to a wiki page Evaluations for change notification. An attachment has been added to that page by IvanKelly. Following detailed information is available: Attachment name: OSU_Queries.pdf Attachment size: 77705 Attachment link: http://wiki.apache.org/nutch/Evaluations?action=AttachFiledo=gettarget=OSU_Queries.pdf Page link: http://wiki.apache.org/nutch/Evaluations
[Nutch Wiki] Update of Evaluations by IvanKelly
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Evaluations page has been changed by IvanKelly. http://wiki.apache.org/nutch/Evaluations?action=diffrev1=4rev2=5 -- -- DougCutting - 29 Jun 2004 ||'''Attachment:'''||'''Action:'''||'''Size:'''||'''Date:'''||'''Who:'''||'''Comment:'''|| - ||[[http://www.nutch.org/twiki/Main/Evaluations/OSU_Queries.pdf|OSU_Queries.pdf]]||action||77705||29 Jun 2004 - 17:07||DougCutting||OSU evaluation by Lyle Benedict|| + ||[[attachment:OSU_Queries.pdf|OSU_Queries.pdf]]||action||77705||29 Jun 2004 - 17:07||DougCutting||OSU evaluation by Lyle Benedict|| || || || || || ||
[Nutch Wiki] Update of RunNutchInEclipse1.0 by maqb oolzee
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunNutchInEclipse1.0 page has been changed by maqboolzee. http://wiki.apache.org/nutch/RunNutchInEclipse1.0?action=diffrev1=15rev2=16 -- 1. Execute 'ant job' (which is the default) after downloading nutch through SVN - 1. Update plugin.folders (under nutch-default.xml) to ECLIPSE_OUTPUT_FOLDER/plugins + 1. Update plugin.folders (under nutch-default.xml) to build/plugins (where ant builds plugins) 1. If it still fails increase your memory allocation or find a simpler website to crawl.
[Nutch Wiki] Update of RunNutchInEclipse1.0 by maqb oolzee
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunNutchInEclipse1.0 page has been changed by maqboolzee. The comment on this change is: plugins BasicURLNormalizer exception resolution. http://wiki.apache.org/nutch/RunNutchInEclipse1.0?action=diffrev1=13rev2=14 -- = Run Nutch In Eclipse on Linux and Windows nutch version 1.0 = - This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page :-) == Tested with == @@ -12, +11 @@ * Windows XP and Vista == Before you start == - Setting up Nutch to run into Eclipse can be tricky, and most of the time it is much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug a problem. - == Steps == - - === For Windows Users === - If you are running Windows (tested on Windows XP) you must first install cygwin. Download it from http://www.cygwin.com/setup.exe Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH. Example PATH: + {{{ C:\Sun\SDK\bin;C:\cygwin\bin }}} If you run bash from the Windows command line (Start Run... cmd.exe) it should successfully run cygwin. If you are running Eclipse on Vista, you will need to either give cygwin administrative privileges or [[http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/|turn off Vista's User Access Control (UAC)]]. Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler: + {{{ org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission denied }}} - - See [[http://markmail.org/message/ymgygimtvuksn2ic#query:Exception%20in%20thread%20main%20org.apache.hadoop.util.Shell%24ExitCodeException%3A%20chmod%3A%20changing%20permissions+page:1+mid:pj3spjhvdtjx736q+state:results|this]] for more information about the UAC issue. + See [[http://markmail.org/message/ymgygimtvuksn2ic#query:Exception%20in%20thread%20main%20org.apache.hadoop.util.Shell$ExitCodeException:%20chmod:%20changing%20permissions+page:1+mid:pj3spjhvdtjx736q+state:results|this]] for more information about the UAC issue. === Install Nutch === - * Grab a [[http://lucene.apache.org/nutch/version_control.html|fresh release]] of Nutch 1.0 or download and untar the [[http://lucene.apache.org/nutch/release/|official 1.0 release]]. + * Grab a [[http://lucene.apache.org/nutch/version_control.html|fresh release]] of Nutch 1.0 or download and untar the [[http://lucene.apache.org/nutch/release/|official 1.0 release]]. * Do not build Nutch yet. Make sure you have no .project and .classpath files in the Nutch directory - === Create a new Java Project in Eclipse === * File New Project Java project click Next * Name the project (Nutch_Trunk for instance) * Select Create project from existing source and use the location where you downloaded Nutch * Click on Next, and wait while Eclipse is scanning the folders - * Add the folder conf to the classpath (Right-click on the project, select properties then Java Build Path tab (left menu) and then the Libraries tab. Click Add Class Folder... button, and select conf from the list) + * Add the folder conf to the classpath (Right-click on the project, select properties then Java Build Path tab (left menu) and then the Libraries tab. Click Add Class Folder... button, and select conf from the list) * Go to Order and Export tab, find the entry for added conf folder and move it to the top (by checking it and clicking the Top button). This is required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder and not from somewhere else. - * Eclipse should have guessed all the Java files that must be added to your classpath. If that's not the case, add src/java, src/test and all plugin src/java and src/test folders to your source folders. Also add all jars in lib and in the plugin lib folders to your libraries + * Eclipse should have guessed all the Java files that must be added to your classpath. If that's not the case, add src/java, src/test and all plugin src/java and src/test folders to your source folders. Also add all jars in lib and in the plugin lib folders to your libraries * Click the Source tab and set the default output folder to Nutch_Trunk/bin/tmp_build. (You may need to create the tmp_build folder.) * Click the Finish button * DO NOT add build to classpath - === Configure Nutch === * See the
[Nutch Wiki] Update of RunNutchInEclipse1.0 by maqb oolzee
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunNutchInEclipse1.0 page has been changed by maqboolzee. http://wiki.apache.org/nutch/RunNutchInEclipse1.0?action=diffrev1=14rev2=15 -- === NOTE: Additional note for people who want to run eclipse with latest nutch code === If you are getting following exception - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: [[http://org.apache.nutch.net.urlnormalizer.basic.ba/|org.apache.nutch.net]].urlnormalizer.basic.BasicURLNormalizer - 1. Execute 'ant job' (which is the default) after downloading nutch through SVN + 1. Execute 'ant job' (which is the default) after downloading nutch through SVN + - 2. Update plugin.folders (under nutch-default.xml) to ECLIPSE_OUTPUT_FOLDER/plugins + 1. Update plugin.folders (under nutch-default.xml) to ECLIPSE_OUTPUT_FOLDER/plugins + - 3. If it still fails increase your memory allocation or find a simpler website to crawl. + 1. If it still fails increase your memory allocation or find a simpler website to crawl. === Unit tests work in eclipse but fail when running ant in the command line === Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running '''ant test''' in the command line - including the ones you haven't modified. Check if you defined the '''plugin.folders''' property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml @@ -235, +237 @@ Original credits: RenaudRichardet + Updated by: Zeeshan +
[Nutch Wiki] Update of Support by OtisGospodnetic
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Support page has been changed by OtisGospodnetic. http://wiki.apache.org/nutch/Support?action=diffrev1=47rev2=48 -- * [[http://www.ingate.de|INGATE GmbH]] * [[http://www.intrafind.de|IntraFind Software AG]] * Michael Rosset mrosset at btmeta.com + * [[http://sematext.com/|Sematext]] (Otis Gospodnetic, Lucene in Action and Solr in Action co-author) - Solr, Lucene, Nutch, Hadoop, HBase, EC2. Lucene/Solr [[http://sematext.com/services/tech-support.html|tech support]], development and [[http://sematext.com/services/index.html|consulting services]], and [[http://sematext.com/products/index.html|search products]]. Presence in North America and Europe. * Supreet Sethi supreet at linux-delhi.org (india preferred) * Sudhi Seshachala sudhi_...@yahoo.com Please visit http://www.myopensourcejobs.com (Built on LAMP and Nutch) * http://www.termindoc.de (SP data GmbH, Germany schackenberg at termindoc.de)
Page search2.net deleted from Nutch Wiki
Dear wiki user, You have subscribed to a wiki page Nutch Wiki for change notification. The page search2.net has been deleted by search2.net. The comment on this change is: empty page. http://wiki.apache.org/nutch/search2.net
[Nutch Wiki] Update of FrontPage by JohnWhelan
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by JohnWhelan. The comment on this change is: Changes to Cygwin mount points have broken the WhelanLabs Search Engine Manager. No new version is planned.. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=128rev2=129 -- * [[http://blog.foofactory.fi/|FooFactory]] Nutch and Hadoop related posts * [[http://spinn3r.com|Spinn3r]] [[http://spinn3r.com/opensource.php|Open Source components]] (our contribution to the crawling OSS community with more to come). * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better quality Nutch logos]] Re-created Nutch logos available in GIF, PNG EPS in resolutions up to 1200 x 449 - * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, and Cygwin, and JRE for Microsoft Windows. Includes an installer and a simplified administrative UI.
[Nutch Wiki] Update of RunningNutchAndSolr by GeoffBe ntley
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by GeoffBentley. http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=28rev2=29 -- = New in Nutch 1.0-dev = - Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from [[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/]]. + Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/. = Pre Solr Nutch integration = - This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to [[http://variogram.com||Brian Whitman at Variogr.am]] and [[http://blog.foofactory.fi||Sami Siren at FooFactory]] for all the help! You guys saved me a lot of time! :) + This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to http://variogram.com and http://blog.foofactory.fi for all the help! You guys saved me a lot of time! :) I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk. @@ -12, +12 @@ * apt-get install sun-java6-jdk subversion ant patch unzip == Steps == - The first step to get started is to download the required software components, namely Apache Solr and Nutch. '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page @@ -23, +22 @@ '''4.''' Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz + '''5.''' Configure Solr For the sake of simplicity we are going to use the example configuration of Solr as a base. - '''5.''' Configure Solr - For the sake of simplicity we are going to use the example - configuration of Solr as a base. - '''a.''' Copy the provided Nutch schema from directory - apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) + '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: @@ -52, +48 @@ str name=qf - content^0.5 anchor^1.0 title^1.2 + content^0.5 anchor^1.0 title^1.2 /str - /str - str name=pf - content^0.5 anchor^1.5 title^1.2 site^1.5 + str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str - /str + str name=fl url /str - str name=fl - url - /str + str name=mm 2-1 5-2 690% /str - str name=mm - 2lt;-1 5lt;-2 6lt;90% - /str int name=ps100/int - bool hl=true/ + bool name=hltrue/bool str name=q.alt*:*/str @@ -91, +80 @@ '''6.''' Start Solr + cd apache-solr-1.3.0/example java -jar start.jar - cd apache-solr-1.3.0/example - java -jar start.jar '''7. Configure Nutch''' a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) : + ?xml version=1.0? configuration - ?xml version=1.0? - configuration property @@ -109, +96 @@ /property - property - namegenerate.max.per.host/name + property namegenerate.max.per.host/name value100/value @@ -126, +112 @@ /configuration - '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace it’s content with following: -^(https|telnet|file|ftp|mailto): + - - # skip some suffixes - -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ + # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ - + - # skip URLs
[Nutch Wiki] Update of TikaPlugin by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The TikaPlugin page has been changed by JulienNioche. http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=3rev2=4 -- = Tika Plugin = The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch. This page will list the differences in coverage or functionality between the Tika plugin and the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not described here and has a more generic capability of representing structured content which can be useful for HtmlParseFilters (which are currently limited to HTML content). - '''html''': ? + '''html''': comparable '''js''': ? @@ -21, +21 @@ '''rss''': ? - '''rtf''': comparable + '''rtf''': deactivated in Nutch for licensing reasons | works in Tika '''swf''' : not yet covered in Tika (see https://issues.apache.org/jira/browse/TIKA-337)
[Nutch Wiki] Update of TikaPlugin by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The TikaPlugin page has been changed by JulienNioche. http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=4rev2=5 -- '''js''': ? - '''mp3''': ? + '''mp3''': Nutch identifies several fields (Title, Album, Artist) whereas Tika knows only about Titles, the rest is stored as paragraphs. '''msexcel''': comparable (+ Tika able to represent content in structured way as XHTML tables which can be useful for HTML parser plugins) @@ -19, +19 @@ '''pdf''': comparable - '''rss''': ? + '''rss''': Tika identifies only the Mimetype but does nothing about the content '''rtf''': deactivated in Nutch for licensing reasons | works in Tika '''swf''' : not yet covered in Tika (see https://issues.apache.org/jira/browse/TIKA-337) - '''text''': ? + '''text''': comparable '''zip''': ?
[Nutch Wiki] Trivial Update of PublicServers by Geoff reyMcCaleb
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PublicServers page has been changed by GeoffreyMcCaleb. The comment on this change is: Updated description of nsyght.com. http://wiki.apache.org/nutch/PublicServers?action=diffrev1=72rev2=73 -- * [[http://www.myopensourcejobs.com|MyOpensourcejobs]] A Opensource skills jobs site using NUTCH and LAMP basedDRUPAL CMS. - * [[http://www.nsyght.com|Nsyght.com]] is a social search engine that customizes a users search based on their social graph. + * [[http://www.nsyght.com|Nsyght.com]] is a real time search and aggregation service that leverages users social graph. * [[http://www.nursewebsearch.com|Nurse Web Search]] - Health Internet Search Engine.
[Nutch Wiki] Update of FAQ by GodmarBack
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FAQ page has been changed by GodmarBack. The comment on this change is: Corrected formatting - the {{{ must be in the first column, apparently.. http://wiki.apache.org/nutch/FAQ?action=diffrev1=111rev2=112 -- There are at least two choices to do that: First you need to copy the .WAR file to the servlet container webapps folder. + {{{ - {{{% cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war +% cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war }}} 1) After building your first index, start Tomcat from the index folder. Assuming your index is located at /index : + {{{ - {{{% cd /index/ + % cd /index/ - % $CATATALINA_HOME/bin/startup.sh}}} + % $CATATALINA_HOME/bin/startup.sh + }}} '''Now you can search.''' 2) After building your first index, start and stop Tomcat which will make Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and put in it the location of the index folder. + {{{ - {{{% $CATATALINA_HOME/bin/startup.sh + % $CATATALINA_HOME/bin/startup.sh - % $CATATALINA_HOME/bin/shutdown.sh}}} + % $CATATALINA_HOME/bin/shutdown.sh + }}} + {{{ - {{{% vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml + % vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=nutch-conf.xsl? @@ -85, +91 @@ /nutch-conf - % $CATATALINA_HOME/bin/startup.sh}}} + % $CATATALINA_HOME/bin/startup.sh + }}} === Injecting === @@ -110, +117 @@ You'll need to create a file fetcher.done in the segment directory an than: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]], [[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and [[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]] . Assuming your index is at /index + {{{ - {{{ % touch /index/segments/2005somesegment/fetcher.done + % touch /index/segments/2005somesegment/fetcher.done % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/ % bin/nutch generate /index/db/ /index/segments/2005somesegment/ - % bin/nutch fetch /index/segments/2005somesegment}}} + % bin/nutch fetch /index/segments/2005somesegment + }}} All the pages that were not crawled will be re-generated for fetch. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. @@ -146, +155 @@ If you have a fast internet connection ( 10Mb/sec) your bottleneck will definitely be in the machine itself (in fact you will need multiple machines to saturate the data pipe). Empirically I have found that the machine works well up to about 1000-1500 threads. - To get this to work on my Linux box I needed to set the ulimit to 65535 (ulimit -n 65535), and I had to make sure that the DNS server could handle the load (we had to speak with our colo to get them to shut off an artifical cap on the DNS servers). Also, in order to get the speed up to a reasonable value, we needed to set the maximum fetches per host to 100 (otherwise we get a quick start followed by a very long slow tail of fetching). + To get this to work on my Linux box I needed to set the ulimit to 65535 (ulimit -n 65535), and I had to make sure that the DNS server could handle the load (we had to speak with our colo to get them to shut off an artificial cap on the DNS servers). Also, in order to get the speed up to a reasonable value, we needed to set the maximum fetches per host to 100 (otherwise we get a quick start followed by a very long slow tail of fetching). To other users: please add to this with your own experiences, my own experience may be atypical. @@ -208, +217 @@ +.* 3) By default the [[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html|file plugin]] is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: - + {{{ property nameplugin.includes/name valueprotocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)/value /property + }}} Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned [[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and this behavior may be disabled by a [[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] (see security.checkloaduri). IE5 does not have this problem. Nutch crawling parent directories for file protocol -
[Nutch Wiki] Update of FAQ by GodmarBack
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FAQ page has been changed by GodmarBack. The comment on this change is: added useful link to Crawling the local filesystem page.. http://wiki.apache.org/nutch/FAQ?action=diffrev1=112rev2=113 -- Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned [[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and this behavior may be disabled by a [[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] (see security.checkloaduri). IE5 does not have this problem. - Nutch crawling parent directories for file protocol - misconfigured URLFilters + Nutch crawling parent directories for file protocol + + If you find nutch crawling parent directories when using the file protocol, the following kludge may help: + - [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex you should put the following in regex-urlfilter.txt : + [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex you could put the following in regex-urlfilter.txt : {{{ +^file:///c:/top/directory/ -. }}} + + Alternatively, you could apply the patch described [[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on this page]], which would avoid the hardwiring of the site-specific /top/directory in your configuration file. How do I index remote file shares?
[Nutch Wiki] Update of FAQ by GodmarBack
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FAQ page has been changed by GodmarBack. http://wiki.apache.org/nutch/FAQ?action=diffrev1=113rev2=114 -- {{{ property nameplugin.includes/name - valueprotocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)/value + valueprotocol-file|...copy original values from nutch-default here.../value /property }}} + + where you should copy and paste all values from nutch-default.xml in the plugin.includes setting provided there. This will ensure that all plug-in normally enabled will be enabled, plus the protocol-file plugin. Make sure to include parse-pdf if you want to parse PDF files. Make sure that urlfilter-regexp is included, or else '''the *urlfilter files will be ignored''', leading nutch to accept all URLs. You need to enable crawl URL filters to prevent nutch from crawling up the parent directory, see below. Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned [[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and this behavior may be disabled by a [[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] (see security.checkloaduri). IE5 does not have this problem.
[Nutch Wiki] Update of FAQ by GodmarBack
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FAQ page has been changed by GodmarBack. The comment on this change is: Fixed erroneous instructions for how to include protocol-file. http://wiki.apache.org/nutch/FAQ?action=diffrev1=114rev2=115 -- /property }}} - where you should copy and paste all values from nutch-default.xml in the plugin.includes setting provided there. This will ensure that all plug-in normally enabled will be enabled, plus the protocol-file plugin. Make sure to include parse-pdf if you want to parse PDF files. Make sure that urlfilter-regexp is included, or else '''the *urlfilter files will be ignored''', leading nutch to accept all URLs. You need to enable crawl URL filters to prevent nutch from crawling up the parent directory, see below. + where you should copy and paste all values from nutch-default.xml in the plugin.includes setting provided there. This will ensure that all plug-ins normally enabled will be enabled, plus the protocol-file plugin. Make sure to include parse-pdf if you want to parse PDF files. Make sure that urlfilter-regexp is included, or else '''the *urlfilter files will be ignored''', leading nutch to accept all URLs. You need to enable crawl URL filters to prevent nutch from crawling up the parent directory, see below. Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned [[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and this behavior may be disabled by a [[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] (see security.checkloaduri). IE5 does not have this problem.
[Nutch Wiki] Update of search2.net by search2.net
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The search2.net page has been changed by search2.net. http://wiki.apache.org/nutch/search2.net -- New page: ##language:en == search2.net == * [[http://search2.net/|search2.net]]
[Nutch Wiki] Update of PublicServers by search2.net
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PublicServers page has been changed by search2.net. http://wiki.apache.org/nutch/PublicServers?action=diffrev1=71rev2=72 -- * [[http://www.gouv.qc.ca/|Government of Quebec websites]] Over 400 websites of the government of Quebec (Canada) are indexed by Nutch. The Web application has been developped by [[http://www.doculibre.com/index_en.html/|Doculibre inc.]] - * [[http://search2.net/|search2.net]] is a search engine based on Nutch. + * [[http://search2.net/|search2.net]] General search engine with an international index based on Nutch. * [[http://www.searchmitchell.com/|SearchMitchell.com]] is a community search engine for businesses and organizations in Mitchell, SD. * [[http://www.umkreisfinder.de/|UmkreisFinder.de]] is running the GeoPosition plugin for local searches in Germany and in German. Please insert a search term in the first field, a German city name in the second field and choose a perimeter at the last field.
[Nutch Wiki] Update of PublicServers by RBalmes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PublicServers page has been changed by RBalmes. http://wiki.apache.org/nutch/PublicServers?action=diffrev1=71rev2=72 -- = Public search engines using Nutch = + Please sort by name alphabetically - * [[http://askaboutoil.com|AskAboutOil]] is a vertical search portal for the petroleum industry. + * [[http://askaboutoil.com|AskAboutOil]] is a vertical search portal for the petroleum industry. - * [[http://www.asbestosinfo.info|Asbestos]] is a vertical search portal and discussion forum for the asbestos and related information. + * [[http://www.asbestosinfo.info|Asbestos]] is a vertical search portal and discussion forum for the asbestos and related information. - * [[http://www.baynote.com/go|Baynote]] provides free hosted Nutch search for businesses. + * [[http://www.baynote.com/go|Baynote]] provides free hosted Nutch search for businesses. - * [[http://betherebesquare.com|BeThere BeSquare]] is an Event Search Engine for the San Francisco Bay Area that allows users to specify keywords, date, city, address, and category and get details about events in 4 different views. + * [[http://betherebesquare.com|BeThere BeSquare]] is an Event Search Engine for the San Francisco Bay Area that allows users to specify keywords, date, city, address, and category and get details about events in 4 different views. - * [[http://www.bible-ref.om/|Biible]] is the first biblical search engine that allows people to search the web for comments of biblical verse or range of verse. 6 major languages are fully recognized and 150 partially for now. Based on Nutch. + * [[http://www.bigsearch.ca/|Bigsearch.ca]] uses nutch open source software to deliver its search results. - * [[http://www.bigsearch.ca/|Bigsearch.ca]] uses nutch open source software to deliver its search results. + * [[http://busytonight.com/|BusyTonight]]: Search for any event in the United States, by keyword, location, and date. Event listings are automatically crawled and updated from original source Web sites. - * [[http://busytonight.com/|BusyTonight]]: Search for any event in the United States, by keyword, location, and date. Event listings are automatically crawled and updated from original source Web sites. + * [[http://www.centralbudapest.com/search|Central Budapest Search]] is a search engine for English language sites focussing on Budapest news, restaurants, accommodation, life and events. - * [[http://www.centralbudapest.com/search|Central Budapest Search]] is a search engine for English language sites focussing on Budapest news, restaurants, accommodation, life and events. + * [[http://circuitscout.com|Circuit Scout]] is a search engine for electrical circuits. - * [[http://circuitscout.com|Circuit Scout]] is a search engine for electrical circuits. + * [[http://www.comtecsearch.com|Comtec Search]] is a search engine for UK Tour Operator Package Holiday Brochures. - * [[http://www.comtecsearch.com|Comtec Search]] is a search engine for UK Tour Operator Package Holiday Brochures. + * [[http://www.coder-suche.de|Coder-Suche.de]] searchs for coding stuff like apis, documentations, tutorials, openBooks and more. Its origin is german, its contents are mainly english. - * [[http://www.coder-suche.de|Coder-Suche.de]] searchs for coding stuff like apis, documentations, tutorials, openBooks and more. Its origin is german, its contents are mainly english. + * [[http://campusgw.library.cornell.edu/|Cornell University Library]] is collaborating with the research group of Thorsten Joachims to develop a learning search engine for library web pages based on Nutch. The nutch-based search engine is near the bottom of the page. - * [[http://campusgw.library.cornell.edu/|Cornell University Library]] is collaborating with the research group of Thorsten Joachims to develop a learning search engine for library web pages based on Nutch. The nutch-based search engine is near the bottom of the page. + * [[http://search.creativecommons.org/|Creative Commons]] is a search engine for creative commons licensed material. - * [[http://search.creativecommons.org/|Creative Commons]] is a search engine for creative commons licensed material. + * [[http://www.dadi360.com/|Dadi360]] Usee nutch search engine for providing search of Chinese language websites in North America. - * [[http://www.dadi360.com/|Dadi360]] Usee nutch search engine for providing search of Chinese language websites in North America. + * [[http://www.ecolicommunity.org/Websearch|Ecolhub Web Search]] an E. coli specific search engine based on Nutch. EcoliHub WebSearch includes only those sites relevant to E. coli, thereby reducing the number of spurious hits. Searches can be optionally limited to your choice of resources. More
[Nutch Wiki] Update of PublicServers by search2.net
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PublicServers page has been changed by search2.net. http://wiki.apache.org/nutch/PublicServers?action=diffrev1=70rev2=71 -- * [[http://www.gouv.qc.ca/|Government of Quebec websites]] Over 400 websites of the government of Quebec (Canada) are indexed by Nutch. The Web application has been developped by [[http://www.doculibre.com/index_en.html/|Doculibre inc.]] - * [[http://search2.net/|search2.net]] is a search engine based on Nutch. + * [[http://search2.net/|search2.net]] General search engine with an international index. * [[http://www.searchmitchell.com/|SearchMitchell.com]] is a community search engine for businesses and organizations in Mitchell, SD. * [[http://www.umkreisfinder.de/|UmkreisFinder.de]] is running the [[GeoPosition]] plugin for local searches in Germany and in German. Please insert a search term in the first field, a German city name in the second field and choose a perimeter at the last field.
[Nutch Wiki] Update of PublicServers by search2.net
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PublicServers page has been changed by search2.net. http://wiki.apache.org/nutch/PublicServers?action=diffrev1=71rev2=72 -- * [[http://www.gouv.qc.ca/|Government of Quebec websites]] Over 400 websites of the government of Quebec (Canada) are indexed by Nutch. The Web application has been developped by [[http://www.doculibre.com/index_en.html/|Doculibre inc.]] - * [[http://search2.net/|search2.net]] General search engine with an international index. + * [[http://search2.net/|search2.net]] is a general search engine with an international index. * [[http://www.searchmitchell.com/|SearchMitchell.com]] is a community search engine for businesses and organizations in Mitchell, SD. * [[http://www.umkreisfinder.de/|UmkreisFinder.de]] is running the [[GeoPosition]] plugin for local searches in Germany and in German. Please insert a search term in the first field, a German city name in the second field and choose a perimeter at the last field.
[Nutch Wiki] Update of TikaPlugin by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The TikaPlugin page has been changed by JulienNioche. http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=2rev2=3 -- = Tika Plugin = - The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch. This page will list the differences in coverage or functionality between the Tika plugin and the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not described here. + The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch. This page will list the differences in coverage or functionality between the Tika plugin and the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not described here and has a more generic capability of representing structured content which can be useful for HtmlParseFilters (which are currently limited to HTML content). '''html''': ? @@ -9, +9 @@ '''mp3''': ? - '''msexcel''': ? + '''msexcel''': comparable (+ Tika able to represent content in structured way as XHTML tables which can be useful for HTML parser plugins) - '''mspowerpoint''': ? + '''mspowerpoint''': comparable - '''msword''': ? + '''msword''': Tika does not support word 95 other versions are comparable - '''openoffice''': ? + '''openoffice''': comparable - '''pdf''': ? + '''pdf''': comparable '''rss''': ? - '''rtf''': ? + '''rtf''': comparable '''swf''' : not yet covered in Tika (see https://issues.apache.org/jira/browse/TIKA-337) '''text''': ? - '''zip''': ?not covered in Tika + '''zip''': ?
[Nutch Wiki] Update of TikaPlugin by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The TikaPlugin page has been changed by JulienNioche. http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=1rev2=2 -- - =Tika Plugin= + = Tika Plugin = + The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch. This page will list the differences in coverage or functionality between the Tika plugin and the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not described here. + '''html''': ? - The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch. - This page will list the differences in coverage or functionality between the Tika plugin and the existing Nutch parsers. + '''js''': ? + + '''mp3''': ? + + '''msexcel''': ? + + '''mspowerpoint''': ? + + '''msword''': ? + + '''openoffice''': ? + + '''pdf''': ? + + '''rss''': ? + + '''rtf''': ? + + '''swf''' : not yet covered in Tika (see https://issues.apache.org/jira/browse/TIKA-337) + + '''text''': ? + + '''zip''': ?not covered in Tika +
[Nutch Wiki] Update of FrontPage by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by JulienNioche. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=127rev2=128 -- * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin a Java application * InstallingWeb2 * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6) + * TikaPlugin - Comments on the Tika integration and differences with existing parse plugins == Nutch 2.0 == * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture.
[Nutch Wiki] Trivial Update of Automating_Fetches_wi th_Python by newacct
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Automating_Fetches_with_Python page has been changed by newacct. http://wiki.apache.org/nutch/Automating_Fetches_with_Python?action=diffrev1=5rev2=6 -- import sys import getopt import re - import string import logging import logging.config import commands @@ -259, +258 @@ total_urls += 1 urllinecount.close() numsplits = total_urls / splitsize - padding = 0 * len(`numsplits`) + padding = 0 * len(repr(numsplits)) # create the url load folder - linenum = 0 filenum = 0 - strfilenum = `filenum` + strfilenum = repr(filenum) urloutdir = outdir + /urls- + padding[len(strfilenum):] + strfilenum os.mkdir(urloutdir) urlfile = urloutdir + /urls @@ -275, +273 @@ outhandle = open(urlfile, w) # loop through the file - for line in inhandle: + for linenum, line in enumerate(inhandle): # if we have come to a split then close the current file, create a new # url folder and open a new url file - if linenum 0 and (linenum % splitsize == 0): + if linenum 0 and linenum % splitsize == 0: - filenum = filenum + 1 + filenum += 1 - strfilenum = `filenum` + strfilenum = repr(filenum) urloutdir = outdir + /urls- + padding[len(strfilenum):] + strfilenum os.mkdir(urloutdir) urlfile = urloutdir + /urls @@ -290, +288 @@ outhandle.close() outhandle = open(urlfile, w) - # write the url to the file and increase the number of lines read + # write the url to the file outhandle.write(line) - linenum = linenum + 1 # close the input and output files inhandle.close() @@ -362, +359 @@ # fetch the current segment outar = result[1].splitlines() - output = outar[len(outar) - 1] + output = outar[-1] - tempseg = string.split(output)[0] + tempseg = output.split()[0] tempseglist.append(tempseg) fetch = self.nutchdir + /bin/nutch fetch + tempseg self.log.info(Starting fetch for: + tempseg) @@ -392, +389 @@ # merge the crawldbs self.log.info(Merging master and temp crawldbs.) - crawlmerge = (self.nutchdir + /bin/nutch mergedb mergetemp/crawldb + + crawlmerge = self.nutchdir + /bin/nutch mergedb mergetemp/crawldb + \ - mastercrawldbdir + + string.join(tempdblist, )) + mastercrawldbdir + + .join(tempdblist) self.log.info(Running: + crawlmerge) result = commands.getstatusoutput(crawlmerge) self.checkStatus(result, Error occurred while running command + crawlmerge) @@ -404, +401 @@ result = commands.getstatusoutput(getsegment) self.checkStatus(result, Error occurred while running command + getsegment) outar = result[1].splitlines() - output = outar[len(outar) - 1] + output = outar[-1] - masterseg = string.split(output)[0] + masterseg = output.split()[0] - mergesegs = (self.nutchdir + /bin/nutch mergesegs mergetemp/segments + + mergesegs = self.nutchdir + /bin/nutch mergesegs mergetemp/segments + \ - masterseg + + string.join(tempseglist, )) + masterseg + + .join(tempseglist) self.log.info(Running: + mergesegs) result = commands.getstatusoutput(mergesegs) self.checkStatus(result, Error occurred while running command + mergesegs) @@ -464, +461 @@ usage.append([-b | --backupdir] The master backup directory, [crawl-backup].\n) usage.append([-s | --splitsize] The number of urls per load [50].\n) usage.append([-f | --fetchmerge] The number of fetches to run before merging [1].\n) - message = string.join(usage) + message = .join(usage) print message
[Nutch Wiki] Update of FrontPage by Davinder
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by Davinder. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=123rev2=124 -- * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better quality Nutch logos]] Re-created Nutch logos available in GIF, PNG EPS in resolutions up to 1200 x 449 * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, and Cygwin, and JRE for Microsoft Windows. Includes an installer and a simplified administrative UI. + * [[http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch search engine + author: Doug Cutting, Yahoo! Research ]]Experiences with the Nutch search engine + author: Doug Cutting, Yahoo! Research . +
[Nutch Wiki] Update of FrontPage by Davinder
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by Davinder. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=124rev2=125 -- Please contribute your knowledge about Nutch here! == General Information == - * [[http://www.nutch.org|Nutch Website ]] + * [[http://www.nutch.org|Nutch Website]] * [[Features]] * PublicServers running Nutch * [[Presentations]] on Nutch @@ -51, +51 @@ * [[RunNutchInEclipse1.0]] for v1.0 (Linux and Windows) * [[Crawl]] - script to crawl (and possible recrawl too) * IntranetRecrawl - script to recrawl a crawl - * MergeCrawl - script to merge 2 (or more) crawls + * MergeCrawl - script to merge 2 (or more) crawls * SearchOverMultipleIndexes - configuring nutch to enable searching over multiple indexes * CrossPlatformNutchScripts * MonitoringNutchCrawls - techniques for keeping an eye on a nutch crawl's progress. @@ -78, +78 @@ * [[Website_Update_HOWTO]] * [[Image_Search_Design]] * [[NutchOSGi]] - * [[StrategicGoals]] + * StrategicGoals - * [[IndexStructure]] + * IndexStructure * [[Getting_Started]] * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin a Java application * InstallingWeb2 * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6) == Nutch 2.0 == - * [[Nutch2Architecture]] -- Discussions on the Nutch 2.0 architecture. + * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture. - * [[NewScoring]] -- New stable pagerank like webgraph and link-analysis jobs. + * NewScoring -- New stable pagerank like webgraph and link-analysis jobs. - * [[NewScoringIndexingExample]] -- Two full fetch cycles of commands using new scoring and indexing systems. + * NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems. == Other Resources == * [[http://nutch.sourceforge.net/blog/cutting.html|Doug's Weblog]] -- He's the one who originally wrote Lucene and Nutch. @@ -96, +96 @@ * [[http://frutch.free.fr/wikini/|Frutch Wiki]] -- French Nutch Wiki * The [[http://nutch.sourceforge.net/cgi-bin/twiki/view/Main/Nutch|Old Wiki]] * [[Search_Theory]] Search Theory White Papers - * [[http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E|Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06]] + * [[http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20Navoni%20Roberto|Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06]] * [[http://blog.foofactory.fi/|FooFactory]] Nutch and Hadoop related posts * [[http://spinn3r.com|Spinn3r]] [[http://spinn3r.com/opensource.php|Open Source components]] (our contribution to the crawling OSS community with more to come). * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better quality Nutch logos]] Re-created Nutch logos available in GIF, PNG EPS in resolutions up to 1200 x 449 * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, and Cygwin, and JRE for Microsoft Windows. Includes an installer and a simplified administrative UI. + * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch search engine author: Doug Cutting,Video Lecture - * [[http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch search engine - author: Doug Cutting, Yahoo! Research ]]Experiences with the Nutch search engine - author: Doug Cutting, Yahoo! Research . -
[Nutch Wiki] Update of FrontPage by Davinder
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by Davinder. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=125rev2=126 -- * [[http://spinn3r.com|Spinn3r]] [[http://spinn3r.com/opensource.php|Open Source components]] (our contribution to the crawling OSS community with more to come). * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better quality Nutch logos]] Re-created Nutch logos available in GIF, PNG EPS in resolutions up to 1200 x 449 * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, and Cygwin, and JRE for Microsoft Windows. Includes an installer and a simplified administrative UI. - * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch search engine author: Doug Cutting,Video Lecture + * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch search engine author: Doug Cutting,Video Lecture
[Nutch Wiki] Update of FrontPage by Davinder
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by Davinder. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=126rev2=127 -- * Commercial [[Support]] and developers for hire * [[Mailing]] Lists * AcademicArticles that deal with Nutch + * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch search engine author:Doug Cutting,Video Lecture + == Nutch Administration == * DownloadingNutch @@ -101, +103 @@ * [[http://spinn3r.com|Spinn3r]] [[http://spinn3r.com/opensource.php|Open Source components]] (our contribution to the crawling OSS community with more to come). * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better quality Nutch logos]] Re-created Nutch logos available in GIF, PNG EPS in resolutions up to 1200 x 449 * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, and Cygwin, and JRE for Microsoft Windows. Includes an installer and a simplified administrative UI. - * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch search engine author: Doug Cutting,Video Lecture
[Nutch Wiki] Update of OptimizingCrawls by DennisKube s
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The OptimizingCrawls page has been changed by DennisKubes. The comment on this change is: Page about optimizing crawling speed. http://wiki.apache.org/nutch/OptimizingCrawls -- New page: '''Here are the things that could potentially slow down fetching''' 1) DNS setup 2) The number of crawlers you have, too many, too few. 3) Bandwidth limitations 4) Number of threads per host (politeness) 5) Uneven distribution of urls to fetch and politeness. 6) High crawl-delays from robots.txt (usually along with an uneven distribution of urls). 7) Many slow websites (again usually with an uneven distribution). 8) Downloading lots of content (PDFS, very large html pages, again possibly an uneven distribution). 9) Others '''Now how do we fix them''' 1) Have a DNS setup on each local crawling machine, if multiple crawling machines and a single centralized DNS it can act like a DOS attack on the DNS server slowing the entire system. We always did a two layer setup hitting first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon. 2) This would be number of map tasks * fetcher.threads.fetch. So 10 map tasks * 20 threads = 200 fetchers at once. Too many and you overload your system, too few and other factors and the machine sites idle. You will need to play around with this setting for your setup. 3) Bandwidth limitations. Use ntop, ganglia, and other monitoring tools to determine how much bandwidth you are using. Account for in and out bandwidth. A simple test, from a server inside the fetching network but not itself fetching, if it is very slow connecting to or downloading content when fetching is occurring, it is a good bet you are maxing out bandwidth. If you set http timeout as we describe later and are maxing your bandwidth, you will start seeing many http timeout errors. 4) Politeness along with uneven distribution of urls is probably the biggest limiting factor. If one thread is processing a single site and there are a lot of urls from that site to fetch all other threads will sit idle while that one thread finishes. Some solutions, use fetcher.server.delay to shorten the time between page fetches and use fetcher.threads.per.host to increase the number of threads fetching for a single site (this would still be in the same map task though and hence the same JVM ChildTask process). If increasing this 0 you could also set fetcher.server.min.delay to some value 0 for politeness to min and max bound the process. 5) Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching dramatically. For full web crawls you want an even distribution so all fetching threads can be active. Setting generate.max.per.host to a value 0 will limit the number of pages from a single host/domain to fetch. 6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites don't use this setting but a few (some malicious do). I have seen crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay variable will ignore pages with crawl delays x. I usually set this to 10 seconds, default is 30. Even at 10 seconds if you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow. On the flip side, setting this to a low value will ignore and not fetch those pages. 7) Sometimes, manytimes websites are just slow. Setting a low value for http.timeout helps. The default is 10 seconds. If you don't care and want as many pages as fast as possible, set it lower. Some websites, digg for instance, will bandwidth limit you on their side only allowing x connections per given time frame. So even if you only have say 50 pages from a single site (which I still think is to many). It may be waiting 10 seconds on each page. The ftp.timeout can also be set if fetching ftp content. 8) Lots of content means slower fetching. If downloading PDFs and other non-html documents this is especially true. To avoid non-html content you can use the url filters. I prefer the prefix and suffix filters. The http.content.limit and ftp.content.limit can be used to limit the amount of content downloaded for a single document. 9) Other things that could be causing slow fetching: * Max the number of open sockets/files on a machine. You will start seeing IO errors or can't open socket errors. * Poor routing. Bad routers or home routers might not be able to handle the number of connections going through at once. An incorrect routing setup could also be causing problems but those are usually much more complex to diagnose. Use network trace and mapping tools if you think this is happening. Upstream routing can also be a problem from your network provider. * Bad network cards. I have seen network
[Nutch Wiki] Update of FrontPage by DennisKubes
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by DennisKubes. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=122rev2=123 -- * NonDefaultIntranetCrawlingOptions - Desirable options to add to your intranet crawling configuration. * RunningNutchAndSolr - How to configure Nutch to crawl, but post to Solr for search/index * NutchWithChineseAnalyzer - References to some Chinese articles explaining how to setup Nutch with 3rd party Chinese analyzers + * OptimizingCrawls - How to optimize your crawling/fetching speed with Nutch. == Nutch Development == * [[Becoming_A_Nutch_Developer|Becoming a Nutch Developer]] - Start developing and contributing to Nutch.
[Nutch Wiki] Update of NutchHadoopTutorial by ilgiz
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchHadoopTutorial page has been changed by ilgiz. The comment on this change is: max outlinks per page. http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diffrev1=14rev2=15 -- * This tutorial worked well for me, however, I ran into a problem where my crawl wasn't working. Turned out, it was because I needed to set the user agent and other properties for the crawl. If anyone is reading this, and running into the same problem, look at the updated tutorial http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29 + + + * By default Nutch will read only the first 100 links on a page. This will result in incomplete indexes when scanning file trees. So I set the max outlinks per page option to -1 in nutch-site.conf and got complete indexes. + {{{ + property + namedb.max.outlinks.per.page/name + value-1/value + descriptionThe maximum number of outlinks that we'll process for a page. + If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks + will be processed for a page; otherwise, all outlinks will be processed. + /description + /property + }}} +
[Nutch Wiki] Trivial Update of NutchHadoopTutorial by ilgiz
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchHadoopTutorial page has been changed by ilgiz. http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diffrev1=15rev2=16 -- - * By default Nutch will read only the first 100 links on a page. This will result in incomplete indexes when scanning file trees. So I set the max outlinks per page option to -1 in nutch-site.conf and got complete indexes. + * By default Nutch will read only the first 100 links on a page. This will result in incomplete indexes when scanning file trees. So I set the max outlinks per page option to -1 in nutch-site.conf and got complete indexes. {{{ property namedb.max.outlinks.per.page/name
[Nutch Wiki] Update of RunNutchInEclipse1.0 by Anas Elghafari
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunNutchInEclipse1.0 page has been changed by AnasElghafari. http://wiki.apache.org/nutch/RunNutchInEclipse1.0?action=diffrev1=12rev2=13 -- * Name the project (Nutch_Trunk for instance) * Select Create project from existing source and use the location where you downloaded Nutch * Click on Next, and wait while Eclipse is scanning the folders - * Add the folder conf to the classpath (click the Libraries tab, click Add Class Folder... button, and select conf from the list) + * Add the folder conf to the classpath (Right-click on the project, select properties then Java Build Path tab (left menu) and then the Libraries tab. Click Add Class Folder... button, and select conf from the list) * Go to Order and Export tab, find the entry for added conf folder and move it to the top (by checking it and clicking the Top button). This is required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder and not from somewhere else. * Eclipse should have guessed all the Java files that must be added to your classpath. If that's not the case, add src/java, src/test and all plugin src/java and src/test folders to your source folders. Also add all jars in lib and in the plugin lib folders to your libraries * Click the Source tab and set the default output folder to Nutch_Trunk/bin/tmp_build. (You may need to create the tmp_build folder.) @@ -72, +72 @@ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. - Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually). + Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually. If that does not work, you may try clicking Add External JARs and the point to the two the directories above). === Two Errors with RTFParseFactory ===
[Nutch Wiki] Update of GettingNutchRunningWithJboss b y TerrenceCurran
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The GettingNutchRunningWithJboss page has been changed by TerrenceCurran. http://wiki.apache.org/nutch/GettingNutchRunningWithJboss -- New page: = Running Nutch with JBoss AS 5.1 = I only had to make minor changes beyond the basic Tomcat tutorials to get Nutch running on JBoss AS 5.1 == Deployment == Make sure that your nutch-site.xml file is configured in your packaged .war file, or exploded .war directory. == Xerces == JBoss ships with a different version of Xerces installed and available to all deployed applications. I was getting an error about the conflict. Removing xerces-2_x_x-apis.jar and xerces-2_x_x.jar from the war file's lib directory fixed the problem. == Code changes == In the file: /src/java/org/apache/nutch/plugin/PluginManifestParser.java Nutch checks has a check to make sure it can find the plugin folder. The check looks like this: {{{ } else if (!file.equals(url.getProtocol())) { LOG.warn(Plugins: not a file: url. Can't load plugins from: + url); return null; } }}} This does not work in jboss because local files in the deployment directory have a protocol of vfsfile:// since vfsfile acts just like file, you just have to change this code to: {{{ } else if (!file.equals(url.getProtocol()) !vfsfile.equals(url.getProtocol())) { LOG.warn(Plugins: not a file: url. Can't load plugins from: + url); return null; } }}}
[Nutch Wiki] Update of FrontPage by TerrenceCurran
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by TerrenceCurran. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=121rev2=122 -- * GettingNutchRunningWithUtf8 - For support of non-ASCII characters (Chinese, German, Japanese, Korean). * GettingNutchRunningWithResin - Resin is a JSP/Servlet/EJB application server (alternative to tomcat). * GettingNutchRunningWithJetty + * GettingNutchRunningWithJboss * GettingNutchRunningWithUbuntu * GettingNutchRunningWithWindows * GettingNutchRunningWithMacOsx
[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The ApacheConUs2009MeetUp page has been changed by KenKrugler. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=5rev2=6 -- - We were planning to have a Web Crawler Developer !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. + We had a Web Crawler Developer !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. - Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp. + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. - So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :) + == Attendees == + + * Andrzej Bialeki - Apache Nutch + * Thorsten xxx - Apache Droids + * Michael Stack - Formerly with Heritrix, now HBase + * Ken Krugler - Bixo + + == Topics == + + === Roadmaps === + + Nutch - become more component based. + Droids - get more people involved. + + === Sharable Components === + + * robots.txt parsing + * URL normalization + * URL filtering + * Page cleansing + * General purpose + * Specialized + * Sub-page parsing (portlets) + * AJAX-ish page interactions + * Document parsing (via Tika) + * HttpClient (configuration) + * Text similarity + * Mime/charset/language detection + + === Tika === + + * Needs help to become really usable + * Would benefit from large test corpus + * Could do comparison with Nutch parser + * Needs option for direct DOM querying (screen scraping tasks) + * Handles mime charset detection now (some issues) + * Could be extended to include language detection (wrap other impl) + + === URL Normalization === + + * Includes both domain (www.x.com == x.com), path, and query portions of URL + * Often site-specific rules + * Option to derive rules using URLs to similar documents. + + === AJAX-ish Page Interaction === + + * Not applicable for broad/general crawling + * Can be very important for specific web sites + * Use Selenium or headless Mozilla + + === Component API Issues === + + * Want to avoid using an API that's tied too closely to any implementation. + * One option is to have simple (e.g. URL param) API that takes meta-data. + * Similar to Tika passing in of meta-data. + + === Hosting Options === + + * As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch. + * As part of Droids - but Droids is both a framework (queue-based) and set of components. + * New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids. + * Google code - seems like a good short-term solution, to judge level of interest and help shake out issues. + + == Next Steps == + + * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page. + * Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up. + * Make decision about build system (and then move on to code formatting debate :)) + * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good. + * Start contributing code + * Ken will put in robots.txt parser. + + == Original Discussion Topic List == Below are some potential topics for discussion - feel free to add/comment.
[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The ApacheConUs2009MeetUp page has been changed by KenKrugler. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=6rev2=7 -- == Attendees == * Andrzej Bialeki - Apache Nutch - * Thorsten xxx - Apache Droids + * Thorsten Sherler - Apache Droids * Michael Stack - Formerly with Heritrix, now HBase * Ken Krugler - Bixo
[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The ApacheConUs2009MeetUp page has been changed by KenKrugler. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=7rev2=8 -- We had a Web Crawler Developer !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. + + - == Attendees == @@ -15, +17 @@ === Roadmaps === - Nutch - become more component based. + * Nutch - become more component based. - Droids - get more people involved. + * Droids - get more people involved. === Sharable Components === @@ -76, +78 @@ * Start contributing code * Ken will put in robots.txt parser. + - + == Original Discussion Topic List == Below are some potential topics for discussion - feel free to add/comment.
[Nutch Wiki] Update of ApacheConUs2009MeetUp by Andrz ejBialecki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The ApacheConUs2009MeetUp page has been changed by AndrzejBialecki. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=8rev2=9 -- == Attendees == - * Andrzej Bialeki - Apache Nutch + * Andrzej Bialecki - Apache Nutch * Thorsten Sherler - Apache Droids * Michael Stack - Formerly with Heritrix, now HBase * Ken Krugler - Bixo
[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The ApacheConUs2009MeetUp page has been changed by KenKrugler. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=4rev2=5 -- - We're planning to have a Web Crawler Developer !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. + We were planning to have a Web Crawler Developer !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. - Tentative plan is for Thursday evening, November 5th. The actual schedule for !MeetUps is [[http://wiki.apache.org/apachecon/ApacheMeetupsUs09|here]]. + Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp. + + So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :) Below are some potential topics for discussion - feel free to add/comment.
[Nutch Wiki] Update of DownloadingNutch by SteveKearn s
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The DownloadingNutch page has been changed by SteveKearns. http://wiki.apache.org/nutch/DownloadingNutch?action=diffrev1=5rev2=6 -- You have two choices in how to get Nutch: - 1. You can download a release from http://lucene.apache.org/nutch/release/. This will give you a relatively stable release. At the moment the latest release is 0.9. + 1. You can download a release from http://lucene.apache.org/nutch/release/. This will give you a relatively stable release. At the moment the latest release is 1.0. - 2. Or, you can check out the latest source code from subversion and build it with Ant. This gets you closer to the bleeding edge of development. The 0.9 should be relatively stable but the trunk (from which the [[http://lucene.apache.org/nutch/nightly.html|nightly builds]] are build) is under heavy development with bugs showing up and getting squashed fairly frequently. + 2. Or, you can check out the latest source code from subversion and build it with Ant. This gets you closer to the bleeding edge of development. The 1.0 release should be relatively stable but the trunk (from which the [[http://lucene.apache.org/nutch/nightly.html|nightly builds]] are build) is under heavy development with bugs showing up and getting squashed fairly frequently. Note: As of 5/29/08 the Subversion trunk seems to be much better than the 0.9 release. If you have trouble with 0.9 your best bet is to try moving to trunk and see if the problems resolve themselves.
[Nutch Wiki] Trivial Update of 首页 by yongping8204
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The 首页 page has been changed by yongping8204. http://wiki.apache.org/nutch/%E9%A6%96%E9%A1%B5?action=diffrev1=4rev2=5 -- #format wiki #language zh #pragma section-numbers off - = 维基链接名 维基 = 您也许可以从这些连接开始: + - * [[最新改动]]: 谁最近改动了什么 + * [[最新改动]]: 谁最近改动了什么 (我在修改) * [[维基沙盘演练]]: 您可以随意改动编辑,热身演练 * [[查找网页]]: 用多种方法搜索浏览这个站点 - * [[语法参考]]: 维基语法简便参考 + * [[语法参考]]: 维基语法简便参考 * [[站点导航]]: 本站点内容概要 + 这个维基是有关什么的? 测试 + == 如何使用这个站点 == + 维基(wiki)是一种协同合作网站,任何人都可以参与网站的建立、编辑和维护并分享网站的内容: - == 如何使用这个站点 == - - 维基(wiki)是一种协同合作网站,任何人都可以参与网站的建立、编辑和维护并分享网站的内容: * 点击每个网页页眉或页尾中的'''GetText(Edit)'''就可以随意编辑改动这个网页。 * 创建一个链接简单的不能再简单了:您可以使用连在一起的,每个单词第一个字母大写,但不用空格分隔的词组(比如WikiSandBox),也可以用{{{[quoted words in brackets]}}}。简体中文的链接可以使用后者,比如{{{[维基沙盘演练]}}}。 * 每页的页眉中的搜索框可以用来将进行网页标题搜索或者进行全文检索。
[Nutch Wiki] Update of Support by KelvinTan
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KelvinTan: http://wiki.apache.org/nutch/Support -- * Sudhi Seshachala sudhi_...@yahoo.com Please visit http://www.myopensourcejobs.com (Built on LAMP and Nutch) * http://www.termindoc.de (SP data GmbH, Germany schackenberg at termindoc.de) * [http://www.mint.nl/ MINT] (Media Integration) info at mint.nl - * [http://www.supermind.org/ Kelvin Tan] kelvint at apache.org + * [http://www.supermind.org/ Kelvin Tan] Kelvin Tan - Lucene, Solr and Nutch consulting. Specializes in vertical search. * [http://www.tokenizer.org/ Tokenizer Inc.] Fuad Efendi, director, [1](416)993-2060, first_name at last_name.ca. Toronto, Canada. * [http://www.webdev2b.com Vladimir Brezhnev] rsdsoft at gmail.com * [http://www.wyona.com/ Wyona] open source software development, contact at wyona.com
[Nutch Wiki] Update of Support by Justin Gilbreath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Justin Gilbreath: http://wiki.apache.org/nutch/Support -- Entries are listed alphabetically by company or last name. - * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, support, and value-add components (i.e. spiders, UI, security) for Nutch, Lucene and Solr. Based in Germany with customers across Europe and North America. contact at 30digits.com + * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, support, and value-add components (i.e. spiders, UI, security) for Nutch, Lucene and Solr. Based in Germany (Deutschland) with customers across Europe and North America. contact at 30digits.com * [http://www.sigram.com Andrzej Bialecki] ab at sigram.com * CNLP http://www.cnlp.org/tech/lucene.asp * [http://www.digitalpebble.com/ DigitalPebble Ltd.] contact at digitalpebble.com. Norwich, UK.
[Nutch Wiki] Update of PublicServers by ReinierBattenberg
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ReinierBattenberg: http://wiki.apache.org/nutch/PublicServers -- * [http://www.misterbot.fr Misterbot.fr] a search engine for french language web sites. + * [http://search.mountbatten.net Mountbatten Search] a search engine that crawls only the part of the Internet located in Uganda. + * [http://www.mozdex.com mozDex].com Running Nutch SVN release with Clustering Ontology support enabled. * [http://www.myopensourcejobs.com MyOpensourcejobs] A Opensource skills jobs site using NUTCH and LAMP basedDRUPAL CMS.
[Nutch Wiki] Trivial Update of FrontPage by KenKrugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/FrontPage -- * [Getting Started] * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin a Java application * InstallingWeb2 + * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6) == Nutch 2.0 == * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture.
[Nutch Wiki] Trivial Update of FrontPage by KenKrugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/FrontPage -- * [Getting Started] * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin a Java application * InstallingWeb2 - * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6) + * [ApacheConUs2009MeetUp ApacheCon US 2009 MeetUp] - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6) == Nutch 2.0 == * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture.
[Nutch Wiki] Trivial Update of FrontPage by KenKrugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/FrontPage -- * [Getting Started] * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin a Java application * InstallingWeb2 - * [ApacheConUs2009MeetUp ApacheCon US 2009 MeetUp] - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6) + * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6) == Nutch 2.0 == * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture.
[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKrugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp The comment on the change is: List of potential discussion topics for ApacheCon US 2009 MeetUp New page: We're planning to have a Web Crawler Developer !MeetUp at this year's ApacheCon US in Oakland. Tentative plan is for Thursday evening, November 5th. The actual schedule for !MeetUps is [http://wiki.apache.org/apachecon/ApacheMeetupsUs09 here]. Below are some potential topics for discussion - feel free to add/comment. * Potential synergies between crawler projects - e.g. sharing robots.txt processing code. * How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite. * Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly. * robots.txt processing - current problems with existing implementations * Avoiding crawler traps - link farms, honeypots, etc. * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping * Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?) * Testing challenges - is it possible to unit test a crawler? * Fuzzy classification - mime-type, charset, language. * The future of Nutch, Droids, Heritrix, Bixo, etc. * Optimizing for types of crawling - intranet, focused, whole web.
[Nutch Wiki] Trivial Update of ApacheConUs2009MeetUp by KenKrugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp -- Below are some potential topics for discussion - feel free to add/comment. - * Potential synergies between crawler projects - e.g. sharing robots.txt processing code. + * Potential synergies between crawler projects - e.g. sharing robots.txt processing code. - * How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite. + * How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite. - * Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly. + * Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly. - * robots.txt processing - current problems with existing implementations + * robots.txt processing - current problems with existing implementations - * Avoiding crawler traps - link farms, honeypots, etc. + * Avoiding crawler traps - link farms, honeypots, etc. - * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping + * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping - * Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?) + * Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?) - * Testing challenges - is it possible to unit test a crawler? + * Testing challenges - is it possible to unit test a crawler? - * Fuzzy classification - mime-type, charset, language. + * Fuzzy classification - mime-type, charset, language. - * The future of Nutch, Droids, Heritrix, Bixo, etc. + * The future of Nutch, Droids, Heritrix, Bixo, etc. - * Optimizing for types of crawling - intranet, focused, whole web. + * Optimizing for types of crawling - intranet, focused, whole web.
[Nutch Wiki] Trivial Update of ApacheConUs2009MeetUp by KenKrugler
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp -- - We're planning to have a Web Crawler Developer !MeetUp at this year's ApacheCon US in Oakland. + We're planning to have a Web Crawler Developer !MeetUp at this year's [http://www.us.apachecon.com/c/acus2009/ ApacheCon US] in Oakland. Tentative plan is for Thursday evening, November 5th. The actual schedule for !MeetUps is [http://wiki.apache.org/apachecon/ApacheMeetupsUs09 here]. @@ -11, +11 @@ * Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly. * robots.txt processing - current problems with existing implementations * Avoiding crawler traps - link farms, honeypots, etc. - * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping + * Parsing content - home grown, Neko/!TagSoup, Tika, screen scraping * Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?) * Testing challenges - is it possible to unit test a crawler? * Fuzzy classification - mime-type, charset, language.
[Nutch Wiki] Update of PublicServers by stoicleo
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by stoicleo: http://wiki.apache.org/nutch/PublicServers -- Please sort by name alphabetically * [http://askaboutoil.com AskAboutOil] is a vertical search portal for the petroleum industry. + + * [http://www.asbestosinfo.info Asbestos] is a vertical search portal and discussion forum for the asbestos and related information. * [http://www.baynote.com/go Baynote] provides free hosted Nutch search for businesses.
[Nutch Wiki] Update of FrontPage by AlexMc
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by AlexMc: http://wiki.apache.org/nutch/FrontPage -- * [:Automating_Fetches_with_Python:Automating Fetches with Python] - How to automatic the Nutch fetching process using Python * [:Upgrading_Hadoop:Upgrading Hadoop Version in Nutch] - Basic steps for upgrading Hadoop in Nutch. * [FAQ] - * [:CommandLineOptions:Commandline] options for 0.7.x + * [:07CommandLineOptions:Commandline] options for 0.7.x * [:08CommandLineOptions:Commandline] options for version 0.8 + * Current CommandLineOptions * OverviewDeploymentConfigs * NutchConfigurationFiles * GettingNutchRunningWithUtf8 - For support of non-ASCII characters (Chinese, German, Japanese, Korean).
[Nutch Wiki] Update of bin/nutch readdb by AlexMc
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by AlexMc: http://wiki.apache.org/nutch/bin/nutch_readdb -- CommandLineOptions + + (Actually this looks out of date. You might be looking for org.apache.nutch.crawl.CrawlDBReader instead) +
[Nutch Wiki] Trivial Update of bin/nutch readdb by AlexMc
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by AlexMc: http://wiki.apache.org/nutch/bin/nutch_readdb -- CommandLineOptions - (Actually this looks out of date. You might be looking for org.apache.nutch.crawl.CrawlDBReader instead) + (Actually this looks out of date. You might be looking for org.apache.nutch.crawl.CrawlDbReader instead)
[Nutch Wiki] Update of FrontPage by DanielZhou
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by DanielZhou: http://wiki.apache.org/nutch/FrontPage -- * HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes. * NonDefaultIntranetCrawlingOptions - Desirable options to add to your intranet crawling configuration. * RunningNutchAndSolr - How to configure Nutch to crawl, but post to Solr for search/index + * NutchWithChineseAnalyzer - References to some Chinese articles explaining how to setup Nutch with 3rd party Chinese analyzers == Nutch Development == * [:Becoming_A_Nutch_Developer:Becoming a Nutch Developer] - Start developing and contributing to Nutch.
[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Mike Dawson: http://wiki.apache.org/nutch/AddingNewLocalization New page: ===Adding a New Language to Nutch=== If you want to have Nutch in your language - hopefully the below helps. I just Googled around. * Unzip Nutch 1.0 to any folder * Translate the .properties files that you find in src/web/locale/org/nutch/jsp : ** For each file make sure that you have your own version ending in _langcode.properties e.g. _fa.properties . Btw OmegaT is an excellent Translation memory program to help with standardizing terms etc. * Make a folder src/web/include/langcode with a file header.xml - again this needs translated. * Make a folder src/web/pages/langcode and copy the .xml files from the English folder and then translate them. In search.xml look for the line: pre input type=hidden name=lang value=fa/ /pre Change the value of lang to match the language you are adding (e.g. fa) * Add your language to src/web/include/footer.html * In the Nutch base directory run ant pre ant generate-docs /pre * Work in progress - I now find that when doing the search it still comes back in English... for some reason it seems like the JSP loads the resource bundle according to the language passed by the browser headers, not according to the lang parameter...
[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Mike Dawson: http://wiki.apache.org/nutch/AddingNewLocalization -- If you want to have Nutch in your language - hopefully the below helps. I just Googled around. - * Unzip Nutch 1.0 to any folder + * Unzip Nutch 1.0 to any folder - * Translate the .properties files that you find in src/web/locale/org/nutch/jsp : + * Translate the .properties files that you find in src/web/locale/org/nutch/jsp : - ** For each file make sure that you have your own version ending in _langcode.properties e.g. _fa.properties . Btw OmegaT is an excellent Translation memory program to help with standardizing terms etc. + * For each file make sure that you have your own version ending in _langcode.properties e.g. _fa.properties . Btw OmegaT is an excellent Translation memory program to help with standardizing terms etc. - * Make a folder src/web/include/langcode with a file header.xml - again this needs translated. + * Make a folder src/web/include/langcode with a file header.xml - again this needs translated. - * Make a folder src/web/pages/langcode and copy the .xml files from the English folder and then translate them. In search.xml look for the line: + * Make a folder src/web/pages/langcode and copy the .xml files from the English folder and then translate them. In search.xml look for the line: - pre + + {{{ input type=hidden name=lang value=fa/ - /pre + }}} Change the value of lang to match the language you are adding (e.g. fa) - * Add your language to src/web/include/footer.html + * Add your language to src/web/include/footer.html - * In the Nutch base directory run ant + * In the Nutch base directory run ant - pre + {{{ ant generate-docs - /pre + }}} - * Work in progress - I now find that when doing the search it still comes back in English... for some reason it seems like the JSP loads the resource bundle according to the language passed by the browser headers, not according to the lang parameter... + * Work in progress - I now find that when doing the search it still comes back in English... for some reason it seems like the JSP loads the resource bundle according to the language passed by the browser headers, not according to the lang parameter...
[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Mike Dawson: http://wiki.apache.org/nutch/AddingNewLocalization -- - ===Adding a New Language to Nutch=== + = Adding a New Language to Nutch = - If you want to have Nutch in your language - hopefully the below helps. I just Googled around. + If you want to have Nutch in your language - hopefully the below helps. I have been Googling around and digging in some source code... * Unzip Nutch 1.0 to any folder @@ -25, +25 @@ ant generate-docs }}} - * Work in progress - I now find that when doing the search it still comes back in English... for some reason it seems like the JSP loads the resource bundle according to the language passed by the browser headers, not according to the lang parameter... + * It seems like some changes are needed to search.jsp to make it behave as users would expect. The original appears to expect the language of the browser to take precedence over the language selected... After out.flush() at about line 160 add the following in src/web/jsp/search.jsp: + {{{ + + //see what locale we should use + Locale ourLocale = null; + if(!queryLang.equals()) { + ourLocale = new Locale(queryLang); + language = new String(queryLang); + }else { + ourLocale = request.getLocale(); + } + + }}} + + Then change the line: + + {{{ + i18n:bundle baseName=org.nutch.jsp.search/ + }}} + + to: + + {{{ + i18n:bundle baseName=org.nutch.jsp.search locale=%=ourLocale%/ + }}} + + * Now we are ready to build it: + + {{{ + ant war + }}} + + * Copy the .war file to your servlet container's webapp directory. If everything went well you will see your language code in the bottom, then you can select it, and the search interface will come back with the localisation you just put in. +
[Nutch Wiki] Update of Support by Justin Gilbreath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Justin Gilbreath: http://wiki.apache.org/nutch/Support -- Entries are listed alphabetically by company or last name. + * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, support, and value-add components (i.e. spiders, UI, security) for Nutch, Lucene and Solr. Based in Germany with customers across Europe and North America. * [http://www.sigram.com Andrzej Bialecki] ab at sigram.com * CNLP http://www.cnlp.org/tech/lucene.asp * [http://www.digitalpebble.com/ DigitalPebble Ltd.] contact at digitalpebble.com. Norwich, UK.
[Nutch Wiki] Update of Support by Justin Gilbreath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by Justin Gilbreath: http://wiki.apache.org/nutch/Support -- Entries are listed alphabetically by company or last name. - * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, support, and value-add components (i.e. spiders, UI, security) for Nutch, Lucene and Solr. Based in Germany with customers across Europe and North America. + * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, support, and value-add components (i.e. spiders, UI, security) for Nutch, Lucene and Solr. Based in Germany with customers across Europe and North America. contact at 30digits.com * [http://www.sigram.com Andrzej Bialecki] ab at sigram.com * CNLP http://www.cnlp.org/tech/lucene.asp * [http://www.digitalpebble.com/ DigitalPebble Ltd.] contact at digitalpebble.com. Norwich, UK.
[Nutch Wiki] Update of HttpAuthenticationSchemes by wobbet
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by wobbet: http://wiki.apache.org/nutch/HttpAuthenticationSchemes -- == Configuration == Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very brief, therefore this section would explain it in a little more detail. In all the examples below, the root element auth-configuration has been omitted for the sake of clarity. + + === Prerequisites === + In order use HTTP Authentication your Nutch install must be configured to use 'protocol-httpclient' instead of the default 'protocol-http'. To make this change copy the 'plugin.includes' property from 'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml'. Within that property replace 'protocol-http' with 'protocol-httpclient'. If you have made no other changes it will look as follows: + {{{ + property + nameplugin.includes/name + valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value + descriptionRegular expression naming plugin directory names to + include. Any plugin not matching this expression is excluded. + In any case you need at least include the nutch-extensionpoints plugin. By + default Nutch includes crawling just HTML and plain text via HTTP, + and basic indexing and search plugins. In order to use HTTPS please enable + protocol-httpclient, but be aware of possible intermittent problems with the + underlying commons-httpclient library. + /description + /property + }}} + + === Optional === + By default Nutch use credential from 'httpclient-auth.xml'. If you wish to use a different file you will need to copy the 'http.auth.file' property from 'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml' and then modify the 'value' element. The default property appears as follows: + {{{ + property + namehttp.auth.file/name + valuehttpclient-auth.xml/value + descriptionAuthentication configuration file for 'protocol-httpclient' plugin./description + /property + }}} + === Crawling an Intranet with Default Authentication Scope === Let's say all pages of an intranet are protected by basic, digest or ntlm authentication and there is only one set of credentials to be used for all web pages in the intranet, then a configuration as described below is enough. This is also the simplest possible configuration possible for authentication schemes.
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: Added TableOfContents and minor edits in Prerequisites and Optional sect -- + [[TableOfContents]] + == Introduction == This is a feature in Nutch that allows the crawler to authenticate itself to websites requiring NTLM, Basic or Digest authentication. This feature can not do POST based authentication that depends on cookies. More information on this can be found at: HttpPostAuthentication @@ -18, +20 @@ Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very brief, therefore this section would explain it in a little more detail. In all the examples below, the root element auth-configuration has been omitted for the sake of clarity. === Prerequisites === - In order use HTTP Authentication your Nutch install must be configured to use 'protocol-httpclient' instead of the default 'protocol-http'. To make this change copy the 'plugin.includes' property from 'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml'. Within that property replace 'protocol-http' with 'protocol-httpclient'. If you have made no other changes it will look as follows: + In order to use HTTP Authentication, the Nutch crawler must be configured to use 'protocol-httpclient' instead of the default 'protocol-http'. To do this copy 'plugin.includes' property from 'conf/nutch-default.xml' into 'conf/nutch-site.xml'. Replace 'protocol-http' with 'protocol-httpclient' in the value of the property. If you have made no other changes it should look as follows: {{{ property nameplugin.includes/name @@ -35, +37 @@ }}} === Optional === - By default Nutch use credential from 'httpclient-auth.xml'. If you wish to use a different file you will need to copy the 'http.auth.file' property from 'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml' and then modify the 'value' element. The default property appears as follows: + By default Nutch uses credentials from 'conf/httpclient-auth.xml'. If you wish to use a different file, the file should be placed in the 'conf' directory and 'http.auth.file' property should be copied from 'conf/nutch-default.xml' into 'conf/nutch-site.xml' and then the file name in the 'value' element should be edited accordingly. The default property appears as follows: {{{ property namehttp.auth.file/name @@ -43, +45 @@ descriptionAuthentication configuration file for 'protocol-httpclient' plugin./description /property }}} - === Crawling an Intranet with Default Authentication Scope === Let's say all pages of an intranet are protected by basic, digest or ntlm authentication and there is only one set of credentials to be used for all web pages in the intranet, then a configuration as described below is enough. This is also the simplest possible configuration possible for authentication schemes.
[Nutch Wiki] Update of IntranetRecrawl by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/IntranetRecrawl -- echo frequent depending on disk constraints) and a new crawl generated. }}} + == Version 1.0 == + A crawl script that runs properly with bash and has been tested with Nutch 1.0 can be found here: Crawl +
[Nutch Wiki] Update of IntranetRecrawl by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/IntranetRecrawl The comment on the change is: Link to crawl script for Nutch 1.0 -- }}} == Version 1.0 == - A crawl script that runs properly with bash and has been tested with Nutch 1.0 can be found here: Crawl + A crawl script that runs properly with bash and has been tested with Nutch 1.0 can be found here: Self:Crawl. This script can do crawl as well as recrawl. However, not much real world recrawl has been done with this script. It might require a little bit of tweaking if you find that the script does not suit your needs.
[Nutch Wiki] Update of Support by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JulienNioche: http://wiki.apache.org/nutch/Support -- * [http://www.sigram.com Andrzej Bialecki] ab at sigram.com * CNLP http://www.cnlp.org/tech/lucene.asp + * [http://www.digitalpebble.com/ DigitalPebble Ltd.] contact at digitalpebble.com. Norwich, UK. * [http://www.doculibre.com/ Doculibre Inc.] Open source and information management consulting. (Lucene, Nutch, Hadoop, Solr, Lius etc.) info at doculibre.com * [http://www.dsen.nl Thomas Delnoij (DSEN) - Java | J2EE | Agile Development Consultancy] * eventax GmbH info at eventax.com
[Nutch Wiki] Update of FrontPage by JohnWhelan
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JohnWhelan: http://wiki.apache.org/nutch/FrontPage The comment on the change is: Adding 'WhelanLabs SearchEngine Manager' under 'other resources'. -- * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts * [http://spinn3r.com Spinn3r] [http://spinn3r.com/opensource.php Open Source components] (our contribution to the crawling OSS community with more to come). * [http://www.interadvertising.co.uk/blog/nutch_logos Larger / better quality Nutch logos] Re-created Nutch logos available in GIF, PNG EPS in resolutions up to 1200 x 449 + * [http://www.whelanlabs.com/content/SearchEngineManager.htm WhelanLabs SearchEngine Manager] An all-in-one, bundled implementation of Nutch, Tomcat, and Cygwin, and JRE for Microsoft Windows. Includes an installer and a simplified administrative UI.
[Nutch Wiki] Update of GettingNutchRunningWithWindows by JohnWhelan
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JohnWhelan: http://wiki.apache.org/nutch/GettingNutchRunningWithWindows -- Since Nutch is written in Java, it is possible to get Nutch working in a Windows environment, provided that the correct software is installed. + + Note: If you're just interested in a basic installation on Windows and are not interested in knowing the details of how it is done, you might want check and see if the[http://www.whelanlabs.com/content/SearchEngineManager.htm WhelanLabs SearchEngine Manager] fits your needs. It is a free installer for Nutch on Windows. The following documents describe how I got it working on Windows XP Pro running Tomcat 5.28. Edit: page updated with my experience installing on Windows Server 2003.
[Nutch Wiki] Trivial Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes -- == Introduction == - This is a feature in Nutch, developed by Susam Pal, that allows the crawler to authenticate itself to websites requiring NTLM, Basic or Digest authentication. This feature can not do POST based authentication that depends on cookies. More information on this can be found at: HttpPostAuthentication + This is a feature in Nutch that allows the crawler to authenticate itself to websites requiring NTLM, Basic or Digest authentication. This feature can not do POST based authentication that depends on cookies. More information on this can be found at: HttpPostAuthentication == Necessity == There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used to configure authentication. @@ -108, +108 @@ Once you have checked the items listed above and you are still unable to fix the problem or confused about any point listed above, please mail the issue with the following information: 1. Version of Nutch you are running. - 1. Complete code in ''conf/httpclient-auth.xml' file. + 1. Complete code in 'conf/httpclient-auth.xml' file. 1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send the complete file.
[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by amitkumar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- - private static class LuceneDocumentWrapper implements Writable { + public static class LuceneDocumentWrapper implements Writable { ). + + HI, I to faced problems to integrate solr and nutch.After , some work i found the below article and integrated successfully. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ +
[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by amitkumar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- + public static class LuceneDocumentWrapper implements Writable { ). - HI, I to faced problems to integrate solr and nutch.After , some work i found the below article and integrated successfully. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ + HI, I to faced problems in integrating solr and nutch. After, some work out i found the below article and integrated successfully. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ +
[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by amitkumar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it requestHandler name=/nutch class=solr.SearchHandler + lst name=defaults + str name=defTypedismax/str + str name=echoParamsexplicit/str + float name=tie0.01/float + str name=qf + content^0.5 anchor^1.0 title^1.2 /str + str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str + str name=fl url /str + str name=mm 2lt;-1 5lt;-2 6lt;90% /str + int name=ps100/int + bool hl=true/ + str name=q.alt*:*/str + str name=hl.fltitle url content/str + str name=f.title.hl.fragsize0/str + str name=f.title.hl.alternateFieldtitle/str + str name=f.url.hl.fragsize0/str + str name=f.url.hl.alternateFieldurl/str + str name=f.content.hl.fragmenterregex/str + /lst + /requestHandler 6. Start Solr
[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by amitkumar: http://wiki.apache.org/nutch/RunningNutchAndSolr -- * apt-get install sun-java6-jdk subversion ant patch unzip == Steps == - Setup The first step to get started is to download the required software components, namely Apache Solr and Nutch. - 1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page + '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page - 2. Extract Solr package + '''2.''' Extract Solr package - 3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality) + '''3.''' Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality) - 4. Extract the Nutch package + '''4.''' Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz - tar xzf apache-nutch-1.0.tar.gz - - 5. Configure Solr + '''5.''' Configure Solr - For the sake of simplicity we are going to use the example configuration of Solr as a base. - a. Copy the provided Nutch schema from directory + '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: - b. Change schema.xml so that the stored attribute of field âcontentâ is true. + '''b.''' Change schema.xml so that the stored attribute of field âcontentâ is true. field name=âcontentâ type=âtextâ stored=âtrueâ indexed=âtrueâ/ We want to be able to tweak the relevancy of queries easily so weâll create new dismax request handler configuration for our use case: - d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it + '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it requestHandler name=/nutch class=solr.SearchHandler @@ -93, +89 @@ /requestHandler - 6. Start Solr + '''6.''' Start Solr cd apache-solr-1.3.0/example java -jar start.jar - 7. Configure Nutch + '''7. Configure Nutch''' a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace itâs contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) : ?xml version=1.0? configuration + property + namehttp.agent.name/name + valuenutch-solr-integration/value + /property + property namegenerate.max.per.host/name + value100/value + /property + property + nameplugin.includes/name + valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value + /property + /configuration + - b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, + '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace itâs content with following: - replace itâs content with following: -^(https|telnet|file|ftp|mailto): @@ -135, +143 @@ # deny anything else -. - 8. Create a seed list (the initial urls to fetch) + '''8.''' Create a seed list (the initial urls to fetch) mkdir urls echo http://www.lucidimagination.com/; urls/seed.txt - 9. Inject seed url(s) to nutch crawldb (execute in nutch directory) + '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory) bin/nutch inject crawl/crawldb urls - 10. Generate fetch list, fetch and parse content + '''10.''' Generate fetch list, fetch and parse content bin/nutch generate crawl/crawldb crawl/segments @@ -166, +174 @@ Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content. - 11. Create linkdb + '''11.''' Create linkdb bin/nutch invertlinks crawl/linkdb -dir crawl/segments - 12. Finally index all content from all segments to Solr + '''12.''' Finally index all content from all segments to Solr bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
[Nutch Wiki] Update of FrontPage by PalashRay
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by PalashRay: http://wiki.apache.org/nutch/FrontPage -- * ErrorMessages -- What they mean and suggestions for getting rid of them. * SetupProxyForNutch - using Tinyproxy on Ubuntu * CreateNewFilter - for example to add a category metadata to your index and be able to search for it + * HowToMakeCustomSearch * UpgradeFrom07To08 * [Upgrading_from_0.8.x_to_0.9] * RunNutchInEclipse for v0.8
[Nutch Wiki] Update of HowToMakeCustomSearch by PalashRay
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by PalashRay: http://wiki.apache.org/nutch/HowToMakeCustomSearch New page: [This is for Nutch 1.0] How do you index your custom data and then search for the same using the Nutch web interface? Suppose we want to search for the author of the website by his email id. == Indexing the email id == Before we can search for our custom data, we need to index it. Nutch has a plugin architecture very similar to that of Eclipse. We can write our own plugin for indexing. Here is the source code: {{{ package com.swayam.nutch.plugins.indexfilter; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.crawl.CrawlDatum; import org.apache.nutch.crawl.Inlinks; import org.apache.nutch.indexer.IndexingException; import org.apache.nutch.indexer.IndexingFilter; import org.apache.nutch.indexer.NutchDocument; import org.apache.nutch.indexer.lucene.LuceneWriter; import org.apache.nutch.parse.Parse; /** *...@author paawak */ public class EmailIndexingFilter implements IndexingFilter { private static final Log LOG = LogFactory.getLog(EmailIndexingFilter.class); private static final String KEY_CREATOR_EMAIL = email; private Configuration conf; public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException { // look up email of the author based on the url of the site String creatorEmail = EmailLookup.getCreatorEmail(url.toString()); LOG.info( creatorEmail = + creatorEmail); if (creatorEmail != null) { doc.add(KEY_CREATOR_EMAIL, creatorEmail); } return doc; } public void addIndexBackendOptions(Configuration conf) { LuceneWriter.addFieldOptions(KEY_CREATOR_EMAIL, LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf); } public Configuration getConf() { return conf; } public void setConf(Configuration conf) { this.conf = conf; } } }}} Also, you need to create a ''plugin.xml'': {{{ plugin id=index-email name=Email Indexing Filter version=1.0.0 provider-name=swayam runtime library name=EmailIndexingFilterPlugin.jar export name=* / /library /runtime requires import plugin=nutch-extensionpoints / /requires extension id=com.swayam.nutch.plugins.indexfilter.EmailIndexingFilter name=Email Indexing Filter point=org.apache.nutch.indexer.IndexingFilter implementation id=index-email class=com.swayam.nutch.plugins.indexfilter.EmailIndexingFilter / /extension /plugin }}} This done, create a new folder in the ''$NUTCH_HOME/plugins'' and put the jar and the plugin.xml there. Now we have to activate this plugin. To do this, we have to edit the ''conf/nutch-site.xml''. {{{ property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|parse-(text|html)|index-(basic|email)|query-(basic|site|url)/value descriptionRegular expression naming plugin id names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property }}} == Now, how do I search my indexed data? == === Option 1 [cumbersome]: === Add my own query plugin: {{{ package com.swayam.nutch.plugins.queryfilter; import org.apache.nutch.searcher.FieldQueryFilter; /** *...@author paawak */ public class MyEmailQueryFilter extends FieldQueryFilter { public MyEmailQueryFilter() { super(email); } } }}} Do not forget to edit the plugin.xml. {{{ plugin id=query-email name=Email Query Filter version=1.0.0 provider-name=swayam runtime library name=EmailQueryFilterPlugin.jar export name=*/ /library /runtime requires import plugin=nutch-extensionpoints/ /requires extension id=com.swayam.nutch.plugins.queryfilter.MyEmailQueryFilter name=Email Query Filter point=org.apache.nutch.searcher.QueryFilter implementation id=query-email class=com.swayam.nutch.plugins.queryfilter.MyEmailQueryFilter parameter name=fields value=email/ /implementation /extension /plugin }}} This line is particularly important: 'parameter name=âfieldsâ value=âemailâ/' If you skip this line, you will never be able to see this in search results. The only catch here is you have to append the keyword email: to the search key. For example, if you want to search for jsm...@mydomain.com, you have to
[Nutch Wiki] Update of HowToMakeCustomSearch by PalashRay
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by PalashRay: http://wiki.apache.org/nutch/HowToMakeCustomSearch -- [This is for Nutch 1.0] - How do you index your custom data and then search for the same using the Nutch web interface? Suppose we want to search for the author of the website by his email id. + How do you index your custom data and then search for the same using the Nutch web interface? + + == Use Case == + + Suppose we want to search for the author of the website by his email id. == Indexing the email id == @@ -184, +188 @@ If you skip this line, you will never be able to see this in search results. - The only catch here is you have to append the keyword email: to the search key. For example, if you want to search for jsm...@mydomain.com, you have to search for email:jsm...@mydomain.com or email:jsmith. + The only catch here is you have to prepend the keyword 'email:' to the search key. For example, if you want to search for 'jsm...@mydomain.com', you have to search for 'email:jsm...@mydomain.com' or 'email:jsmith'. There is an easier and more elegant way. @@ -224, +228 @@ }}} - With this while looking for jsm...@mydomain.com, you can simply enter jsm...@mydomain.com or a part the name like jsmit. + With this while looking for 'jsm...@mydomain.com', you can simply enter 'jsm...@mydomain.com' or a part the name like 'jsmit'. == Building a Nutch plugin ==
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: For Vista, give cygwin administrative privileges -- Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH. - I have in PATH like: + Example PATH: {{{ C:\Sun\SDK\bin;C:\cygwin\bin }}} If you run bash from the Windows command line (Start Run... cmd.exe) it should successfully run cygwin. - If you are running Eclipse on Vista, you will likely need to [http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/ turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler: + If you are running Eclipse on Vista, you will need to either give cygwin administrative privileges or [http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/ turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler: {{{ org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission denied }}}
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Removed install of whoami in Windows (cygwin's whoami is used) -- == Before you start == + Setting up Nutch to run into Eclipse can be tricky, and most of the time it is much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug a problem. - Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). - However, it's very useful to be able to debug Nutch in Eclipse. But again you might be quicker by looking at the logs (logs/hadoop.log)... == Steps == @@ -34, +33 @@ C:\Sun\SDK\bin;C:\cygwin\bin - If you run bash in Start-RUN-cmd.exe it should work. + If you run bash in Start Run... cmd.exe it should work. - - Then you should install tools from Microsoft website (adding 'whoami' command). - - Example for Windows XP and sp2 - - http://www.microsoft.com/downloads/details.aspx?FamilyId=49AE8576-9BB9-4126-9761-BA8011FABF38displaylang=en - - - Then you can follow rest of these steps === Install Nutch === * Grab a fresh release of Nutch 0.9 - http://lucene.apache.org/nutch/version_control.html
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Add link to official release -- === Install Nutch === - * Grab a fresh release of Nutch 0.9 - http://lucene.apache.org/nutch/version_control.html + * Grab a [http://lucene.apache.org/nutch/version_control.html fresh release] of Nutch 1.0 or download and untar the [http://lucene.apache.org/nutch/release/ official 1.0 release]. - * Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory + * Do not build Nutch yet. Make sure you have no .project and .classpath files in the Nutch directory === Create a new java project in Eclipse ===
[Nutch Wiki] Trivial Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Clarified some instructions and improved grammar -- * Do not build Nutch yet. Make sure you have no .project and .classpath files in the Nutch directory - === Create a new java project in Eclipse === + === Create a new Java Project in Eclipse === * File New Project Java project click Next * Name the project (Nutch_Trunk for instance) * Select Create project from existing source and use the location where you downloaded Nutch * Click on Next, and wait while Eclipse is scanning the folders - * Add the folder conf to the classpath (third tab and then add class folder) + * Add the folder conf to the classpath (click the Libraries tab, click Add Class Folder... button, and select conf from the list) - * Go to Order and Export tab, find the entry for added conf folder and move it to the top. It's required to make eclipse take config (nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder not anywhere else. + * Go to Order and Export tab, find the entry for added conf folder and move it to the top (by checking it and clicking the Top button). This is required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder and not from somewhere else. - * Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add src/java, src/test and all plugin src/java and src/test folders to your source folders. Also add all jars in lib and in the plugin lib folders to your libraries - * Set output dir to tmp_build, create it if necessary + * Eclipse should have guessed all the Java files that must be added to your classpath. If that's not the case, add src/java, src/test and all plugin src/java and src/test folders to your source folders. Also add all jars in lib and in the plugin lib folders to your libraries + * Click the Source tab and set the default output folder to Nutch_Trunk/bin/tmp_build. (You may need to create the tmp_build folder.) + * Click the Finish button * DO NOT add build to classpath
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Added fix for RTFParseFactory issues -- Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually). + === Two Errors with RTFParseFactory === + + If you are trying to build the official 1.0 release, Eclipse will complain about 2 errors regarding the RTFParseFactory (this is after adding the RTF jar file from the previous step). This problem was fixed (see [http://issues.apache.org/jira/browse/NUTCH-644 NUTCH-644] and [http://issues.apache.org/jira/browse/NUTCH-705 NUTCH-705]) but was not included in the 1.0 official release because of licensing issues. So you will need to manually alter the code to remove these 2 build errors. + + In RTFParseFactory.java: + 1. Add the following import statement: {{{import org.apache.nutch.parse.ParseResult;}}} + + 2. Change + + {{{ + public Parse getParse(Content content) { + }}} + to + {{{ + public ParseResult getParse(Content content) { + }}} + 1.#3 In the getParse function, replace + {{{ + return new ParseStatus(ParseStatus.FAILED, +ParseStatus.FAILED_EXCEPTION, +e.toString()).getEmptyParse(conf); + }}} + with + {{{ + return new ParseStatus(ParseStatus.FAILED, + ParseStatus.FAILED_EXCEPTION, + e.toString()).getEmptyParseResult(content.getUrl(), getConf()); + }}} + 1.#4 In the getParse function, replace + {{{ + return new ParseImpl(text, + new ParseData(ParseStatus.STATUS_SUCCESS, +title, +OutlinkExtractor.getOutlinks(text, this.conf), +content.getMetadata(), +metadata)); + }}} + with + {{{ + return ParseResult.createParseResult(content.getUrl(), +new ParseImpl(text, +new ParseData(ParseStatus.STATUS_SUCCESS, +title, +OutlinkExtractor.getOutlinks(text, this.conf), +content.getMetadata(), +metadata))); + + }}} + + In TestRTFParser.java, replace + {{{ + parse = new ParseUtil(conf).parseByExtensionId(parse-rtf, content); + }}} + with + {{{ + parse = new ParseUtil(conf).parseByExtensionId(parse-rtf, content).get(urlString); + }}} + + Once you have made these changes and saved the files, Eclipse should build with no errors. === Build Nutch === If you setup the project correctly, Eclipse will build Nutch for you into tmp_build. See below for problems you could run into.
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: In case you forget to add cygwin to path -- * add the hadoop project as a dependent project of nutch project * you can now also set break points within hadoop classes lik inputformat implementations etc. + + === Failed to get the current user's information === + + On Windows, if the crawler throws an exception complaining it Failed to get the current user's information or 'Login failed: Cannot run program bash', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs Accessories Command Prompt) and type bash. This should start cygwin. If it doesn't, type path to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under For Windows Users. + + Original credits: RenaudRichardet
[Nutch Wiki] Trivial Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Restart Eclipse after setting PATH -- === Failed to get the current user's information === - On Windows, if the crawler throws an exception complaining it Failed to get the current user's information or 'Login failed: Cannot run program bash', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs Accessories Command Prompt) and type bash. This should start cygwin. If it doesn't, type path to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under For Windows Users. + On Windows, if the crawler throws an exception complaining it Failed to get the current user's information or 'Login failed: Cannot run program bash', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs Accessories Command Prompt) and type bash. This should start cygwin. If it doesn't, type path to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under For Windows Users. After setting the PATH, you will likely need to restart Eclipse so it will use the new PATH. Original credits: RenaudRichardet
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Moved heap problem to location of other problems -- * if all works, you should see Nutch getting busy at crawling :-) - == Java Heap Size problem == - - If you find in hadoop.log line similar to this: - - {{{ - 2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001 - java.lang.OutOfMemoryError: Java heap space - }}} - - You should increase amount of RAM for running applications from eclipse. - - Just set it in: - - Eclipse - Window - Preferences - Java - Installed JREs - edit - Default VM arguments - - I've set mine to - {{{ - -Xms5m -Xmx150m - }}} - because I have like 200MB RAM left after runnig all apps - - -Xms (minimum ammount of RAM memory for running applications) - -Xmx (maximum) - == Debug Nutch in Eclipse (not yet tested for 0.9) == * Set breakpoints and debug a crawl * It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints: @@ -195, +171 @@ == If things do not work... == Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-) + === Java Heap Size problem === + + If the crawler throws an IOException exception early in the crawl (Exception in thread main java.io.IOException: Job failed!), check the logs/hadoop.log file for further information. If you find in hadoop.log lines similar to this: + + {{{ + 2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001 + java.lang.OutOfMemoryError: Java heap space + }}} + + then you should increase amount of RAM for running applications from Eclipse. + + Just set it in: + + Eclipse - Window - Preferences - Java - Installed JREs - edit - Default VM arguments + + I've set mine to + {{{ + -Xms5m -Xmx150m + }}} + because I have like 200MB RAM left after running all apps + + -Xms (minimum ammount of RAM memory for running applications) + -Xmx (maximum) + - === eclipse: Cannot create project content in workspace === + === Eclipse: Cannot create project content in workspace === The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine. === plugin dir not found ===