Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "RunningNutchAndSolr" page has been changed by GeoffBentley. http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=28&rev2=29 -------------------------------------------------- = New in Nutch 1.0-dev = - Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from [[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/]]. + Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/. = Pre Solr Nutch integration = - This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to [[http://variogram.com||Brian Whitman at Variogr.am]] and [[http://blog.foofactory.fi||Sami Siren at FooFactory]] for all the help! You guys saved me a lot of time! :) + This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to http://variogram.com and http://blog.foofactory.fi for all the help! You guys saved me a lot of time! :) I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk. @@ -12, +12 @@ * apt-get install sun-java6-jdk subversion ant patch unzip == Steps == - The first step to get started is to download the required software components, namely Apache Solr and Nutch. '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page @@ -23, +22 @@ '''4.''' Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz + '''5.''' Configure Solr For the sake of simplicity we are going to use the example configuration of Solr as a base. - '''5.''' Configure Solr - For the sake of simplicity we are going to use the example - configuration of Solr as a base. - '''a.''' Copy the provided Nutch schema from directory - apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) + '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: @@ -52, +48 @@ <str name="qf"> - content^0.5 anchor^1.0 title^1.2 + content^0.5 anchor^1.0 title^1.2 </str> - </str> - <str name="pf"> - content^0.5 anchor^1.5 title^1.2 site^1.5 + <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> - </str> + <str name="fl"> url </str> - <str name="fl"> - url - </str> + <str name="mm"> 2<-1 5<-2 6<90% </str> - <str name="mm"> - 2<-1 5<-2 6<90% - </str> <int name="ps">100</int> - <bool hl="true"/> + <bool name="hl">true</bool> <str name="q.alt">*:*</str> @@ -91, +80 @@ '''6.''' Start Solr + cd apache-solr-1.3.0/example java -jar start.jar - cd apache-solr-1.3.0/example - java -jar start.jar '''7. Configure Nutch''' a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) : + <?xml version="1.0"?> <configuration> - <?xml version="1.0"?> - <configuration> <property> @@ -109, +96 @@ </property> - <property> - <name>generate.max.per.host</name> + <property> <name>generate.max.per.host</name> <value>100</value> @@ -126, +112 @@ </configuration> - '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace it’s content with following: -^(https|telnet|file|ftp|mailto): + - - # skip some suffixes - -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ + # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ - + - # skip URLs containing certain characters as probable queries, etc. + # skip URLs containing certain characters as probable queries, etc. -[...@=] + + # allow urls in foofactory.fi domain +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/ + - -[...@=] - - # allow urls in foofactory.fi domain - +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/ - - # deny anything else + # deny anything else -. - -. '''8.''' Create a seed list (the initial urls to fetch) - mkdir urls - echo "http://www.lucidimagination.com/" > urls/seed.txt + mkdir urls echo "http://www.lucidimagination.com/" > urls/seed.txt '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory) @@ -190, +170 @@ http://127.0.0.1:8983/solr/nutch/?q=solr&version=2.2&start=0&rows=10&indent=on&wt=json - - - HI, I to faced problems in integrating solr and nutch. After, some work out i found the below article and integrated successfully. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/