[Nutch Wiki] Update of "RunningNutchAndSolr" by GeoffBe ntley

Apache Wiki Sun, 17 Jan 2010 18:37:21 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "RunningNutchAndSolr" page has been changed by GeoffBentley.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=28&rev2=29

--------------------------------------------------

  = New in Nutch 1.0-dev =
- Please note that in the nightly version of Apache Nutch there is now a Solr 
integration embedded so you can start to use a lot easier. Just download a 
nightly version from [[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/]].
+ Please note that in the nightly version of Apache Nutch there is now a Solr 
integration embedded so you can start to use a lot easier. Just download a 
nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/.
  
  = Pre Solr Nutch integration =
- This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to [[http://variogram.com||Brian Whitman at 
Variogr.am]] and [[http://blog.foofactory.fi||Sami Siren at FooFactory]] for 
all the help!  You guys saved me a lot of time! :)
+ This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to http://variogram.com and http://blog.foofactory.fi 
for all the help!  You guys saved me a lot of time! :)
  
  I'm posting it under Nutch rather than Solr on the presumption that people 
are more likely to be learning/using Solr first, then come here looking to 
combine it with Nutch.  I'm going to skip over doing command by command for 
right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm 
assuming that the Solr trunk code is checked out into solr-trunk and Nutch 
trunk code is checked out into nutch-trunk.
  
@@ -12, +12 @@

   * apt-get install sun-java6-jdk subversion ant patch unzip
  
  == Steps ==
- 
  The first step to get started is to download the required software 
components, namely Apache Solr and Nutch.
  
  '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page
@@ -23, +22 @@

  
  '''4.''' Extract the Nutch package       tar xzf apache-nutch-1.0.tar.gz
  
+ '''5.''' Configure Solr For the sake of simplicity we are going to use the 
example configuration of Solr as a base.
- '''5.''' Configure Solr
- For the sake of simplicity we are going to use the example
- configuration of Solr as a base.
  
- '''a.''' Copy the provided Nutch schema from directory
- apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf 
(override the existing file)
+ '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf 
to directory apache-solr-1.3.0/example/solr/conf (override the existing file)
  
  We want to allow Solr to create the snippets for search results so we need to 
store the content in addition to indexing it:
  
@@ -52, +48 @@

  
  <str name="qf">
  
- content^0.5 anchor^1.0 title^1.2
+ content^0.5 anchor^1.0 title^1.2 </str>
- </str>
  
- <str name="pf">
- content^0.5 anchor^1.5 title^1.2 site^1.5
+ <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str>
- </str>
  
+ <str name="fl"> url </str>
- <str name="fl">
- url
- </str>
  
+ <str name="mm"> 2<-1 5<-2 6<90% </str>
- <str name="mm">
- 2&lt;-1 5&lt;-2 6&lt;90%
- </str>
  
  <int name="ps">100</int>
  
- <bool hl="true"/>
+ <bool name="hl">true</bool>
  
  <str name="q.alt">*:*</str>
  
@@ -91, +80 @@

  
  '''6.''' Start Solr
  
+ cd apache-solr-1.3.0/example java -jar start.jar
- cd apache-solr-1.3.0/example
- java -jar start.jar
  
  '''7. Configure Nutch'''
  
  a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s 
contents with the following (we specify our crawler name, active plugins and 
limit maximum url count for single host per run to be 100) :
  
+ <?xml version="1.0"?> <configuration>
- <?xml version="1.0"?>
- <configuration>
  
  <property>
  
@@ -109, +96 @@

  
  </property>
  
- <property>
- <name>generate.max.per.host</name>
+ <property> <name>generate.max.per.host</name>
  
  <value>100</value>
  
@@ -126, +112 @@

  
  </configuration>
  
- 
  '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with following:
  
  -^(https|telnet|file|ftp|mailto):
+ 
-  
- # skip some suffixes
- 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+ # skip some suffixes 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-  
+ 
- # skip URLs containing certain characters as probable queries, etc.
+ # skip URLs containing certain characters as probable queries, etc. -[...@=]
+ 
+ # allow urls in foofactory.fi domain 
+^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
+ 
- -[...@=]
-  
- # allow urls in foofactory.fi domain
- +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
-  
- # deny anything else
+ # deny anything else -.
- -.
  
  '''8.''' Create a seed list (the initial urls to fetch)
  
- mkdir urls
- echo "http://www.lucidimagination.com/"; > urls/seed.txt
+ mkdir urls echo "http://www.lucidimagination.com/"; > urls/seed.txt
  
  '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)
  
@@ -190, +170 @@

  
  
http://127.0.0.1:8983/solr/nutch/?q=solr&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json
  
- 
- 
- 
  HI, I to faced problems in integrating solr and nutch. After, some work out i 
found the below article and integrated successfully. 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

[Nutch Wiki] Update of "RunningNutchAndSolr" by GeoffBe ntley

Reply via email to