Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change
notification.
The "RunningNutchAndSolr" page has been changed by Dmitrius.
The comment on this change is: Fixed commang (single quotes missed).
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=28&rev2=29
--
= New in Nutch 1.0-dev =
- Please note that in the nightly version of Apache Nutch there is now a Solr
integration embedded so you can start to use a lot easier. Just download a
nightly version from [[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/]].
+ Please note that in the nightly version of Apache Nutch there is now a Solr
integration embedded so you can start to use a lot easier. Just download a
nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/.
= Pre Solr Nutch integration =
- This is just a quick first pass at a guide for getting Nutch running with
Solr. I'm sure there are better ways of doing some/all of it, but I'm not
aware of them. By all means, please do correct/update this if someone has a
better idea. Many thanks to [[http://variogram.com||Brian Whitman at
Variogr.am]] and [[http://blog.foofactory.fi||Sami Siren at FooFactory]] for
all the help! You guys saved me a lot of time! :)
+ This is just a quick first pass at a guide for getting Nutch running with
Solr. I'm sure there are better ways of doing some/all of it, but I'm not
aware of them. By all means, please do correct/update this if someone has a
better idea. Many thanks to http://variogram.com and http://blog.foofactory.fi
for all the help! You guys saved me a lot of time! :)
I'm posting it under Nutch rather than Solr on the presumption that people
are more likely to be learning/using Solr first, then come here looking to
combine it with Nutch. I'm going to skip over doing command by command for
right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm
assuming that the Solr trunk code is checked out into solr-trunk and Nutch
trunk code is checked out into nutch-trunk.
@@ -12, +12 @@
* apt-get install sun-java6-jdk subversion ant patch unzip
== Steps ==
-
The first step to get started is to download the required software
components, namely Apache Solr and Nutch.
'''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page
@@ -23, +22 @@
'''4.''' Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz
+ '''5.''' Configure Solr For the sake of simplicity we are going to use the
example configuration of Solr as a base.
- '''5.''' Configure Solr
- For the sake of simplicity we are going to use the example
- configuration of Solr as a base.
- '''a.''' Copy the provided Nutch schema from directory
- apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf
(override the existing file)
+ '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf
to directory apache-solr-1.3.0/example/solr/conf (override the existing file)
We want to allow Solr to create the snippets for search results so we need to
store the content in addition to indexing it:
@@ -52, +48 @@
- content^0.5 anchor^1.0 title^1.2
+ content^0.5 anchor^1.0 title^1.2
-
-
- content^0.5 anchor^1.5 title^1.2 site^1.5
+ content^0.5 anchor^1.5 title^1.2 site^1.5
-
+ url
-
- url
-
+ 2<-1 5<-2 6<90%
-
- 2<-1 5<-2 6<90%
-
100
@@ -91, +80 @@
'''6.''' Start Solr
+ cd apache-solr-1.3.0/example java -jar start.jar
- cd apache-solr-1.3.0/example
- java -jar start.jar
'''7. Configure Nutch'''
a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s
contents with the following (we specify our crawler name, active plugins and
limit maximum url count for single host per run to be 100) :
+
-
-
@@ -109, +96 @@
-
- generate.max.per.host
+ generate.max.per.host
100
@@ -126, +112 @@
-
'''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace
it’s content with following:
-^(https|telnet|file|ftp|mailto):
+
-
- # skip some suffixes
-
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+ # skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-
+
- # skip URLs containing certain characters as probable queries, etc.
+ # skip URLs containing certain characters as probable queries, etc. -[...@=]
+
+ # allow urls in foofactory.fi domain
+^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
+
- -[...@=]
-
- # allow urls in