Prevent website parts of being indexed problem

sebio Wed, 25 Jul 2012 05:49:36 -0700

Hello nutch users,

i think thats a common problem and i found this thread: 
http://www.adick.at/2008/09/16/nutch-prevent-sections-of-a-website-from-being-indexed/
http://lucene.472066.n3.nabble.com/ignore-content-between-tags-crawl-only-between-tags-td609676.html
 
and this blog entry 
http://www.adick.at/2008/09/16/nutch-prevent-sections-of-a-website-from-being-indexed/
http://www.adick.at/2008/09/16/nutch-prevent-sections-of-a-website-from-being-indexed/


I modified the DOMContentUtils class as suggested. When i test the plugin
using "bin/nutch plugin parse-htmlnoindex
com.example.nutch.parse.html.HtmlParser ~/test.html" everything works fine.
The result text only states the parts that are not in "nutch_noindex".

Nevertheless, when indexing a whole site the section inside "nutch_noindex"
still appears in the solr index. Assuming the modified "parse-html" plugin
is named "parse-htmlnoindex" my "nutch-sites.xml" lookes like this.

 <property>
    <name>plugin.includes</name>
   
<value>protocol-httpclient|urlfilter-regex|parse-(htmlnoindex|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  </property>

I started the test crawling/indexing process with: "bin/nutch crawl urls
-solr http://localhost:8983/solr/de -depth 2 -topN 5"

My questions are:

1. Is that the right way to index/crawl a site or do i have to include the
used plugins explicitely in the command?

2. I am wondering if this feature isn't a common use case and if such
functionalities are already available in nutch? Perhaps in nutch 2.0?

I am using nutch 1.5.1 with solr 3.4.0. Solr uses the example schema.xml
shipped with nutch. I am new to the whole nutch topic and i hope someone
could help me with that.


Kinde regards,
Sebastian



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Prevent-website-parts-of-being-indexed-problem-tp3997213.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Prevent website parts of being indexed problem

Reply via email to