RE: Tell Nutch to only crawl parts of document

Mark Vega Thu, 02 Feb 2017 15:41:07 -0800

Christian,
I am using a Nutch plugin called Extractor from BayanGroup 
(https://github.com/BayanGroup/nutch-custom-search)  that allows you to select 
content elements on the page based on xpath expressions or css selectors.  I've 
mapped all the repeating content elements (navs, headers, footers, search bars, 
etc) on my sites to specific custom SOLR fields and am able to index the 
non-repeating content into the defaut 'content' field in SOLR.  Only the 
'content' field is used when conducting a search, thereby side-stepping the 
issue you've encountered of every page showing up in results for certain 
searches that match on repeated content.  I think the plugin may have changed 
somewhat from when I included it in my Nutch 1.10 installation, but was easy to 
set up and has worked well for several years now.  I still index the repeating 
elements, but now that information is in custom SOLR fields that are not 
searched (I indexed them anyway just in case I have some reason to search those 
fields in the future).  One caveat:  When I first set this up, I was indexing 7 
sites that basically used the same theme but had no consistent template across 
sites, i.e, the main 'content' section and the repeating content sections were 
each given different css selectors in different sites so that the only way to, 
say, grab all the left navs of every site and separate that content from the 
main searchable content was to create a very detailed Extractor config file 
that mapped each individual site's elements into a shared set of custom SOLR 
fields. Again, only the main 'content' section from each site is indexed into 
the default SOLR content field and repeating content is indexed into custom 
global nav, left nav, global search, header, and footer fields in SOLR.  As we 
undertook redesigns of our public sites last year, I took special pains to make 
sure that each site used the same css selectors for the repeating content 
elements and the main content section of all pages.  Now my Extractor config 
file is much smaller and still works great!


--
Mark F. Vega
Programmer/Analyst
UC Irvine Libraries - Web Services
[email protected]
949.824.9872
--


-----Original Message-----
From: Christian Kunz [mailto:[email protected]] 
Sent: Thursday, February 02, 2017 6:23 AM
To: [email protected]
Subject: Tell Nutch to only crawl parts of document

Hi everybody,

we've got a problem using Nutch: On the website that has to be crawled, there 
is a navigation on top of each page. Nutch crawls the navigation of each page 
which leads to the situation that for certain queries (that are included in the 
navigation) every page is delivered as a result.

Is there a way to tell Nutch to only crawl parts of a page like only the main 
content?

Thanks in advance and regards,
Christian

RE: Tell Nutch to only crawl parts of document

Reply via email to