RE: Removing header,Footer and left menus while crawling

Mark Vega Tue, 14 Nov 2017 13:18:03 -0800

Michael,
I don't know if it's compatible with v1.13, but I've been using an extractor 
plug-in from Bayan Group (https://github.com/BayanGroup/nutch-custom-search) 
with v1.10 to strip content that repeats on every page (header, footer, 
toc/nav) and index only the main content section into the default search field. 
 The plug-in is easy to configure and use and allows you to specify multiple 
elements to remove from the indexable content by element type, id, name or css 
class.  It also allows you to map multiple elements from different sites with 
different element naming/classing conventions into the same field, helpful if 
you've got multiple sites that each call or class their main content section 
something different. I've been using it without issue for about four years now.

--
Mark F. Vega
Programmer/Analyst
UC Irvine Libraries - Web Services
[email protected]
949.824.9872
--

-----Original Message-----
From: Michael Coffey [mailto:[email protected]] 
Sent: Tuesday, November 14, 2017 11:25 AM
To: [email protected]
Cc: [email protected]
Subject: Re: Removing header,Footer and left menus while crawling

That is a very interesting note. I have been wanting something like that. I use 
the python-based "newspaper" package but it is not directly compatible with the 
nutch/hadoop infrastructure.

      From: Jorge Betancourt <[email protected]>
 To: [email protected]
Cc: [email protected]
 Sent: Tuesday, November 14, 2017 5:35 AM
 Subject: Re: Removing header,Footer and left menus while crawling

Hello Rushikesh,

Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you 
could use the Tika boilerpipe implementation, on the nutch-site.xml you need to 
enable this feature with:

<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>

And configure the proper extractor with
the tika.extractor.boilerpipe.algorithm setting.

This is not a perfect solution, but I've used it successfully in the past, of 
course, your results will depend on how is the structure (markup of the 
website).

Other option could be to implement your own parser if you need to have more 
control over what to include/exclude from the HTML. You can take a look at this 
issue https://issues.apache.org/jira/browse/NUTCH-585 which contains some info 
and old patches.

Best Regards,
Jorge

On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected]>
wrote:

> Hello Sebastian,
> we are most excited in using the  Nutch 1.3 (with solr 6.4)  for 
> crawling our website and we are happy with the search results  but we 
> had requirement to skip the header footer and left menus and some 
> other parts of the page, can you please guide how can we exclude those 
> parts.i was trying various ways on google but nothing works for me.
>
> Appreciate for your help in Advance!
> --
> Regards
> Rushikesh M
> .Net Developer
>

RE: Removing header,Footer and left menus while crawling

Reply via email to