I believe you can do it with Tika,

I did it a different way...
I recently had to do something similar and I wrote a little parse-filter
plugin to accomplish this.

For reference look into the Jira Issue 585, it will give you some ideas.
https://issues.apache.org/jira/browse/NUTCH-585

If it helps here is my open nutch install with the integrated plugin (look
for the parse-html-filter-select-nodes plugin). I haven't created a patch
but you are free to use it if it helps you...
https://github.com/osohm/apache-nutch-1.10

cheers,

On Wed, Sep 30, 2015 at 11:57 AM, <[email protected]> wrote:

> Hi All,
>
> We need to remove header, footer and menu from the crawled content before
> we index content into SOLR. I researched online and found references to
> removal via Tika's boilerpipe support - Nutch-961
>
> We are currently using Nutch 1.7 but I am looking into updating to Nutch
> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will do a
> better job in removing extra content.
>
> I will be very thankful if you can let me know the best method and steps
> to achieve this goal and how effective this is in removal.
>
> Thanks so much,
> Madhvi
>
>


-- 
Camilo Tejeiro
*Be **honest, be grateful, be humble.*
https://www.linkedin.com/in/camilotejeiro
http://camilotejeiro.wordpress.com

Reply via email to