I believe you can do it with Tika, I did it a different way... I recently had to do something similar and I wrote a little parse-filter plugin to accomplish this.
For reference look into the Jira Issue 585, it will give you some ideas. https://issues.apache.org/jira/browse/NUTCH-585 If it helps here is my open nutch install with the integrated plugin (look for the parse-html-filter-select-nodes plugin). I haven't created a patch but you are free to use it if it helps you... https://github.com/osohm/apache-nutch-1.10 cheers, On Wed, Sep 30, 2015 at 11:57 AM, <[email protected]> wrote: > Hi All, > > We need to remove header, footer and menu from the crawled content before > we index content into SOLR. I researched online and found references to > removal via Tika's boilerpipe support - Nutch-961 > > We are currently using Nutch 1.7 but I am looking into updating to Nutch > 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will do a > better job in removing extra content. > > I will be very thankful if you can let me know the best method and steps > to achieve this goal and how effective this is in removal. > > Thanks so much, > Madhvi > > -- Camilo Tejeiro *Be **honest, be grateful, be humble.* https://www.linkedin.com/in/camilotejeiro http://camilotejeiro.wordpress.com

