Camillo thank you so much for sharing your changes. I am checking it out.
On 9/30/15 3:37 PM, "Camilo Tejeiro" <[email protected]> wrote: >I believe you can do it with Tika, > >I did it a different way... >I recently had to do something similar and I wrote a little parse-filter >plugin to accomplish this. > >For reference look into the Jira Issue 585, it will give you some ideas. >https://issues.apache.org/jira/browse/NUTCH-585 > >If it helps here is my open nutch install with the integrated plugin (look >for the parse-html-filter-select-nodes plugin). I haven't created a patch >but you are free to use it if it helps you... >https://github.com/osohm/apache-nutch-1.10 > >cheers, > >On Wed, Sep 30, 2015 at 11:57 AM, <[email protected]> wrote: > >> Hi All, >> >> We need to remove header, footer and menu from the crawled content >>before >> we index content into SOLR. I researched online and found references to >> removal via Tika's boilerpipe support - Nutch-961 >> >> We are currently using Nutch 1.7 but I am looking into updating to Nutch >> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will do a >> better job in removing extra content. >> >> I will be very thankful if you can let me know the best method and steps >> to achieve this goal and how effective this is in removal. >> >> Thanks so much, >> Madhvi >> >> > > >-- >Camilo Tejeiro >*Be **honest, be grateful, be humble.* >https://www.linkedin.com/in/camilotejeiro >http://camilotejeiro.wordpress.com

