Re: Remove Header Footer and Menus from crawled content

marora Thu, 01 Oct 2015 05:21:33 -0700

Camillo thank you so much for sharing your changes. I am checking it out.


On 9/30/15 3:37 PM, "Camilo Tejeiro" <[email protected]> wrote:

>I believe you can do it with Tika,
>
>I did it a different way...
>I recently had to do something similar and I wrote a little parse-filter
>plugin to accomplish this.
>
>For reference look into the Jira Issue 585, it will give you some ideas.
>https://issues.apache.org/jira/browse/NUTCH-585
>
>If it helps here is my open nutch install with the integrated plugin (look
>for the parse-html-filter-select-nodes plugin). I haven't created a patch
>but you are free to use it if it helps you...
>https://github.com/osohm/apache-nutch-1.10
>
>cheers,
>
>On Wed, Sep 30, 2015 at 11:57 AM, <[email protected]> wrote:
>
>> Hi All,
>>
>> We need to remove header, footer and menu from the crawled content
>>before
>> we index content into SOLR. I researched online and found references to
>> removal via Tika's boilerpipe support - Nutch-961
>>
>> We are currently using Nutch 1.7 but I am looking into updating to Nutch
>> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will do a
>> better job in removing extra content.
>>
>> I will be very thankful if you can let me know the best method and steps
>> to achieve this goal and how effective this is in removal.
>>
>> Thanks so much,
>> Madhvi
>>
>>
>
>
>-- 
>Camilo Tejeiro
>*Be **honest, be grateful, be humble.*
>https://www.linkedin.com/in/camilotejeiro
>http://camilotejeiro.wordpress.com

Re: Remove Header Footer and Menus from crawled content

Reply via email to