I have been using something similar to this for a while because we came
from Google Search Appliance and had googleon and googleoff all over the
place.  I don't really like having to patch the parse-html plugin everytime
I do an upgrade, wish I could move that into it's own plugin somehow.

Speaking of googleon/googleoff, is there any standard for denoting
indexable elements?  That one seems specific to GSA, it would be nice if
there was something other search engines might also take into consideration.

On Thu, Oct 1, 2015 at 7:20 AM, <[email protected]> wrote:

> Camillo thank you so much for sharing your changes. I am checking it out.
>
>
> On 9/30/15 3:37 PM, "Camilo Tejeiro" <[email protected]> wrote:
>
> >I believe you can do it with Tika,
> >
> >I did it a different way...
> >I recently had to do something similar and I wrote a little parse-filter
> >plugin to accomplish this.
> >
> >For reference look into the Jira Issue 585, it will give you some ideas.
> >https://issues.apache.org/jira/browse/NUTCH-585
> >
> >If it helps here is my open nutch install with the integrated plugin (look
> >for the parse-html-filter-select-nodes plugin). I haven't created a patch
> >but you are free to use it if it helps you...
> >https://github.com/osohm/apache-nutch-1.10
> >
> >cheers,
> >
> >On Wed, Sep 30, 2015 at 11:57 AM, <[email protected]> wrote:
> >
> >> Hi All,
> >>
> >> We need to remove header, footer and menu from the crawled content
> >>before
> >> we index content into SOLR. I researched online and found references to
> >> removal via Tika's boilerpipe support - Nutch-961
> >>
> >> We are currently using Nutch 1.7 but I am looking into updating to Nutch
> >> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will do a
> >> better job in removing extra content.
> >>
> >> I will be very thankful if you can let me know the best method and steps
> >> to achieve this goal and how effective this is in removal.
> >>
> >> Thanks so much,
> >> Madhvi
> >>
> >>
> >
> >
> >--
> >Camilo Tejeiro
> >*Be **honest, be grateful, be humble.*
> >https://www.linkedin.com/in/camilotejeiro
> >http://camilotejeiro.wordpress.com
>
>

Reply via email to