Hi,

If you are going to crawl whole web then there is a java library called
Boilerpipe https://code.google.com/p/boilerpipe/ that might help you.

The boilerpipe library provides algorithms to detect and remove the surplus
"clutter" (boilerplate, templates) around the main textual content of a web
page.

Imtiaz Shakil Siddique
On Oct 3, 2015 2:05 AM, "Camilo Tejeiro" <[email protected]> wrote:

> @marora: I am glad it helps!
> @john: I think you don't have to patch or modify the parse-html plugin, you
> can build a parse-filter that is executed afterwards, this is the way I am
> doing it currently, because I read somewhere (not remember where) that it
> is good practice to extend the parse-html plugin as opposed to modifying it
> directly, because as you mention your changes might have to be reapplied to
> new nutch releases. But if you are concerned about having to execute an
> extra process after parsing you could also make your own very similar parse
> plugin (and integrate the filter functionality at parse time) and replace
> the parse-html plugin in the nutch-site includes with your own.
>
>
>
> On Thu, Oct 1, 2015 at 9:17 AM, John Lafitte <[email protected]>
> wrote:
>
> > I have been using something similar to this for a while because we came
> > from Google Search Appliance and had googleon and googleoff all over the
> > place.  I don't really like having to patch the parse-html plugin
> everytime
> > I do an upgrade, wish I could move that into it's own plugin somehow.
> >
> > Speaking of googleon/googleoff, is there any standard for denoting
> > indexable elements?  That one seems specific to GSA, it would be nice if
> > there was something other search engines might also take into
> > consideration.
> >
> > On Thu, Oct 1, 2015 at 7:20 AM, <[email protected]> wrote:
> >
> > > Camillo thank you so much for sharing your changes. I am checking it
> out.
> > >
> > >
> > > On 9/30/15 3:37 PM, "Camilo Tejeiro" <[email protected]> wrote:
> > >
> > > >I believe you can do it with Tika,
> > > >
> > > >I did it a different way...
> > > >I recently had to do something similar and I wrote a little
> parse-filter
> > > >plugin to accomplish this.
> > > >
> > > >For reference look into the Jira Issue 585, it will give you some
> ideas.
> > > >https://issues.apache.org/jira/browse/NUTCH-585
> > > >
> > > >If it helps here is my open nutch install with the integrated plugin
> > (look
> > > >for the parse-html-filter-select-nodes plugin). I haven't created a
> > patch
> > > >but you are free to use it if it helps you...
> > > >https://github.com/osohm/apache-nutch-1.10
> > > >
> > > >cheers,
> > > >
> > > >On Wed, Sep 30, 2015 at 11:57 AM, <[email protected]>
> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> We need to remove header, footer and menu from the crawled content
> > > >>before
> > > >> we index content into SOLR. I researched online and found references
> > to
> > > >> removal via Tika's boilerpipe support - Nutch-961
> > > >>
> > > >> We are currently using Nutch 1.7 but I am looking into updating to
> > Nutch
> > > >> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will
> > do a
> > > >> better job in removing extra content.
> > > >>
> > > >> I will be very thankful if you can let me know the best method and
> > steps
> > > >> to achieve this goal and how effective this is in removal.
> > > >>
> > > >> Thanks so much,
> > > >> Madhvi
> > > >>
> > > >>
> > > >
> > > >
> > > >--
> > > >Camilo Tejeiro
> > > >*Be **honest, be grateful, be humble.*
> > > >https://www.linkedin.com/in/camilotejeiro
> > > >http://camilotejeiro.wordpress.com
> > >
> > >
> >
>
>
>
> --
> Camilo Tejeiro
> *Be **honest, be grateful, be humble.*
> https://www.linkedin.com/in/camilotejeiro
> http://camilotejeiro.wordpress.com
>

Reply via email to