Hi, If you are going to crawl whole web then there is a java library called Boilerpipe https://code.google.com/p/boilerpipe/ that might help you.
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. Imtiaz Shakil Siddique On Oct 3, 2015 2:05 AM, "Camilo Tejeiro" <[email protected]> wrote: > @marora: I am glad it helps! > @john: I think you don't have to patch or modify the parse-html plugin, you > can build a parse-filter that is executed afterwards, this is the way I am > doing it currently, because I read somewhere (not remember where) that it > is good practice to extend the parse-html plugin as opposed to modifying it > directly, because as you mention your changes might have to be reapplied to > new nutch releases. But if you are concerned about having to execute an > extra process after parsing you could also make your own very similar parse > plugin (and integrate the filter functionality at parse time) and replace > the parse-html plugin in the nutch-site includes with your own. > > > > On Thu, Oct 1, 2015 at 9:17 AM, John Lafitte <[email protected]> > wrote: > > > I have been using something similar to this for a while because we came > > from Google Search Appliance and had googleon and googleoff all over the > > place. I don't really like having to patch the parse-html plugin > everytime > > I do an upgrade, wish I could move that into it's own plugin somehow. > > > > Speaking of googleon/googleoff, is there any standard for denoting > > indexable elements? That one seems specific to GSA, it would be nice if > > there was something other search engines might also take into > > consideration. > > > > On Thu, Oct 1, 2015 at 7:20 AM, <[email protected]> wrote: > > > > > Camillo thank you so much for sharing your changes. I am checking it > out. > > > > > > > > > On 9/30/15 3:37 PM, "Camilo Tejeiro" <[email protected]> wrote: > > > > > > >I believe you can do it with Tika, > > > > > > > >I did it a different way... > > > >I recently had to do something similar and I wrote a little > parse-filter > > > >plugin to accomplish this. > > > > > > > >For reference look into the Jira Issue 585, it will give you some > ideas. > > > >https://issues.apache.org/jira/browse/NUTCH-585 > > > > > > > >If it helps here is my open nutch install with the integrated plugin > > (look > > > >for the parse-html-filter-select-nodes plugin). I haven't created a > > patch > > > >but you are free to use it if it helps you... > > > >https://github.com/osohm/apache-nutch-1.10 > > > > > > > >cheers, > > > > > > > >On Wed, Sep 30, 2015 at 11:57 AM, <[email protected]> > wrote: > > > > > > > >> Hi All, > > > >> > > > >> We need to remove header, footer and menu from the crawled content > > > >>before > > > >> we index content into SOLR. I researched online and found references > > to > > > >> removal via Tika's boilerpipe support - Nutch-961 > > > >> > > > >> We are currently using Nutch 1.7 but I am looking into updating to > > Nutch > > > >> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will > > do a > > > >> better job in removing extra content. > > > >> > > > >> I will be very thankful if you can let me know the best method and > > steps > > > >> to achieve this goal and how effective this is in removal. > > > >> > > > >> Thanks so much, > > > >> Madhvi > > > >> > > > >> > > > > > > > > > > > >-- > > > >Camilo Tejeiro > > > >*Be **honest, be grateful, be humble.* > > > >https://www.linkedin.com/in/camilotejeiro > > > >http://camilotejeiro.wordpress.com > > > > > > > > > > > > -- > Camilo Tejeiro > *Be **honest, be grateful, be humble.* > https://www.linkedin.com/in/camilotejeiro > http://camilotejeiro.wordpress.com >

