The plugin works on headings only, but if you check the sources, you can 
quickly adapt it to any element/attribute section. 
 
-----Original message-----
> From:Vishal Sharma <[email protected]>
> Sent: Thursday 27th November 2014 18:25
> To: user <[email protected]>
> Subject: Re: How to parse specific html tag in nutch+solr while crawling
> 
> Hi Markus,
> 
> Thank you so much for your reply.
> 
> Quick question: Will this parse only hN tags only or can we confiure it for
> other html tags also like <div class=''test"> ?
> 
> *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> E: [email protected] <[email protected]>
> www.grazitti.com [image: Description: LinkedIn]
> <http://www.linkedin.com/company/grazitti-interactive>[image: Description:
> Twitter] <https://twitter.com/grazitti>[image: fbook]
> <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> Salesforce1TM Calendar
> App for Teams
> <https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3>
> 
> 
> 
> 
> On Thu, Nov 27, 2014 at 10:33 PM, Markus Jelsma <[email protected]>
> wrote:
> 
> > You may want to check the headings plugin, it reads content from those
> > elements and writes them to some field. Very basic.
> >
> >
> >
> > -----Original message-----
> > > From:Vishal Sharma <[email protected]>
> > > Sent: Thursday 27th November 2014 17:59
> > > To: user <[email protected]>
> > > Subject: How to parse specific html tag in nutch+solr while crawling
> > >
> > > I tried this on Google also. But, nothing useful. Appreciate any help.
> > >
> > > Is there a way to parse specific html tag while doing the crawling with
> > > nutch and then indexing it to solr.
> > >
> > > For-example I don't want all html page to go to content node. I would
> > want
> > > to parse h1 h2 tags into separate nodes.
> > >
> > >
> > >
> > > *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> > > E: [email protected] <[email protected]>
> > > www.grazitti.com [image: Description: LinkedIn]
> > > <http://www.linkedin.com/company/grazitti-interactive>[image:
> > Description:
> > > Twitter] <https://twitter.com/grazitti>[image: fbook]
> > > <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> > > Salesforce1TM Calendar
> > > App for Teams
> > > <
> > https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3
> > >
> > >
> >
> 

Reply via email to