Simply implement a HTMLParseFilter which will receive a DOM representation from the tika|html parser. Look in existing plkugins for examples or search the mailing list
On 20 July 2011 08:53, Cheng Li <[email protected]> wrote: > Thank you . > > What do you mean by Xpath? Could you explain a little bit more ? > > Actually I was considering using Tika to deal with the extraction part. Any > suggestions for that ? > > Thanks, > > On Wed, Jul 20, 2011 at 12:37 AM, Hannes Carl Meyer < > [email protected]> wrote: > > > As I can see the price is on the source code. > > You could use for example XPath to extract that information via > > > > //li[@class='good-value selected']/span[@class='value'] > > > > BR > > > > Hannes > > > > On Wed, Jul 20, 2011 at 9:13 AM, Gora Mohanty <[email protected]> > wrote: > > > > > On Wed, Jul 20, 2011 at 12:12 PM, Cheng Li <[email protected]> wrote: > > > > Hi , > > > > > > > > I want to extract price data( here the price is $1110 ) from > > > > > > > > > > http://www.kbb.com/volkswagen/jetta/1991-volkswagen-jetta/gl-sedan-2d/?vehicleid=11638&intent=buy-used&pricetype=private-party&condition=good > > > . > > > > > > > > But in the website source code , I cannot find any information about > > the > > > > price of $1110. How should I extract the price data from this page? > > > > > > Haven't tried crawling the site with Nutch, but the price is in the > > source > > > code. Do a "View Source" in your browser, and search for 1,100 (there > > > is a comma in there). I see > > > <span class="value"><span class="icon"></span>$1,110</span> > > > > > > Regards, > > > Gora > > > > > > > > > -- > Cheng Li > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

