Hello Everyone, I've been investigating and I understand that using the RegexTransformer is an option that is open for identifying and extracting data to multiple fields from a single rss value source ... But rather than hack together something I once again wanted to check with the community: Is there another option for navigating the HTML DOM tree using some well-tested transformer or TIka or something?
Thanks! - Pulkit On Mon, Sep 12, 2011 at 1:45 PM, Pulkit Singhal <pulkitsing...@gmail.com>wrote: > Given an RSS raw feed source link such as the following: > > http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn > > I can easily get to the value of the description for an item like so: > <field column="description" xpath="/rss/item/description" /> > > But the content of "description" happens to be in HTML and sadly it is this > HTML chunk that has some pretty decent information that I would like to > import as well. > 1) For example it has the image for the item: > <img src=" > http://ecx.images-amazon.com/images/I/51yyAAoYzKL._SL160_SS160_.jpg" ... > /> > 2) It has the price for the item: > <span class="tgProductPrice">$13.99</span> > And many other useful pieces of data that aren't in a proper rss format but > they are simply thrown together inside the html chunk that is served as the > value for the xpath="/rss/item/description" > > So, how can I configure DIH to start importing this html information as > well? > Is Tika the way to go? > Can someone give a brief example of what a config file with both Tika > config and RSS config would/should look like? > > Thanks! > - Pulkit >