: I've been investigating and I understand that using the RegexTransformer is : an option that is open for identifying and extracting data to multiple : fields from a single rss value source ... But rather than hack together : something I once again wanted to check with the community: Is there another : option for navigating the HTML DOM tree using some well-tested transformer : or TIka or something?
I don't think so ... if it's a *really* wellformed feed, then the description will actually be xhtml nodes (with the appropriate namespace) that are already part of the Document's DOM. But if it's just a blob of CDATA that happens to contain welformed HTML, then I think a regex is currently your best option -- you'll probably want something tailor made for the subtleties of the site whose RSS you're scraping anyway since things like "are & chars in the URLs html escaped?" is going to vary from site to site. It would probably be possible to write a DIH Transformer based on something like tagsoup to actually produce a DOM from an arbitrary html string in an entity, so you could then treat it as a subentity and use the XPathEntityProcessor -- but i don't think i've seen anyone talk about doing anything like that before. -Hoss