Re: How to combine RSS w/ Tika when using Data Import Handler (DIH)

Chris Hostetter Tue, 13 Sep 2011 09:10:14 -0700

: I've been investigating and I understand that using the RegexTransformer is
: an option that is open for identifying and extracting data to multiple
: fields from a single rss value source ... But rather than hack together
: something I once again wanted to check with the community: Is there another
: option for navigating the HTML DOM tree using some well-tested transformer
: or TIka or something?


I don't think so ... if it's a *really* wellformed feed, then the 
description will actually be xhtml nodes (with the appropriate 
namespace) that are already part of the Document's DOM.

But if it's just a blob of CDATA that happens to contain welformed HTML, 
then I think a regex is currently your best option -- you'll probably want 
something tailor made for the subtleties of the site whose RSS you're 
scraping anyway since things like "are & chars in the URLs html escaped?" 
is going to vary from site to site.

It would probably be possible to write a DIH Transformer based on 
something like tagsoup to actually produce a DOM from an arbitrary html 
string in an entity, so you could then treat it as a subentity and use the 
XPathEntityProcessor -- but i don't think i've seen anyone talk about 
doing anything like that before.

-Hoss

Re: How to combine RSS w/ Tika when using Data Import Handler (DIH)

Reply via email to