Hi, looked at the HtmlHandler Code of Tika 0.9 and attributes are only checked for "link" attributes, the rest is discarded.
I would like to get the content alt and title attributes on image tags and title tags on anchor tags to include those in the fulltext output of the result - they are missing at the moment. I guess the default HtmlHandler does nut support this, right? So some question to support this: I would extend the HtmlParser one and hack the code to include my attributes on the startElement call. To get the text i need to use my own WriteOutContentHandler which does not only output the characters but also process startElement calls and extract alt + title tags. Is this how it would be done with tika or is there a better way than this? regards Torsten
smime.p7s
Description: S/MIME cryptographic signature
