Hi,

looked at the HtmlHandler Code of Tika 0.9 and attributes are only
checked for "link" attributes, the rest is discarded.

I would like to get the content alt and title attributes on image tags
and title tags on anchor tags to include those in the fulltext output of
the result - they are missing at the moment.

I guess the default HtmlHandler does nut support this, right?

So some question to support this:

I would extend the HtmlParser one and hack the code to include my
attributes on the startElement call.
To get the text i need to use my own WriteOutContentHandler which does
not only output the characters but also process startElement calls and
extract alt + title tags.
Is this how it would be done with tika or is there a better way than
this?

regards

Torsten

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to