Hello,

I've just started using Tika to process PDFs with embedded images.  I'm
getting fantastic results but I'm having to post-process the generated
XHTML to correct the value of the src attribute on the img elements.  The
generated XHTML has elements like:

<img src="embedded:image1.jpg" alt="image1.jpg" />

My EmbeddedDocumentExtractor is saving image1.jpg in the same directory as
the generated XHTML.  Looking in PDF2XHTML.java it appears that the img
element is written with a hard coded src of:  "embedded:" + fileName

My questions are:

   1. Is there a significance to the word "embedded"?   I can't find any
   reference to "embedded" in xhtml img elements.  I was thinking that it
   might indicate there's a base64 encoded object in the page but that does
   not appear to be the case.
   2. Is there a pattern for overriding the embedded img src value?  I see
   that "parseEmbedded" is called with outputHtml=false.   Would there be a
   way to have parseEmbedded return the img element if that were set to true?

Any direction would be greatly appreciated.  I'm currently just passing the
generated XHTML through a regex that converts the src attributes and that
works fine, it just feels like there may be a more idiomatic way that I'm
not seeing.

Cheers,
Mike

Reply via email to