Hello, I've just started using Tika to process PDFs with embedded images. I'm getting fantastic results but I'm having to post-process the generated XHTML to correct the value of the src attribute on the img elements. The generated XHTML has elements like:
<img src="embedded:image1.jpg" alt="image1.jpg" /> My EmbeddedDocumentExtractor is saving image1.jpg in the same directory as the generated XHTML. Looking in PDF2XHTML.java it appears that the img element is written with a hard coded src of: "embedded:" + fileName My questions are: 1. Is there a significance to the word "embedded"? I can't find any reference to "embedded" in xhtml img elements. I was thinking that it might indicate there's a base64 encoded object in the page but that does not appear to be the case. 2. Is there a pattern for overriding the embedded img src value? I see that "parseEmbedded" is called with outputHtml=false. Would there be a way to have parseEmbedded return the img element if that were set to true? Any direction would be greatly appreciated. I'm currently just passing the generated XHTML through a regex that converts the src attributes and that works fine, it just feels like there may be a more idiomatic way that I'm not seeing. Cheers, Mike
