On Fri, 3 Jan 2020, Mike Dalrymple wrote:
I've just started using Tika to process PDFs with embedded images.  I'm
getting fantastic results but I'm having to post-process the generated
XHTML to correct the value of the src attribute on the img elements.

That is expected. A simple sax handler should let you do that, to re-write it to where you're saving the images

The generated XHTML has elements like:

<img src="embedded:image1.jpg" alt="image1.jpg" />

The embedded prefix is Tika's way of letting you know there was an embedded image there, and what name it would have if you extracted it (which you may not of done).

The idea is that, for the extract+display case, you re-write it to match where you stored the image. For other cases, you know it was an embedded image rather than an external reference

Any direction would be greatly appreciated. I'm currently just passing the generated XHTML through a regex that converts the src attributes and that works fine, it just feels like there may be a more idiomatic way that I'm not seeing.

Several jobs ago, I wrote some code to do this for Alfresco:
https://github.com/alfresco-mirror/alfresco-mirror/blob/b3d815063d3634d4bde83b4a214db62215a490fd/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java#L490

The idea is it re-writes just the embedded image links to point to a specific folder path or prefix where the embedded images were written, while leaving all other (external) images alone

Nick

Reply via email to