On Fri, 3 Jan 2020, Mike Dalrymple wrote:
I've just started using Tika to process PDFs with embedded images. I'm
getting fantastic results but I'm having to post-process the generated
XHTML to correct the value of the src attribute on the img elements.
That is expected. A simple sax handler should let you do that, to re-write
it to where you're saving the images
The generated XHTML has elements like:
<img src="embedded:image1.jpg" alt="image1.jpg" />
The embedded prefix is Tika's way of letting you know there was an
embedded image there, and what name it would have if you extracted it
(which you may not of done).
The idea is that, for the extract+display case, you re-write it to match
where you stored the image. For other cases, you know it was an embedded
image rather than an external reference
Any direction would be greatly appreciated. I'm currently just passing
the generated XHTML through a regex that converts the src attributes and
that works fine, it just feels like there may be a more idiomatic way
that I'm not seeing.
Several jobs ago, I wrote some code to do this for Alfresco:
https://github.com/alfresco-mirror/alfresco-mirror/blob/b3d815063d3634d4bde83b4a214db62215a490fd/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java#L490
The idea is it re-writes just the embedded image links to point to a
specific folder path or prefix where the embedded images were written,
while leaving all other (external) images alone
Nick