This makes sense and I think that ContentHandlerDecorator in your code may actually help me improve my processing elsewhere.
Thank you for the detailed reply, it's appreciated. Mike On Fri, Jan 3, 2020 at 9:17 AM Nick Burch <[email protected]> wrote: > On Fri, 3 Jan 2020, Mike Dalrymple wrote: > > I've just started using Tika to process PDFs with embedded images. I'm > > getting fantastic results but I'm having to post-process the generated > > XHTML to correct the value of the src attribute on the img elements. > > That is expected. A simple sax handler should let you do that, to re-write > it to where you're saving the images > > > The generated XHTML has elements like: > > > > <img src="embedded:image1.jpg" alt="image1.jpg" /> > > The embedded prefix is Tika's way of letting you know there was an > embedded image there, and what name it would have if you extracted it > (which you may not of done). > > The idea is that, for the extract+display case, you re-write it to match > where you stored the image. For other cases, you know it was an embedded > image rather than an external reference > > > Any direction would be greatly appreciated. I'm currently just passing > > the generated XHTML through a regex that converts the src attributes and > > that works fine, it just feels like there may be a more idiomatic way > > that I'm not seeing. > > Several jobs ago, I wrote some code to do this for Alfresco: > > https://github.com/alfresco-mirror/alfresco-mirror/blob/b3d815063d3634d4bde83b4a214db62215a490fd/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java#L490 > > The idea is it re-writes just the embedded image links to point to a > specific folder path or prefix where the embedded images were written, > while leaving all other (external) images alone > > Nick >
