This makes sense and I think that ContentHandlerDecorator in your code may
actually help me improve my processing elsewhere.

Thank you for the detailed reply, it's appreciated.

Mike

On Fri, Jan 3, 2020 at 9:17 AM Nick Burch <[email protected]> wrote:

> On Fri, 3 Jan 2020, Mike Dalrymple wrote:
> > I've just started using Tika to process PDFs with embedded images.  I'm
> > getting fantastic results but I'm having to post-process the generated
> > XHTML to correct the value of the src attribute on the img elements.
>
> That is expected. A simple sax handler should let you do that, to re-write
> it to where you're saving the images
>
> > The generated XHTML has elements like:
> >
> > <img src="embedded:image1.jpg" alt="image1.jpg" />
>
> The embedded prefix is Tika's way of letting you know there was an
> embedded image there, and what name it would have if you extracted it
> (which you may not of done).
>
> The idea is that, for the extract+display case, you re-write it to match
> where you stored the image. For other cases, you know it was an embedded
> image rather than an external reference
>
> > Any direction would be greatly appreciated.  I'm currently just passing
> > the generated XHTML through a regex that converts the src attributes and
> > that works fine, it just feels like there may be a more idiomatic way
> > that I'm not seeing.
>
> Several jobs ago, I wrote some code to do this for Alfresco:
>
> https://github.com/alfresco-mirror/alfresco-mirror/blob/b3d815063d3634d4bde83b4a214db62215a490fd/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java#L490
>
> The idea is it re-writes just the embedded image links to point to a
> specific folder path or prefix where the embedded images were written,
> while leaving all other (external) images alone
>
> Nick
>

Reply via email to