Re: Understanding XML/JSON output structure

Markus Wed, 15 May 2019 07:09:37 -0700

Hi Tim et al.,

unfortunately the "unpack" endpoint does not seems to be working for PDF's.
I can successfully get the content, metadata and images for DOCX files like
this:


*curl -T test.docx http://localhost:32768/unpack/all
<http://localhost:32768/unpack/all> > x.zip*

however when using PDF, and setting the correct header:
*curl -H "X-Tika-PDFextractInlineImages: true" -T test.pdf
http://localhost:32768/unpack/all <http://localhost:32768/unpack/all> >
x.zip*

the result is "only" metadata and text in a zipfile. No images.
Could this be a bug? Not sure how it is supposed to work.

Markus


On Mon, May 13, 2019 at 9:47 PM Tim Allison <[email protected]> wrote:

> Hi Markus,
>   If you want the embedded images as the literal binary images, you
> can use the unpack resource:
> https://wiki.apache.org/tika/TikaJAXRS#Unpack_Resource
>   The /rmeta endpoint, as you found, only includes metadata and extracted
> text.
>
>           Best,
>
>                Tim
>
> On Mon, May 13, 2019 at 5:03 AM Markus <[email protected]> wrote:
> >
> > Hi,
> >
> > I am running Tika with curl against the REST service.
> > When I request to extract also the inline images from PDF like this:
> > curl --header "X-Tika-PDFextractInlineImages:true" -T ./test.pdf
> http://localhost:32768/rmeta >result.json
> >
> > I do obtain structure (xml/html) including image tags such as "<img
> src=\"embedded:image0.jpg\" alt=\"image0.jpg\" />"
> >
> > I see separate metadata in the output such as
> ""X-TIKA:embedded_resource_path": "/image0.jpg"" but nothing in there
> resembles binary image data in any form.
> > And I have to restrict myself to REST only, since I will be using pyton
> to talk to Tika.
> >
> > Is there any specification to the embedded image data?
> > Is there a way to pull/save the embedded images in actual binary image
> format?
> >
> > Example JSON output attached.
> >
> > Markus
>

Re: Understanding XML/JSON output structure

Reply via email to