Re: Understanding XML/JSON output structure

Tim Allison Mon, 13 May 2019 12:48:06 -0700

Hi Markus,
  If you want the embedded images as the literal binary images, you
can use the unpack resource:
https://wiki.apache.org/tika/TikaJAXRS#Unpack_Resource
  The /rmeta endpoint, as you found, only includes metadata and extracted text.


          Best,

               Tim

On Mon, May 13, 2019 at 5:03 AM Markus <[email protected]> wrote:
>
> Hi,
>
> I am running Tika with curl against the REST service.
> When I request to extract also the inline images from PDF like this:
> curl --header "X-Tika-PDFextractInlineImages:true" -T ./test.pdf 
> http://localhost:32768/rmeta >result.json
>
> I do obtain structure (xml/html) including image tags such as "<img 
> src=\"embedded:image0.jpg\" alt=\"image0.jpg\" />"
>
> I see separate metadata in the output such as 
> ""X-TIKA:embedded_resource_path": "/image0.jpg"" but nothing in there 
> resembles binary image data in any form.
> And I have to restrict myself to REST only, since I will be using pyton to 
> talk to Tika.
>
> Is there any specification to the embedded image data?
> Is there a way to pull/save the embedded images in actual binary image format?
>
> Example JSON output attached.
>
> Markus

Re: Understanding XML/JSON output structure

Reply via email to