Understanding XML/JSON output structure

Markus Mon, 13 May 2019 02:04:20 -0700

Hi,

I am running Tika with curl against the REST service.
When I request to extract also the inline images from PDF like this:
*curl --header "X-Tika-PDFextractInlineImages:true" -T ./test.pdf
http://localhost:32768/rmeta <http://localhost:32768/rmeta> >result.json*


I do obtain structure (xml/html) including image tags such as "*<img
src=\"embedded:image0.jpg\" alt=\"image0.jpg\" />*"

I see separate metadata in the output such as
"*"X-TIKA:embedded_resource_path":
"/image0.jpg"*" but nothing in there resembles binary image data in any
form.
And I have to restrict myself to REST only, since I will be using pyton to
talk to Tika.

Is there any specification to the embedded image data?
Is there a way to pull/save the embedded images in actual binary image
format?

Example JSON output attached.

Markus

result.json
Description: application/json

Understanding XML/JSON output structure

Reply via email to