Ugh... could be a bug. I’ll look into it this afternoon. On Wed, May 15, 2019 at 10:09 AM Markus <[email protected]> wrote:
> Hi Tim et al., > > unfortunately the "unpack" endpoint does not seems to be working for PDF's. > I can successfully get the content, metadata and images for DOCX files > like this: > > *curl -T test.docx http://localhost:32768/unpack/all > <http://localhost:32768/unpack/all> > x.zip* > > however when using PDF, and setting the correct header: > *curl -H "X-Tika-PDFextractInlineImages: true" -T test.pdf > http://localhost:32768/unpack/all <http://localhost:32768/unpack/all> > > x.zip* > > the result is "only" metadata and text in a zipfile. No images. > Could this be a bug? Not sure how it is supposed to work. > > Markus > > > On Mon, May 13, 2019 at 9:47 PM Tim Allison <[email protected]> wrote: > >> Hi Markus, >> If you want the embedded images as the literal binary images, you >> can use the unpack resource: >> https://wiki.apache.org/tika/TikaJAXRS#Unpack_Resource >> The /rmeta endpoint, as you found, only includes metadata and extracted >> text. >> >> Best, >> >> Tim >> >> On Mon, May 13, 2019 at 5:03 AM Markus <[email protected]> wrote: >> > >> > Hi, >> > >> > I am running Tika with curl against the REST service. >> > When I request to extract also the inline images from PDF like this: >> > curl --header "X-Tika-PDFextractInlineImages:true" -T ./test.pdf >> http://localhost:32768/rmeta >result.json >> > >> > I do obtain structure (xml/html) including image tags such as "<img >> src=\"embedded:image0.jpg\" alt=\"image0.jpg\" />" >> > >> > I see separate metadata in the output such as >> ""X-TIKA:embedded_resource_path": "/image0.jpg"" but nothing in there >> resembles binary image data in any form. >> > And I have to restrict myself to REST only, since I will be using pyton >> to talk to Tika. >> > >> > Is there any specification to the embedded image data? >> > Is there a way to pull/save the embedded images in actual binary image >> format? >> > >> > Example JSON output attached. >> > >> > Markus >> >
