Bug.  Thank you for trying and letting us know!

https://issues.apache.org/jira/browse/TIKA-2876

On Wed, May 15, 2019 at 10:09 AM Markus <[email protected]> wrote:
>
> Hi Tim et al.,
>
> unfortunately the "unpack" endpoint does not seems to be working for PDF's.
> I can successfully get the content, metadata and images for DOCX files like 
> this:
>
> curl -T test.docx http://localhost:32768/unpack/all > x.zip
>
> however when using PDF, and setting the correct header:
> curl -H "X-Tika-PDFextractInlineImages: true" -T test.pdf 
> http://localhost:32768/unpack/all > x.zip
>
> the result is "only" metadata and text in a zipfile. No images.
> Could this be a bug? Not sure how it is supposed to work.
>
> Markus
>
>
> On Mon, May 13, 2019 at 9:47 PM Tim Allison <[email protected]> wrote:
>>
>> Hi Markus,
>>   If you want the embedded images as the literal binary images, you
>> can use the unpack resource:
>> https://wiki.apache.org/tika/TikaJAXRS#Unpack_Resource
>>   The /rmeta endpoint, as you found, only includes metadata and extracted 
>> text.
>>
>>           Best,
>>
>>                Tim
>>
>> On Mon, May 13, 2019 at 5:03 AM Markus <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > I am running Tika with curl against the REST service.
>> > When I request to extract also the inline images from PDF like this:
>> > curl --header "X-Tika-PDFextractInlineImages:true" -T ./test.pdf 
>> > http://localhost:32768/rmeta >result.json
>> >
>> > I do obtain structure (xml/html) including image tags such as "<img 
>> > src=\"embedded:image0.jpg\" alt=\"image0.jpg\" />"
>> >
>> > I see separate metadata in the output such as 
>> > ""X-TIKA:embedded_resource_path": "/image0.jpg"" but nothing in there 
>> > resembles binary image data in any form.
>> > And I have to restrict myself to REST only, since I will be using pyton to 
>> > talk to Tika.
>> >
>> > Is there any specification to the embedded image data?
>> > Is there a way to pull/save the embedded images in actual binary image 
>> > format?
>> >
>> > Example JSON output attached.
>> >
>> > Markus

Reply via email to