Parsing PDF files

A.M. Sabuncu Wed, 24 Dec 2014 12:32:29 -0800

I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
using the following curl command to test text extraction from PDF files:


curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
"Content-type: application/pdf"

On trivial PDF files (e.g. created using Word 2010's convert-to-pdf
functionality and containing only the text "Testing", about 81 KB in size),
I get errors in that there's nothing returned from the curl command, and on
the tika-server end, I see the following errors:

<lots of garbage characters displayed on screen, followed by>

WARNING: Did not found XRef object at specified startxref position 0

Being new to Tika, I would like to know whether I am doing something wrong,
or if PDF parsing is not yet an exact science.

Many thanks in advance.

Sabuncu

Parsing PDF files

Reply via email to