Re: Parsing PDF files

David Meikle Mon, 29 Dec 2014 14:51:56 -0800

Hello,

> On 24 Dec 2014, at 20:30, A.M. Sabuncu <[email protected]> wrote:
> 
> I am following the examples at http://wiki.apache.org/tika/TikaJAXRS 
> <http://wiki.apache.org/tika/TikaJAXRS> and using the following curl command 
> to test text extraction from PDF files:
> curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika 
> <http://localhost:9998/tika> --header "Content-type: application/pdf"
> On trivial PDF files (e.g. created using Word 2010's convert-to-pdf 
> functionality and containing only the text "Testing", about 81 KB in size), I 
> get errors in that there's nothing returned from the curl command, and on the 
> tika-server end, I see the following errors:
> 
> <lots of garbage characters displayed on screen, followed by>
> 
> WARNING: Did not found XRef object at specified startxref position 0
> 
> Being new to Tika, I would like to know whether I am doing something wrong, 
> or if PDF parsing is not yet an exact science.
> 
> Many thanks in advance.
> 
> Sabuncu


Working through this we have discovered we were using different commands, which 
then uncovered an error in the example on the TikaJAXRS wiki page where all 
examples, regardless of the nature of the content, use the -d flag (effectively 
--data-ascii) in the curl commands.  This means that binary files are being 
processed as ASCII content.

Based on the above, all that was required was to change the command from:

curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header 
"Content-type: application/pdf”

To:

curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header 
"Content-type: application/pdf”

I have updated the TikaJAXRS wiki page accordingly but felt it was worth 
posting back to the list for future reference.

Cheers,
Dave

Re: Parsing PDF files

Reply via email to