Thanks Dave! ------------------------ Chris Mattmann [email protected]
-----Original Message----- From: David Meikle <[email protected]> Reply-To: <[email protected]> Date: Monday, December 29, 2014 at 2:50 PM To: <[email protected]> Subject: Re: Parsing PDF files >Hello, > >On 24 Dec 2014, at 20:30, A.M. Sabuncu <[email protected]> wrote: > >I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and >using the following curl command to test text extraction from PDF files: >curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header >"Content-type: application/pdf"On trivial PDF files (e.g. created using >Word 2010's convert-to-pdf functionality and containing only the text >"Testing", about 81 KB in size), I get errors in that there's nothing >returned from the curl command, and on the tika-server end, I see the >following errors: > > ><lots of garbage characters displayed on screen, followed by> > >WARNING: Did not found XRef object at specified startxref position 0 > > >Being new to Tika, I would like to know whether I am doing something >wrong, or if PDF parsing is not yet an exact science. > >Many thanks in advance. > > >Sabuncu > > > > > > > > >Working through this we have discovered we were using different commands, >which then uncovered an error in the example on the TikaJAXRS wiki page >where all examples, regardless of the nature of the content, use the -d >flag (effectively --data-ascii) in the curl commands. This means that >binary files are being processed as ASCII content. > >Based on the above, all that was required was to change the command from: > >curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header >"Content-type: application/pdf” > >To: > >curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika >--header "Content-type: application/pdf” > >I have updated the TikaJAXRS wiki page accordingly but felt it was worth >posting back to the list for future reference. > >Cheers, >Dave
