I'd like to thank David Meikle for his persistent assistance in resolving this problem. Much appreciated.
Todd On Tue, Dec 30, 2014 at 12:50 AM, David Meikle <[email protected]> wrote: > Hello, > > On 24 Dec 2014, at 20:30, A.M. Sabuncu <[email protected]> wrote: > > I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and > using the following curl command to test text extraction from PDF files: > > curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header > "Content-type: application/pdf" > > On trivial PDF files (e.g. created using Word 2010's convert-to-pdf > functionality and containing only the text "Testing", about 81 KB in size), > I get errors in that there's nothing returned from the curl command, and on > the tika-server end, I see the following errors: > > <lots of garbage characters displayed on screen, followed by> > > WARNING: Did not found XRef object at specified startxref position 0 > > Being new to Tika, I would like to know whether I am doing something > wrong, or if PDF parsing is not yet an exact science. > > Many thanks in advance. > > Sabuncu > > > Working through this we have discovered we were using different commands, > which then uncovered an error in the example on the TikaJAXRS wiki page > where all examples, regardless of the nature of the content, use the -d > flag (effectively --data-ascii) in the curl commands. This means that > binary files are being processed as ASCII content. > > Based on the above, all that was required was to change the command from: > > *curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika > <http://localhost:9998/tika> --header "Content-type: application/pdf”* > > To: > > curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header > "Content-type: application/pdf” > > I have updated the TikaJAXRS wiki page accordingly but felt it was worth > posting back to the list for future reference. > > Cheers, > Dave > >
