Re: Parsing PDF files

A.M. Sabuncu Tue, 30 Dec 2014 10:14:49 -0800

I'd like to thank David Meikle for his persistent assistance in resolving
this problem.  Much appreciated.


Todd

On Tue, Dec 30, 2014 at 12:50 AM, David Meikle <[email protected]> wrote:

> Hello,
>
> On 24 Dec 2014, at 20:30, A.M. Sabuncu <[email protected]> wrote:
>
> I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
> using the following curl command to test text extraction from PDF files:
>
> curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header 
> "Content-type: application/pdf"
>
> On trivial PDF files (e.g. created using Word 2010's convert-to-pdf
> functionality and containing only the text "Testing", about 81 KB in size),
> I get errors in that there's nothing returned from the curl command, and on
> the tika-server end, I see the following errors:
>
> <lots of garbage characters displayed on screen, followed by>
>
> WARNING: Did not found XRef object at specified startxref position 0
>
> Being new to Tika, I would like to know whether I am doing something
> wrong, or if PDF parsing is not yet an exact science.
>
> Many thanks in advance.
>
> Sabuncu
>
>
> Working through this we have discovered we were using different commands,
> which then uncovered an error in the example on the TikaJAXRS wiki page
> where all examples, regardless of the nature of the content, use the -d
> flag (effectively --data-ascii) in the curl commands.  This means that
> binary files are being processed as ASCII content.
>
> Based on the above, all that was required was to change the command from:
>
> *curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika
> <http://localhost:9998/tika> --header "Content-type: application/pdf”*
>
> To:
>
> curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header
> "Content-type: application/pdf”
>
> I have updated the TikaJAXRS wiki page accordingly but felt it was worth
> posting back to the list for future reference.
>
> Cheers,
> Dave
>
>

Re: Parsing PDF files

Reply via email to