Re: Parsing PDF files

A.M. Sabuncu Thu, 25 Dec 2014 02:01:10 -0800

OK, I obtained GeoSPARQL.pdf file from here:
http://www.w3.org/2011/02/GeoSPARQL.pdf

I first tried the following command line:

*curl -T GeoSPARQL.pdf http://localhost:9998/tika
<http://localhost:9998/tika> --header "Content-type: application/pdf"*

I got nothing back from the above curl command, and the server dumped the
following on screen, part of a longer trace:

*Caused by: java.io.IOException: Push back buffer is full*

Did research, and tried starting tika-server as follows to increase the
property in question to 1 GB:

*java -Dorg.apache.pdfbox.baseParser.pushBackSize=1073741824 -jar
tika-server-1.6.jar*
I still got nothing back from the curl command, but the server did not
produce a stack trace, instead just the following output:

*Dec 25, 2014 9:40:33 AM org.apache.tika.server.TikaResource
logRequestINFO: tika (application/pdf)*

Have a feeling maybe I am missing something rudimentary.

I am running tika-server on an AWS Ubuntu instance, and issueing the curl
commands from a Windows 7 system.  I downloaded and built Tika 1.6 from
apache.org/dist/tika, with timestamp 2014-09-05 05:42.

Thanks so much, happy holidays.

On Thu, Dec 25, 2014 at 8:02 AM, Nick Burch <[email protected]> wrote:

> On Wed, 24 Dec 2014, A.M. Sabuncu wrote:
>
>> I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
>> using the following curl command to test text extraction from PDF files:
>>
>> curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
>> "Content-type: application/pdf"
>>
>
> What happens if you try
>
> curl -T GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type:
> application/pdf"
>
> ? THat works fine for me for a test pdf
>
> Nick
>

Re: Parsing PDF files

Reply via email to