Re: Parsing PDF files

A.M. Sabuncu Sun, 28 Dec 2014 01:11:47 -0800

Update: I have since developed a small program using PDFBox (1.8.8) to
extract text from PDF files.  The PDFBox library is able to parse pretty
much any PDF handed to it including the files I have been having problems
with using the tika-server.  Even when a file has problems and causes an
error, PDFBox displays the error but still extracts text from the file.


FYI.

On Thu, Dec 25, 2014 at 11:59 AM, A.M. Sabuncu <[email protected]> wrote:

> OK, I obtained GeoSPARQL.pdf file from here:
> http://www.w3.org/2011/02/GeoSPARQL.pdf
>
> I first tried the following command line:
>
> *curl -T GeoSPARQL.pdf http://localhost:9998/tika
> <http://localhost:9998/tika> --header "Content-type: application/pdf"*
>
> I got nothing back from the above curl command, and the server dumped the
> following on screen, part of a longer trace:
>
> *Caused by: java.io.IOException: Push back buffer is full*
>
> Did research, and tried starting tika-server as follows to increase the
> property in question to 1 GB:
>
>
> *java -Dorg.apache.pdfbox.baseParser.pushBackSize=1073741824 -jar
> tika-server-1.6.jar*
> I still got nothing back from the curl command, but the server did not
> produce a stack trace, instead just the following output:
>
>
> *Dec 25, 2014 9:40:33 AM org.apache.tika.server.TikaResource
> logRequestINFO: tika (application/pdf)*
>
> Have a feeling maybe I am missing something rudimentary.
>
> I am running tika-server on an AWS Ubuntu instance, and issueing the curl
> commands from a Windows 7 system.  I downloaded and built Tika 1.6 from
> apache.org/dist/tika, with timestamp 2014-09-05 05:42.
>
> Thanks so much, happy holidays.
>
>
> On Thu, Dec 25, 2014 at 8:02 AM, Nick Burch <[email protected]> wrote:
>
>> On Wed, 24 Dec 2014, A.M. Sabuncu wrote:
>>
>>> I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and
>>> using the following curl command to test text extraction from PDF files:
>>>
>>> curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header
>>> "Content-type: application/pdf"
>>>
>>
>> What happens if you try
>>
>> curl -T GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type:
>> application/pdf"
>>
>> ? THat works fine for me for a test pdf
>>
>> Nick
>>
>
>

Re: Parsing PDF files

Reply via email to