Update: I have since developed a small program using PDFBox (1.8.8) to extract text from PDF files. The PDFBox library is able to parse pretty much any PDF handed to it including the files I have been having problems with using the tika-server. Even when a file has problems and causes an error, PDFBox displays the error but still extracts text from the file.
FYI. On Thu, Dec 25, 2014 at 11:59 AM, A.M. Sabuncu <[email protected]> wrote: > OK, I obtained GeoSPARQL.pdf file from here: > http://www.w3.org/2011/02/GeoSPARQL.pdf > > I first tried the following command line: > > *curl -T GeoSPARQL.pdf http://localhost:9998/tika > <http://localhost:9998/tika> --header "Content-type: application/pdf"* > > I got nothing back from the above curl command, and the server dumped the > following on screen, part of a longer trace: > > *Caused by: java.io.IOException: Push back buffer is full* > > Did research, and tried starting tika-server as follows to increase the > property in question to 1 GB: > > > *java -Dorg.apache.pdfbox.baseParser.pushBackSize=1073741824 -jar > tika-server-1.6.jar* > I still got nothing back from the curl command, but the server did not > produce a stack trace, instead just the following output: > > > *Dec 25, 2014 9:40:33 AM org.apache.tika.server.TikaResource > logRequestINFO: tika (application/pdf)* > > Have a feeling maybe I am missing something rudimentary. > > I am running tika-server on an AWS Ubuntu instance, and issueing the curl > commands from a Windows 7 system. I downloaded and built Tika 1.6 from > apache.org/dist/tika, with timestamp 2014-09-05 05:42. > > Thanks so much, happy holidays. > > > On Thu, Dec 25, 2014 at 8:02 AM, Nick Burch <[email protected]> wrote: > >> On Wed, 24 Dec 2014, A.M. Sabuncu wrote: >> >>> I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and >>> using the following curl command to test text extraction from PDF files: >>> >>> curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header >>> "Content-type: application/pdf" >>> >> >> What happens if you try >> >> curl -T GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: >> application/pdf" >> >> ? THat works fine for me for a test pdf >> >> Nick >> > >
