Hello,
I am trying to extract the text from the following pdf document:
http://cheatsheet.codeslower.com/CheatSheet.pdf
It is a "cheat sheet" for haskell. The problem I have is that the haskell
code as extracted is not readable(gibberish?). However, the log output shows
that the code is recognized.
Here is a sample of the log (COSString objects contain the haskell code; it
is recognized):
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSArray{[COSString{pos}, COSInt{-517}, COSString{x}, COSInt{-516},
COSString{|}, COSInt{-517}, COSString{x}, COSInt{-517}, COSString{<},
COSInt{-517}, COSString{0}, COSInt{-516}, COSString{=}, COSInt{-517},
COSString{negate}, COSInt{-517}, COSString{x}]}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
PDFOperator{TJ}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSFloat{33.824}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSFloat{-14.226}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
PDFOperator{Td}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSArray{[COSString{|}, COSInt{-517}, COSString{otherw}, COSInt{1},
COSString{ise}, COSInt{-517}, COSString{=}, COSInt{-517}, COSString{x}]}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
PDFOperator{TJ}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSFloat{-33.824}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSFloat{-28.454}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
PDFOperator{Td}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSArray{[COSString{neg}, COSInt{-517}, COSString{y}, COSInt{-516},
COSString{|}, COSInt{-517}, COSString{y}, COSInt{-517}, COSString{>},
COSInt{-517}, COSString{0}, COSInt{-516}, COSString{=}, COSInt{-517},
COSString{negate}, COSInt{-517}, COSString{y}]}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
PDFOperator{TJ}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSFloat{33.824}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSFloat{-14.226}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
PDFOperator{Td}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine - token:
COSArray{[COSString{|}, COSInt{-517}, COSString{otherw}, COSInt{1},
COSString{ise}, COSInt{-517}, COSString{=}, COSInt{-517}, COSString{y}]}
However the text extraction gives something like this:
a74a117a115a116 a40a70a105a114a115a116 a95a41 a45a62
a34a70a105a114a115a116a33a34
I have used the PDFTextStripper (as used in the ExtractText example).
Code snippet:
PDFTextStripper stripper = new PDFTextStripper();
int startPage = 1;
int endPage = Integer.MAX_VALUE;
stripper.setSortByPosition(false);
stripper.setStartPage(startPage);
stripper.setEndPage(endPage);
File outputFile = new File("pdf2txt.txt");
Writer output = new OutputStreamWriter(new
FileOutputStream(outputFile));
stripper.writeText(document, output);
Am I doing something wrong?
Best regards,
Stefan