source code extraction problem

Stefan Chis Sun, 07 Feb 2010 02:54:29 -0800

Hello,

I am trying to extract the text from the following pdf document:
http://cheatsheet.codeslower.com/CheatSheet.pdf


It is a "cheat sheet" for haskell. The problem I have is that the haskell
code as extracted is not readable(gibberish?). However, the log output shows
that the code is recognized.

Here is a sample of the log (COSString objects contain the haskell code; it
is recognized):

2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSArray{[COSString{pos}, COSInt{-517}, COSString{x}, COSInt{-516},
COSString{|}, COSInt{-517}, COSString{x}, COSInt{-517}, COSString{<},
COSInt{-517}, COSString{0}, COSInt{-516}, COSString{=}, COSInt{-517},
COSString{negate}, COSInt{-517}, COSString{x}]}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
PDFOperator{TJ}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSFloat{33.824}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSFloat{-14.226}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
PDFOperator{Td}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSArray{[COSString{|}, COSInt{-517}, COSString{otherw}, COSInt{1},
COSString{ise}, COSInt{-517}, COSString{=}, COSInt{-517}, COSString{x}]}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
PDFOperator{TJ}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSFloat{-33.824}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSFloat{-28.454}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
PDFOperator{Td}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSArray{[COSString{neg}, COSInt{-517}, COSString{y}, COSInt{-516},
COSString{|}, COSInt{-517}, COSString{y}, COSInt{-517}, COSString{>},
COSInt{-517}, COSString{0}, COSInt{-516}, COSString{=}, COSInt{-517},
COSString{negate}, COSInt{-517}, COSString{y}]}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
PDFOperator{TJ}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSFloat{33.824}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSFloat{-14.226}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
PDFOperator{Td}
2516 [main] DEBUG org.apache.pdfbox.util.PDFStreamEngine  - token:
COSArray{[COSString{|}, COSInt{-517}, COSString{otherw}, COSInt{1},
COSString{ise}, COSInt{-517}, COSString{=}, COSInt{-517}, COSString{y}]}


However the text extraction gives something like this:
a74a117a115a116 a40a70a105a114a115a116 a95a41 a45a62
a34a70a105a114a115a116a33a34


I have used the PDFTextStripper (as used in the ExtractText example).

Code snippet:

            PDFTextStripper stripper = new PDFTextStripper();

            int startPage = 1;
            int endPage = Integer.MAX_VALUE;
            stripper.setSortByPosition(false);
            stripper.setStartPage(startPage);
            stripper.setEndPage(endPage);

            File outputFile = new File("pdf2txt.txt");
            Writer output = new OutputStreamWriter(new
FileOutputStream(outputFile));
            stripper.writeText(document, output);

Am I doing something wrong?


Best regards,
Stefan

source code extraction problem

Reply via email to