I am using PDFBox's PDFTextStripper.getText() for a particular kind of
PDF file generated by a government agency, and the text I'm getting does
not match that displayed by Acrobat Reader for the same files. The
getText() calls occasionally get characters Reader does not display, and
in one case getText() gets an "O" instead of the "U" displayed by
Reader. I would like to know if there's some way I can get same text as
Reader displays.
The text from Reader is "correct", i.e., it is (clearly) the text
intended by the program(s) generating the files. The extracted text
contains typos and misspelled words.
Unfortunately, I cannot share any of the PDF files. They contain
confidential information.
The rest of this email relates various things I have tried, mostly to
understand the problem better.
I copied the text within Reader, just using control-A / control-C, then
pasted the text into a text editor. The text pasted this way matches the
extracted text, not the Reader-displayed text (the copied/pasted text
does not have the line breaks that getText() gives). With my newfound
(very limited) knowledge of how PDFs are constructed, this made me
wonder if some of the content displayed by Reader is somewhere other
than the Tj streams in the document.
I've downloaded and attempted to extract information with various tools
-- mupdf, qpdf, and XpdfReader, so far. I've found it difficult to
figure out how to use them, mostly because their help text assumes you
know things about PDF that I'm still trying to learn. I have not yet
managed, with any of them, to get an uncompressed text document that
shows the PDF commands and their arguments in readable form. I thought
if I could do that I might at least figure out the location of the
information that is displayed by Reader but not extracted by PDFBox. I
haven't gotten much useful out of them yet.
I downloaded PDFBox source and stepped through code to follow how
getText() works. I ran across the LegacyPDFStreamEngine class comments
indicating that it is only to be used for PDFTextStripper. At least
sometimes, a word from the file is passed to
PDFTextStripper.showText(byte[] string) as a byte array of PDF letter
codes, and then showGlyph() is called on each one. Oddly, the spacing
for each glyph is not a constant, which I expected it to be for a
fixed-width font, but if it's only used for extraction, I guess that
doesn't matter.
I put a trace statement on PDFTextStripper.processTextPosition(); for
every character on page 6 of a particular document, it displayed the
page number, character string, the flag indicating whether the character
is to be shown (always true), and the X and Y position of the character.
I put the result into a spreadsheet and sorted it by Y then by X, to see
if the Reader-displayed characters showed up out of sequence. None of
them do.
In the case of "O" instead of "U" -- part of the page header on the
Reader displayed printout has a line with "CUST ID" on it; for pages 2-5
of this file, the extracted text shows "CUST ID" correctly, but "COST
ID" on the 6th page.
Here's the Reader version of those lines:
And the extracted text of the same lines.
The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" is
displayed and extracted correctly. This is the only case I've noticed so
far where there's a seeming change in a character, as opposed to extra
characters.
Here are some other redacted lines from the Reader display of this report:
And here is the extracted text from the same part of the file:
I included these images inline; I also attached them, since I don't know
what facilities people have to read inline attachments.
The similarity of the errors on these lines -- that all three of the
error lines had dates in February in the second position on the line and
all had the same error -- must mean something, but I don't know what.
I've got other information, but I don't know how much of it (or of what
I've provided) is helpful.
I do not expect anyone to 'solve the problem' based on this information.
But I was hoping to get pointers to ways I could attempt to get the same
text that Acrobat Reader displays, hopefully using PDFBox, but I'll
change libraries or methods if I need to.
rc
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org