Hi,
Your screenshots didn't get through. There are so many things that can
go wrong with PDF, so it's difficult to tell without the file.
"then pasted the text into a text editor. The text pasted this way
matches the extracted text"
Then it means PDFBox is correct. It's possible that the unicode text for
a glyph is a wrong one. Sometimes it is intended to make text extraction
difficult. It could also be a crappy OCR.
"I have not yet managed, with any of them, to get an uncompressed text
document that shows the PDF commands and their arguments in readable form"
Try PDFBox PDFDebugger!
Tilman
Am 23.01.2022 um 19:02 schrieb Ralph Cook:
I am using PDFBox's PDFTextStripper.getText() for a particular kind of
PDF file generated by a government agency, and the text I'm getting
does not match that displayed by Acrobat Reader for the same files.
The getText() calls occasionally get characters Reader does not
display, and in one case getText() gets an "O" instead of the "U"
displayed by Reader. I would like to know if there's some way I can
get same text as Reader displays.
The text from Reader is "correct", i.e., it is (clearly) the text
intended by the program(s) generating the files. The extracted text
contains typos and misspelled words.
Unfortunately, I cannot share any of the PDF files. They contain
confidential information.
The rest of this email relates various things I have tried, mostly to
understand the problem better.
I copied the text within Reader, just using control-A / control-C,
then pasted the text into a text editor. The text pasted this way
matches the extracted text, not the Reader-displayed text (the
copied/pasted text does not have the line breaks that getText()
gives). With my newfound (very limited) knowledge of how PDFs are
constructed, this made me wonder if some of the content displayed by
Reader is somewhere other than the Tj streams in the document.
I've downloaded and attempted to extract information with various
tools -- mupdf, qpdf, and XpdfReader, so far. I've found it difficult
to figure out how to use them, mostly because their help text assumes
you know things about PDF that I'm still trying to learn. I have not
yet managed, with any of them, to get an uncompressed text document
that shows the PDF commands and their arguments in readable form. I
thought if I could do that I might at least figure out the location of
the information that is displayed by Reader but not extracted by
PDFBox. I haven't gotten much useful out of them yet.
I downloaded PDFBox source and stepped through code to follow how
getText() works. I ran across the LegacyPDFStreamEngine class comments
indicating that it is only to be used for PDFTextStripper. At least
sometimes, a word from the file is passed to
PDFTextStripper.showText(byte[] string) as a byte array of PDF letter
codes, and then showGlyph() is called on each one. Oddly, the spacing
for each glyph is not a constant, which I expected it to be for a
fixed-width font, but if it's only used for extraction, I guess that
doesn't matter.
I put a trace statement on PDFTextStripper.processTextPosition(); for
every character on page 6 of a particular document, it displayed the
page number, character string, the flag indicating whether the
character is to be shown (always true), and the X and Y position of
the character. I put the result into a spreadsheet and sorted it by Y
then by X, to see if the Reader-displayed characters showed up out of
sequence. None of them do.
In the case of "O" instead of "U" -- part of the page header on the
Reader displayed printout has a line with "CUST ID" on it; for pages
2-5 of this file, the extracted text shows "CUST ID" correctly, but
"COST ID" on the 6th page.
Here's the Reader version of those lines:
And the extracted text of the same lines.
The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" is
displayed and extracted correctly. This is the only case I've noticed
so far where there's a seeming change in a character, as opposed to
extra characters.
Here are some other redacted lines from the Reader display of this
report:
And here is the extracted text from the same part of the file:
I included these images inline; I also attached them, since I don't
know what facilities people have to read inline attachments.
The similarity of the errors on these lines -- that all three of the
error lines had dates in February in the second position on the line
and all had the same error -- must mean something, but I don't know what.
I've got other information, but I don't know how much of it (or of
what I've provided) is helpful.
I do not expect anyone to 'solve the problem' based on this
information. But I was hoping to get pointers to ways I could attempt
to get the same text that Acrobat Reader displays, hopefully using
PDFBox, but I'll change libraries or methods if I need to.
rc
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org