Re: Problem with text extraction

Tilman Hausherr Sun, 23 Jan 2022 10:39:47 -0800

Hi,

Your screenshots didn't get through. There are so many things that cango wrong with PDF, so it's difficult to tell without the file.

"then pasted the text into a text editor. The text pasted this waymatches the extracted text"

Then it means PDFBox is correct. It's possible that the unicode text fora glyph is a wrong one. Sometimes it is intended to make text extractiondifficult. It could also be a crappy OCR.

"I have not yet managed, with any of them, to get an uncompressed textdocument that shows the PDF commands and their arguments in readable form"


Try PDFBox PDFDebugger!

Tilman

Am 23.01.2022 um 19:02 schrieb Ralph Cook:

I am using PDFBox's PDFTextStripper.getText() for a particular kind ofPDF file generated by a government agency, and the text I'm gettingdoes not match that displayed by Acrobat Reader for the same files.The getText() calls occasionally get characters Reader does notdisplay, and in one case getText() gets an "O" instead of the "U"displayed by Reader. I would like to know if there's some way I canget same text as Reader displays.
The text from Reader is "correct", i.e., it is (clearly) the textintended by the program(s) generating the files. The extracted textcontains typos and misspelled words.
Unfortunately, I cannot share any of the PDF files. They containconfidential information.
The rest of this email relates various things I have tried, mostly tounderstand the problem better.
I copied the text within Reader, just using control-A / control-C,then pasted the text into a text editor. The text pasted this waymatches the extracted text, not the Reader-displayed text (thecopied/pasted text does not have the line breaks that getText()gives). With my newfound (very limited) knowledge of how PDFs areconstructed, this made me wonder if some of the content displayed byReader is somewhere other than the Tj streams in the document.
I've downloaded and attempted to extract information with varioustools -- mupdf, qpdf, and XpdfReader, so far. I've found it difficultto figure out how to use them, mostly because their help text assumesyou know things about PDF that I'm still trying to learn. I have notyet managed, with any of them, to get an uncompressed text documentthat shows the PDF commands and their arguments in readable form. Ithought if I could do that I might at least figure out the location ofthe information that is displayed by Reader but not extracted byPDFBox. I haven't gotten much useful out of them yet.
I downloaded PDFBox source and stepped through code to follow howgetText() works. I ran across the LegacyPDFStreamEngine class commentsindicating that it is only to be used for PDFTextStripper. At leastsometimes, a word from the file is passed toPDFTextStripper.showText(byte[] string) as a byte array of PDF lettercodes, and then showGlyph() is called on each one. Oddly, the spacingfor each glyph is not a constant, which I expected it to be for afixed-width font, but if it's only used for extraction, I guess thatdoesn't matter.
I put a trace statement on PDFTextStripper.processTextPosition(); forevery character on page 6 of a particular document, it displayed thepage number, character string, the flag indicating whether thecharacter is to be shown (always true), and the X and Y position ofthe character. I put the result into a spreadsheet and sorted it by Ythen by X, to see if the Reader-displayed characters showed up out ofsequence. None of them do.
In the case of "O" instead of "U" -- part of the page header on theReader displayed printout has a line with "CUST ID" on it; for pages2-5 of this file, the extracted text shows "CUST ID" correctly, but"COST ID" on the 6th page.
Here's the Reader version of those lines:



And the extracted text of the same lines.
The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" isdisplayed and extracted correctly. This is the only case I've noticedso far where there's a seeming change in a character, as opposed toextra characters.
Here are some other redacted lines from the Reader display of thisreport:
And here is the extracted text from the same part of the file:
I included these images inline; I also attached them, since I don'tknow what facilities people have to read inline attachments.
The similarity of the errors on these lines -- that all three of theerror lines had dates in February in the second position on the lineand all had the same error -- must mean something, but I don't know what.
I've got other information, but I don't know how much of it (or ofwhat I've provided) is helpful.
I do not expect anyone to 'solve the problem' based on thisinformation. But I was hoping to get pointers to ways I could attemptto get the same text that Acrobat Reader displays, hopefully usingPDFBox, but I'll change libraries or methods if I need to.
rc

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Problem with text extraction

Reply via email to