Problem with text extraction

Ralph Cook Sun, 23 Jan 2022 10:02:31 -0800

I am using PDFBox's PDFTextStripper.getText() for a particular kind ofPDF file generated by a government agency, and the text I'm getting doesnot match that displayed by Acrobat Reader for the same files. ThegetText() calls occasionally get characters Reader does not display, andin one case getText() gets an "O" instead of the "U" displayed byReader. I would like to know if there's some way I can get same text asReader displays.

The text from Reader is "correct", i.e., it is (clearly) the textintended by the program(s) generating the files. The extracted textcontains typos and misspelled words.

Unfortunately, I cannot share any of the PDF files. They containconfidential information.

The rest of this email relates various things I have tried, mostly tounderstand the problem better.

I copied the text within Reader, just using control-A / control-C, thenpasted the text into a text editor. The text pasted this way matches theextracted text, not the Reader-displayed text (the copied/pasted textdoes not have the line breaks that getText() gives). With my newfound(very limited) knowledge of how PDFs are constructed, this made mewonder if some of the content displayed by Reader is somewhere otherthan the Tj streams in the document.

I've downloaded and attempted to extract information with various tools-- mupdf, qpdf, and XpdfReader, so far. I've found it difficult tofigure out how to use them, mostly because their help text assumes youknow things about PDF that I'm still trying to learn. I have not yetmanaged, with any of them, to get an uncompressed text document thatshows the PDF commands and their arguments in readable form. I thoughtif I could do that I might at least figure out the location of theinformation that is displayed by Reader but not extracted by PDFBox. Ihaven't gotten much useful out of them yet.

I downloaded PDFBox source and stepped through code to follow howgetText() works. I ran across the LegacyPDFStreamEngine class commentsindicating that it is only to be used for PDFTextStripper. At leastsometimes, a word from the file is passed toPDFTextStripper.showText(byte[] string) as a byte array of PDF lettercodes, and then showGlyph() is called on each one. Oddly, the spacingfor each glyph is not a constant, which I expected it to be for afixed-width font, but if it's only used for extraction, I guess thatdoesn't matter.

I put a trace statement on PDFTextStripper.processTextPosition(); forevery character on page 6 of a particular document, it displayed thepage number, character string, the flag indicating whether the characteris to be shown (always true), and the X and Y position of the character.I put the result into a spreadsheet and sorted it by Y then by X, to seeif the Reader-displayed characters showed up out of sequence. None ofthem do.

In the case of "O" instead of "U" -- part of the page header on theReader displayed printout has a line with "CUST ID" on it; for pages 2-5of this file, the extracted text shows "CUST ID" correctly, but "COSTID" on the 6th page.


Here's the Reader version of those lines:



And the extracted text of the same lines.

The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" isdisplayed and extracted correctly. This is the only case I've noticed sofar where there's a seeming change in a character, as opposed to extracharacters.


Here are some other redacted lines from the Reader display of this report:



And here is the extracted text from the same part of the file:

I included these images inline; I also attached them, since I don't knowwhat facilities people have to read inline attachments.

The similarity of the errors on these lines -- that all three of theerror lines had dates in February in the second position on the line andall had the same error -- must mean something, but I don't know what.

I've got other information, but I don't know how much of it (or of whatI've provided) is helpful.

I do not expect anyone to 'solve the problem' based on this information.But I was hoping to get pointers to ways I could attempt to get the sametext that Acrobat Reader displays, hopefully using PDFBox, but I'llchange libraries or methods if I need to.

rc

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Problem with text extraction

Reply via email to