Re: Text extraction adding lots of strange spaces

Kevin Day Sat, 14 Dec 2024 09:03:15 -0800

I will also review that jira commit you referenced. It seems like it might
be related.


On Sat, Dec 14, 2024, 9:59 AM Kevin Day <ke...@trumpetinc.com> wrote:

> Great, I will put that together in the next couple of days.
>
> I haven't looked at the iText code in several years - this will be
> greenfield development work.
>
> Take care,
>
> K
>
> On Fri, Dec 13, 2024, 9:21 PM Tilman Hausherr <thaush...@t-online.de>
> wrote:
>
>> On 13.12.2024 15:41, Kevin Day wrote:
>> > Ok, this is very interesting.
>> >
>> > So we have one text render that adds a bunch of spaces.
>> >
>> > Then we have another text render that puts visible characters on top of
>> > those spaces.
>> >
>> > Then we positional-sort the text positions, and they wind up
>> interleaving,
>> > which kills the text extraction accuracy.
>> >
>> > Would a reasonable refinement to the algorithm be to look for
>> overlapping
>> > glyphs? And if there is a space that overlaps a non-space, then ignore
>> the
>> > space?
>> >
>> > I know that text extraction is a black art (disclosure: I wrote the
>> > original text extraction modules for iText, so lots of hard-earned
>> > experience here), but I think the above would be appropriate for all
>> > situations...
>> >
>> > Would you be up for reviewing a patch if I implement this?
>> >
>> > PS I also think that having an option to ignore spaces in rendering (the
>> > issue you linked to) would be a good idea. But that should be optional.
>> I'm
>> > happy to include that in my patch if you would like.
>>
>> Yes to both!
>>
>> Note that there is a pending patch about a corner case:
>>
>> https://github.com/apache/pdfbox/pull/155
>>
>> But don't use code from itext unless you are allowed to, i.e. check the
>> papers that you signed. If you patch is more than just a small bugfix
>> you may have to sign something with us as well.
>>
>> https://www.apache.org/licenses/icla.pdf
>>
>> Tilman
>>
>>
>>
>> >
>> > K
>> >
>> > On Thu, Dec 12, 2024, 9:40 PM Tilman Hausherr <thaush...@t-online.de>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> These spaces are really in the PDF:
>> >>
>> >>      BT
>> >>      /Content <</MCID 224 >>BDC
>> >>      1 i
>> >>      /T1_4 1 Tf
>> >>      7 0 0 7 *195.4 110.502* Tm
>> >>      *(\(                                                     \))Tj*
>> >>      EMC
>> >>      /Content <</MCID 225 >>BDC
>> >>      /T1_0 1 Tf
>> >>      -19.686 6.786 Td
>> >>      (Beginning capital account)Tj
>> >>      /T1_4 1 Tf
>> >>      ( )Tj
>> >>      14.057 0 Td
>> >>      (.)Tj
>> >>      1.714 0 Td
>> >>      (.)Tj
>> >>      1.714 0 Td
>> >>      (.)Tj
>> >>      EMC
>> >>      ET
>> >>
>> >>
>> >> And *195.4 110.502* is really the position, you can move your mouse
>> >> there in PDFDebugger.
>> >>
>> >> The font messages are not important here.
>> >> There is a way to get rid of such spaces, but it requires a source code
>> >> change, it is described here:
>> >> https://issues.apache.org/jira/browse/PDFBOX-3774
>> >>
>> >> However it's possible that other files would have a bad text
>> extraction.
>> >>
>> >> Tilman
>> >>
>> >> On 13.12.2024 03:29, Kevin Day wrote:
>> >>> Hello-
>> >>>
>> >>> We are using PDFTextStripper, and have found some cases where there
>> are a
>> >>> *lot* of extraneous spaces being added to the output.  It almost acts
>> >> like
>> >>> the stripper is thinking that the space width of the font is super
>> tiny.
>> >>>
>> >>> I managed to get a document that exhibits the behavior:
>> >>>
>> >>>
>> >>
>> https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing
>> >>> The easiest way to see the behavior is in PDFDebugger, View->Show
>> >> Stripper
>> >>> Text Positions.
>> >>>
>> >>> Note in the lower left corner of the document, there is text "999".
>> The
>> >>> text above and below that is fine, but the line with 999 has a *ton*
>> of
>> >>> extra space rectangles displated.
>> >>>
>> >>> The extract text function in PDFDebugger doesn't sort, so that one
>> comes
>> >>> out fine, but if you use PDFTextStripper with sorting enabled (), the
>> >> line
>> >>> renders like this:
>> >>>
>> >>> Withdrawals and distributions . . . $ ( 9 9 9 )
>> >>>
>> >>> Note the many space characters, and that there are even spaces between
>> >> each
>> >>> 9.
>> >>>
>> >>> I also observe that the PDF has warning messages about fonts (not
>> sure if
>> >>> this might be involved):
>> >>>
>> >>> [main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using
>> fallback
>> >>> font ArialMT for HelveticaLTStd-Roman
>> >>>
>> >>> [main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap
>> table is
>> >>> not supported and will be ignored
>> >>>
>> >>>
>> >>>
>> >>> It almost acts like the parenthesis on the line are triggering some
>> >>> different detection mode where the font's space width is computing to
>> be
>> >>> much smaller than it should be.
>> >>>
>> >>> Any ideas on what is going on or if it is fixable?
>> >>>
>> >>> Thanks!
>> >>>
>> >>> - K
>> >>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>
>>

Re: Text extraction adding lots of strange spaces

Reply via email to