Re: PDFTextStripper: space characters inside words

kirillkh Wed, 06 Jul 2011 10:28:24 -0700

Hi,

I was wondering whether the mail I sent a month ago was received on this
list, since I haven't received any responses. (I guess it's possible it was
not received because it contained an attachment.) The original mail is
quoted below.


Thanks,
-Kirill

2011/6/7 kirillkh <[email protected]>

> Hi,
>
> I've encountered two issues with PDFTextStripper and discovered (imperfect)
> workarounds for both. Can anyone from the maintainers please take a look at
> the issues and at my patch (which is admittedly pretty hackish)?
> The patch is based off trunk, but I only tested it with PDFBox 1.5.0.
> https://github.com/kirillkh/pdfbox/commit/9a23c3956a96c276dfc677a0862c6954661b6d6a
>
> 1. With the attached document (I hope it will be accepted by the mailing
> list... If not, contact me, and I'll send it to you directly.), I'm seeing
> spaces interspersed inside certain words (e.g., in the second page's title.)
> The document is in Hebrew (RTL), which might or might not matter.
>
> While I don't know what exactly the code is doing, I got the impression
> that the problem is caused by zero-width space characters. Looks like the
> document was produced by software that incorrectly specified the width of
> every space character as 0 and also inserted them at random places inside
> the document. (Does that make any sense?.. In any case, that was my
> impression.) I assume that a real PDF renderer just ignores such characters,
> but PDFTextStripper outputs every such character as text. I've managed to
> modify the code in a way that makes these space characters be ignored (see
> the patch), but chances are it is not the best solution.
>
> 2. (RTL-specific) After working around the main issue, I've encountered
> another one. In some cases, the zero-width space characters coincided with
> word boundaries; since I removed them, PDFTextStripper switched to using the
> average character width to determine word boundaries. This resulted in
> special WordSeparator positions being inserted where spaces were before. The
> problem with that is the PDFTextStripper.normalize() method for some reason
> splits the text on these word boundaries (instead of splitting it on the
> line boundaries) to perform visual-to-logical reordering. For some lines,
> this results in words order being reversed (the characters inside words are
> in the correct order, the words are ordered in reverse).
>
> I solved this by outputting a space character for every WordSeparator
> encountered by normalize(). Again, this worked for me with this document,
> but I'm not sure that is the right way to go.
>
>
> -Kirill
>

Re: PDFTextStripper: space characters inside words

Reply via email to