Hi, I was wondering whether the mail I sent a month ago was received on this list, since I haven't received any responses. (I guess it's possible it was not received because it contained an attachment.) The original mail is quoted below.
Thanks, -Kirill 2011/6/7 kirillkh <[email protected]> > Hi, > > I've encountered two issues with PDFTextStripper and discovered (imperfect) > workarounds for both. Can anyone from the maintainers please take a look at > the issues and at my patch (which is admittedly pretty hackish)? > The patch is based off trunk, but I only tested it with PDFBox 1.5.0. > https://github.com/kirillkh/pdfbox/commit/9a23c3956a96c276dfc677a0862c6954661b6d6a > > 1. With the attached document (I hope it will be accepted by the mailing > list... If not, contact me, and I'll send it to you directly.), I'm seeing > spaces interspersed inside certain words (e.g., in the second page's title.) > The document is in Hebrew (RTL), which might or might not matter. > > While I don't know what exactly the code is doing, I got the impression > that the problem is caused by zero-width space characters. Looks like the > document was produced by software that incorrectly specified the width of > every space character as 0 and also inserted them at random places inside > the document. (Does that make any sense?.. In any case, that was my > impression.) I assume that a real PDF renderer just ignores such characters, > but PDFTextStripper outputs every such character as text. I've managed to > modify the code in a way that makes these space characters be ignored (see > the patch), but chances are it is not the best solution. > > 2. (RTL-specific) After working around the main issue, I've encountered > another one. In some cases, the zero-width space characters coincided with > word boundaries; since I removed them, PDFTextStripper switched to using the > average character width to determine word boundaries. This resulted in > special WordSeparator positions being inserted where spaces were before. The > problem with that is the PDFTextStripper.normalize() method for some reason > splits the text on these word boundaries (instead of splitting it on the > line boundaries) to perform visual-to-logical reordering. For some lines, > this results in words order being reversed (the characters inside words are > in the correct order, the words are ordered in reverse). > > I solved this by outputting a space character for every WordSeparator > encountered by normalize(). Again, this worked for me with this document, > but I'm not sure that is the right way to go. > > > -Kirill >

