Hi, I've encountered two issues with PDFTextStripper and discovered (imperfect) workarounds for both. Can anyone from the maintainers please take a look at the issues and at my patch (which is admittedly pretty hackish)? The patch is based off trunk, but I only tested it with PDFBox 1.5.0. https://github.com/kirillkh/pdfbox/commit/9a23c3956a96c276dfc677a0862c6954661b6d6a
1. With the attached document (I hope it will be accepted by the mailing list... If not, contact me, and I'll send it to you directly.), I'm seeing spaces interspersed inside certain words (e.g., in the second page's title.) The document is in Hebrew (RTL), which might or might not matter. While I don't know what exactly the code is doing, I got the impression that the problem is caused by zero-width space characters. Looks like the document was produced by software that incorrectly specified the width of every space character as 0 and also inserted them at random places inside the document. (Does that make any sense?.. In any case, that was my impression.) I assume that a real PDF renderer just ignores such characters, but PDFTextStripper outputs every such character as text. I've managed to modify the code in a way that makes these space characters be ignored (see the patch), but chances are it is not the best solution. 2. (RTL-specific) After working around the main issue, I've encountered another one. In some cases, the zero-width space characters coincided with word boundaries; since I removed them, PDFTextStripper switched to using the average character width to determine word boundaries. This resulted in special WordSeparator positions being inserted where spaces were before. The problem with that is the PDFTextStripper.normalize() method for some reason splits the text on these word boundaries (instead of splitting it on the line boundaries) to perform visual-to-logical reordering. For some lines, this results in words order being reversed (the characters inside words are in the correct order, the words are ordered in reverse). I solved this by outputting a space character for every WordSeparator encountered by normalize(). Again, this worked for me with this document, but I'm not sure that is the right way to go. -Kirill

