Thank you all for those accurate answers. I will give a try to the geometrical approach based on the (x, y) coordinates of the characters.
Best regards, Julien Le 7 mars 2014 à 13:25, Peter Murray-Rust <[email protected]> a écrit : > On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential < > [email protected]> wrote: > >> Sirs, >> >> I had already thought about this graphical approach to reconstruct the >> words. I've let it down because I'm a bit sceptical on the reliability of >> such a method. I can't help thinking that it will not be a 100% sure >> method. I understand why a CAD software would produce such an output, >> though (thank you for this new word that I didn't know "boustrophedonic", >> but it explains well the result obtained). >> > > It's not as bad as you think. We have re-constructed the text from hundreds > of scientific papers (so probably nearly a million words) and found very > few problems. The reason we are doing this rather than using PDFBox tools > is that scientific (and especially maths) PDFs contain may diacritics, high > Unicode points, occasional graphics strokes, variable font size and style, > ligatures, non-horizontal text, etc. > > For running text it works very well - assuming that the characters announce > their widths. Then - roughly - "ab" is a word if > > x(a) + width(a)*fontSize(a) + tolerance >= x(b) > > else we can *crudely* estimate the number of intervening spaces (this is > very suspect as publishers may elide concatenated spaces). > > All standard Fonts (see PDF spec) should announce their widths. > Unfortunately scientific publishers use some of the worst constructed fonts > in the world and sometimes we have to guess - by surveying a body of > character positions and trying to work out spaces and font-type. > > >> Supposing that the characters appear in a totally arbitrary order, >> detecting that they're on the same line is more or less piece of cake >> (except if I need to introduce a tolerance, which makes things more >> difficult), > > > In a modern PDF we find that all characters on the same line tend to have > equal y-coords to at least 3 decimals. The problem is that OCR'ed > characters may have variable y because of rounding errors and antialiasing. > > > >> but grouping the characters according to their X position is >> not at all an easy task. >> > > The order should be fairly clear. The problems are: > * spaces (see above) > * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet) > - we generally solve > 90%. Hyphens in chemistry are meaningful > * diacritics. Some characters have diacritics with the same x (e.g. E and > acute). These can occur in variable order. Where possible we try to > recreate a single Unicode point. > * over and underbars > * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We > split the latter. > > >> >> But this is not an issue, my problem is more the fact that this method may >> not be 100% reliable. What do you think ? >> > > We are committed to solving it for English-language science and European > personal names. The worst case is probably slanted text in diagrams. > > >> >> As for the technical part (overloading the processText), it's ok, thanks >> for the advice. >> >> Best regards >> >> Julien >> >> >> >> -- > Peter Murray-Rust > Reader in Molecular Informatics > Unilever Centre, Dep. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069

