Peter, What you said about the factor 1000 I’ve seen it on a website dealing with PDFBox so you might be right. I have tried the following assertion which, if true, makes 2 characters connected to the same word :
leftChar.getX() + leftChar.getWidth() + space * .5f + X_TOLERANCE >= rightChar.getX() I tried with X_TOLERANCE = 0 space is simply equal to leftChar.getWidthOfSpace() , a method in the TextPosition class. getWidth() is also a method of that class. The first results are very satisfying. By the way, is there an « easy » way to delete text from a PDF, apart from parsing the tokens and delete those preceding the « Tj » / « TJ » operators ? I need this to erase the reference strings that I have detected and create an hyperlink at the same location with the same font. When I’ve tested the PDF words extractor I will post the source code so that we can share our technics. The extractor I’m making is a bit more advanced than the one embedded in PDFBox as it creates a list of couples (XY position of a word, contents of a word) and not just give the list of words. Thanks all ! Julien Le 8 mars 2014 à 15:14, Peter Murray-Rust <[email protected]> a écrit : > The width appears to be a ratio, independent of size. It also seems to be > conventionally multiplied by 1000 (I have not found a definition for this - > I have only guessed it). > > Thus a character "A" of width=600 and fontSize=10.5 appears to have > pixelwidth = 600. / 1000. * 10.5 = 6.3 pixels > > I'd be grateful for confirmation or correction... > > > On Sat, Mar 8, 2014 at 11:12 AM, HQS <[email protected]> wrote: > >> Well, I have a precision to ask to Peter, about this formula : >> >> x(a) + width(a)*fontSize(a) + tolerance >= x(b) >> >> What is the difference between « width(a) » and « fontSize(a) » ? Is it >> not enough >> to know the width of the character « a » in pixels given by the font, to >> check this assertion ? >> >> Thanks ! >> >> >> Le 7 mars 2014 à 18:46, Maruan Sahyoun <[email protected]> a écrit : >> >>> if you need further assistance please let us know. >>> >>> BR >>> Maruan Sahyoun >>> >>> Am 07.03.2014 um 18:24 schrieb HQS <[email protected]>: >>> >>>> Thank you all for those accurate answers. >>>> I will give a try to the geometrical approach based on the (x, y) >> coordinates of the characters. >>>> >>>> Best regards, >>>> >>>> Julien >>>> >>>> Le 7 mars 2014 à 13:25, Peter Murray-Rust <[email protected]> a écrit : >>>> >>>>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential < >>>>> [email protected]> wrote: >>>>> >>>>>> Sirs, >>>>>> >>>>>> I had already thought about this graphical approach to reconstruct the >>>>>> words. I've let it down because I'm a bit sceptical on the >> reliability of >>>>>> such a method. I can't help thinking that it will not be a 100% sure >>>>>> method. I understand why a CAD software would produce such an output, >>>>>> though (thank you for this new word that I didn't know >> "boustrophedonic", >>>>>> but it explains well the result obtained). >>>>>> >>>>> >>>>> It's not as bad as you think. We have re-constructed the text from >> hundreds >>>>> of scientific papers (so probably nearly a million words) and found >> very >>>>> few problems. The reason we are doing this rather than using PDFBox >> tools >>>>> is that scientific (and especially maths) PDFs contain may diacritics, >> high >>>>> Unicode points, occasional graphics strokes, variable font size and >> style, >>>>> ligatures, non-horizontal text, etc. >>>>> >>>>> For running text it works very well - assuming that the characters >> announce >>>>> their widths. Then - roughly - "ab" is a word if >>>>> >>>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b) >>>>> >>>>> else we can *crudely* estimate the number of intervening spaces (this >> is >>>>> very suspect as publishers may elide concatenated spaces). >>>>> >>>>> All standard Fonts (see PDF spec) should announce their widths. >>>>> Unfortunately scientific publishers use some of the worst constructed >> fonts >>>>> in the world and sometimes we have to guess - by surveying a body of >>>>> character positions and trying to work out spaces and font-type. >>>>> >>>>> >>>>>> Supposing that the characters appear in a totally arbitrary order, >>>>>> detecting that they're on the same line is more or less piece of cake >>>>>> (except if I need to introduce a tolerance, which makes things more >>>>>> difficult), >>>>> >>>>> >>>>> In a modern PDF we find that all characters on the same line tend to >> have >>>>> equal y-coords to at least 3 decimals. The problem is that OCR'ed >>>>> characters may have variable y because of rounding errors and >> antialiasing. >>>>> >>>>> >>>>> >>>>>> but grouping the characters according to their X position is >>>>>> not at all an easy task. >>>>>> >>>>> >>>>> The order should be fairly clear. The problems are: >>>>> * spaces (see above) >>>>> * hyphens at line-end (this requires heuristics - maybe lookup in >> Wordnet) >>>>> - we generally solve > 90%. Hyphens in chemistry are meaningful >>>>> * diacritics. Some characters have diacritics with the same x (e.g. E >> and >>>>> acute). These can occur in variable order. Where possible we try to >>>>> recreate a single Unicode point. >>>>> * over and underbars >>>>> * ligatures (in "waffle") their may be 6 characters or only 4 >> w-a-ffl-e. We >>>>> split the latter. >>>>> >>>>> >>>>>> >>>>>> But this is not an issue, my problem is more the fact that this >> method may >>>>>> not be 100% reliable. What do you think ? >>>>>> >>>>> >>>>> We are committed to solving it for English-language science and >> European >>>>> personal names. The worst case is probably slanted text in diagrams. >>>>> >>>>> >>>>>> >>>>>> As for the technical part (overloading the processText), it's ok, >> thanks >>>>>> for the advice. >>>>>> >>>>>> Best regards >>>>>> >>>>>> Julien >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>> Peter Murray-Rust >>>>> Reader in Molecular Informatics >>>>> Unilever Centre, Dep. Of Chemistry >>>>> University of Cambridge >>>>> CB2 1EW, UK >>>>> +44-1223-763069 >>>> >>> >> >> > > > -- > Peter Murray-Rust > Reader in Molecular Informatics > Unilever Centre, Dep. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069

