The factor of 1000 is defined in the PDF specification and is to map from Glyph Space to Text Space. Maybe you should take a look in chap 9.1 - 9.4 of the ISO 32000 spec.
BR Maruan Sahyoun Am 08.03.2014 um 18:23 schrieb HQS <[email protected]>: > Peter, > > What you said about the factor 1000 I’ve seen it on a website dealing with > PDFBox so you might be right. > I have tried the following assertion which, if true, makes 2 characters > connected to the same word : > > leftChar.getX() + leftChar.getWidth() + space * .5f + X_TOLERANCE >= > rightChar.getX() > > I tried with X_TOLERANCE = 0 > > space is simply equal to leftChar.getWidthOfSpace() , a method in the > TextPosition class. > getWidth() is also a method of that class. > > The first results are very satisfying. > > By the way, is there an « easy » way to delete text from a PDF, apart from > parsing the tokens > and delete those preceding the « Tj » / « TJ » operators ? I need this to > erase the reference strings > that I have detected and create an hyperlink at the same location with the > same font. > > When I’ve tested the PDF words extractor I will post the source code so that > we can share our technics. > The extractor I’m making is a bit more advanced than the one embedded in > PDFBox as it creates a list of > couples (XY position of a word, contents of a word) and not just give the > list of words. > > Thanks all ! > > Julien > > > Le 8 mars 2014 à 15:14, Peter Murray-Rust <[email protected]> a écrit : > >> The width appears to be a ratio, independent of size. It also seems to be >> conventionally multiplied by 1000 (I have not found a definition for this - >> I have only guessed it). >> >> Thus a character "A" of width=600 and fontSize=10.5 appears to have >> pixelwidth = 600. / 1000. * 10.5 = 6.3 pixels >> >> I'd be grateful for confirmation or correction... >> >> >> On Sat, Mar 8, 2014 at 11:12 AM, HQS <[email protected]> wrote: >> >>> Well, I have a precision to ask to Peter, about this formula : >>> >>> x(a) + width(a)*fontSize(a) + tolerance >= x(b) >>> >>> What is the difference between « width(a) » and « fontSize(a) » ? Is it >>> not enough >>> to know the width of the character « a » in pixels given by the font, to >>> check this assertion ? >>> >>> Thanks ! >>> >>> >>> Le 7 mars 2014 à 18:46, Maruan Sahyoun <[email protected]> a écrit : >>> >>>> if you need further assistance please let us know. >>>> >>>> BR >>>> Maruan Sahyoun >>>> >>>> Am 07.03.2014 um 18:24 schrieb HQS <[email protected]>: >>>> >>>>> Thank you all for those accurate answers. >>>>> I will give a try to the geometrical approach based on the (x, y) >>> coordinates of the characters. >>>>> >>>>> Best regards, >>>>> >>>>> Julien >>>>> >>>>> Le 7 mars 2014 à 13:25, Peter Murray-Rust <[email protected]> a écrit : >>>>> >>>>>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Sirs, >>>>>>> >>>>>>> I had already thought about this graphical approach to reconstruct the >>>>>>> words. I've let it down because I'm a bit sceptical on the >>> reliability of >>>>>>> such a method. I can't help thinking that it will not be a 100% sure >>>>>>> method. I understand why a CAD software would produce such an output, >>>>>>> though (thank you for this new word that I didn't know >>> "boustrophedonic", >>>>>>> but it explains well the result obtained). >>>>>>> >>>>>> >>>>>> It's not as bad as you think. We have re-constructed the text from >>> hundreds >>>>>> of scientific papers (so probably nearly a million words) and found >>> very >>>>>> few problems. The reason we are doing this rather than using PDFBox >>> tools >>>>>> is that scientific (and especially maths) PDFs contain may diacritics, >>> high >>>>>> Unicode points, occasional graphics strokes, variable font size and >>> style, >>>>>> ligatures, non-horizontal text, etc. >>>>>> >>>>>> For running text it works very well - assuming that the characters >>> announce >>>>>> their widths. Then - roughly - "ab" is a word if >>>>>> >>>>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b) >>>>>> >>>>>> else we can *crudely* estimate the number of intervening spaces (this >>> is >>>>>> very suspect as publishers may elide concatenated spaces). >>>>>> >>>>>> All standard Fonts (see PDF spec) should announce their widths. >>>>>> Unfortunately scientific publishers use some of the worst constructed >>> fonts >>>>>> in the world and sometimes we have to guess - by surveying a body of >>>>>> character positions and trying to work out spaces and font-type. >>>>>> >>>>>> >>>>>>> Supposing that the characters appear in a totally arbitrary order, >>>>>>> detecting that they're on the same line is more or less piece of cake >>>>>>> (except if I need to introduce a tolerance, which makes things more >>>>>>> difficult), >>>>>> >>>>>> >>>>>> In a modern PDF we find that all characters on the same line tend to >>> have >>>>>> equal y-coords to at least 3 decimals. The problem is that OCR'ed >>>>>> characters may have variable y because of rounding errors and >>> antialiasing. >>>>>> >>>>>> >>>>>> >>>>>>> but grouping the characters according to their X position is >>>>>>> not at all an easy task. >>>>>>> >>>>>> >>>>>> The order should be fairly clear. The problems are: >>>>>> * spaces (see above) >>>>>> * hyphens at line-end (this requires heuristics - maybe lookup in >>> Wordnet) >>>>>> - we generally solve > 90%. Hyphens in chemistry are meaningful >>>>>> * diacritics. Some characters have diacritics with the same x (e.g. E >>> and >>>>>> acute). These can occur in variable order. Where possible we try to >>>>>> recreate a single Unicode point. >>>>>> * over and underbars >>>>>> * ligatures (in "waffle") their may be 6 characters or only 4 >>> w-a-ffl-e. We >>>>>> split the latter. >>>>>> >>>>>> >>>>>>> >>>>>>> But this is not an issue, my problem is more the fact that this >>> method may >>>>>>> not be 100% reliable. What do you think ? >>>>>>> >>>>>> >>>>>> We are committed to solving it for English-language science and >>> European >>>>>> personal names. The worst case is probably slanted text in diagrams. >>>>>> >>>>>> >>>>>>> >>>>>>> As for the technical part (overloading the processText), it's ok, >>> thanks >>>>>>> for the advice. >>>>>>> >>>>>>> Best regards >>>>>>> >>>>>>> Julien >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>> Peter Murray-Rust >>>>>> Reader in Molecular Informatics >>>>>> Unilever Centre, Dep. Of Chemistry >>>>>> University of Cambridge >>>>>> CB2 1EW, UK >>>>>> +44-1223-763069 >>>>> >>>> >>> >>> >> >> >> -- >> Peter Murray-Rust >> Reader in Molecular Informatics >> Unilever Centre, Dep. Of Chemistry >> University of Cambridge >> CB2 1EW, UK >> +44-1223-763069 >

