Peter,

What you said about the factor 1000 I’ve seen it on a website dealing with 
PDFBox so you might be right.
I have tried the following assertion which, if true, makes 2 characters 
connected to the same word :

leftChar.getX() + leftChar.getWidth() + space * .5f + X_TOLERANCE >= 
rightChar.getX()

I tried with X_TOLERANCE = 0

space is simply equal to leftChar.getWidthOfSpace() , a method in the 
TextPosition class.
getWidth() is also a method of that class.

The first results are very satisfying.

By the way, is there an « easy » way to delete text from a PDF, apart from 
parsing the tokens
and delete those preceding the « Tj » / « TJ » operators ? I need this to erase 
the reference strings
that I have detected and create an hyperlink at the same location with the same 
font.

When I’ve tested the PDF words extractor I will post the source code so that we 
can share our technics.
The extractor I’m making is a bit more advanced than the one embedded in PDFBox 
as it creates a list of
couples (XY position of a word, contents of a word) and not just give the list 
of words.

Thanks all !

Julien


Le 8 mars 2014 à 15:14, Peter Murray-Rust <[email protected]> a écrit :

> The width appears to be a ratio, independent of size. It also seems to be
> conventionally multiplied by 1000 (I have not found a definition for this -
> I have only guessed it).
> 
> Thus a character "A" of width=600 and fontSize=10.5 appears to have
> pixelwidth = 600. / 1000. * 10.5 = 6.3 pixels
> 
> I'd be grateful for confirmation or correction...
> 
> 
> On Sat, Mar 8, 2014 at 11:12 AM, HQS <[email protected]> wrote:
> 
>> Well, I have a precision to ask to Peter, about this formula :
>> 
>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>> 
>> What is the difference between « width(a) » and « fontSize(a) » ? Is it
>> not enough
>> to know the width of the character « a » in pixels given by the font, to
>> check this assertion ?
>> 
>> Thanks !
>> 
>> 
>> Le 7 mars 2014 à 18:46, Maruan Sahyoun <[email protected]> a écrit :
>> 
>>> if you need further assistance please let us know.
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>> Am 07.03.2014 um 18:24 schrieb HQS <[email protected]>:
>>> 
>>>> Thank you all for those accurate answers.
>>>> I will give a try to the geometrical approach based on the (x, y)
>> coordinates of the characters.
>>>> 
>>>> Best regards,
>>>> 
>>>> Julien
>>>> 
>>>> Le 7 mars 2014 à 13:25, Peter Murray-Rust <[email protected]> a écrit :
>>>> 
>>>>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Sirs,
>>>>>> 
>>>>>> I had already thought about this graphical approach to reconstruct the
>>>>>> words. I've let it down because I'm a bit sceptical on the
>> reliability of
>>>>>> such a method. I can't help thinking that it will not be a 100% sure
>>>>>> method. I understand why a CAD software would produce such an output,
>>>>>> though (thank you for this new word that I didn't know
>> "boustrophedonic",
>>>>>> but it explains well the result obtained).
>>>>>> 
>>>>> 
>>>>> It's not as bad as you think. We have re-constructed the text from
>> hundreds
>>>>> of scientific papers (so probably nearly a million words) and found
>> very
>>>>> few problems. The reason we are doing this rather than using PDFBox
>> tools
>>>>> is that scientific (and especially maths) PDFs contain may diacritics,
>> high
>>>>> Unicode points, occasional graphics strokes, variable font size and
>> style,
>>>>> ligatures, non-horizontal text, etc.
>>>>> 
>>>>> For running text it works very well - assuming that the characters
>> announce
>>>>> their widths. Then - roughly - "ab" is a word if
>>>>> 
>>>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>>>>> 
>>>>> else we can *crudely* estimate the number of intervening spaces (this
>> is
>>>>> very suspect as publishers may elide concatenated spaces).
>>>>> 
>>>>> All standard Fonts (see PDF spec) should announce their widths.
>>>>> Unfortunately scientific publishers use some of the worst constructed
>> fonts
>>>>> in the world and sometimes we have to guess - by surveying a body of
>>>>> character positions and trying to work out spaces and font-type.
>>>>> 
>>>>> 
>>>>>> Supposing that the characters appear in a totally arbitrary order,
>>>>>> detecting that they're on the same line is more or less piece of cake
>>>>>> (except if I need to introduce a tolerance, which makes things more
>>>>>> difficult),
>>>>> 
>>>>> 
>>>>> In a modern PDF we find that all characters on the same line tend to
>> have
>>>>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
>>>>> characters may have variable y because of rounding errors and
>> antialiasing.
>>>>> 
>>>>> 
>>>>> 
>>>>>> but grouping the characters according to their X position is
>>>>>> not at all an easy task.
>>>>>> 
>>>>> 
>>>>> The order should be fairly clear. The problems are:
>>>>> * spaces (see above)
>>>>> * hyphens at line-end (this requires heuristics - maybe lookup in
>> Wordnet)
>>>>> - we generally solve > 90%. Hyphens in chemistry are meaningful
>>>>> * diacritics. Some characters have diacritics with the same x (e.g. E
>> and
>>>>> acute). These can occur in variable order. Where possible we try to
>>>>> recreate a single Unicode point.
>>>>> * over and underbars
>>>>> * ligatures (in "waffle") their may be 6 characters or only 4
>> w-a-ffl-e. We
>>>>> split the latter.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> But this is not an issue, my problem is more the fact that this
>> method may
>>>>>> not be 100% reliable. What do you think ?
>>>>>> 
>>>>> 
>>>>> We are committed to solving it for English-language science and
>> European
>>>>> personal names. The worst case is probably slanted text in diagrams.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> As for the technical part (overloading the processText), it's ok,
>> thanks
>>>>>> for the advice.
>>>>>> 
>>>>>> Best regards
>>>>>> 
>>>>>> Julien
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>> Peter Murray-Rust
>>>>> Reader in Molecular Informatics
>>>>> Unilever Centre, Dep. Of Chemistry
>>>>> University of Cambridge
>>>>> CB2 1EW, UK
>>>>> +44-1223-763069
>>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Reply via email to