You could use x and y position and rotation information to determine whether 
two given characters - given their size - are relatively close to each other or 
not and are on the same line. 

BT / ET is not at all guaranteed to give you strings as perceived by a human.

Olaf


Am 6 Mar 2014 um 21:06 schrieb HQS <[email protected]>:

> Well, thanks sirs for your reactivity.
> 
> The PDFs are generated by Autodesk Inventor (even the latest version produces 
> that kind of output).
> 
> It is for one of my clients who wants an automatic transformation
> of some specific strings in the PDF into a clickable link.
> 
> My problem is very simple : with such a structure I have no way to know when 
> the string ends.
> 
> As a matter of fact all the references to be transformed are prefixed
> with an ‘I-‘ but there is no termination character, for instance : « 
> I-HOIST-042 ».
> Given that in the PDF I, -, H, O, (etc.), 2 are separated characters I cannot 
> rebuild the original string.
> 
> I was hoping that there is a block of text (BT … ET) but, as I mentioned, 
> each character is put in its own block...
> 
> Regards,
> 
> 
> Le 6 mars 2014 à 18:57, Maruan Sahyoun <[email protected]> a écrit :
> 
>> Hi Julien,
>> 
>> for 1) that’s possible and supported - how was the document generated? DTP 
>> application?
>> for 2) PDFBox doesn’t enforce a PDF version. In general it supports all PDF 
>> files but it doesn’t have full coverage of all features defined within 
>> certain PDF versions but it should have a reasonable coverage. There is no 
>> documentation on coverage yet so I can’t guarantee that a specific feature 
>> is supported. Is there something special you are looking for?
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 06.03.2014 um 18:39 schrieb HQS <[email protected]>:
>> 
>>> Hello all,
>>> 
>>> 1.
>>> Have you ever seen PDFs having this kind of (pseudo) structure :
>>> 
>>> BT
>>> <character>
>>> Tj
>>> ET
>>> 
>>> ?
>>> 
>>> Which means, the strings are split into characters and there is one block 
>>> of text per character ?
>>> It seems to be ill-formed doesn't it ?
>>> 
>>> 2. Reminder of my first mail, what is the library compliancy regarding PDF 
>>> standards ? 1.3 to 1.7 ?
>>> 
>>> 
>>> Thanks and regards
>>> 
>>> Julien
>>> 
>> 
> 

Reply via email to