Re: Problems converting Special characters from PDF to text

aduester Tue, 11 Mar 2014 04:45:08 -0700

when i open the converted txt file in browser it is displayed correct.
is there a way to convert the character in pdf to unicode in txt?


like

text U+2308 text


Zitat von [email protected]:

thanks for the advice. I tried copying the character to notepad++and its the same behavior. If i use pdflib i get blanket squaresfor the special characters.
the characters are these:
http://www.fileformat.info/info/unicode/char/2308/index.htm
http://www.fileformat.info/info/unicode/char/230b/index.htm

i dont have acrobat pro version.

added attachement

thank you

Zitat von Olaf Drümmer <[email protected]>:
If the text encoding or ToUnicode table for that character does notmake the connection to the right Unicode value - then it can't beextracted properly.
There are two quick ways to double check:
- use a recent version of Adobe Reader or Adobe Acrobat, copy thepiece of text in question and paste it into a Unicode enabled textwindow or control- try text extraction with PDFlib TET (cf.http://www.pdflib.com/download/tet/ )
If neither of these get the right Unicode values, you are probablyout of luck. If they are showing the right character, report back.
You could also use a low level inspection tool (for example inAcrobat Pro, use Preflight and from the Preflight window's optionsmenu choose "Explore PDF structure") and drill down to the resp.font resource and find out whether it has a decent ToUnicode entryor not.
Olaf


---
Olaf Druemmer | Managing Director | callas software GmbH |Schoenhauser Allee 6/7 | 10119 BerlinTel +49.30.4439031-0 | Fax +49.30.4416402 |[email protected] | www.callassoftware.com
?  PDF Days Europe 2014 - June 16-17, 2014 ·  Cologne
?  Two days packed with PDF ? Register now at:
?  http://pdfa.org/pdf-days-europe-2014



Am 10 Mar 2014 um 19:56 schrieb Andreas Düster <[email protected]>:
Hi,
I am using PDFBox 1.7.0 (unofficial converted .net version fromhttp://pdfbox.lehmi.de/) to convert a pdf to text. It works finefor me except one thing.My Problem is that the pdf contains gaussian brackets which areconverted to a single letter (right ceil is converted to a "d" andleft floor one to a "c". I need at least something unique becauseI want parse the text later and I need to localize the brackets.I am not sure if thats a problem of PDF box at all. If I copy thebracket out of the pdf manually it?s the same behavior. Any ideato help me?
Thanks!!

Re: Problems converting Special characters from PDF to text

Reply via email to