I am by no means an expert here :)

But I use FontForge (a great tool!):

    http://fontforge.sourceforge.net/

You can use it to open a PDF and pick one of the embedded fonts inside
it.  Then, select the glyphs one by one and see if FontForge displays
the unicode character for them.

For example, when I do this with the chinese.pdf from PDBOX-5 I see
u+????, so there is no unicode character recorded for each glyph.
Though, I do see the glyphs named as g1332 and g3921 -- not sure if
those are meaningful / could somehow be mapped to corresponding
unicode characters.

Mike McCandless

http://blog.mikemccandless.com



On Fri, Oct 21, 2011 at 5:29 AM, Srinivaas_Venkatarayan
<[email protected]> wrote:
> Hi Michael,
>
> It looks like PDFBox fails to extract text contents from PDF when the PDF has 
> custom encoded fonts in it. Is there a way to find out if the PDF has custom 
> encoded fonts using PDFBox?
>
> Srinivaas
>
> -----Original Message-----
> From: Srinivaas_Venkatarayan
> Sent: Tuesday, October 18, 2011 11:45 AM
> To: [email protected]
> Subject: RE: Issue while extracting chinese chars from pdf
>
> Michael,
>
> Thanks for your time, I was about to open a jira issue and then realized 
> there is an issue (PDFBOX-5) opened already on this related to CJK fonts. In 
> fact I have downloaded the chinese.pdf for my testing from this URL only 
> https://issues.apache.org/jira/browse/PDFBOX-5 (pls check 
> PDFBOX5-CJK.zip\chinese.pdf in the url).
>
> Pls let me know if there is any work around for this issue.
>
> Regards
> Srinivaas
>
> -----Original Message-----
> From: Michael McCandless [mailto:[email protected]]
> Sent: Friday, October 14, 2011 5:57 PM
> To: [email protected]
> Subject: Re: Issue while extracting chinese chars from pdf
>
> Can you open a jira issue and attach your PDF there?  Your attachment
> didn't come through.  Thanks.
>
> Unfortunately (from my limited understanding) PDFs can be tricky.  EG,
> if they use an embedded font and that font doesn't include unicode
> mappings for its glyphs then PDFBox won't be able to extract the
> character data, I believe.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Fri, Oct 14, 2011 at 7:31 AM, Srinivaas_Venkatarayan
> <[email protected]> wrote:
>> HI,
>>
>> Can someone pls help me with this issue? From this url 
>> http://www.pinxue.net/java/PDFBox_String_Charset_analyze_en.html it looks 
>> like PDFBox can handle CJK fonts but I'm not sure what is that I have to do 
>> to extract Chinese chars.
>>
>> Thanks
>> Srinivaas
>> From: Srinivaas_Venkatarayan
>> Sent: Wednesday, October 12, 2011 5:12 PM
>> To: '[email protected]'
>> Subject: Issue while extracting chinese chars from pdf
>>
>> Hi,
>>
>> I'm trying to extract the text contents of a PDF file and store it in a txt 
>> file using PDFBox (ver 1.6.0). I have issues extracting the content of a PDF 
>> that has Chinese characters in it. Attached is the PDF and the java code. 
>> I'm not sure what encoding is being used in this PDF. Can you pls help?
>>
>> Thanks
>> Srini
>>
>>
>>
>> ________________________________
>> DISCLAIMER:
>> This email (including any attachments) is intended for the sole use of the 
>> intended recipient/s and may contain material that is CONFIDENTIAL AND 
>> PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or 
>> distribution or forwarding of any or all of the contents in this message is 
>> STRICTLY PROHIBITED. If you are not the intended recipient, please contact 
>> the sender by email and delete all copies; your cooperation in this regard 
>> is appreciated.
>>
>
>
> DISCLAIMER:
> This email (including any attachments) is intended for the sole use of the 
> intended recipient/s and may contain material that is CONFIDENTIAL AND 
> PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or 
> distribution or forwarding of any or all of the contents in this message is 
> STRICTLY PROHIBITED. If you are not the intended recipient, please contact 
> the sender by email and delete all copies; your cooperation in this regard is 
> appreciated.
>

Reply via email to