Re: Supporting multiple languages, including CJK

John Hewson Tue, 18 Oct 2016 16:49:40 -0700

> On 18 Oct 2016, at 15:08, Daniel King <dk...@halogensoftware.com> wrote:
> 
> Okay, that makes sense. Back to my root problem. I have a string which I want 
> to write to a PDF. This string contains both latin and Arabic characters. I 
> can't use Arial Unicode MS since we are running on a linux system and I have 
> been told we have some licensing concerns as well since PDFBox embeds the 
> font within the PDF file. I have found that Noto Naskh Arabic does support 
> Arabic characters but it doesn’t support latin characters, so my string that 
> happens to contain both will throw a no glyph exception when trying to print.


Unfortunately PDFBox does not support Arabic text layout at all. It can draw 
individual characters, but not words.

> One idea I have seen is to attempt to use a true type collection but loading 
> the TTC you seem to solely reference a single font within the collection 
> which won't allow characters from both latin and Arabic to be printed from a 
> single string. Yes, in theory I could split this string but I can't be 
> guaranteed where the latin characters exist. Maybe I’m not understanding 
> something correctly with how I assume TTC works.

PDF as a format doesn’t support TTC, so PDFBox allows you to extract a single 
font from a TTC and embed it in the PDF. What you need to do is break up your 
strings into runs based on script, and draw each one using an appropriate font. 

Java’s Character.UnicodeScript.of(codePoint) can help you do this. But if 
you’re dealing with genuine Arabic words and phrases, this won’t be enough, 
you’ll need to apply a bidirectional (“bidi”) reordering first, see 
java.text.Bidi. This is not exactly easy.

— John

> Thanks,
> Dan
> 
> -----Original Message-----
> From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
> Sent: Tuesday, October 18, 2016 2:21 PM
> To: users@pdfbox.apache.org
> Subject: Re: Supporting multiple languages, including CJK
> 
> Am 18.10.2016 um 15:32 schrieb Daniel King:
>> I'm curious why you shouldn't load fonts that are scanned in by PDFBox using 
>> org.apache.fontbox.util.autodetect.FontDirFinder and instead reference a 
>> hard coded system directory?
> As you don't know what you get when asking the FontMapper for "Arial" 
> especially if you run your code on different environments or OS.
> 
> You may get a simple Arial font with a limited charset, or you may get "Arial 
> Unicode MS" which has a wide support for non latin charsets or you may get 
> any arial alike font.
> 
> IMHO there are to many "may" especially if you are looking for a CJK capable 
> font.
> 
> As John already said, it's the best idea to choose the font on your own to be 
> sure you get what you are looking for.
> 
> BR
> Andreas
> 
>> 
>> -----Original Message-----
>> From: John Hewson [mailto:j...@jahewson.com]
>> Sent: Tuesday, October 18, 2016 3:09 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: Supporting multiple languages, including CJK
>> 
>> 
>>> On 12 Oct 2016, at 05:24, Daniel King <dk...@halogensoftware.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I'm attempting to write text to a PDF in situations where I need to 
>>> support multiple languages on a single PDF. This may include regular 
>>> latin characters as well as CJK characters. I've tried many attempts 
>>> to do this and have it load the character sets from the OS without 
>>> much success. The farthest I have gotten is support latin characters, 
>>> some russian and I believe Vietnamese characters founds on the 
>>> embedded fonts example here 
>>> https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org
>>> / apache/pdfbox/examples/pdmodel/EmbeddedFonts.java?view=markup
>>> 
>>> I'm doing a similar approach from the example but I believe I'm using 
>>> the FileSystemFontProvider provided by the FontMappers class by doing 
>>> something such as
>>> 
>>> TrueTypeFont ttf = FontMappers.instance().getTrueTypeFont("Arial",
>>> null).getFont(); PDFont font = PDType0Font.load(signatureDocument,
>>> ttf.getOriginalData());
>> 
>> Don’t load fonts like this. Follow the approach from the EmbeddedFonts 
>> example and load them from the filesystem.
>> 
>>> As I mentioned I seem to be able to support the text in the EmbeddedFonts 
>>> example but can't seem to determine how I can also support CJK. I’m 
>>> currently using 2.0.2 of PDFBox but could potentially upgrade to 2.0.3 if 
>>> that would help at all.
>> 
>> If you have a font which supports CJK then PDFBox should be able to use it. 
>> I recommend “Arial Unicode MS” as a good starting point, as it provides many 
>> more Unicode characters than plain “Arial”. Google’s Noto fonts also provide 
>> a great selection of characters.
>> 
>> — John
>> 
>>> Thanks for the help,
>>> Dan
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Supporting multiple languages, including CJK

Reply via email to