Re: IllegalStateException are thrown by surrogate pair character.

Toshiaki Ito Sat, 04 May 2024 06:22:21 -0700

Hi, Tilman.

𩸽 is a Kanji character defined by a standard called "JIS X 0213".


about JIS X 0213
https://en.wikipedia.org/wiki/JIS_X_0213

Although this is a common character, the characters included vary from
font to font and may not be included.
I have given up on using fonts that do not contain these characters,
but I would like to output them because they are included in Noto Sans
Japanese.


I read the source code for pdfbox 3.0.2.
In applyGSUBRules in PDAbstractContentStream.java

"char[] charArray = word.toCharArray();"

In this code, I think the problem is that it is split into two
char("\uD867" and "\uDE3D")  , and I expect 𩸽 to be treated as a
single character ("U+29E3D").


I modified the code to process each code point, and it works as expected.
(All existing tests have passed.)



By the way, with pdbox 2.0.31, the same code produces the expected output.


**** Modified code ****

    private List<Integer> applyGSUBRules(GsubWorker gsubWorker,
ByteArrayOutputStream out, PDType0Font font, String word) throws
IOException
    {
        int[] codePointArray = word.codePoints().toArray();
        List<Integer> originalGlyphIds = new
ArrayList<>(word.codePointCount(0, word.length()));
        CmapLookup cmapLookup = font.getCmapLookup();

        // convert characters into glyphIds
        for (int unicodeChar : codePointArray)
        {
            int glyphId = cmapLookup.getGlyphId(unicodeChar);
            if (glyphId <= 0)
            {
                throw new IllegalStateException(
                        "could not find the glyphId for the character:
" + unicodeChar);
            }
            originalGlyphIds.add(glyphId);
        }

        List<Integer> glyphIdsAfterGsub =
gsubWorker.applyTransforms(originalGlyphIds);

        for (Integer glyphId : glyphIdsAfterGsub)
        {
            out.write(font.encodeGlyphId(glyphId));
        }

        return glyphIdsAfterGsub;

    }

2024年5月4日(土) 20:12 Tilman Hausherr <thaush...@t-online.de>:
>
> Hi,
>
> Is it this one? 𩸽
>
> According to my understanding of
> https://www.compart.com/en/unicode/U+29E3D you should use \u29E3D  or 𩸽
> directly. However I tried this with your font and with MingLiU and MS
> Mincho and it didn't work either. Is this a very standard glyph? Or
> something unusual? So I don't know if this is a bug on our side, missing
> feature or a different problem.
>
> Tilman
>
> On 04.05.2024 07:21, 伊東寿晃 wrote:
> > Hi,
> >
> > In pdfbox 3.0, an IllegalStateException occurs when trying to output
> > surrogate pair characters.
> > According to the exception, it seems that one Kanji character is
> > processed as two chars.
> >
> > Is this a bug?
> > Is there any possible workaround on the program side?
> >
> >
> > **** Conditions ****
> > JDK: 21
> > PDFBox: 3.0.0 / 3.0.1 / 3.0.2
> > Font: Noto Sans Japanese 
> > (https://fonts.google.com/noto/specimen/Noto+Sans+JP)
> > Font and glyph preview :
> > https://fonts.google.com/noto/specimen/Noto+Sans+JP?preview.text=%F0%A9%B8%BD
> >
> > **** Test code ****
> >    public static void main(String[] args) throws IOException {
> >
> >      final String fontPath = "NotoSansJP-Regular.ttf";
> >      final String out = "output.pdf";
> >
> >      // Atka Mackerel in Japanese kanji. (surrogate pair)
> >      final String message = "\uD867\uDE3D";
> >
> >      try (PDDocument doc = new PDDocument()) {
> >        PDPage page = new PDPage();
> >        doc.addPage(page);
> >        PDFont font = PDType0Font.load(doc, new File(fontPath));
> >
> >        try (PDPageContentStream contents = new PDPageContentStream(doc, 
> > page)) {
> >          contents.beginText();
> >          contents.setFont(font, 64);
> >          contents.newLineAtOffset(100, 700);
> >          contents.showText(message);
> >          contents.endText();
> >        }
> >
> >        doc.save(out);
> >        System.out.println(out + " created!");
> >      }
> >    }
> >
> >
> > **** StackTrace ****
> > Exception in thread "main" java.lang.IllegalStateException: could not
> > find the glyphId for the character: ?
> >      at 
> > org.apache.pdfbox.pdmodel.PDAbstractContentStream.applyGSUBRules(PDAbstractContentStream.java:1651)
> >      at 
> > org.apache.pdfbox.pdmodel.PDAbstractContentStream.encodeForGsub(PDAbstractContentStream.java:1632)
> >      at 
> > org.apache.pdfbox.pdmodel.PDAbstractContentStream.showTextInternal(PDAbstractContentStream.java:302)
> >      at 
> > org.apache.pdfbox.pdmodel.PDAbstractContentStream.showText(PDAbstractContentStream.java:266)
> >      at 
> > org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:37)
> >      at org.example.App.main(App.java:30)
> >
> >
> >
> > My English isn't so good so feel free to ask me if there is anything 
> > unclear.
> >
> > --
> > Toshiaki Ito
> > Mail:evolut...@1024kb.cx
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail:users-h...@pdfbox.apache.org
> >



-- 
Toshiaki Ito
Mail: evolut...@1024kb.cx

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: IllegalStateException are thrown by surrogate pair character.

Reply via email to