FontBox: Failing to get multiple encodings from cmap table

Ty Lewis Tue, 16 Nov 2021 11:53:18 -0800

Hello,

I suppose this is a bug report. If this behavior is by design, please let
me know. (I had trouble seeing if this had been already reported in
Apache's Jira--kept saying the server was down while searching.)


The following Scala program illustrates the problem I am experiencing:

```
import org.apache.commons.logging.LogFactory
import org.apache.fontbox.ttf.{CmapTable, CmapSubtable, OTFParser}
import java.io.File
import scala.jdk.CollectionConverters._

object MultipleEncodingTest {
  def main(args: Array[String]): Unit = {
    val fontFile = new File("./Noto_Sans_SC/NotoSansSC-Regular.otf")
    val otfParser = new OTFParser(false)
    val otfFont = otfParser.parse(fontFile)

    val unicodeCmapLookup = otfFont.getUnicodeCmapLookup()
    val gid = 8712
    val charCodes = unicodeCmapLookup.getCharCodes(gid)
    println(s"Unicode encodings for GID $gid:
${toUnicodeNotation(charCodes)}")

    val cmapTable = otfFont.getCmap()
    val unicodeBmpCmapTable =
cmapTable.getSubtable(CmapTable.PLATFORM_UNICODE,
CmapTable.ENCODING_UNICODE_2_0_BMP)
    val unicodeFullCmapTable =
cmapTable.getSubtable(CmapTable.PLATFORM_UNICODE,
CmapTable.ENCODING_UNICODE_2_0_FULL)

    val unicodeBmpCharCodes = unicodeBmpCmapTable.getCharCodes(gid)
    val unicodeFullCharCodes = unicodeFullCmapTable.getCharCodes(gid)

    println(s"Unicode encodings for GID $gid from table (platformId =
${unicodeBmpCmapTable.getPlatformId()} encodingId =
${unicodeBmpCmapTable.getPlatformEncodingId()}):
${toUnicodeNotation(unicodeBmpCharCodes)}")
    println(s"Unicode encodings for GID $gid from table (platformId =
${unicodeFullCmapTable.getPlatformId()} encodingId =
${unicodeFullCmapTable.getPlatformEncodingId()}):
${toUnicodeNotation(unicodeFullCharCodes)}")

    println(s"${unicodeCmapLookup == unicodeFullCmapTable}")
  }

  private def toUnicodeNotation(charCodes: java.util.List[Integer]):
Seq[String] = {
    charCodes.asScala.toSeq.map(c => s"U+${Integer.toHexString(c)}")
  }
}
```

The output for this program is the following:
```
Unicode encodings for GID 8712: List(U+f967)
Unicode encodings for GID 8712 from table (platformId = 0 encodingId = 3):
List(U+4e0d, U+f967)
Unicode encodings for GID 8712 from table (platformId = 0 encodingId = 4):
List(U+f967)
true
```

The font I am using (`./Noto_Sans_SC/NotoSansSC-Regular.otf`) can be
downloaded from google fonts:
https://fonts.google.com/noto/specimen/Noto+Sans+SC

Taking a dump of NotoSansSC-Regular.otf using ttx, I was able to confirm
that both the cmap subtables (platformId = 0 encodingId = 3) and
(platformId = 0 encodingId = 4) map the codepoints U+4e0d and U+f967 to
GID 8712. Note, though, (platformId = 0 encodingId = 4) excludes the U+4e0d
mapping.

After doing some digging (based on the code at commit 6404de4b8), it seems
the problem is with the "processing" functions in the CmapSubtable class.
The subtable (platformId = 0 encodingId = 3) has the format 4, and gets
parsed by the CmapSubtable.processSubtype4 method. This method builds up
the CmapSubtable.characterCodeToGlyphId map and then calls
CmapSubtable.buildGlyphIdToCharacterCodeLookup. The
buildGlyphIdToCharacterCodeLookup handles the multiple encodings correctly
and populates CmapSubtable.glyphIdToCharacterCodeMultiple and
CmapSubtable.glyphIdToCharacterCode accordingly.

On the other hand, the subtable (platformId = 0 encodingId = 4) is of the
format 12 and gets processed by CmapSubtable.processSubtype12. The
processSubtype12 method does not call
CmapSubtable.buildGlyphIdToCharacterCodeLookup and seems to just map the
GID to the last Unicode encoding associated with it. This is why we seem to
get the output "Unicode encodings for GID 8712 from table (platformId = 0
encodingId = 4): List(U+f967)" where GID 8712 is mapping a single
codepoint. This is a bug, no?

I added the last line of output just to illustrate that
OpenTypeFont.getUnicodeCmapLookup seems to return the problem subtable.

Thank you,

Ty Lewis

FontBox: Failing to get multiple encodings from cmap table

Reply via email to