Hello, I suppose this is a bug report. If this behavior is by design, please let me know. (I had trouble seeing if this had been already reported in Apache's Jira--kept saying the server was down while searching.)
The following Scala program illustrates the problem I am experiencing: ``` import org.apache.commons.logging.LogFactory import org.apache.fontbox.ttf.{CmapTable, CmapSubtable, OTFParser} import java.io.File import scala.jdk.CollectionConverters._ object MultipleEncodingTest { def main(args: Array[String]): Unit = { val fontFile = new File("./Noto_Sans_SC/NotoSansSC-Regular.otf") val otfParser = new OTFParser(false) val otfFont = otfParser.parse(fontFile) val unicodeCmapLookup = otfFont.getUnicodeCmapLookup() val gid = 8712 val charCodes = unicodeCmapLookup.getCharCodes(gid) println(s"Unicode encodings for GID $gid: ${toUnicodeNotation(charCodes)}") val cmapTable = otfFont.getCmap() val unicodeBmpCmapTable = cmapTable.getSubtable(CmapTable.PLATFORM_UNICODE, CmapTable.ENCODING_UNICODE_2_0_BMP) val unicodeFullCmapTable = cmapTable.getSubtable(CmapTable.PLATFORM_UNICODE, CmapTable.ENCODING_UNICODE_2_0_FULL) val unicodeBmpCharCodes = unicodeBmpCmapTable.getCharCodes(gid) val unicodeFullCharCodes = unicodeFullCmapTable.getCharCodes(gid) println(s"Unicode encodings for GID $gid from table (platformId = ${unicodeBmpCmapTable.getPlatformId()} encodingId = ${unicodeBmpCmapTable.getPlatformEncodingId()}): ${toUnicodeNotation(unicodeBmpCharCodes)}") println(s"Unicode encodings for GID $gid from table (platformId = ${unicodeFullCmapTable.getPlatformId()} encodingId = ${unicodeFullCmapTable.getPlatformEncodingId()}): ${toUnicodeNotation(unicodeFullCharCodes)}") println(s"${unicodeCmapLookup == unicodeFullCmapTable}") } private def toUnicodeNotation(charCodes: java.util.List[Integer]): Seq[String] = { charCodes.asScala.toSeq.map(c => s"U+${Integer.toHexString(c)}") } } ``` The output for this program is the following: ``` Unicode encodings for GID 8712: List(U+f967) Unicode encodings for GID 8712 from table (platformId = 0 encodingId = 3): List(U+4e0d, U+f967) Unicode encodings for GID 8712 from table (platformId = 0 encodingId = 4): List(U+f967) true ``` The font I am using (`./Noto_Sans_SC/NotoSansSC-Regular.otf`) can be downloaded from google fonts: https://fonts.google.com/noto/specimen/Noto+Sans+SC Taking a dump of NotoSansSC-Regular.otf using ttx, I was able to confirm that both the cmap subtables (platformId = 0 encodingId = 3) and (platformId = 0 encodingId = 4) map the codepoints U+4e0d and U+f967 to GID 8712. Note, though, (platformId = 0 encodingId = 4) excludes the U+4e0d mapping. After doing some digging (based on the code at commit 6404de4b8), it seems the problem is with the "processing" functions in the CmapSubtable class. The subtable (platformId = 0 encodingId = 3) has the format 4, and gets parsed by the CmapSubtable.processSubtype4 method. This method builds up the CmapSubtable.characterCodeToGlyphId map and then calls CmapSubtable.buildGlyphIdToCharacterCodeLookup. The buildGlyphIdToCharacterCodeLookup handles the multiple encodings correctly and populates CmapSubtable.glyphIdToCharacterCodeMultiple and CmapSubtable.glyphIdToCharacterCode accordingly. On the other hand, the subtable (platformId = 0 encodingId = 4) is of the format 12 and gets processed by CmapSubtable.processSubtype12. The processSubtype12 method does not call CmapSubtable.buildGlyphIdToCharacterCodeLookup and seems to just map the GID to the last Unicode encoding associated with it. This is why we seem to get the output "Unicode encodings for GID 8712 from table (platformId = 0 encodingId = 4): List(U+f967)" where GID 8712 is mapping a single codepoint. This is a bug, no? I added the last line of output just to illustrate that OpenTypeFont.getUnicodeCmapLookup seems to return the problem subtable. Thank you, Ty Lewis