Ideally, the purpose of such base-1024 encoding is to allow compacting arbitrary data into plain-text which can be safely preserved including by Unicode normalization and transforms by encoding like UTF-8. But then we have a way to do that is such a way that this minimizes the UTF-8 string sizes (Emojis is probably not the best set to use if most of them lie in supplementary planes).
You can choose another arbitrary set of 1024 codepoints in the BMP that is preserved by normalization (no decomposition, combining class=0) and text filters (no controls, no whitespaces, possibly no punctuation, only letters or digits) and which is still simple to compute with a basic algorithm not requiring any table lookup (only a few tests for some boundary values or a very small lookup table with 16 entries, one entry for each subset of 64 values). As well some frequent binary data (notably runs of null bytes) should be able to use shorter UTF-8 sequences from the ASCII set, so my opinion is that the 64 first codes should be the same as standard Base-64, others can be taken easily from CJK blocks, or the PUA block in the BMP, but you can also select some blocks below the U+0800 codepoint so that they get encoded as 2 bytes and not 3 for the rest of the BMP (and 4 bytes for most emojis, where 10 bits become 64 bits with a huge waste of storage space in UTF-8) So the real need it to find the smallest set of subranges with 64 consecutive codepoints with minimal values that contain only letters or digits and where all positions are assigned with such general properties. Emojis will unlikely be part of them ! With this goal, you can even avoid using any PUAs (which are likely to be filtered/forbidden by some protocols), or compatibility characters (likely to be transformed by NFKC/NFKD). And even within just the BMP, you could reach more than 10-bit encoding (base-1024) and can probably find 12-bit encoding (base 4096) or more (CJK blocks of the BMP offer wide ranges of suitable characters, as well as some extended Latin or extended Cyrillic blocks) If you want to use supplementary characters that are already encoded, then you can certainly use CJK blocks in the large supplementary ideographic plane and create a 16-bit encoding (base 65536). Only some legacy Emojis in the BMP will be used before that. 2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode <[email protected]>: > I created a neat little project based on Unicode emojis. I thought > some on this list may find it interesting. It encodes arbitrary data > as 1024 emojis. The project is called Ecoji and is hosted on github > at https://github.com/keith-turner/ecoji > > Below are some examples of encoding and decoding. > > $ echo 'Unicode emojis are awesome!!' | ecoji > ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ > > $ echo ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ | ecoji -d > Unicode emojis are awesome!! > > I would eventually like to create a base4096 version when there are more > emojis. > > Keith > >

