Re: base1024 encoding using Unicode emojis
On 2018/03/12 02:07, Keith Turner via Unicode wrote: Yeah, it certainly results in larger utf8 strings. For example a sha256 hash is 112 bytes when encoded as Ecoji utf8. For base64, sha256 is 44 bytes. Even though its more bytes, Ecoji has less visible characters than base64 for sha256. Ecoji has 28 visible characters and base64 44. So that makes me wonder which one would be quicker for a human to verify on average? Also, which one is more accurate for a human to verify? I have no idea. For accuracy, it seems like a lot of thought was put into the visual uniqueness of Unicode emojis. Using emoji to help people verify security information is an interesting idea. What I'm afraid is that even if emoji are designed with distinctiveness in mind, some people may have difficulties distinguish all the various face variants. Also, while emoji get designed so that in-font distinguishability is high, the same may not apply across fonts (e.g. if one has to compare a printed version with a version on-screen). Regards, Martin. 2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode : I created a neat little project based on Unicode emojis. I thought some on this list may find it interesting. It encodes arbitrary data as 1024 emojis. The project is called Ecoji and is hosted on github at https://github.com/keith-turner/ecoji Below are some examples of encoding and decoding. $ echo 'Unicode emojis are awesome!!' | ecoji ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ $ echo ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ | ecoji -d Unicode emojis are awesome!! I would eventually like to create a base4096 version when there are more emojis.
Re: base1024 encoding using Unicode emojis
Oh, let him have a little fun. At least he's using emoji for something related to characters, instead of playing Mr. Potato Head. Incidentally, more prior art on large-base encoding: https://sites.google.com/site/markusicu/unicode/base16k -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: base1024 encoding using Unicode emojis
On Sun, Mar 11, 2018 at 11:25 AM, Philippe Verdy wrote: > Ideally, the purpose of such base-1024 encoding is to allow compacting > arbitrary data into plain-text which can be safely preserved including by > Unicode normalization and transforms by encoding like UTF-8. > But then we have a way to do that is such a way that this minimizes the > UTF-8 string sizes (Emojis is probably not the best set to use if most of > them lie in supplementary planes). > Yeah, it certainly results in larger utf8 strings. For example a sha256 hash is 112 bytes when encoded as Ecoji utf8. For base64, sha256 is 44 bytes. Even though its more bytes, Ecoji has less visible characters than base64 for sha256. Ecoji has 28 visible characters and base64 44. So that makes me wonder which one would be quicker for a human to verify on average? Also, which one is more accurate for a human to verify? I have no idea. For accuracy, it seems like a lot of thought was put into the visual uniqueness of Unicode emojis. > > You can choose another arbitrary set of 1024 codepoints in the BMP that is > preserved by normalization (no decomposition, combining class=0) and text > filters (no controls, no whitespaces, possibly no punctuation, only letters > or digits) and which is still simple to compute with a basic algorithm not > requiring any table lookup (only a few tests for some boundary values or a > very small lookup table with 16 entries, one entry for each subset of 64 > values). > > As well some frequent binary data (notably runs of null bytes) should be > able to use shorter UTF-8 sequences from the ASCII set, so my opinion is > that the 64 first codes should be the same as standard Base-64, others can > be taken easily from CJK blocks, or the PUA block in the BMP, but you can > also select some blocks below the U+0800 codepoint so that they get encoded > as 2 bytes and not 3 for the rest of the BMP (and 4 bytes for most emojis, > where 10 bits become 64 bits with a huge waste of storage space in UTF-8) > > So the real need it to find the smallest set of subranges with 64 > consecutive codepoints with minimal values that contain only letters or > digits and where all positions are assigned with such general properties. > Emojis will unlikely be part of them ! With this goal, you can even avoid > using any PUAs (which are likely to be filtered/forbidden by some > protocols), or compatibility characters (likely to be transformed by > NFKC/NFKD). > > And even within just the BMP, you could reach more than 10-bit encoding > (base-1024) and can probably find 12-bit encoding (base 4096) or more (CJK > blocks of the BMP offer wide ranges of suitable characters, as well as some > extended Latin or extended Cyrillic blocks) > > If you want to use supplementary characters that are already encoded, then > you can certainly use CJK blocks in the large supplementary ideographic > plane and create a 16-bit encoding (base 65536). Only some legacy Emojis in > the BMP will be used before that. > > > > 2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode : > >> I created a neat little project based on Unicode emojis. I thought >> some on this list may find it interesting. It encodes arbitrary data >> as 1024 emojis. The project is called Ecoji and is hosted on github >> at https://github.com/keith-turner/ecoji >> >> Below are some examples of encoding and decoding. >> >> $ echo 'Unicode emojis are awesome!!' | ecoji >> ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ >> >> $ echo ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ | ecoji -d >> Unicode emojis are awesome!! >> >> I would eventually like to create a base4096 version when there are more >> emojis. >> >> Keith >> >> >
Re: base1024 encoding using Unicode emojis
Ideally, the purpose of such base-1024 encoding is to allow compacting arbitrary data into plain-text which can be safely preserved including by Unicode normalization and transforms by encoding like UTF-8. But then we have a way to do that is such a way that this minimizes the UTF-8 string sizes (Emojis is probably not the best set to use if most of them lie in supplementary planes). You can choose another arbitrary set of 1024 codepoints in the BMP that is preserved by normalization (no decomposition, combining class=0) and text filters (no controls, no whitespaces, possibly no punctuation, only letters or digits) and which is still simple to compute with a basic algorithm not requiring any table lookup (only a few tests for some boundary values or a very small lookup table with 16 entries, one entry for each subset of 64 values). As well some frequent binary data (notably runs of null bytes) should be able to use shorter UTF-8 sequences from the ASCII set, so my opinion is that the 64 first codes should be the same as standard Base-64, others can be taken easily from CJK blocks, or the PUA block in the BMP, but you can also select some blocks below the U+0800 codepoint so that they get encoded as 2 bytes and not 3 for the rest of the BMP (and 4 bytes for most emojis, where 10 bits become 64 bits with a huge waste of storage space in UTF-8) So the real need it to find the smallest set of subranges with 64 consecutive codepoints with minimal values that contain only letters or digits and where all positions are assigned with such general properties. Emojis will unlikely be part of them ! With this goal, you can even avoid using any PUAs (which are likely to be filtered/forbidden by some protocols), or compatibility characters (likely to be transformed by NFKC/NFKD). And even within just the BMP, you could reach more than 10-bit encoding (base-1024) and can probably find 12-bit encoding (base 4096) or more (CJK blocks of the BMP offer wide ranges of suitable characters, as well as some extended Latin or extended Cyrillic blocks) If you want to use supplementary characters that are already encoded, then you can certainly use CJK blocks in the large supplementary ideographic plane and create a 16-bit encoding (base 65536). Only some legacy Emojis in the BMP will be used before that. 2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode : > I created a neat little project based on Unicode emojis. I thought > some on this list may find it interesting. It encodes arbitrary data > as 1024 emojis. The project is called Ecoji and is hosted on github > at https://github.com/keith-turner/ecoji > > Below are some examples of encoding and decoding. > > $ echo 'Unicode emojis are awesome!!' | ecoji > ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ > > $ echo ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ | ecoji -d > Unicode emojis are awesome!! > > I would eventually like to create a base4096 version when there are more > emojis. > > Keith > >
Re: base1024 encoding using Unicode emojis
Neat! Prior art: - https://github.com/watson/base64-emoji - https://github.com/nate-parrott/emojicode On Sun, Mar 11, 2018 at 6:04 AM, Keith Turner via Unicode < unicode@unicode.org> wrote: > I created a neat little project based on Unicode emojis. I thought > some on this list may find it interesting. It encodes arbitrary data > as 1024 emojis. The project is called Ecoji and is hosted on github > at https://github.com/keith-turner/ecoji > > Below are some examples of encoding and decoding. > > $ echo 'Unicode emojis are awesome!!' | ecoji > ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ > > $ echo ๐ฆ๐ฑ๐ซ๐ค๐ข๐ฅ๐ฎ๐พ๐๐๐ฏ๐๐๐ข๐๐ฉ๐ฎ๐ช๐จ๐ค๐ฅ๐ค๐๐ | ecoji -d > Unicode emojis are awesome!! > > I would eventually like to create a base4096 version when there are more > emojis. > > Keith > >