subject:"Re\: base1024 encoding using Unicode emojis"

Re: base1024 encoding using Unicode emojis

2018-03-12 Thread Martin J. Dürst via Unicode

On 2018/03/12 02:07, Keith Turner via Unicode wrote:

Yeah, it certainly results in larger utf8 strings. For example a sha256
hash is 112 bytes when encoded as Ecoji utf8. For base64, sha256 is 44
bytes.

Even though its more bytes, Ecoji has less visible characters than base64
for sha256. Ecoji has 28 visible characters and base64 44. So that makes
me wonder which one would be quicker for a human to verify on average?
Also, which one is more accurate for a human to verify? I have no idea. For
accuracy, it seems like a lot of thought was put into the visual uniqueness
of Unicode emojis.

Using emoji to help people verify security information is an interesting
idea. What I'm afraid is that even if emoji are designed with
distinctiveness in mind, some people may have difficulties distinguish
all the various face variants. Also, while emoji get designed so that
in-font distinguishability is high, the same may not apply across fonts
(e.g. if one has to compare a printed version with a version on-screen).

Regards, Martin.

2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode :

I created a neat little project based on Unicode emojis. I thought
some on this list may find it interesting. It encodes arbitrary data
as 1024 emojis. The project is called Ecoji and is hosted on github
at https://github.com/keith-turner/ecoji

Below are some examples of encoding and decoding.

$ echo 'Unicode emojis are awesome!!' | ecoji
🐦😱🔫🤜👢🔥🇮🐾💎🗓🔯🚜👖🚢🐙🌩💮🔪🎨🤚👥📤🌈📑

$ echo 🐦😱🔫🤜👢🔥🇮🐾💎🗓🔯🚜👖🚢🐙🌩💮🔪🎨🤚👥📤🌈📑 | ecoji -d
Unicode emojis are awesome!!

I would eventually like to create a base4096 version when there are more
emojis.

Re: base1024 encoding using Unicode emojis

2018-03-11 Thread Doug Ewell via Unicode

Oh, let him have a little fun. At least he's using emoji for something 
related to characters, instead of playing Mr. Potato Head.


Incidentally, more prior art on large-base encoding:
https://sites.google.com/site/markusicu/unicode/base16k

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: base1024 encoding using Unicode emojis

2018-03-11 Thread Keith Turner via Unicode

On Sun, Mar 11, 2018 at 11:25 AM, Philippe Verdy  wrote:

> Ideally, the purpose of such base-1024 encoding is to allow compacting
> arbitrary data into plain-text which can be safely preserved including by
> Unicode normalization and transforms by encoding like UTF-8.
> But then we have a way to do that is such a way that this minimizes the
> UTF-8 string sizes (Emojis is probably not the best set to use if most of
> them lie in supplementary planes).
>

Yeah, it certainly results in larger utf8 strings.  For example a sha256
hash is 112 bytes when encoded as Ecoji utf8.  For base64, sha256 is 44
bytes.

Even though its more bytes, Ecoji has less visible characters than base64
for sha256.  Ecoji has 28 visible characters and base64 44.  So that makes
me wonder which one would be quicker for a human to verify on average?
Also, which one is more accurate for a human to verify? I have no idea. For
accuracy, it seems like a lot of thought was put into the visual uniqueness
of Unicode emojis.


>
> You can choose another arbitrary set of 1024 codepoints in the BMP that is
> preserved by normalization (no decomposition, combining class=0) and text
> filters (no controls, no whitespaces, possibly no punctuation, only letters
> or digits) and which is still simple to compute with a basic algorithm not
> requiring any table lookup (only a few tests for some boundary values or a
> very small lookup table with 16 entries, one entry for each subset of 64
> values).
>
> As well some frequent binary data (notably runs of null bytes) should be
> able to use shorter UTF-8 sequences from the ASCII set, so my opinion is
> that the 64 first codes should be the same as standard Base-64, others can
> be taken easily from CJK blocks, or the PUA block in the BMP, but you can
> also select some blocks below the U+0800 codepoint so that they get encoded
> as 2 bytes and not 3 for the rest of the BMP (and 4 bytes for most emojis,
> where 10 bits become 64 bits with a huge waste of storage space in UTF-8)
>
> So the real need it to find the smallest set of subranges with 64
> consecutive codepoints with minimal values that contain only letters or
> digits and where all positions are assigned with such general properties.
> Emojis will unlikely be part of them ! With this goal, you can even avoid
> using any PUAs (which are likely to be filtered/forbidden by some
> protocols), or compatibility characters (likely to be transformed by
> NFKC/NFKD).
>
> And even within just the BMP, you could reach more than 10-bit encoding
> (base-1024) and can probably find 12-bit encoding (base 4096) or more (CJK
> blocks of the BMP offer wide ranges of suitable characters, as well as some
> extended Latin or extended Cyrillic blocks)
>
> If you want to use supplementary characters that are already encoded, then
> you can certainly use CJK blocks in the large supplementary ideographic
> plane and create a 16-bit encoding (base 65536). Only some legacy Emojis in
> the BMP will be used before that.
>
>
>
> 2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode :
>
>> I created a neat little project based on Unicode emojis.  I thought
>> some on this list may find it interesting.  It encodes arbitrary data
>> as 1024 emojis.  The project is called Ecoji and is hosted on github
>> at https://github.com/keith-turner/ecoji
>>
>> Below are some examples of encoding and decoding.
>>
>> $ echo 'Unicode emojis are awesome!!' | ecoji
>> 🐦😱🔫🤜👢🔥🇮🐾💎🗓🔯🚜👖🚢🐙🌩💮🔪🎨🤚👥📤🌈📑
>>
>> $ echo 🐦😱🔫🤜👢🔥🇮🐾💎🗓🔯🚜👖🚢🐙🌩💮🔪🎨🤚👥📤🌈📑   | ecoji -d
>> Unicode emojis are awesome!!
>>
>> I would eventually like to create a base4096 version when there are more
>> emojis.
>>
>> Keith
>>
>>
>

Re: base1024 encoding using Unicode emojis

2018-03-11 Thread Philippe Verdy via Unicode

Ideally, the purpose of such base-1024 encoding is to allow compacting
arbitrary data into plain-text which can be safely preserved including by
Unicode normalization and transforms by encoding like UTF-8.
But then we have a way to do that is such a way that this minimizes the
UTF-8 string sizes (Emojis is probably not the best set to use if most of
them lie in supplementary planes).

You can choose another arbitrary set of 1024 codepoints in the BMP that is
preserved by normalization (no decomposition, combining class=0) and text
filters (no controls, no whitespaces, possibly no punctuation, only letters
or digits) and which is still simple to compute with a basic algorithm not
requiring any table lookup (only a few tests for some boundary values or a
very small lookup table with 16 entries, one entry for each subset of 64
values).

As well some frequent binary data (notably runs of null bytes) should be
able to use shorter UTF-8 sequences from the ASCII set, so my opinion is
that the 64 first codes should be the same as standard Base-64, others can
be taken easily from CJK blocks, or the PUA block in the BMP, but you can
also select some blocks below the U+0800 codepoint so that they get encoded
as 2 bytes and not 3 for the rest of the BMP (and 4 bytes for most emojis,
where 10 bits become 64 bits with a huge waste of storage space in UTF-8)

So the real need it to find the smallest set of subranges with 64
consecutive codepoints with minimal values that contain only letters or
digits and where all positions are assigned with such general properties.
Emojis will unlikely be part of them ! With this goal, you can even avoid
using any PUAs (which are likely to be filtered/forbidden by some
protocols), or compatibility characters (likely to be transformed by
NFKC/NFKD).

And even within just the BMP, you could reach more than 10-bit encoding
(base-1024) and can probably find 12-bit encoding (base 4096) or more (CJK
blocks of the BMP offer wide ranges of suitable characters, as well as some
extended Latin or extended Cyrillic blocks)

If you want to use supplementary characters that are already encoded, then
you can certainly use CJK blocks in the large supplementary ideographic
plane and create a 16-bit encoding (base 65536). Only some legacy Emojis in
the BMP will be used before that.



2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode :

> I created a neat little project based on Unicode emojis.  I thought
> some on this list may find it interesting.  It encodes arbitrary data
> as 1024 emojis.  The project is called Ecoji and is hosted on github
> at https://github.com/keith-turner/ecoji
>
> Below are some examples of encoding and decoding.
>
> $ echo 'Unicode emojis are awesome!!' | ecoji
> 🐦😱🔫🤜👢🔥🇮🐾💎🗓🔯🚜👖🚢🐙🌩💮🔪🎨🤚👥📤🌈📑
>
> $ echo 🐦😱🔫🤜👢🔥🇮🐾💎🗓🔯🚜👖🚢🐙🌩💮🔪🎨🤚👥📤🌈📑   | ecoji -d
> Unicode emojis are awesome!!
>
> I would eventually like to create a base4096 version when there are more
> emojis.
>
> Keith
>
>

Re: base1024 encoding using Unicode emojis

2018-03-11 Thread Mathias Bynens via Unicode

Neat! Prior art:

   - https://github.com/watson/base64-emoji
   - https://github.com/nate-parrott/emojicode


On Sun, Mar 11, 2018 at 6:04 AM, Keith Turner via Unicode <
unicode@unicode.org> wrote:

> I created a neat little project based on Unicode emojis.  I thought
> some on this list may find it interesting.  It encodes arbitrary data
> as 1024 emojis.  The project is called Ecoji and is hosted on github
> at https://github.com/keith-turner/ecoji
>
> Below are some examples of encoding and decoding.
>
> $ echo 'Unicode emojis are awesome!!' | ecoji
> 🐦😱🔫🤜👢🔥🇮🐾💎🗓🔯🚜👖🚢🐙🌩💮🔪🎨🤚👥📤🌈📑
>
> $ echo 🐦😱🔫🤜👢🔥🇮🐾💎🗓🔯🚜👖🚢🐙🌩💮🔪🎨🤚👥📤🌈📑   | ecoji -d
> Unicode emojis are awesome!!
>
> I would eventually like to create a base4096 version when there are more
> emojis.
>
> Keith
>
>

Re: base1024 encoding using Unicode emojis

Re: base1024 encoding using Unicode emojis

Re: base1024 encoding using Unicode emojis

Re: base1024 encoding using Unicode emojis

Re: base1024 encoding using Unicode emojis

5 matches

Site Navigation

Mail list logo

Footer information