Re: Unicode, SMS and year 2012

anbu Sat, 28 Apr 2012 07:48:36 -0700

> How data is transformed to this string is
> undefined, which is a problem.

As mentioned in the mail, just like utf-8 is pre-installed in most
systems, this design would also be pre-installed in the systems intending
to use them. The example given above is not existing anywhere. One needs to
come with the correct mapping based on the frequency of the use, lets say,
all ANSI characters not encoded in the eight bits would be encoded in 10
bits (instead of the 16 bits of UTF-8), all the Cyrillic characters would
be encoded in either 10 or 12 bits (instead of the 16 bits of UTF-8), all
the Tamil character would be assigned in 18 bits (instead of the 24 bits of
UTF-8) and so on. The above are possibilities. We assign each character of
the latter and former scripts to a code point in their specified range
(Please note that this is not yet done and possibly not the best, the
example in the previous mail is just a random assumption for
conceptualisation, not based on any theory). We generate a mapping
something like this. If we go by assigning all ANSI, then Cyrillic, then
the next suitable and so on, most of the population would be covered.

> Code words starting with an initial 1 code variable-length values,
> which are magically created.

As noted above, they are not going to be magically created (once the
design is complete), codes from this design need to be predefined to
characters. Please note that this encoding is a work in progress, so I am
stilling working on ways to assign the generated codes to the characters.
Maybe after I have completed that, you may get a clear picture of what I
want to do.

> * Code words starting with an initial 0 code literal 7-bit ASCII values,
> which follow the initial zero bit. 0MXX XXXL where M and L are MSB and
> LSB of the respective ASCII value.

Thanks! This is what I wanted to suggest here. No correction to this.

> Code words starting with an initial 1 code variable-length values,
> which are magically created. Read N bits until a 1 bit is encountered
> (inclusive) on an even position within the bit string (where the
> position of the initial code word bit is 0) following a 0 bit on an even
> position. The complete word is N+2 bits long, including the initial 1
bit.

I apologise for my poor explanation. I further assure, the codes are not
magically created, they are created by the EBNF below. I regenerated the
EBNF to make me as clear as possible, in fact, now they are two:

1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1)

1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1)

All the codes produced and only these codes produced by any of the EBNF
are valid. That is to say, a code produced independently from the first
EBNF is valid, similarly a code independently produced by the second EBNF
is also valid. There is one constraint on these EBNF's that at any given
point the code (sentence) produced must always be greater than 8 bits. That
is repeat any of the ones inside the curly braces {} till at least the code
is of 10 bits.

* Code words starting with an initial 1 code variable-length values,
which are created from any of the above EBNF. Read N bits until a 1 bit is
encountered
(inclusive) on an even position within the bit string (where the
position of the initial code word bit is 0) following a 0 bit on an even
position [This statement is correct and only valid if the bit on the third
position (Position 2, an even position) is a 1 bit]. If the bit on the
third position (Position 2, an even position) is a 0 bit, then, Read N bits
until a 0 bit is encountered (inclusive) on an even position within the bit
string (where the position of the initial code word bit is 0) following a 1
bit on an even position other than the first position (Position 0). The
complete word is N+2 bits long, including the initial 1 bit.

> Also, I wonder how efficiently your encoding can code general texts...
> Seeing as how your 10bit codes can only code 192 out of 512 possible
> values, 12 bit codes only 512 out of 2048 values and so on... This means
> you will have a massive amount of bits for rare-ish characters sooner or
> later...

As with the number of possible values, you are underestimating for future
codes.
The number of characters of (and the total number of characters till) 8
bits is given as 128 values.
The actual formula (for number of bits of only that point) goes like this,
for bits greater than 8 bits:

[number of bits - 4] [2 ^ (number of bits ÷ 2)]

8 bits - 128 values (cumulative: 128 values)
10 bits - 192 values (cumulative: 320 values)
12 bits - 512 values (cumulative: 704 values)
14 bits - 1280 values (cumulative: 1792 values)
16 bits - 3072 values (cumulative: 4352, this is double of what the UTF-8
provides = 128 (Basic Latin) + 1024 (all the 16 bit codes of UTF-8 count to
this))

Thank You! For Your Time. Please Contact me If you Need more
Clarification. I am always willing to clarify on this.

Regards,

Anbu

On Sat, 28 Apr 2012 01:46:58 +0200, Robert Abel
<[email protected]> wrote:
> Hi
> 
> On 2012/04/28 00:23, [email protected] wrote:
>> 1. let 'x' be the position of a code positioned at an odd number eg
when
>> we take the code '1001010110', the first '1' is positioned at location
>> '1'
>> (so an odd number), the first '0' is positioned at location '2' (not an
>> odd
>> number), the next '0' is positioned at location '3' (an odd number) and
>> so
>> on.
>>
>> 2. the program takes into memory all the bits till it reaches the end
>> (whether they are at position 'x' or not), till it has reached the end
>>
>> 3. the program checks each consecutive bit at position 'x'.
>>
>> 4. The program finds the end by the theory 'The bit before the last bit
>> of
>> the code is reached if and only if the bit value at 'x' has changed
>> twice'.
>> Changing twice is that the bit value must change from the initial '1'
to
>> '0', then back to '1'. The last bit is immediately after the '1' at
>> position 'x', which in turn itself comes after a '0' at position 'x'.
>>
>> 5. Here we find this doesn't need much or complicated arithmetic.
Simple
>> logic is enough.
> You stated that way too complicated... From what I understand from your
> description:
> 
> * Read data as string of bits. How data is transformed to this string is
> undefined, which is a problem.
> 
> * Code words starting with an initial 0 code literal 7-bit ASCII values,
> which follow the initial zero bit. 0MXX XXXL where M and L are MSB and
> LSB of the respective ASCII value.
> 
> * Code words starting with an initial 1 code variable-length values,
> which are magically created. Read N bits until a 1 bit is encountered
> (inclusive) on an even position within the bit string (where the
> position of the initial code word bit is 0) following a 0 bit on an even
> position. The complete word is N+2 bits long, including the initial 1
bit.
> 
> 
> Also, I wonder how efficiently your encoding can code general texts...
> Seeing as how your 10bit codes can only code 192 out of 512 possible
> values, 12 bit codes only 512 out of 2048 values and so on... This means
> you will have a massive amount of bits for rare-ish characters sooner or
> later...
> 
> Regards,
> 
> Robert

Re: Unicode, SMS and year 2012

Reply via email to