Please note some correction and additions in the comparison of the values My design provides the following number values for the specified number of bits:
8 bits - 128 values (Cumulative: 128 values) 10 bits - 192 values (Cumulative: 320 values) 12 bits - 512 values (Cumulative: 832 values) 14 bits - 1280 values (Cumulative: 2112 values) 16 bits - 3072 values (Cumulative: 5184 values) Note: UTF8 has 2048 values of 16 bits (Cumulative: 2176) This clearly shows that my design yields number of values more than double that of UTF8 18 bits - 7168 values (Cumulative: 12353 values) and so on, At any given number of bits, my design yields more (With the exception of 48 bits/6 bytes only, where UTF 8 yields more values than my design, but in the immediate next possible bits,50 bits, my design follows its trajectory of having more values than UTF8) Another advantage is that my design increments progressively by two bits. Please refer attached Spreadsheet for more comparison of values. -------- Original Message -------- Subject: Re: Unicode, SMS and year 2012 Date: Sat, 28 Apr 2012 07:54:02 -0400 From: <[email protected]> To: <[email protected]> > How data is transformed to this string is > undefined, which is a problem. As mentioned in the mail, just like utf-8 is pre-installed in most systems, this design would also be pre-installed in the systems intending to use them. The example given above is not existing anywhere. One needs to come with the correct mapping based on the frequency of the use, lets say, all ANSI characters not encoded in the eight bits would be encoded in 10 bits (instead of the 16 bits of UTF-8), all the Cyrillic characters would be encoded in either 10 or 12 bits (instead of the 16 bits of UTF-8), all the Tamil character would be assigned in 18 bits (instead of the 24 bits of UTF-8) and so on. The above are possibilities. We assign each character of the latter and former scripts to a code point in their specified range (Please note that this is not yet done and possibly not the best, the example in the previous mail is just a random assumption for conceptualisation, not based on any theory). We generate a mapping something like this. If we go by assigning all ANSI, then Cyrillic, then the next suitable and so on, most of the population would be covered. > Code words starting with an initial 1 code variable-length values, > which are magically created. As noted above, they are not going to be magically created (once the design is complete), codes from this design need to be predefined to characters. Please note that this encoding is a work in progress, so I am stilling working on ways to assign the generated codes to the characters. Maybe after I have completed that, you may get a clear picture of what I want to do. > * Code words starting with an initial 0 code literal 7-bit ASCII values, > which follow the initial zero bit. 0MXX XXXL where M and L are MSB and > LSB of the respective ASCII value. Thanks! This is what I wanted to suggest here. No correction to this. > Code words starting with an initial 1 code variable-length values, > which are magically created. Read N bits until a 1 bit is encountered > (inclusive) on an even position within the bit string (where the > position of the initial code word bit is 0) following a 0 bit on an even > position. The complete word is N+2 bits long, including the initial 1 bit. I apologise for my poor explanation. I further assure, the codes are not magically created, they are created by the EBNF below. I regenerated the EBNF to make me as clear as possible, in fact, now they are two: 1(0|1){1(0|1)}{0(0|1)}0(0|1)1(0|1) 1(0|1){0(0|1)}{1(0|1)}1(0|1)0(0|1) All the codes produced and only these codes produced by any of the EBNF are valid. That is to say, a code produced independently from the first EBNF is valid, similarly a code independently produced by the second EBNF is also valid. There is one constraint on these EBNF's that at any given point the code (sentence) produced must always be greater than 8 bits. That is repeat any of the ones inside the curly braces {} till at least the code is of 10 bits. * Code words starting with an initial 1 code variable-length values, which are created from any of the above EBNF. Read N bits until a 1 bit is encountered (inclusive) on an even position within the bit string (where the position of the initial code word bit is 0) following a 0 bit on an even position [This statement is correct and only valid if the bit on the third position (Position 2, an even position) is a 1 bit]. If the bit on the third position (Position 2, an even position) is a 0 bit, then, Read N bits until a 0 bit is encountered (inclusive) on an even position within the bit string (where the position of the initial code word bit is 0) following a 1 bit on an even position other than the first position (Position 0). The complete word is N+2 bits long, including the initial 1 bit. > Also, I wonder how efficiently your encoding can code general texts... > Seeing as how your 10bit codes can only code 192 out of 512 possible > values, 12 bit codes only 512 out of 2048 values and so on... This means > you will have a massive amount of bits for rare-ish characters sooner or > later... As with the number of possible values, you are underestimating for future codes. The number of characters of (and the total number of characters till) 8 bits is given as 128 values. The actual formula (for number of bits of only that point) goes like this, for bits greater than 8 bits: [number of bits - 4] [2 ^ (number of bits ÷ 2)] 8 bits - 128 values (cumulative: 128 values) 10 bits - 192 values (cumulative: 320 values) 12 bits - 512 values (cumulative: 704 values) 14 bits - 1280 values (cumulative: 1792 values) 16 bits - 3072 values (cumulative: 4352, this is double of what the UTF-8 provides = 128 (Basic Latin) + 1024 (all the 16 bit codes of UTF-8 count to this)) Thank You! For Your Time. Please Contact me If you Need more Clarification. I am always willing to clarify on this. Regards, Anbu On Sat, 28 Apr 2012 01:46:58 +0200, Robert Abel <[email protected]> wrote: > Hi > > On 2012/04/28 00:23, [email protected] wrote: >> 1. let 'x' be the position of a code positioned at an odd number eg when >> we take the code '1001010110', the first '1' is positioned at location >> '1' >> (so an odd number), the first '0' is positioned at location '2' (not an >> odd >> number), the next '0' is positioned at location '3' (an odd number) and >> so >> on. >> >> 2. the program takes into memory all the bits till it reaches the end >> (whether they are at position 'x' or not), till it has reached the end >> >> 3. the program checks each consecutive bit at position 'x'. >> >> 4. The program finds the end by the theory 'The bit before the last bit >> of >> the code is reached if and only if the bit value at 'x' has changed >> twice'. >> Changing twice is that the bit value must change from the initial '1' to >> '0', then back to '1'. The last bit is immediately after the '1' at >> position 'x', which in turn itself comes after a '0' at position 'x'. >> >> 5. Here we find this doesn't need much or complicated arithmetic. Simple >> logic is enough. > You stated that way too complicated... From what I understand from your > description: > > * Read data as string of bits. How data is transformed to this string is > undefined, which is a problem. > > * Code words starting with an initial 0 code literal 7-bit ASCII values, > which follow the initial zero bit. 0MXX XXXL where M and L are MSB and > LSB of the respective ASCII value. > > * Code words starting with an initial 1 code variable-length values, > which are magically created. Read N bits until a 1 bit is encountered > (inclusive) on an even position within the bit string (where the > position of the initial code word bit is 0) following a 0 bit on an even > position. The complete word is N+2 bits long, including the initial 1 bit. > > > Also, I wonder how efficiently your encoding can code general texts... > Seeing as how your 10bit codes can only code 192 out of 512 possible > values, 12 bit codes only 512 out of 2048 values and so on... This means > you will have a massive amount of bits for rare-ish characters sooner or > later... > > Regards, > > Robert
myDesign.vs.UTF8.xlsx
Description: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

