I am pretty sure that this whole discussion does more harm than good for most
people's understanding of Unicode.
It is best and (mostly) correct to think of a Unicode string as a sequence of
Unicode characters, each defined/identified by a code point (out of 10.000s
covering all languages).
EuanM wrote
> ...
> all ISO-8859-1 maps 1:1 to Unicode UTF-8
> ...
I am late coming in to this conversation. If it hasn't already been said,
please do not conflate Unicode and UTF-8. I think that would be a recipe for
a high P.I.T.A. factor.
Unicode defines the meaning of the code
Hi Todd,
> On Dec 11, 2015, at 12:57 PM, Todd Blanchard wrote:
>
>
>> On Dec 11, 2015, at 12:19, EuanM wrote:
>>
>> "If it hasn't already been said, please do not conflate Unicode and
>> UTF-8. I think that would be a recipe for
>> a high P.I.T.A.
Hello Sven
On 12/9/15, Sven Van Caekenberghe wrote:
> The simplest example in a common language is (the French letter é) is
>
> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>
> which can also be written as
>
> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]
>
On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
wrote:
>
>> On 8 dic 2015, at 10:07 p.m., EuanM wrote:
>>
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
>>
> On 09 Dec 2015, at 10:35, Guillermo Polito wrote:
>
>
>> On 8 dic 2015, at 10:07 p.m., EuanM wrote:
>>
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
> On 09 Dec 2015, at 14:16, EuanM wrote:
>
> "To encode Unicode for external representation as bytes, we use UTF-8
> like the rest of the modern world.
>
> So far, so good.
>
> Why all the confusion ?"
That was a rhetorical question.
I know that we lack normalization, we
"No. a codepoint is the numerical value assigned to a character. An
"encoded character" is the way a codepoint is represented in bytes
using a given encoding."
No.
A codepoint may represent a component part of an abstract character,
or may represent an abstract character, or it may do both (but
I am sorry but one of your basic assumptions is completely wrong:
'Les élèves Français' encodeWith: #iso99591.
=> #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]
'Les élèves Français' utf8Encoded.
=> #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110
> On 07 Dec 2015, at 11:51 , EuanM wrote:
>
> And indeed, in principle.
>
> On 7 December 2015 at 10:51, EuanM wrote:
>> Verifying assumptions is the key reason why you should documents like
>> this out for review.
>>
>> Sven -
>>
>> I'm confident I
And indeed, in principle.
On 7 December 2015 at 10:51, EuanM wrote:
> Verifying assumptions is the key reason why you should documents like
> this out for review.
>
> Sven -
>
> Cuis is encoded with ISO 8859-15 (aka ISO Latin 9)
>
> Sven, this is *NOT* as you state, ISO
> On 07 Dec 2015, at 1:05 , EuanM wrote:
>
> Hi Henry,
>
> To be honest, at some point I'm going to long for the for the much
> more succinct semantics of healthcare systems and sports scoring and
> administration systems again. :-)
>
> codepoints are any of *either*
> -
> On 05 Dec 2015, at 17:35, Todd Blanchard wrote:
>
> would suggest that the only worthwhile encoding is UTF8 - the rest are
> distractions except for being able to read and convert from other encodings
> to UTF8. UTF16 is a complete waste of time.
>
> Read
Thanks for those pointers, Steph. I'll make sure they are on my
reading list. (I have a limited weekly time-budget for Unicode work,
but I expect this is a long-term project).
I'll keep in touch with Steph, so any new facilities can be
immediately useful to Pharo, and someone can guide them to
Todd, As long as others are using it, it's useful to be able to send
UTF16, and to successfully import it.
I like systems that play well with others. :-)
On 5 December 2015 at 16:35, Todd Blanchard wrote:
> would suggest that the only worthwhile encoding is UTF8 - the rest
Steph - I'll dig out the Fr phone book ordering from wherever it was
I read about it!
I thought I ghad it to hand, but I haven;t found it tonight. It can't
be far away.
On 5 December 2015 at 13:08, stepharo wrote:
> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>>
>> I'm
> On 06 Dec 2015, at 18:44, Sven Van Caekenberghe wrote:
>
>
>> On 05 Dec 2015, at 17:35, Todd Blanchard wrote:
>>
>> would suggest that the only worthwhile encoding is UTF8 - the rest are
>> distractions except for being able to read and convert from other
Hi EuanM
Le 4/12/15 12:42, EuanM a écrit :
I'm currently groping my way to seeing how feature-complete our
Unicode support is. I am doing this to establish what still needs to
be done to provide full Unicode support.
this is great. Thanks for pushing this. I wrote and collected some
roadmap
Hi todd
thanks for the link.
It looks really interesting.
Stef
Le 5/12/15 17:35, Todd Blanchard a écrit :
would suggest that the only worthwhile encoding is UTF8 - the rest are
distractions except for being able to read and convert from other
encodings to UTF8. UTF16 is a complete waste of
Sent from the road
> On Dec 5, 2015, at 05:08, stepharo wrote:
>
> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is. I am doing this to establish what still needs to
>> be done to
would suggest that the only worthwhile encoding is UTF8 - the rest are
distractions except for being able to read and convert from other encodings to
UTF8. UTF16 is a complete waste of time.
Read http://utf8everywhere.org/
I have extensive Unicode chops from around 1999 to 2004 and my
Hi Euan
I think it’s great that you’re trying this. I hope you know what you’re getting
yourself into :)
I’m no Unicode expert but I want to add two points to your list (although
you’ve probably already thought of them):
- Normalisation and conversion
> On 04 Dec 2015, at 17:00, Max Leske wrote:
>
> Hi Euan
>
> I think it’s great that you’re trying this. I hope you know what you’re
> getting yourself into :)
>
>
> I’m no Unicode expert but I want to add two points to your list (although
> you’ve probably already
23 matches
Mail list logo