[EMAIL PROTECTED] wrote: > Beginners, even young children, can get the concept of characters > being mapped to numbers. Certainly those young children that will > thrive on programming will have a fascination with this process in > and off itself (it's just like the kids-in-treehuts type cryptography > such kids often like). > (...) > I don't think characters -> numbers -> bytes -> bits is > particularly difficult as programming concepts go, or even � <=> > e + � when compared to many higher-level string handling activities > (regular expressions, bidirectional over-riding, and the subtler points > of case operations). > > Even so, I think it's making those two levels meet that is the biggest > stumbling block for beginners.
Well, if you just consider the concept of writing and learning how to do it, the decomposition of the spoken language into words and letters with conventional signs to mark them which then creates a second meta-language applied to the initial spoken language is just a similar abstraction. If children can learn (sometimes with difficulty) how to write and read the language they have first learn to speak, with such decomposition models made of collections of glyphs, themselves composed in a more or less regular way with strokes, we can't assume that it's illogical to map grapheme clusters (the nearest model of the written form of languages) into abstract characters (that's what children learn at school when they learn orthographic and orthographic rules), then code points (similar to what they learn when they start collating words by ordering characters with more or less complex rules, the simplest one being as simple as counting numbers, just because it's necessary to learn how to search in a dictionnary or in a phone diary). Most literated people stop at this previous step, but then computer students learn about code units (what they learn when they start programming in most computer languages with completely arbitrary integer range limits), then streamed bytes (what they learn when they need to transmit their documents and find a way to interchange their local data). If there's something which seems natural for all literated people, excluding computer students that learn how to write computer programs, it's the level of abstraction of code points, not code units. Thanks this is exactly the main level at which Unicode and ISO10646 is working on. But it is also at that level (decomposition of grapheme clusters to abstract characters then into code points) that canonical equivalences and normalizing forms are occuring (I exclude there all considerations on code units including surrogates, and streamed bytes or bits). However the standard C/C++ "string" handling library does not operate at the codepoint level (and not even Java) but really in terms of code units (whatever their effective sizes in terms representable integer ranges, from 1 bit to 32 bits, and even quite recently with 64 bit code units). It was not designed to operate on code points which is the natural level of abstraction for written languages. This means that C/C++ or Java strings are NOT a good abstraction of Unicode strings. Conformance to Unicode when only the code units level is implemented is an illusion: such computer languages were not designed to handle natively Unicode strings. So these computer languages cannot claim they "support and conform to Unicode". This is not true however for JavaScript/ECMA-script, and it should not be true for computer languages like XML, HTML and SGML which were designed specifically to correctly represent natural written languages. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

