Nah!!! STRICTLY NOBODY counts "scalar values". Every one counts either - (a) code units (most often 8-bit bytes, more rarely 16-bit bytes e.g. with basic Javascript code), or - (b) code points (independantly of code units used in the storage or communication message format).
The application *may* enforce a normalization form prior to counting this (I'm not convinced/sure if Tweeter effectiely forces NFC prior to trunctating messages, it just happens that most texts are composed already in NFC form, and this is cauysed by keyboard drivers, or IME on the client device or browser which highly favor the NFC form, and also because many devices are still not able display properly some character clusters if they are not in NFC form). Stop speaking about scalar values, they are just meant for internal arithmetics between distinct "abstract characters" (those that are given unique code coints) or as internal mappings necessary for converting between UTF's. But arbitrary arithmetic otherwise is completely unsafe and ives unpredictable results, not warrantied or stabilized in the standard). Also yes the term "character" alone is ambiguous, but "abstract character" in the Unicode standard (even if in many occasions it is abbreviated to just "character" **in this context* (but it then it may contradict other definitions of "character" used for example in programming languages). If you want to be clear only speak counting about counting - "code points" (more or less the same as counting abstract characters, except that you can count code points which are still not assigned to abstract characters, or can also count code points assigned to "non-characters", or even count code points that are assigned to surrogates and that you may find in non-conforming documents supposed to be encoded in UTF-32). Such count will be independant of the encoding. Code points are noted U+nnnn. - "code units" (but be more specific and explictly give its size). Such counting will be fully dependant of the encoding. Code units are usually noted with fixed-width hexadecimal values. Code units do NOT have a "scalar value" in the same meaning as given in TUS. If you count code units, you may also count some of them that have NO meaning in the standard UTF (or legacy 8-bit encoding), such as an 8-bit code unit equal to 0xFF found in a non-conforming UTF-8 string. In all cases however the niormalization form may change the result of your measurement. But technically even if texts are not normalized or are normalized to distinct forms, if they are "canonically equivalent", they are still not "equal', and it is notmal that your countings will given different results. But note that it is not always possible to normalize input documents (notably you may be able to measure these documents in code points or in code units, even if they are not conforming to their supposed UTF, but then any prior normalization of these non-conforming documents will likely fail). This also means that just counting code points or code units in an encoded text is not a conforming process, unless your counting is performed after first applying a (conforming) normalization. And such conforming counting process is allowed to fail (and in fact should even fail with an error returned if the document is not conforming to its assumed UTF, just like it would fail if you converted it from/to a legacy encoding other than a standard UTF). Normalization should be perceived like a transcoding. Some normalizations are conforming and will (should!) fail, some others are non-nonconforming and will never fail, but you now the risks when using non conforming processes because they create ambiguities (the same kind of ambiguities that also occur when you just say you'll measure any "length" of a text, when not being very specific about : what you are counting, in which dimensional space, through which surjective projection(s), with which unit of measure, and sometimes with which rounding mode if the returned measure will have a limited precision)... 2013/9/16 Phillips, Addison <[email protected]> > Actually, that's my bad: I meant to type scalar value. > > > Stephan Stiller <[email protected]> wrote: > > On 9/15/2013 3:07 PM, Phillips, Addison wrote: > > Not if the limit is counted in characters and not in bytes. Twitter, for > example, counts code points in the NFC representation of a tweet. > > "character", "code point" – these are confusing words :-) > > From the link it isn't entirely clear whether they > (a) count scalar values of NFC *or* > (b) count code points of NFC. > > That's why I think it's bad to write "code point" when "scalar value" is > intended. > > Stephan > >

