Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Clark Cox
On Fri, 10 Dec 2004 13:28:59 -0800, Michael (michka) Kaplan <[EMAIL PROTECTED]> wrote: > From: "Kenneth Whistler" <[EMAIL PROTECTED]> > > > On the other hand, for many English speakers, "RSVP" is simply > > learned as an unanalyzed verb, pronounced "aressveepee", meaning > > "send a response to th

Re: Please RSVP... (was: US-ASCII)

2004-12-10 Thread Philippe Verdy
From: "Kenneth Whistler" <[EMAIL PROTECTED]> That it has been morphological reanalyzed is demonstrated by the fact that it takes regular English verb endings, as in: "I RSVPed yesterday, right after I got the email." As I said, it is now a bona fide English verb, and most English speakers will tre

Re: Nicest UTF

2004-12-10 Thread John Cowan
Philippe Verdy scripsit: > And I disagree with you about the fact the U+ can't be used in XML > documents. It can be used in URI through URI escaping mechanism, as > explicitly indicated in the XML specification... You have a hold of the right stick but at the wrong end. U+ can be enco

Re: Nicest UTF

2004-12-10 Thread John Cowan
Philippe Verdy scripsit: > >Okay, I'm confused. Does ≮ open a tag? Does it matter if it's > >composed or decomposed? > > It does not open a XML tag. > It does matter if it's composed (won't open a tag) or decomposed (will > open a tag, but with a combining character, invalid as an identifier >

Re: Nicest UTF

2004-12-10 Thread John Cowan
Philippe Verdy scripsit: > If you look at the XML 1.0 Second Edition The Second Edition has been superseded by the Third. > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | > [#x1-#x10] That is normative. > But the comment following it specifies: That comment is not n

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Asmus Freytag
At 12:50 PM 12/10/2004, Kenneth Whistler wrote: Tim Greenwood asked: > > ... a perfectly normal linguistic process of > > attributive disambiguation of a term which had grown ambiguous > > in usage. > > Is that like the 'Please RSVP' that I see all too often? Or should > that not be excused? *grins

Re: Please RSVP... (was: US-ASCII)

2004-12-10 Thread Kenneth Whistler
Philippe, > RSVP is a French acronym for "Répondez, s'il vous plait". Yes, we know that. But it is also a reanalyzed English verb which means "reply to a message (or invitation)". That it has been morphological reanalyzed is demonstrated by the fact that it takes regular English verb endings, a

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: "D. Starner" <[EMAIL PROTECTED]> Okay, I'm confused. Does ≮ open a tag? Does it matter if it's composed or decomposed? It does not open a XML tag. It does matter if it's composed (won't open a tag) or decomposed (will open a tag, but with a combining character, invalid as an identifier star

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Mark Davis
This is just a confusion among the hoi polloi. âMark - Original Message - From: "Asmus Freytag" <[EMAIL PROTECTED]> To: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Friday, December 10, 2004 17:38 Subject: Re: US-ASCII (was: Re: Invalid UTF-8

Re: Please RSVP... (was: US-ASCII)

2004-12-10 Thread Philippe Verdy
> Is that like the 'Please RSVP' that I see all too often? Or should > that not be excused? *grins* Well, technically, that is not a case of attributive disambiguation, but rather ignorant redundancy. RSVP is a French acronym for "Répondez, s'il vous plait". SVP is also the wellknown French acronym

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: "John Cowan" <[EMAIL PROTECTED]> Marcin 'Qrczak' Kowalczyk scripsit: http://www.w3.org/TR/2000/REC-xml-20001006#charsets implies that the appropriate level for parsing XML is code points. You are reading the XML Recommendation incorrectly. It is not defined in terms of codepoints (8-bit, 16-

Re: Nicest UTF

2004-12-10 Thread D. Starner
John Cowan writes: > You are reading the XML Recommendation incorrectly.  It is not defined > in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of > characters.  XML processors are required to process UTF-8 and UTF-16, > and may process other character encodings or not.  But the inter

Re: Nicest UTF

2004-12-10 Thread John Cowan
Marcin 'Qrczak' Kowalczyk scripsit: > http://www.w3.org/TR/2000/REC-xml-20001006#charsets > implies that the appropriate level for parsing XML is code points. You are reading the XML Recommendation incorrectly. It is not defined in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of c

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: "Philippe Verdy" <[EMAIL PROTECTED]> From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Philippe Verdy" <[EMAIL PROTECTED]> writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '&', '<', quotation marks, and with special behavior for spaces. T

Re: Nicest UTF

2004-12-10 Thread D. Starner
"Marcin 'Qrczak' Kowalczyk" writes: > "D. Starner" writes: > > > This implies that every programmer needs an indepth knowledge of > > Unicode to handle simple strings. > > There is no way to avoid that. Then there's no way that we're ever going to get reliable Unicode support. > If the ru

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
John Cowan <[EMAIL PROTECTED]> writes: >> > The XML/HTML core syntax is defined with fixed behavior of some >> > individual characters like '&', '<', quotation marks, and with special >> > behavior for spaces. >> >> The point is: what "characters" mean in this sentence. Code points? >> Combining

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy
- Original Message - From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, December 10, 2004 8:35 PM Subject: Re: Nicest UTF "Philippe Verdy" <[EMAIL PROTECTED]> writes: The XML/HTML core syntax is defined with fixed behavior of some individual chara

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
John Cowan <[EMAIL PROTECTED]> writes: >> > The XML/HTML core syntax is defined with fixed behavior of some >> > individual characters like '&', '<', quotation marks, and with special >> > behavior for spaces. >> >> The point is: what "characters" mean in this sentence. Code points? >> Combining

Re: Software support costs (was: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: "Carl W. Brown" <[EMAIL PROTECTED]> Philippe, Also a broken opening tag for HTML/XML documents In addition to not having endian problems UTF-8 is also useful when tracing intersystem communications data because XML and other tags are usually in the ASCII subset of UTF-8 and stand out making

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Michael \(michka\) Kaplan
From: "Kenneth Whistler" <[EMAIL PROTECTED]> > On the other hand, for many English speakers, "RSVP" is simply > learned as an unanalyzed verb, pronounced "aressveepee", meaning > "send a response to this message". And to castigate such speakers > for politely prepending a "please" to that verb is

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread John Cowan
Kenneth Whistler scripsit: > On the other hand, for many English speakers, "RSVP" is simply > learned as an unanalyzed verb, pronounced "aressveepee", meaning > "send a response to this message". And to castigate such speakers > for politely prepending a "please" to that verb is a little > too muc

Re: Nicest UTF

2004-12-10 Thread John Cowan
Marcin 'Qrczak' Kowalczyk scripsit: > > The XML/HTML core syntax is defined with fixed behavior of some > > individual characters like '&', '<', quotation marks, and with special > > behavior for spaces. > > The point is: what "characters" mean in this sentence. Code points? > Combining character

Re: Software support costs (was: Nicest UTF)

2004-12-10 Thread Theodore H. Smith
Philippe, Also a broken opening tag for HTML/XML documents In addition to not having endian problems UTF-8 is also useful when tracing intersystem communications data because XML and other tags are usually in the ASCII subset of UTF-8 and stand out making it easier to find the specific data you a

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Kenneth Whistler
Tim Greenwood asked: > > ... a perfectly normal linguistic process of > > attributive disambiguation of a term which had grown ambiguous > > in usage. > > Is that like the 'Please RSVP' that I see all too often? Or should > that not be excused? *grins* Well, technically, that is not a case of at

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Tim Greenwood
On Fri, 10 Dec 2004 12:06:12 -0800 (PST), Kenneth Whistler <[EMAIL PROTECTED]> wrote: > In addition to Doug's historical clarification, you need to > understand this as a perfectly normal linguistic process of > attributive disambiguation of a term which had grown ambiguous > in usage. Is that li

Re: When to validate?

2004-12-10 Thread Antoine Leca
Arcane Jill va escriure: > And yet, in an expression such as tolower(trim(s)), the second > validation is unnecessary. The input to tolower() /must/ be valid, > because it is the output of trim(). But on the other hand, tolower() > could be called with arbitrary input, so I can't skip the validati

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Kenneth Whistler
> If any > criticism was present, it referred to the redundant "US-" prefix in > "US-ASCII", not to Unicode, and even that wasn't really criticism, just my > lack of understanding /why/. In addition to Doug's historical clarification, you need to understand this as a perfectly normal linguistic

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > The XML/HTML core syntax is defined with fixed behavior of some > individual characters like '&', '<', quotation marks, and with special > behavior for spaces. The point is: what "characters" mean in this sentence. Code points? Combining character se

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: >> String equality in a programming language should not treat composed >> and decomposed forms as equal. Not this level of abstraction. > > This implies that every programmer needs an indepth knowledge of > Unicode to handle simple strings. There is no way

Re: When to validate?

2004-12-10 Thread Doug Ewell
Arcane Jill wrote: > Here's something that's been bothering me. Suppose I write a function > - let's call it trim(), which removes leading and trailing spaces from > a string, represented as one of the UTFs. If I've understood this > correctly, I'm supposed to validate the input, yes? > > Okay, n

Re: When to validate?

2004-12-10 Thread Andy Heninger
Arcane Jill wrote: Here's something that's been bothering me. Suppose I write a function - [ that process strings in one of the UTFs] > I'm supposed to validate the input, yes? You are designing the API - you get to choose what it does. An application as a whole needs to validate external input t

Software support costs (was: Nicest UTF

2004-12-10 Thread Carl W. Brown
Philippe, > Also a broken opening tag for HTML/XML documents In addition to not having endian problems UTF-8 is also useful when tracing intersystem communications data because XML and other tags are usually in the ASCII subset of UTF-8 and stand out making it easier to find the specific data you

RE: When to validate?

2004-12-10 Thread Carl W. Brown
Jill, I think that the best practice is to validate input. Besides the overhead of revalidating there is the issue of what do you do with data that contains invalid characters. This has to be handles explicitly. Once validated all transforms should maintain valid data. If you also provide a mo

Re: When to validate?

2004-12-10 Thread Mark Davis
Use of the Unicode standard does *not* require constant validation of strings. The standard carefully distinguishes between Unicode strings (D29a-d, page 74) and UTFs. The Unicode strings are in-memory representations of Unicode, but do not have to be valid UTFs; so all Unicode X-bit strings are va

Re: When to validate?

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > Here's something that's been bothering me. Suppose I write a function > - > let's call it trim(), which removes leading and trailing spaces from a > string, represented as one of the UTFs. If I've understood this > correctly, I'm supposed to validate the

When to validate?

2004-12-10 Thread Arcane Jill
Here's something that's been bothering me. Suppose I write a function - let's call it trim(), which removes leading and trailing spaces from a string, represented as one of the UTFs. If I've understood this correctly, I'm supposed to validate the input, yes? Okay, now suppose I write a second f