Re: UTF-8 Corrigendum, new Glossary
On Thu, Nov 30, 2000 at 05:28:51PM -0800, David Starner wrote: >Is that your rule in all cases, to try and guess what they meant and do >that? Not in all cases. But this particular Ister interpreter is designed to run CGI scripts. When it comes to CGI languages, I have the philosophy of graceful degradation: If I can interpret it, I will. Otherwise, the user (the person who is browsing, not the webmster) might be confused. > It'll be hell on anyone who has to try and interpret Ister if >there's a large chunk of code that follows no standards, but was read by >the original interpreter. If it follows no standards, my interpreter will throw up hands. But if the source code says it is in UTF-8, and I can decode it, I will. If it says it is in Latin1 (or some other encoding), I will convert it to UTF-8. In either case, my output will always be legal UTF-8. (Or even later versions of that interpreter - >I've hung around the gcc lists long enough to know that people don't >like "that's no longer supported" or even "that was never officially >supported.") I have been programming since 1965, and I have never said that. I have always went to great pains to make sure later versions of my software could handle the data expected by older versions. Or, in some cases, I supplied conversion software, so old files could be converted to a new format. >Even if it works fine in the case of your interpreter, it'll come to >problems when it gets fed through a UTF-8 conformant (or non-multi-byte >aware) text tool that won't interpret over-long sequences. Especially >non-multi-byte aware tools, since they will seem to work and silently >get stuff wrong. It seems better just to refuse it, and force the buggy >software to get fixed, than have a bunch of obscure bugs show up latter. Well, the worst bug this particular language will produce is HTML with the wrong text or tags. Presumably, any webmaster worth his keep will check the output of his code before posting it on the web, and will fix his source code. All it does is convert from one mark-up language (Ister) to another (HTML/ SGML/XML), e.g., it will convert; ^p (Hello, World!^/br ^b (Here I come!)) to: Hello, World!Here I come! No big security issues at stake here. :) Cheers, Adam -- When two do the same, it's not the same -- Slovak proverb
Re: UTF-8 Corrigendum, new Glossary
On Thu, Nov 30, 2000 at 04:48:56PM -0800, G. Adam Stanislav wrote: > If the source (in Ister) uses illegal but decipherable UTF-8, my > software accepts it. Naturally, before it sends it out it transforms > it to perfectly legal UTF-8. The idea I should reject it is silly > (and, no, the "internal data" clause does not apply here: my software > accepts data from an external source). Rejecting it would mean > that if the web page designer used some design software that messed > up the UTF-8 encoding, the web page would suddenly miss a letter here, > a letter there. It could do a lot more than that - if the encoding is messed up, you could be getting anything, from Latin-1, to UTF-16, to pure noise. So why does this particular mistake matter more than those? It's the responsibility of the design software to get it right, not your code's obligation to try and understand it. > Not rejecting it poses no security risk, so, for this > specific application it is better to accept it (and correct it) than > to reject it. Is that your rule in all cases, to try and guess what they meant and do that? It'll be hell on anyone who has to try and interpret Ister if there's a large chunk of code that follows no standards, but was read by the original interpreter. (Or even later versions of that interpreter - I've hung around the gcc lists long enough to know that people don't like "that's no longer supported" or even "that was never officially supported.") Even if it works fine in the case of your interpreter, it'll come to problems when it gets fed through a UTF-8 conformant (or non-multi-byte aware) text tool that won't interpret over-long sequences. Especially non-multi-byte aware tools, since they will seem to work and silently get stuff wrong. It seems better just to refuse it, and force the buggy software to get fixed, than have a bunch of obscure bugs show up latter. -- David Starner - [EMAIL PROTECTED] http://dvdeug.dhis.org Looking for a Debian developer in the Stillwater, Oklahoma area to sign my GPG key
Re: UTF-8 Corrigendum, new Glossary
G. Adam Stansilav wrote: > On Thu, Nov 30, 2000 at 10:18:07AM -0800, Markus Scherer wrote: > >you are free to write and use a non-conformant implementation. just be aware of >what that means... :-) > >markus > > I guess it means I'm a non-conformist. :) > It's tempting to make an observation about non-conformists who use international standards, but I'm living in a glass house. Best regards, James Kass.
Re: UTF-8 Corrigendum, new Glossary
Adam said: > On Thu, Nov 30, 2000 at 10:18:07AM -0800, Markus Scherer wrote: > >you are free to write and use a non-conformant implementation. just be aware of >what that means... :-) > >markus > > I guess it means I'm a non-conformist. :) > > I am currently working on software that translates mark-up made in one > mark-up language (Ister) and translates it into another (HTML). It > uses UTF-8, and works as CGI, i.e., generates HTML dynamically on a web > server (see http://www.whizkidtech.net/ister/ for unfinished docs). > > If the source (in Ister) uses illegal but decipherable UTF-8, my > software accepts it. Naturally, before it sends it out it transforms > it to perfectly legal UTF-8. The idea I should reject it is silly > (and, no, the "internal data" clause does not apply here: my software > accepts data from an external source). Basically, you've already answered your own question. If you recognize that your source data is "illegal UTF-8", and if you know that you are passing it into a controlled environment where it does not pose a security risk, than effectively you can have one layer that unmangles the "decipherable" but illegal UTF-8, and passes it to the layer that interprets legal UTF-8. As long as this is above board and explicit, then you should be o.k. It is the conversion process that just silently interprets non-shortest UTF-8 without discrimination in an uncontrolled environment that is dangerous. > Rejecting it would mean > that if the web page designer used some design software that messed > up the UTF-8 encoding, the web page would suddenly miss a letter here, > a letter there. Not rejecting it poses no security risk, so, for this > specific application it is better to accept it (and correct it) than > to reject it. As I read it, this would fall under the mangled text note. The "internal" note is referring to functions that don't *check* for illegal code unit sequences. Those are not conformant unless being used on certifiably legal data. But if your function is checking and catching illegal code unit sequences explicitly, you can fix them and proceed, as long as you know you are not part of a process pipeline that could lead to a security problem by doing so. The point of the UTF-8 corrigendum was not to force people to do unreasonable things with their software, but rather to tighten up the definition sufficiently so that people could claim secure implementations of UTF-8. --Ken
Re: UTF-8 Corrigendum, new Glossary
On Thu, Nov 30, 2000 at 10:18:07AM -0800, Markus Scherer wrote: >you are free to write and use a non-conformant implementation. just be aware of what >that means... :-) >markus I guess it means I'm a non-conformist. :) I am currently working on software that translates mark-up made in one mark-up language (Ister) and translates it into another (HTML). It uses UTF-8, and works as CGI, i.e., generates HTML dynamically on a web server (see http://www.whizkidtech.net/ister/ for unfinished docs). If the source (in Ister) uses illegal but decipherable UTF-8, my software accepts it. Naturally, before it sends it out it transforms it to perfectly legal UTF-8. The idea I should reject it is silly (and, no, the "internal data" clause does not apply here: my software accepts data from an external source). Rejecting it would mean that if the web page designer used some design software that messed up the UTF-8 encoding, the web page would suddenly miss a letter here, a letter there. Not rejecting it poses no security risk, so, for this specific application it is better to accept it (and correct it) than to reject it. Cheers, Adam -- Don't send me spam, I'm a vegetarian
Re: UTF-8 Corrigendum, new Glossary
On Thu, Nov 30, 2000 at 07:12:37AM -0800, Mark Davis wrote: >We know of specific situations that caused problems, as outlined in the >Corrigendum. That does not justify forbidding it in other situations (ask the NRA :) ). Adam -- When a finger points at the Moon... do you look at the Moon? Or, do you prefer to worship the finger? -- Unknown Zen Master
Re: UTF-8 Corrigendum, new Glossary
And to be clear, what it means in this case: 1) People have security concerns about UTF-8 2) The Unicode Consortium has an official solution to address these concerens 3) Your implementation does not The "People" from (1) can believe what they will about your implementation! MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: "Markus Scherer" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Thursday, November 30, 2000 10:18 AM Subject: Re: UTF-8 Corrigendum, new Glossary > Kevin Bracey wrote: > > > I find this silly. That creation of such forms would be forbidden I can see > > > and agree to. But interpretation? I understand the reasoning when security > > > is an issue. But why make it flat illegal? There are many applications > > > where such a sequence poses no security danger. > > you are free to write and use a non-conformant implementation. just be aware of what that means... :-) > markus >
Re: UTF-8 Corrigendum, new Glossary
Kevin Bracey wrote: > > I find this silly. That creation of such forms would be forbidden I can see > > and agree to. But interpretation? I understand the reasoning when security > > is an issue. But why make it flat illegal? There are many applications > > where such a sequence poses no security danger. you are free to write and use a non-conformant implementation. just be aware of what that means... :-) markus
Re: UTF-8 Corrigendum, new Glossary
"G. Adam Stanislav" <[EMAIL PROTECTED]> wrote: >> 1. The Unicode Technical Committee has modified the definition of >> UTF-8 to forbid conformant implementations from interpreting non- >> shortest forms for BMP characters, > > I find this silly. That creation of such forms would be forbidden I > can see and agree to. But interpretation? I understand the reasoning > when security is an issue. But why make it flat illegal? There are > many applications where such a sequence poses no security danger. I used to be concerned about that. I think I cited the example of an encyclopedia on CD-ROM with text in UTF-8. Obviously this text is all internal and almost certainly valid, and there are no security holes involved, so the UTF-8 decoder can take certain shortcuts. But this is now covered in the corrigendum: > Internally, a particular function might be used that does not check > for illegal code unit sequences. However, a conformant process can > use that function _only_ on data that has already been certified to > not contain any illegal code unit sequences. The word "certified" did make me chuckle, though. Who would do the certifying? Katherine Harris? -Doug Ewell Fullerton, California
Re: UTF-8 Corrigendum, new Glossary
We know of specific situations that caused problems, as outlined in the Corrigendum. a.. Process A performs security checks, but does not check for non-shortest forms. a.. Process B accepts the byte sequence from process A, and transforms it into UTF-16 while interpreting non-shortest forms. a.. The UTF-16 text may then contain characters that should have been filtered out by process A. a.. Process C interprets the text, and does something bad. The case was with "..\". It was "hidden" in a non-longest form. Process A missed it. After this was all done, Process C interpreted it, and executed a program in a higher level directory that the client should not have had access to. While a correctly written set of programs would not fall prey to this problem, the UTC decided that given the real-world situations it would be better to close off that avenue. So what about interpreting a surrogate pair encoded in UTF-8 as two separate 3-byte sequences? (For example, interpreting the UTF-8 sequence as UTF-16 (equivalently as UTF-32 <0001>)). It is still permissible according to the conformance rules to interpret such UTF-8 sequences, although not to generate them. The Unicode Technical Committee has debated this last issue at length, but has not made a final decision about how to deal with it. It is complicated by widespread practice of actually generating those types of sequences in older software. Mark - Original Message - From: "G. Adam Stanislav" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Wednesday, November 29, 2000 22:42 Subject: Re: UTF-8 Corrigendum, new Glossary > At 21:08 29-11-2000 -0800, Mark Davis wrote: > >1. The Unicode Technical Committee has modified the definition of UTF-8 to > >forbid conformant implementations from interpreting non-shortest forms for > >BMP characters, > > I find this silly. That creation of such forms would be forbidden I can see > and agree to. But interpretation? I understand the reasoning when security > is an issue. But why make it flat illegal? There are many applications > where such a sequence poses no security danger. > > Whatever happened to the ancient "abusus non tollit usum" principle? Looks > like Big Brother to me... > > Adam
Re: UTF-8 Corrigendum, new Glossary
In message <[EMAIL PROTECTED]> "G. Adam Stanislav" <[EMAIL PROTECTED]> wrote: > At 21:08 29-11-2000 -0800, Mark Davis wrote: > >1. The Unicode Technical Committee has modified the definition of UTF-8 to > >forbid conformant implementations from interpreting non-shortest forms for > >BMP characters, > > I find this silly. That creation of such forms would be forbidden I can see > and agree to. But interpretation? I understand the reasoning when security > is an issue. But why make it flat illegal? There are many applications > where such a sequence poses no security danger. > Consistency. If some implementations won't read the non-shortest forms and some will, you end up in the mess that HTML has fallen into due to lack of rigorous parsing. "This file is illegal." "But it works on my system!" -- Kevin Bracey, Principal Software Engineer Pace Micro Technology plc Tel: +44 (0) 1223 518566 645 Newmarket RoadFax: +44 (0) 1223 518526 Cambridge, CB5 8PB, United KingdomWWW: http://www.pace.co.uk/
Re: UTF-8 Corrigendum, new Glossary
At 21:08 29-11-2000 -0800, Mark Davis wrote: >1. The Unicode Technical Committee has modified the definition of UTF-8 to >forbid conformant implementations from interpreting non-shortest forms for >BMP characters, I find this silly. That creation of such forms would be forbidden I can see and agree to. But interpretation? I understand the reasoning when security is an issue. But why make it flat illegal? There are many applications where such a sequence poses no security danger. Whatever happened to the ancient "abusus non tollit usum" principle? Looks like Big Brother to me... Adam