Re: UTF-8 Corrigendum, new Glossary

2000-12-05 Thread G. Adam Stanislav

On Thu, Nov 30, 2000 at 05:28:51PM -0800, David Starner wrote:
>Is that your rule in all cases, to try and guess what they meant and do
>that?

Not in all cases. But this particular Ister interpreter is designed
to run CGI scripts. When it comes to CGI languages, I have the philosophy
of graceful degradation: If I can interpret it, I will. Otherwise,
the user (the person who is browsing, not the webmster) might be confused.

> It'll be hell on anyone who has to try and interpret Ister if 
>there's a large chunk of code that follows no standards, but was read by
>the original interpreter.

If it follows no standards, my interpreter will throw up hands. But if
the source code says it is in UTF-8, and I can decode it, I will. If it
says it is in Latin1 (or some other encoding), I will convert it to UTF-8.
In either case, my output will always be legal UTF-8.

 (Or even later versions of that interpreter -
>I've hung around the gcc lists long enough to know that people don't
>like "that's no longer supported" or even "that was never officially
>supported.") 

I have been programming since 1965, and I have never said that. I have
always went to great pains to make sure later versions of my software
could handle the data expected by older versions. Or, in some cases,
I supplied conversion software, so old files could be converted to a
new format.

>Even if it works fine in the case of your interpreter, it'll come to
>problems when it gets fed through a UTF-8 conformant (or non-multi-byte
>aware) text tool that won't interpret over-long sequences. Especially
>non-multi-byte aware tools, since they will seem to work and silently
>get stuff wrong. It seems better just to refuse it, and force the buggy
>software to get fixed, than have a bunch of obscure bugs show up latter.

Well, the worst bug this particular language will produce is HTML with
the wrong text or tags. Presumably, any webmaster worth his keep will check
the output of his code before posting it on the web, and will fix his
source code.

All it does is convert from one mark-up language (Ister) to another (HTML/
SGML/XML), e.g., it will convert;

^p (Hello, World!^/br ^b (Here I come!))

to:

Hello, World!Here I come!

No big security issues at stake here. :)

Cheers,
Adam

-- 
When two do the same, it's not the same
-- Slovak proverb



Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread David Starner

On Thu, Nov 30, 2000 at 04:48:56PM -0800, G. Adam Stanislav wrote:
> If the source (in Ister) uses illegal but decipherable UTF-8, my
> software accepts it. Naturally, before it sends it out it transforms
> it to perfectly legal UTF-8. The idea I should reject it is silly
> (and, no, the "internal data" clause does not apply here: my software
> accepts data from an external source). Rejecting it would mean
> that if the web page designer used some design software that messed
> up the UTF-8 encoding, the web page would suddenly miss a letter here,
> a letter there. 

It could do a lot more than that - if the encoding is messed up, you
could be getting anything, from Latin-1, to UTF-16, to pure noise. So
why does this particular mistake matter more than those? It's the
responsibility of the design software to get it right, not your code's
obligation to try and understand it. 

> Not rejecting it poses no security risk, so, for this
> specific application it is better to accept it (and correct it) than
> to reject it.

Is that your rule in all cases, to try and guess what they meant and do
that? It'll be hell on anyone who has to try and interpret Ister if 
there's a large chunk of code that follows no standards, but was read by
the original interpreter. (Or even later versions of that interpreter -
I've hung around the gcc lists long enough to know that people don't
like "that's no longer supported" or even "that was never officially
supported.") 

Even if it works fine in the case of your interpreter, it'll come to
problems when it gets fed through a UTF-8 conformant (or non-multi-byte
aware) text tool that won't interpret over-long sequences. Especially
non-multi-byte aware tools, since they will seem to work and silently
get stuff wrong. It seems better just to refuse it, and force the buggy
software to get fixed, than have a bunch of obscure bugs show up latter.

-- 
David Starner - [EMAIL PROTECTED]
http://dvdeug.dhis.org
Looking for a Debian developer in the Stillwater, Oklahoma area 
to sign my GPG key



Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread James Kass

G. Adam Stansilav wrote:

> On Thu, Nov 30, 2000 at 10:18:07AM -0800, Markus Scherer wrote:
> >you are free to write and use a non-conformant implementation. just be aware of 
>what that
means... :-)
> >markus
>
> I guess it means I'm a non-conformist. :)
>

It's tempting to make an observation about non-conformists
who use international standards, but I'm living in a glass house.

Best regards,

James Kass.






Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread Kenneth Whistler

Adam said:

> On Thu, Nov 30, 2000 at 10:18:07AM -0800, Markus Scherer wrote:
> >you are free to write and use a non-conformant implementation. just be aware of 
>what that means... :-)
> >markus
> 
> I guess it means I'm a non-conformist. :)
> 
> I am currently working on software that translates mark-up made in one
> mark-up language (Ister) and translates it into another (HTML). It
> uses UTF-8, and works as CGI, i.e., generates HTML dynamically on a web
> server (see http://www.whizkidtech.net/ister/ for unfinished docs).
> 
> If the source (in Ister) uses illegal but decipherable UTF-8, my
> software accepts it. Naturally, before it sends it out it transforms
> it to perfectly legal UTF-8. The idea I should reject it is silly
> (and, no, the "internal data" clause does not apply here: my software
> accepts data from an external source).

Basically, you've already answered your own question. If you recognize
that your source data is "illegal UTF-8", and if you know that you
are passing it into a controlled environment where it does not
pose a security risk, than effectively you can have one layer that
unmangles the "decipherable" but illegal UTF-8, and passes it to the
layer that interprets legal UTF-8.

As long as this is above board and explicit, then you should be o.k.
It is the conversion process that just silently interprets non-shortest
UTF-8 without discrimination in an uncontrolled environment that
is dangerous.

> Rejecting it would mean
> that if the web page designer used some design software that messed
> up the UTF-8 encoding, the web page would suddenly miss a letter here,
> a letter there. Not rejecting it poses no security risk, so, for this
> specific application it is better to accept it (and correct it) than
> to reject it.

As I read it, this would fall under the mangled text note.

The "internal" note is referring to functions that don't *check* for
illegal code unit sequences. Those are not conformant unless being
used on certifiably legal data. But if your function is checking and
catching illegal code unit sequences explicitly, you can fix them
and proceed, as long as you know you are not part of a process pipeline
that could lead to a security problem by doing so.

The point of the UTF-8 corrigendum was not to force people to do
unreasonable things with their software, but rather to tighten up
the definition sufficiently so that people could claim secure
implementations of UTF-8.

--Ken




Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread G. Adam Stanislav

On Thu, Nov 30, 2000 at 10:18:07AM -0800, Markus Scherer wrote:
>you are free to write and use a non-conformant implementation. just be aware of what 
>that means... :-)
>markus

I guess it means I'm a non-conformist. :)

I am currently working on software that translates mark-up made in one
mark-up language (Ister) and translates it into another (HTML). It
uses UTF-8, and works as CGI, i.e., generates HTML dynamically on a web
server (see http://www.whizkidtech.net/ister/ for unfinished docs).

If the source (in Ister) uses illegal but decipherable UTF-8, my
software accepts it. Naturally, before it sends it out it transforms
it to perfectly legal UTF-8. The idea I should reject it is silly
(and, no, the "internal data" clause does not apply here: my software
accepts data from an external source). Rejecting it would mean
that if the web page designer used some design software that messed
up the UTF-8 encoding, the web page would suddenly miss a letter here,
a letter there. Not rejecting it poses no security risk, so, for this
specific application it is better to accept it (and correct it) than
to reject it.

Cheers,
Adam

-- 
Don't send me spam, I'm a vegetarian



Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread G. Adam Stanislav

On Thu, Nov 30, 2000 at 07:12:37AM -0800, Mark Davis wrote:
>We know of specific situations that caused problems, as outlined in the
>Corrigendum.

That does not justify forbidding it in other situations (ask the NRA :) ).

Adam

-- 
When a finger points at the Moon... do you look at the Moon?
Or, do you prefer to worship the finger?
-- Unknown Zen Master



Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread Michael \(michka\) Kaplan

And to be clear, what it means in this case:

1) People have security concerns about UTF-8
2) The Unicode Consortium has an official solution to address these
concerens
3) Your implementation does not

The "People" from (1) can believe what they will about your implementation!

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/



- Original Message -
From: "Markus Scherer" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, November 30, 2000 10:18 AM
Subject: Re: UTF-8 Corrigendum, new Glossary


> Kevin Bracey wrote:
> > > I find this silly. That creation of such forms would be forbidden I
can see
> > > and agree to. But interpretation? I understand the reasoning when
security
> > > is an issue. But why make it flat illegal? There are many applications
> > > where such a sequence poses no security danger.
>
> you are free to write and use a non-conformant implementation. just be
aware of what that means... :-)
> markus
>




Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread Markus Scherer

Kevin Bracey wrote:
> > I find this silly. That creation of such forms would be forbidden I can see
> > and agree to. But interpretation? I understand the reasoning when security
> > is an issue. But why make it flat illegal? There are many applications
> > where such a sequence poses no security danger.

you are free to write and use a non-conformant implementation. just be aware of what 
that means... :-)
markus



Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread Doug Ewell

"G. Adam Stanislav" <[EMAIL PROTECTED]> wrote:

>> 1. The Unicode Technical Committee has modified the definition of
>> UTF-8 to forbid conformant implementations from interpreting non-
>> shortest forms for BMP characters,
>
> I find this silly. That creation of such forms would be forbidden I
> can see and agree to. But interpretation? I understand the reasoning
> when security is an issue. But why make it flat illegal? There are
> many applications where such a sequence poses no security danger.

I used to be concerned about that.  I think I cited the example of an
encyclopedia on CD-ROM with text in UTF-8.  Obviously this text is all
internal and almost certainly valid, and there are no security holes
involved, so the UTF-8 decoder can take certain shortcuts.

But this is now covered in the corrigendum:

> Internally, a particular function might be used that does not check
> for illegal code unit sequences.  However, a conformant process can
> use that function _only_ on data that has already been certified to
> not contain any illegal code unit sequences.

The word "certified" did make me chuckle, though.  Who would do the
certifying?  Katherine Harris?

-Doug Ewell
 Fullerton, California



Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread Mark Davis

We know of specific situations that caused problems, as outlined in the
Corrigendum.

a.. Process A performs security checks, but does not check for non-shortest
forms.
a.. Process B accepts the byte sequence from process A, and transforms it
into UTF-16 while interpreting non-shortest forms.
a.. The UTF-16 text may then contain characters that should have been
filtered out by process A.
a.. Process C interprets the text, and does something bad.

The case was with "..\". It was "hidden" in a non-longest form. Process A
missed it. After this was all done, Process C interpreted it, and executed a
program in a higher level directory that the client should not have had
access to. While a correctly written set of programs would not fall prey to
this problem, the UTC decided that given the real-world situations it would
be better to close off that avenue.


So what about interpreting a surrogate pair encoded in UTF-8 as two separate
3-byte sequences? (For example, interpreting the UTF-8 sequence  as UTF-16  (equivalently as UTF-32 <0001>)). It is
still permissible according to the conformance rules to interpret such UTF-8
sequences, although not to generate them. The Unicode Technical Committee
has debated this last issue at length, but has not made a final decision
about how to deal with it. It is complicated by widespread practice of
actually generating those types of sequences in older software.

Mark

- Original Message -
From: "G. Adam Stanislav" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 29, 2000 22:42
Subject: Re: UTF-8 Corrigendum, new Glossary


> At 21:08 29-11-2000 -0800, Mark Davis wrote:
> >1. The Unicode Technical Committee has modified the definition of UTF-8
to
> >forbid conformant implementations from interpreting non-shortest forms
for
> >BMP characters,
>
> I find this silly. That creation of such forms would be forbidden I can
see
> and agree to. But interpretation? I understand the reasoning when security
> is an issue. But why make it flat illegal? There are many applications
> where such a sequence poses no security danger.
>
> Whatever happened to the ancient "abusus non tollit usum" principle? Looks
> like Big Brother to me...
>
> Adam




Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread Kevin Bracey

In message <[EMAIL PROTECTED]>
  "G. Adam Stanislav" <[EMAIL PROTECTED]> wrote:

> At 21:08 29-11-2000 -0800, Mark Davis wrote:
> >1. The Unicode Technical Committee has modified the definition of UTF-8 to
> >forbid conformant implementations from interpreting non-shortest forms for
> >BMP characters,
> 
> I find this silly. That creation of such forms would be forbidden I can see
> and agree to. But interpretation? I understand the reasoning when security
> is an issue. But why make it flat illegal? There are many applications
> where such a sequence poses no security danger.
> 

Consistency. If some implementations won't read the non-shortest forms and
some will, you end up in the mess that HTML has fallen into due to lack of
rigorous parsing. "This file is illegal." "But it works on my system!"

-- 
Kevin Bracey, Principal Software Engineer
Pace Micro Technology plc Tel: +44 (0) 1223 518566
645 Newmarket RoadFax: +44 (0) 1223 518526
Cambridge, CB5 8PB, United KingdomWWW: http://www.pace.co.uk/



Re: UTF-8 Corrigendum, new Glossary

2000-11-29 Thread G. Adam Stanislav

At 21:08 29-11-2000 -0800, Mark Davis wrote:
>1. The Unicode Technical Committee has modified the definition of UTF-8 to
>forbid conformant implementations from interpreting non-shortest forms for
>BMP characters,

I find this silly. That creation of such forms would be forbidden I can see
and agree to. But interpretation? I understand the reasoning when security
is an issue. But why make it flat illegal? There are many applications
where such a sequence poses no security danger.

Whatever happened to the ancient "abusus non tollit usum" principle? Looks
like Big Brother to me...

Adam