Re: utf-8 encoding scheme

2000-08-04 Thread Markus Kuhn

Bruno Haible wrote on 2000-08-03 21:10 UTC:
> The point of ASCII compatibility of UTF-8 is that software changes are
> kept to a minimum. It would be stupid if the kernel had to verify
> every filename passed to it via a system call or read from disk to see
> whether it's well-formed UTF-8. For the kernel, a filename continues
> to be a sequence of bytes with a NUL byte at the end.

Amen!

As to the remaining discussion between Henry and Peter, please reread

  http://mail.nl.linux.org/linux-utf8/2000-07/msg00073.html

carefully, which I believe to be the only appropriate answer here.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-08-03 Thread Bruno Haible

Henry Spencer writes:

> If you're not using the raw input, why does this matter?

Many programs use the raw input, for example the kernel - which
doesn't know about encodings at the filename level -, 90% of the GNU
fileutils, 50% of the GNU textutils, etc. Whereas others know that
it's UTF-8 and perform to conversion in order to have a different
internal representation.

The point of ASCII compatibility of UTF-8 is that software changes are
kept to a minimum. It would be stupid if the kernel had to verify
every filename passed to it via a system call or read from disk to see
whether it's well-formed UTF-8. For the kernel, a filename continues
to be a sequence of bytes with a NUL byte at the end.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-08-03 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Henry Spencer <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
>
> On 3 Aug 2000, H. Peter Anvin wrote:
> > > ...The potential for security holes comes when you
> > > attempt to use the raw input, *without* decoding it.  It is the
> > > *non-decoding* users who are vulnerable.
> > 
> > Great.  Now you have a datastream with may contain, say, embedded '/'
> > in filenames, or null characters.  If you then convert them back to
> > UTF-8 you now have a string referring to a potentially completely
> > different file than you started with.
> 
> If you're not using the raw input, why does this matter?  My point stands: 
> it's only people who try to use the raw input -- that is, users who are
> *not* decoding -- who are vulnerable.  If you always decode the input
> before processing it, checking it, filtering it, etc., then games played
> with non-minimal encodings *cannot* affect you. 
> 

Sure.  Now find a case where that *isn't* going to happen.  There are
enough layers of software you need to worry about -- including the
filesystem itself.  Seriously.  Your argument sounds a lot like "if
your computer is off, you can't break into it" -- a truism, but an
utterly useless one.

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-08-03 Thread Henry Spencer

On 3 Aug 2000, H. Peter Anvin wrote:
> > ...The potential for security holes comes when you
> > attempt to use the raw input, *without* decoding it.  It is the
> > *non-decoding* users who are vulnerable.
> 
> Great.  Now you have a datastream with may contain, say, embedded '/'
> in filenames, or null characters.  If you then convert them back to
> UTF-8 you now have a string referring to a potentially completely
> different file than you started with.

If you're not using the raw input, why does this matter?  My point stands: 
it's only people who try to use the raw input -- that is, users who are
*not* decoding -- who are vulnerable.  If you always decode the input
before processing it, checking it, filtering it, etc., then games played
with non-minimal encodings *cannot* affect you. 

  Henry Spencer
   [EMAIL PROTECTED]

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-08-03 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Henry Spencer <[EMAIL PROTECTED]>
In newsgroup: linux.utf8

> Um, no, I think you've missed my point.  The user of a decoder is *not*
> going to get bitten by these security holes, because he's
> *decoding*.

... and thus losing any verification done by any other layer of
software.

> The act of decoding transforms the input into a form where these
> holes do not exist.  The potential for security holes comes when you
> attempt to use the raw input, *without* decoding it.  It is the
> *non-decoding* users who are vulnerable.

Great.  Now you have a datastream with may contain, say, embedded '/'
in filenames, or null characters.  If you then convert them back to
UTF-8 you now have a string referring to a potentially completely
different file than you started with.  If this isn't a security hole,
I don't know what is.

> This being so, decoding users -- who are not vulnerable -- may balk at
> having their programs misbehave on inputs which do not threaten them anyway.

This is complete nonsense.  See above.

> > Implicit aliases are very dangerous.
> 
> I agree, but the problem is to protect the non-decoding users, and doing
> substitutions in decoders may not be the best way to do that. 
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-27 Thread Henry Spencer

On 21 Jul 2000, H. Peter Anvin wrote:
> > > One possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> > > CHARACTER on encountering illegal sequences.
> > Unless you are Bill Gates and have the power to decree that your users
> > *will* use your preferred decoder, this may be a mistake.  Remember that
> > the users of a decoder see no advantage from this behavior, since they are
> > canonicalizing anyway.
> 
> Um... not so...
> The user of the decoder is the user that gets bitten by these security
> holes...

Um, no, I think you've missed my point.  The user of a decoder is *not*
going to get bitten by these security holes, because he's *decoding*.  The
act of decoding transforms the input into a form where these holes do not
exist.  The potential for security holes comes when you attempt to use the
raw input, *without* decoding it.  It is the *non-decoding* users who are
vulnerable. 

This being so, decoding users -- who are not vulnerable -- may balk at
having their programs misbehave on inputs which do not threaten them anyway.

> Implicit aliases are very dangerous.

I agree, but the problem is to protect the non-decoding users, and doing
substitutions in decoders may not be the best way to do that. 

  Henry Spencer
   [EMAIL PROTECTED]

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-23 Thread Florian Weimer

  Larry Wall <[EMAIL PROTECTED]> writes:

> [EMAIL PROTECTED] writes:
> :   "H. Peter Anvin" <[EMAIL PROTECTED]> writes:
> : 
> : > The alternate spelling
> : > 
> : >   1101 10001011
> : > 
> : > ... is not the character K  but INVALID SEQUENCE.  One
> : > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> : > CHARACTER on encountering illegal sequences.
> : 
> : Is there any consensus whether to use one or two U+FFFD characters in
> : such situations? For example, what do Perl, Tcl and Java here?

In the meantime, I've looked at Tcl: Invalid UTF-8 sequences are
treated as characters from ISO-8859-1, i.e. the sequence "c0 80" is
converted to "U+00C0 U+0080".  (Perhaps my test routine is wrong?
This behavior doesn't match the comments in the C source.)

Python is going to follow RFC 2279 strictly.  Invalid UTF-8 sequences
raise an exception or are replaced by U+FFFD characters (how many of
them is still subject to debate, that's why I asked).

Sun's Java documentation doesn't specify what happens if their
UTF-8 decoder is fed with invalid sequences.  It's probably
implementation-dependent.

> At the moment Perl does no input validation on UTF-8.

> That being said, we will certainly be having input disciplines that do
> validation and canonicalization, and I'd imagine that we'll allow the
> user to choose how picky to be.

Thanks for your explanation.  IOW, the Unicode/UTF-8 support in Perl
is still quite rudimentary.

Anyway, Why are most UTF-8 decoders ignoring the advice in RFC
2279?  Maybe Bruce Schneier is right at all when he claims UTF-8 is
inherently insecure.  Perhaps we would have been better off with
a slightly more complicated format in which there is exactly one
representation for each UCS character which can be encoded (like
UTF-16, for example).
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-22 Thread Larry Wall

[EMAIL PROTECTED] writes:
:   "H. Peter Anvin" <[EMAIL PROTECTED]> writes:
: 
: > The alternate spelling
: > 
: > 1101 10001011
: > 
: > ... is not the character K  but INVALID SEQUENCE.  One
: > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
: > CHARACTER on encountering illegal sequences.
: 
: Is there any consensus whether to use one or two U+FFFD characters in
: such situations? For example, what do Perl, Tcl and Java here?

At the moment Perl does no input validation on UTF-8.  This is not as
big a problem as you might expect, since in high-security situations
Perl marks any input strings as "tainted", so you can't use them
directly in secure operations anyway.  And when "vetting" such strings
for use in secure operations, we always tell people to check for the
presence of good" characters, not the absence of "bad" characters.
That's just good policy regardless of the character set.

That being said, we will certainly be having input disciplines that do
validation and canonicalization, and I'd imagine that we'll allow the
user to choose how picky to be.  If they don't choose, how picky the
default discipline will be may depend on whether we're running in a
high-security situation (that is, whether taint mode is turned on).
One of the reasons we chose UTF-8 for the internal representation of
strings in Perl was so that we could slurp in a UTF-8 file very
efficiently.

As for whether a strict discipline ought to substitute one or two
U+FFFD characters for the sequence above, that'd probably depend on
whether you thought the author of the data was trying to sneak some
naughty bits in, or just screwed up by embedding Latin-1 in a UTF-8
file.  I expect that the latter is likelier in practice.

On the other hand, good security experts never attribute to stupidity
that which can adequately be explained by malice.

Larry
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-22 Thread Florian Weimer

  "H. Peter Anvin" <[EMAIL PROTECTED]> writes:

> The alternate spelling
> 
>   1101 10001011
> 
> ... is not the character K  but INVALID SEQUENCE.  One
> possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> CHARACTER on encountering illegal sequences.

Is there any consensus whether to use one or two U+FFFD characters in
such situations? For example, what do Perl, Tcl and Java here?
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-21 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Henry Spencer <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> 
> > This is incredibly important, since any misguided
> > attempt to "be liberal in what you accept" without addition of an
> > explicit canonicalization step would lead to the kind of security
> > holes that Microsoft web-related applications have been so full of...
> 
> There is somewhat of a contradiction here:  the usual place to "be liberal
> in what you accept" is a decoder, which has an inherent canonicalization
> step.  In practice, I think the only place where you make a meaningful
> *choice* as to whether to be liberal or not is when canonicalization is
> imminent anyway.  Someone who is comparing the raw UTF-8 sequences is
> making a mistake, yes, but it's not a "be liberal in what you accept"
> mistake, it's an error in choice of working representation. 
> 

It's not such a contradiction, actually.  If your decoder doesn
canonicalization, you absolutely must guarantee that the stream of
bits has not been in any way, shape, or form analyzed or filtered
prior to decoding.  This is usually an extremely difficult thing to
guarantee; in fact, the design of UTF-8 is done exactly so that common
text string operations such as (sub)string matching and sorting in
encoding order are still valid.  These guarantees *require* that
illegal encodings be rejected.  Thus, unless you are willing to do
some *serious* security analysis, being libral in what to accept is
probably a very bad idea in this particular context.

> > The alternate spelling
> > 1101 10001011
> > ... is not the character K  but INVALID SEQUENCE.  One
> > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> > CHARACTER on encountering illegal sequences.
> 
> Unless you are Bill Gates and have the power to decree that your users
> *will* use your preferred decoder, this may be a mistake.  Remember that
> the users of a decoder see no advantage from this behavior, since they are
> canonicalizing anyway.
> 

Um... not so.  Bill Gates decreed that everyone should use Windows,
and Windows has security holes up the wazoo, *because* its encodings
are ambiguous, although not because of UTF-8.

The user of the decoder is the user that gets bitten by these security
holes, so I would *definitely* say the user of the decoder would see a
benefit from this behaviour!!

Implicit aliases are very dangerous.

 -hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-13 Thread Henry Spencer

On Thu, 13 Jul 2000, Jeu George wrote:
> > >   2-byte characters 110x 10xx
> > The bits are encoded bigendian (MSB first), i.e. the way you would
> > read the bits when written in the above form.
> 
> for a two byte long character.
> where will the MSB be
> in the 4th bit of the first byte from the left or on the 3rd bit of the
> second byte from the left

As he said:  MSB first.  The 16-bit character 0pqrstuvwxyz is encoded
as 110pqrst 10uvwxyz. 

> Will this be OS dependant. ie the arrangement of bits

No, the encoding fully defines the bit arrangement.

> How is the null character going to be??   ??
> but u have mentioned something else below

Properly, ASCII NUL (U+), the 16-bit character ,
should be encoded as just , since that is the shortest encoding
for it.  This is one case where there has been some violation of the rules
internally within some systems, although with luck it will remain an
internal oddity and won't become visible. 

> I thought that all the ascii character were be retained in UTF-8 .
> That is the major reason why 1 byte long charcters will always have the
> MSB as 0. am i right??

Correct.

> > One possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> > CHARACTER on encountering illegal sequences.
> 
> WHat is this U+FFFD SUBSTITUTION  about exactly could you ellaborate on
> this also??

The character U+FFFD, whose Unicode 3.0 name is REPLACEMENT CHARACTER, is
"used to replace an incoming character whose value is unknown or
unrepresentable in Unicode".  That is, it marks the place where something
untranslatable used to be.

  Henry Spencer
   [EMAIL PROTECTED]

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-13 Thread Jeu George



On 12 Jul 2000, H. Peter Anvin wrote:

> Followup to:  <[EMAIL PROTECTED]>
> By author:Jeu George <[EMAIL PROTECTED]>
> In newsgroup: linux.utf8
> >
> > 
> > Hello,
> > 
> > The utf-8 encoding scheme goes like this
> >   for
> >   1-byte characters 0xxx 
> >   2-byte characters 110x 10xx
> >   3-byte characters 1110 10xx
> > 
> 
> 4-byte characters 0xxx 10xx 10xx 10xx
> 5-byte characters 10xx 10xx 10xx 10xx 10xx
> 6-byte characters 110x 10xx 10xx 10xx 10xx 10xx
> 
> > here the bits marked x are used up for the actuall encoding of characters
> > i would like to know the way these bits are used to code a particular
> > charter, also is this dependent on the operating system, can u provide a
> > program which checks finds this or any link that provides information
> > about this
> 
> The bits are encoded bigendian (MSB first), i.e. the way you would
> read the bits when written in the above form.



> 
> It is also very important to realize that ONLY THE SHORTEST POSSIBLE
> SEQUENCE IS LEGAL.  This is incredibly important, since any misguided

*
Could u ellborate on this with some examples or something.
The example given below was not clear to me.
Thanks for the help any way
*


for a two byte long character.
where will the MSB be
in the 4th bit of the first byte from the left or on the 3rd bit of the
second byte from the left

Will this be OS dependant. ie the arrangement of bits

How is the null character going to be??   ??
but u have mentioned something else below
I thought that all the ascii character were be retained in UTF-8 .
That is the major reason why 1 byte long charcters will always have the
MSB as 0. am i right??



> attempt to "be liberal in what you accept" without addition of an
> explicit canonicalization step would lead to the kind of security
> holes that Microsoft web-related applications have been so full of,
> because MS operating systems have way too many ways to say the same
> thing.
> 
> Thus, the character K  is encoded as:
> 
>   01001011
> 
> The alternate spelling
> 
>   1101 10001011
> 
> ... is not the character K  but INVALID SEQUENCE.  One
> possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> CHARACTER on encountering illegal sequences.

WHat is this U+FFFD SUBSTITUTION  about exactly could you ellaborate on
this also??


Regards,
Jeu



> 
>   -hpa
> 
> -- 
> <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
> "Unix gives you enough rope to shoot yourself in the foot."
> http://www.zytor.com/~hpa/puzzle.txt
> -
> Linux-UTF8:   i18n of Linux on all levels
> Archive:  http://mail.nl.linux.org/lists/
> 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-12 Thread Bruno Haible

Henry Spencer writes:

> It's worth noting that some implementations which generally hew pretty
> close to this make one exception:  they represent ASCII NUL (U+) as
> 1100 1000, so that  can be used as a terminator within
> programs without worrying that it will collide with a user character. 

Yes, the Java virtual machine uses this encoding for strings in .class
files. It is internal to Java and not visible at user level.
The JDK documentation of "java.io.DataOutput.writeUTF" says:
"Writes a Unicode string by encoding it using modified UTF-8 format."
Therefore I don't think people will be misled to use this function for
text output.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-12 Thread Henry Spencer

On 12 Jul 2000, H. Peter Anvin wrote:
> >   1-byte characters 0xxx 
> >   2-byte characters 110x 10xx
> 
> ...It is also very important to realize that ONLY THE SHORTEST POSSIBLE
> SEQUENCE IS LEGAL...

It's worth noting that some implementations which generally hew pretty
close to this make one exception:  they represent ASCII NUL (U+) as
1100 1000, so that  can be used as a terminator within
programs without worrying that it will collide with a user character. 
This *is* a violation of the rules, and one would hope that it will stay
a program-internal convention, but it may well sneak out into files.

> This is incredibly important, since any misguided
> attempt to "be liberal in what you accept" without addition of an
> explicit canonicalization step would lead to the kind of security
> holes that Microsoft web-related applications have been so full of...

There is somewhat of a contradiction here:  the usual place to "be liberal
in what you accept" is a decoder, which has an inherent canonicalization
step.  In practice, I think the only place where you make a meaningful
*choice* as to whether to be liberal or not is when canonicalization is
imminent anyway.  Someone who is comparing the raw UTF-8 sequences is
making a mistake, yes, but it's not a "be liberal in what you accept"
mistake, it's an error in choice of working representation. 

> The alternate spelling
>   1101 10001011
> ... is not the character K  but INVALID SEQUENCE.  One
> possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> CHARACTER on encountering illegal sequences.

Unless you are Bill Gates and have the power to decree that your users
*will* use your preferred decoder, this may be a mistake.  Remember that
the users of a decoder see no advantage from this behavior, since they are
canonicalizing anyway.

  Henry Spencer
   [EMAIL PROTECTED]


-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-12 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Jeu George <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
>
> 
> Hello,
> 
>   The utf-8 encoding scheme goes like this
>   for
>   1-byte characters 0xxx 
>   2-byte characters 110x 10xx
>   3-byte characters 1110 10xx
> 

4-byte characters   0xxx 10xx 10xx 10xx
5-byte characters   10xx 10xx 10xx 10xx 10xx
6-byte characters   110x 10xx 10xx 10xx 10xx 10xx

> here the bits marked x are used up for the actuall encoding of characters
> i would like to know the way these bits are used to code a particular
> charter, also is this dependent on the operating system, can u provide a
> program which checks finds this or any link that provides information
> about this

The bits are encoded bigendian (MSB first), i.e. the way you would
read the bits when written in the above form.

It is also very important to realize that ONLY THE SHORTEST POSSIBLE
SEQUENCE IS LEGAL.  This is incredibly important, since any misguided
attempt to "be liberal in what you accept" without addition of an
explicit canonicalization step would lead to the kind of security
holes that Microsoft web-related applications have been so full of,
because MS operating systems have way too many ways to say the same
thing.

Thus, the character K  is encoded as:

01001011

The alternate spelling

1101 10001011

... is not the character K  but INVALID SEQUENCE.  One
possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
CHARACTER on encountering illegal sequences.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-06-26 Thread Bruno Haible

Jeu George writes:

> can u provide a program which checks finds this or any link that
> provides information about this

Sample code for UTF-8 conversion can be found at
   ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/CVTUTF.H
   ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/CVTUTF.C

But you don't really need to do the conversion yourself; you can use
the system's iconv facility: start with calling
iconv_open("UTF-8","ISO-8859-1"). See "man iconv" for more info.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-06-26 Thread Jeu George




 i need a function that converts a character from a latin1 character set
to utf-8 and another that converts from utf-8 to latin1
Thanks
Jeu George


On Mon, 26 Jun 2000, Markus Kuhn wrote:

> Jeu George wrote on 2000-06-26 11:17 UTC:
> > i would like to know the way these bits are used to code a particular
> > charter, also is this dependent on the operating system, can u provide a
> > program which checks finds this or any link that provides information
> > about this
> 
> Try
> 
>   http://www.cl.cam.ac.uk/~mgk25/unicode.html
> 
> and get a copy of
> 
>   http://www.amazon.com/exec/obidos/ASIN/0201616335/mgk25
> 
> UTF-8 itself is independent of the operating system, however there are
> different line terminator conventions (just as with ASCII) and some
> operating systems (Microsoft) like to put a signature code in front of
> many UTF-8 files, while Unix systems generally don't like this.
> 
> Markus
> 
> -- 
> Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> Email: mkuhn at acm.org,  WWW: 
> 
> -
> Linux-UTF8:   i18n of Linux on all levels
> Archive:  http://mail.nl.linux.org/lists/
> 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-06-26 Thread Markus Kuhn

Jeu George wrote on 2000-06-26 11:17 UTC:
> i would like to know the way these bits are used to code a particular
> charter, also is this dependent on the operating system, can u provide a
> program which checks finds this or any link that provides information
> about this

Try

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

and get a copy of

  http://www.amazon.com/exec/obidos/ASIN/0201616335/mgk25

UTF-8 itself is independent of the operating system, however there are
different line terminator conventions (just as with ASCII) and some
operating systems (Microsoft) like to put a signature code in front of
many UTF-8 files, while Unix systems generally don't like this.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



utf-8 encoding scheme

2000-06-26 Thread Jeu George


Hello,

The utf-8 encoding scheme goes like this
  for
  1-byte characters 0xxx 
  2-byte characters 110x 10xx
  3-byte characters 1110 10xx

here the bits marked x are used up for the actuall encoding of characters
i would like to know the way these bits are used to code a particular
charter, also is this dependent on the operating system, can u provide a
program which checks finds this or any link that provides information
about this

Thanks
Jeu George


-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/