Title: RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)
Edward H. Trager wrote:
UTF-8's home directory). So both users could probably guess
the filename
they were looking at.
Which, BTW, is true for most of Europe but is not true for some other
Arcane Jill writes:
The obvious solution is for all Unix machines everywhere to be using the same
locale - and it
had better be UTF-8. But an instantaneous global switch-over is never going
to happen, so we see
this gradual switch-over ... and it is during this transition phase that
-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Philippe Verdy
Sent: 14 December 2004 22:47
To: Marcin 'Qrczak' Kowalczyk
Cc: [EMAIL PROTECTED]
Subject: Re: Roundtripping in Unicode
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Arcane Jill [EMAIL PROTECTED] writes:
If so,
Arcane Jill [EMAIL PROTECTED] writes:
OBSERVATION - Requirement (4) is not met absolutely, however,
the probability of the UTF-8 encoding of this sequence occuring
accidently at an arbitrary offset in an arbitrary octet stream
is approximately one in 2^384;
Assuming that the distribution of
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk replied:
Arcane Jill [EMAIL PROTECTED] writes:
If so, Marcin, what exactly is the error, and whose fault is it?
It's an error to use locales with different encodings on the same
system.
U, and whose fault is it?
Title: RE: Roundtripping in Unicode
Ops, correction:
In response to Marcin 'Qrczak' Kowalczyk
Question: should a new programming language which uses Unicode for
string representation allow non-characters in strings? Argument for
allowing them: otherwise they are completely useless
Title: RE: Roundtripping in Unicode
D. Starner wrote:
The only solution is (a) to use ASCII or (b) to make the
switch over as quick
and clean as possible. Anyone who wants to create new files
in UTF-8 and leave
their old files in the old encoding is asking for trouble.
There's no
Title: RE: Roundtripping in Unicode
Kenneth Whistler wrote:
Lars said:
According to UTC, you need to keep processing
the UNIX filenames as BINARY data. And, also according to
UTC, any UTF-8
function is allowed to reject invalid sequences. Basically,
you are not
supposed to use
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
If one application switches from standard UTF-8 to your modification,
and another application continues to use standard UTF-8, then the
ability to pass arbitrary Unicode strings between them by serializing
them to UTF-8
Yes, but only if you can have some reasonable assurance that the byte
sequence emitted by UTF(c,x) (where c is the single reserved codepoint you
suggest, and x is U+00xx, the value to be escaped expressed as a character)
will not occur in plain text. This is theoretically checkable - the total
Title: RE: Roundtripping Solved
Arcane Jill wrote:
solution, again without breaking the Unicode model. If I have
It is for reasons of requirement (4) that Lars proposes the
introduction of
128 BMP codepoints. His intention is that they be marked as
reserved - do
not use, so
Nope. No data corruption. You just get the odd bytes back. And achieve
I see more of what you are trying to do; let me try to be more clear.
Suppose that the conversion is defined in the following way, between Unicode
strings (D29a-d, page 74) and UTFs using your proposed new characters, for
now
I followed (and understood) Lar's explanation as to why the NOT-
solution wouldn't work for him. Shame really - but here's another bash at a
solution, again without breaking the Unicode model. If I have understood
this correctly, these are Lars' requirements:
1) There exists a function,
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 -
NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
awkward way which would happen to exclude those subsequences of
non-characters which
On 15/12/2004 00:22, Mike Ayers wrote:
From: Peter Kirk [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 3:37 PM
Thanks for the clarification. Perhaps the bifurcation could
be better expressed as into strings of characters as defined
by the locale and strings of non-null octets.
Arcane Jill [EMAIL PROTECTED] writes:
Unix makes is possible for /you/ to change /your/ locale - but by
your reasoning, this is an error, unless all other users do so
simultaneously.
Not necessarily: you can change the locale as long as it uses the same
default encoding.
By error I mean a
Title: RE: Roundtripping in Unicode
Philippe Verdy wrote:
I have not
found a solution to this problem, and I don't know if such
solution even
exists; if such solution exists, it should be quite complex...).
I think it should be possible to mathematically prove that it doesn't exist.
Title: RE: Roundtripping in Unicode
Arcane Jill wrote:
The obvious solution is for all Unix machines everywhere to
be using the
same locale - and it had better be UTF-8. But an instantaneous global
switch-over is never going to happen, so we see this gradual
switch-over ...
and it
Lars Kristan [EMAIL PROTECTED] writes:
OK, strcpy does not need to interpret UTF-8. But strchr probably should.
No. Its argument is a byte, even though it's passed as type int.
By byte here I mean C char value, which is an octet in virtually
all modern C implementations; the C standard doesn't
On 15/12/2004 11:11, Arcane Jill wrote:
I followed (and understood) Lar's explanation as to why the NOT-
solution wouldn't work for him. Shame really - but here's another bash
at a solution, again without breaking the Unicode model. If I have
understood this correctly, these are Lars'
Marcin 'Qrczak' Kowalczyk qrczak at knm dot org dot pl wrote:
OBSERVATION - Requirement (4) is not met absolutely, however,
the probability of the UTF-8 encoding of this sequence occuring
accidently at an arbitrary offset in an arbitrary octet stream
is approximately one in 2^384;
Assuming
Title: RE: Roundtripping in Unicode
From: Peter Kirk [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, December 15, 2004 3:52 AM
But surely octets 0x80 to 0x9f are (at least mostly) invalid
in ISO 8859?
They are in fact valid. However, because they are control characters, they are not
Arcane Jill arcanejill at ramonsky dot com wrote:
DEFINITION - f is a function which maps an arbitrary octet stream to
a sequence of Unicode characters, such that (1) any substring which
happens to be valid UTF-8 is mapped to the sequence of Unicode
characters which would have been produced
On 15/12/2004 14:36, Arcane Jill wrote:
Yes, but only if you can have some reasonable assurance that the byte
sequence emitted by UTF(c,x) (where c is the single reserved codepoint
you suggest, and x is U+00xx, the value to be escaped expressed as a
character) will not occur in plain text. This
Peter Kirk [EMAIL PROTECTED] writes:
Jill, again your solution is ingenious. But would it not work just
as well to for Lars' purposes to use, instead of your string of
random characters, just ONE reserved code point followed by U+0xx?
Instead of asking the UTC to allocate a specific code
25 matches
Mail list logo