RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtri pping in Unicode)

2004-12-15 Thread Lars Kristan
Title: RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode) Edward H. Trager wrote: UTF-8's home directory). So both users could probably guess the filename they were looking at. Which, BTW, is true for most of Europe but is not true for some other

RE: Roundtripping in Unicode

2004-12-15 Thread D. Starner
Arcane Jill writes: The obvious solution is for all Unix machines everywhere to be using the same locale - and it had better be UTF-8. But an instantaneous global switch-over is never going to happen, so we see this gradual switch-over ... and it is during this transition phase that

RE: Roundtripping in Unicode

2004-12-15 Thread Arcane Jill
-Original Message- From: [EMAIL PROTECTED] On Behalf Of Philippe Verdy Sent: 14 December 2004 22:47 To: Marcin 'Qrczak' Kowalczyk Cc: [EMAIL PROTECTED] Subject: Re: Roundtripping in Unicode From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Arcane Jill [EMAIL PROTECTED] writes: If so,

Re: Roundtripping Solved

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes: OBSERVATION - Requirement (4) is not met absolutely, however, the probability of the UTF-8 encoding of this sequence occuring accidently at an arbitrary offset in an arbitrary octet stream is approximately one in 2^384; Assuming that the distribution of

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk replied: Arcane Jill [EMAIL PROTECTED] writes: If so, Marcin, what exactly is the error, and whose fault is it? It's an error to use locales with different encodings on the same system. U, and whose fault is it?

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Ops, correction: In response to Marcin 'Qrczak' Kowalczyk Question: should a new programming language which uses Unicode for string representation allow non-characters in strings? Argument for allowing them: otherwise they are completely useless

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode D. Starner wrote: The only solution is (a) to use ASCII or (b) to make the switch over as quick and clean as possible. Anyone who wants to create new files in UTF-8 and leave their old files in the old encoding is asking for trouble. There's no

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Kenneth Whistler wrote: Lars said: According to UTC, you need to keep processing the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 function is allowed to reject invalid sequences. Basically, you are not supposed to use

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: If one application switches from standard UTF-8 to your modification, and another application continues to use standard UTF-8, then the ability to pass arbitrary Unicode strings between them by serializing them to UTF-8

Re: Roundtripping Solved

2004-12-15 Thread Arcane Jill
Yes, but only if you can have some reasonable assurance that the byte sequence emitted by UTF(c,x) (where c is the single reserved codepoint you suggest, and x is U+00xx, the value to be escaped expressed as a character) will not occur in plain text. This is theoretically checkable - the total

RE: Roundtripping Solved

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping Solved Arcane Jill wrote: solution, again without breaking the Unicode model. If I have It is for reasons of requirement (4) that Lars proposes the introduction of 128 BMP codepoints. His intention is that they be marked as reserved - do not use, so

Re: Roundtripping in Unicode

2004-12-15 Thread Mark Davis
Nope. No data corruption. You just get the odd bytes back. And achieve I see more of what you are trying to do; let me try to be more clear. Suppose that the conversion is defined in the following way, between Unicode strings (D29a-d, page 74) and UTFs using your proposed new characters, for now

Roundtripping Solved

2004-12-15 Thread Arcane Jill
I followed (and understood) Lar's explanation as to why the NOT- solution wouldn't work for him. Shame really - but here's another bash at a solution, again without breaking the Unicode model. If I have understood this correctly, these are Lars' requirements: 1) There exists a function,

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 - NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an awkward way which would happen to exclude those subsequences of non-characters which

Re: Roundtripping in Unicode

2004-12-15 Thread Peter Kirk
On 15/12/2004 00:22, Mike Ayers wrote: From: Peter Kirk [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 3:37 PM Thanks for the clarification. Perhaps the bifurcation could be better expressed as into strings of characters as defined by the locale and strings of non-null octets.

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes: Unix makes is possible for /you/ to change /your/ locale - but by your reasoning, this is an error, unless all other users do so simultaneously. Not necessarily: you can change the locale as long as it uses the same default encoding. By error I mean a

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Philippe Verdy wrote: I have not found a solution to this problem, and I don't know if such solution even exists; if such solution exists, it should be quite complex...). I think it should be possible to mathematically prove that it doesn't exist.

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Arcane Jill wrote: The obvious solution is for all Unix machines everywhere to be using the same locale - and it had better be UTF-8. But an instantaneous global switch-over is never going to happen, so we see this gradual switch-over ... and it

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: OK, strcpy does not need to interpret UTF-8. But strchr probably should. No. Its argument is a byte, even though it's passed as type int. By byte here I mean C char value, which is an octet in virtually all modern C implementations; the C standard doesn't

Re: Roundtripping Solved

2004-12-15 Thread Peter Kirk
On 15/12/2004 11:11, Arcane Jill wrote: I followed (and understood) Lar's explanation as to why the NOT- solution wouldn't work for him. Shame really - but here's another bash at a solution, again without breaking the Unicode model. If I have understood this correctly, these are Lars'

Re: Roundtripping Solved

2004-12-15 Thread Doug Ewell
Marcin 'Qrczak' Kowalczyk qrczak at knm dot org dot pl wrote: OBSERVATION - Requirement (4) is not met absolutely, however, the probability of the UTF-8 encoding of this sequence occuring accidently at an arbitrary offset in an arbitrary octet stream is approximately one in 2^384; Assuming

RE: Roundtripping in Unicode

2004-12-15 Thread Mike Ayers
Title: RE: Roundtripping in Unicode From: Peter Kirk [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 15, 2004 3:52 AM But surely octets 0x80 to 0x9f are (at least mostly) invalid in ISO 8859? They are in fact valid. However, because they are control characters, they are not

Re: Roundtripping Solved

2004-12-15 Thread Doug Ewell
Arcane Jill arcanejill at ramonsky dot com wrote: DEFINITION - f is a function which maps an arbitrary octet stream to a sequence of Unicode characters, such that (1) any substring which happens to be valid UTF-8 is mapped to the sequence of Unicode characters which would have been produced

Re: Roundtripping Solved

2004-12-15 Thread Peter Kirk
On 15/12/2004 14:36, Arcane Jill wrote: Yes, but only if you can have some reasonable assurance that the byte sequence emitted by UTF(c,x) (where c is the single reserved codepoint you suggest, and x is U+00xx, the value to be escaped expressed as a character) will not occur in plain text. This

Re: Roundtripping Solved

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Peter Kirk [EMAIL PROTECTED] writes: Jill, again your solution is ingenious. But would it not work just as well to for Lars' purposes to use, instead of your string of random characters, just ONE reserved code point followed by U+0xx? Instead of asking the UTC to allocate a specific code