from:"Marcin 'Qrczak' Kowalczyk"

Re: Roundtripping Solved

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk

Arcane Jill [EMAIL PROTECTED] writes:

 OBSERVATION - Requirement (4) is not met absolutely, however,
 the probability of the UTF-8 encoding of this sequence occuring
 accidently at an arbitrary offset in an arbitrary octet stream
 is approximately one in 2^384;

Assuming that the distribution of sequences of characters is uniform.
But it's not! As soon as you start using this encoding somewhere,
the probability of appearing of this sequence raises dramatically.
If you convert UTF-8 - UTF-32 using modified rules, and UTF-32 - UTF-8
using standard rules, then you get this sequence without waiting
2^340 years.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk

Arcane Jill [EMAIL PROTECTED] writes:

 Unix makes is possible for /you/ to change /your/ locale - but by
 your reasoning, this is an error, unless all other users do so
 simultaneously.

Not necessarily: you can change the locale as long as it uses the same
default encoding.

By error I mean a bad idea. The system does not prevent from
changing the locale to a different encoding. But then you are on your
own and various things can break: terminal output will be mangled, you
can't enter characters used in a different encoding from the keyboard,
text files will be illegible, and Unicode programs which process texts
may reject your data or even filenames. If you still need to change
encodings, it's safer to use ASCII-only filenames.

This situation is temporary. Well, it may last 10 more years or so,
but it will probably gradually improve:

First, more protocols and file formats are becoming aware of character
encodings and either label them explicitly or use a known encoding
(generally some Unicode encoding scheme). Especially protocols for
data interchange over Internet: WWW, email, usenet, modern instant
messaging protocols like Jabber. Some old protocols remain
encoding-ignorant, e.g. irc and finger. GNOME 1 used the locale
encoding, GNOME 2 uses UTF-8. Copying  pasting text in X window now
has a separate API which uses UTF-8. While the irc protocol doesn't
specify the encoding, the irssi client can now recode texts itself
to conform to customs of particular channels.

Second, UTF-8 is becoming more usable as the default encoding
specified by the locale. I don't use it now because too many things
still break, but it's improving: there are things which didn't work
just a few years ago and work now. Terminal emulators in X widely
support UTF-8 mode now. The curses library now has a working wide
character API. Emacs and vi work in UTF-8 (Emacs still has problems).
Readline now works in UTF-8. Localized messages (gettext) are now
recoded automatically.

Other programs still don't work. Bash works, while zsh and ksh don't.
Most full-screen text programs use the narrow character curses API and
don't work in UTF-8. Brokenness of interactive interpreters of various
languages vary.

BTW, in the wide character curses API, the only way curses can work
in a UTF-8 terminal, characters are expressed as sequences of wchar_t
(base char + some combining chars, possibly double width). Which means
that you must somehow translate filenames to this representation
in order to display them - same as with a Unicode-based GUI. It's
meaningless to render arbitrary bytes on the terminal, and you can't
force curses to emit the original byte sequences which form filenames
(which would be a bad idea for control characters anyway). By
legimitizing non-UTF-8 filenames in a UTF-8 system you increase
problems to overcome by such applications: not only they have to
show control characters somehow, but also invalid UTF-8.

 But it goes beyond that. Copy a file onto a floppy disc and then
 physically take that floppy disc to a different Unix machine and log
 on as guest and insert the disc ... Will the filename look the same?

Depends on the filesystem and the way it is mounted.

For example if it's FAT with long filenames (which I think is the
usual format for floppies even on Unix), filenames can be recoded by
the kernel: you specify the encoding to present filenames in and the
encoding of short names. I don't know what happens with filenames
which are not expressible in the selected encoding.

In this way filenames may automatically convert between systems which
use different default encodings, preserving the character semantics
rather than the byte representation. Of course file contents will not
be converted.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 OK, strcpy does not need to interpret UTF-8. But strchr probably should.

No. Its argument is a byte, even though it's passed as type int.
By byte here I mean C char value, which is an octet in virtually
all modern C implementations; the C standard doesn't guarantee this
but POSIX does.

Many C functions are not suitable for processing UTF-8, or are
suitable only as long as we consider all non-ASCII characters opaque
bags of bytes. For example isalpha takes a byte, toupper transforms
a byte to a byte, and strncpy copies up to n bytes even if it's
in the middle of a UTF-8 character.

There are wide character versions like iswalpha and towupper. But then
data must be converted from a sequence of char to a sequence of wchar_t.
Standard and semi-standard function which do this conversion for UTF-8
reject invalid UTF-8 (they all have a mean for reporting errors).

The assumption that wchar_t has something do to with Unicode is not as
common as about char and bytes. I don't know whether FreeBSD finally
changed their wchar_t to Unicode. And it can be UTF-32 (Unix) or
UTF-16 (Windows).

 But then all languages are supposed to provide functions for
 processing opaque strings in addition to their Unicode functions.

Yes, IMHO all general-purpose languages should support processing
arrays of bytes, in addition to Unicode strings.

It's not clear however how the API of filenames should look like,
especially if they wish to be portable to Windows.

 But sooner or later you need to incorporate the filename in some
 UTF-8 text. An error report, for example.

While it's not clear what a well-behaved application should do by
default, in order to be 100% robust and preserve all information
you must change the usual conventions anyway. Remember that any byte
except \0 and / is valid in a filename, so you must either escape
some characters, or delimit the filename with \0, or prefix it with
the length, or something like this. A backup software should do this
and not pay attention to the locale. But for end-user software like
an image viewer, processing arbitrary filenames is less important.

 What are stdin, stdout and argv (command line parameters) when a
 process is running in a UTF-8 locale?

Technically they are binary (command line arguments must not contain
zero bytes). Users are expecting stdin and stdout to be treated as
text or binary depending on the program, while command like arguments
are generally interpreted as text or filenames.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping Solved

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk

Peter Kirk [EMAIL PROTECTED] writes:

 Jill, again your solution is ingenious. But would it not work just
 as well to for Lars' purposes to use, instead of your string of
 random characters, just ONE reserved code point followed by U+0xx?
 Instead of asking the UTC to allocate a specific code point for this
 (which it probably will not do), he can use either U+FFFE or U+,
 which are intended for process internal uses, but are not permitted
 for interchange. Let's call the one non-character chosen INVALID.

Perhaps what is needed is a shift of viewpoint, not a big technical
change.

Don't call it a UTF. Call it escaping. Don't reserve 128 code points.
Use an existing but rare code point to prefix a byte escaped among
code points, and escape the escape if it's found in the original.
Perhaps the character could be ESC (27) or SUB (26), followed by
U+00nn.

Well, a viewpoint shift doesn't solve all problems: it's still
dangerous for interoperability. If the programmer doesn't do anything
special when writing filenames to a file, then instead of an error
which indicates that the goal doesn't have a natural solution he gets
an escaped string which will not be understood by other applications
wich don't use this convention. If the filename is passed to a part
of the program which doesn't use this convention, then it will break
too. If something cannot be done reliably, it's better to signal the
problem immediately than to hide it and misbehave later.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk

Arcane Jill [EMAIL PROTECTED] writes:

 OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -
 NOT-UTF-16 - NOT-UTF-8

But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 -
NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
awkward way which would happen to exclude those subsequences of
non-characters which would form a valid UTF-8 fragment.

Unicode has the following property. Consider sequences of valid
Unicode characters: from the range U+..U+10, excluding
non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and
U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
in any UTF-n, and nothing else is expected from UTF-n.

With the exception of the set of non-characters being irregular and
IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
limit caused by UTF-16, this gives a precise and unambiguous set of
values for which encoders and decoders are supposed to work. Well,
except non-obvious treatment of a BOM (at which level it should be
stripped? does this include UTF-8?).

A variant of UTF-8 which includes all byte sequences yields a much
less regular set of abstract string values. Especially if we consider
that 1110 1011 1010 binary is not valid UTF-8, as much as
0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
order for a BOM to fulfill its role).

Question: should a new programming language which uses Unicode for
string representation allow non-characters in strings? Argument for
allowing them: otherwise they are completely useless at all, except
U+FFFE for BOM detection. Argument for disallowing them: they make
UTF-n inappropriate for serialization of arbitrary strings, and thus
non-standard extensions of UTF-n must be used for serialization.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 Hm, here lies the catch. According to UTC, you need to keep
 processing the UNIX filenames as BINARY data. And, also according
 to UTC, any UTF-8 function is allowed to reject invalid sequences.
 Basically, you are not supposed to use strcpy to process filenames.

No: strcpy passes raw bytes, it does not interpret them according to
the locale. It's not an UTF-8 function.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk

Arcane Jill [EMAIL PROTECTED] writes:

 If so, Marcin, what exactly is the error, and whose fault is it?

It's an error to use locales with different encodings on the same
system.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Unicode filenames and other external strings on Unix - existing practice

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk

I describe here languages which exclusively use Unicode strings.
Some languages have both byte strings and Unicode strings (e.g. Python)
and then byte strings are generally used for strings exchanged with
the OS, the programmer is responsible for the conversion if he wishes
to use Unicode.

I consider situations when the encoding is implicit. For I/O of file
contents it's always possible to set the encoding explicitly somehow.

Corrections are welcome. This is mostly based on experimentation.


Java (Sun)
--

Strings are UTF-16.

Filenames are assumed to be in the locale encoding.

a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD.

b) Creating. Characters which cannot be converted are replaced by ?.

Command line arguments and standard I/O are treated in the same way.


Java (GNU)
--

Strings are UTF-16.

Filenames are assumed to be in Java-modified UTF-8.

a) Interpreting. If a filename cannot be converted, a directory listing
   contains a null instead of a string object.

b) Creating. All Java characters are representable in Java-modified UTF-8.
   Obviously not all potential filenames can be represented.

Command line arguments are interpreted according to the locale.
Bytes which cannot be converted are skipped.

Standard I/O works in ISO-8859-1 by default. Obviously all input is
accepted. On output characters above U+00FF are replaced by ?.


C# (mono)
-

Strings are UTF-16.

Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS
environment variable, with UTF-8 implicitly added at the end. These
encodings are tried in order.

a) Interpreting. If a filename cannot be converted, it's skipped in
   a directory listing.

   The documentation says that if a filename, a command line argument
   etc. looks like valid UTF-8, it is treated as such first, and
   MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases.
   The reality seems to not match this (mono-1.0.5).

b) Creating. If UTF-8 is used, Non-characters are converted to
   pseudo-UTF-8, U+ throws an exception (System.ArgumentException:
   Path contains invalid chars), paired surrogates are treated
   correctly, and an isolated surrogate causes an internal error:
** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion 
failed: (utf8!=NULL)
aborting...

Command line arguments are treated in the same way, except that if an
argument cannot be converted, the program dies at start:
[Invalid UTF-8]
Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea).
Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again.

Console.WriteLine emits UTF-8. Paired surrogates are treated
correctly, non-characters and unpaired surrogates are converted to
pseudo-UTF-8.

Console.ReadLine interprets text as UTF-8. Bytes which cannot be
converted are skipped.


Perl


Depending on the convention used by a particular function and on
imported packages, a Perl string is treated either as Perl-modified
Unicode (with character values up to 32 bits or 64 bits depending on
the architecture) or as an unspecified locale encoding. It has two
internal representations: ISO-8859-1 and Perl-modified UTF-8 (with
an extended range).

If every Perl string is assumed to be a Unicode string, then filenames
are effectively ISO-8859-1.

a) Interpreting. Characters up to 0xFF are used.

b) Creating. If the filename has no characters above 0xFF, it is
   converted to ISO-8859-1. Otherwise it is converted to Perl-modified
   UTF-8 (all characters, not just those above 0xFF).

Command line arguments and standard I/O are treated in the same way,
i.e. ISO-8859-1 on input and a mixture of ISO-8859-1 and UTF-8 on
output, depending on the contents.

This behavior is modifiable by importing various packages and using
interpreter invocation flags. When Perl is told that command line
arguments are UTF-8, the behavior for strings which cannot be
converted is inconsistent: sometimes it's treated as ISO-8859-1,
sometimes an error is signalled.


Haskell
---

Haskell nominally uses Unicode. There is no conversion framework
standarized or implemented yet though. Implementations which support
more than 256 characters currently assume ISO-8859-1 for filenames,
command line arguments and all I/O, taking the lowest 8 bits of a
character code on output.


Common Lisp: Clisp
--

Common Lisp standard doesn't say anything about string encoding.
In Clisp strings are UTF-32 (internally optimized as UCS-2 and
ISO-8859-1 when possible). Any character code up to U+10 is
allowed, including non-characters and isolated surrogates.

Filenames are assumed to be in the locale encoding.

a) Interpreting. If a byte cannot be converted, an exception is thrown.

b) Creating. If a character cannot be converted, an exception is thrown.


Kogut (my language; this is the current state - can be changed)
-

Strings are UTF-32 (internally optimized as ISO-8859-1 when possible).
Currently any

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 But, as I once already said, you can do it with UTF-8, you simply
 keep the invalid sequences as they are, and really handle them
 differently only when you actually process them or display them.

UTF-8 is painful to process in the first place. You are making it
even harder by demanding that all functions which process UTF-8 do
something sensible for bytes which don't form valid UTF-8. They even
can't temporarily convert it to UTF-32 for internal processing for
convenience.

 Listing files in a directory should not signal anything. It MUST
 return all files and it should also return them in a way that this
 list can be used to access each of the files.

Which implies that they can't be interpreted as UTF-8.

By masking an error you are not encouraging users to fix it.
Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.

 Let's start with UTF-8 usernames. This is a likely scenario, since I
 think UTF-8 will typically be used in network communication. If you
 store the usernames in UTF-16, the conversion will signal an error
 and you will not have any users with invalid UTF-8 sequences nor
 will any invalid sequence be able to match any user. If you later on
 start comparing users somewhere else, in UTF-8, then you must not
 only strcmp them, but also validate each string. This is just a fact
 and I am not complaining about it.

If usernames are supposed to be UTF-8, and in fact they are not,
then it's normal that some software will signal an error instead
of processing them. The proper way is to fix the username database,
not to change programs.

 The interesting thing is that if you do start using my conversion,
 you can actually get rid of the need to validate UTF-8 strings
 in the first scenario. That of course means you will allow users
 with invalid UTF-8 sequences, but if one determines that this is
 acceptable (or even desired), then it makes things easier. But the
 choice is yours.

For me it's not acceptable, so I will not support declaring it valid.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 And once we understand that things are manageable and not as
 frigtening as it seems at first, then we can stop using this as an
 argument against introducing 128 codepoints. People who will find
 them useful should and will bother with the consequences. Others
 don't need to and can roundtrip them as today.

A person who is against them can't ignore a motion to introduce them,
because if they are introduced, other people / programs will start
feeding our programs arbitrary byte sequences labeled as UTF-8
expecting them to accept the data.

 So, interpreting the 128 codepoints as 'recreate the original byte
 sequence' is an option.

Which guarantees that different programs will have different view of
the validity and meaning of the same data labeled by the same encoding.
Long live standarization.

 Even I will do the same where I just want to represent Unicode in
 UTF-8. I will only use this conversion in certain places.

So it's not just different programs, but even the same program in
different places. Great...

 The fact that my conversion actually produces UTF-8 from most of
 Unicode points does not mean it produced UTF-8.

Increasing the number of encodings means more opportunities of
mislabeling and using wrong libraries to process data (as it works
in most of cases and thus the error is not detected immediately)
and harder life for programs which aim at supporting all data.

Think further than the immediate moment where many people are
performing a transition form something to UTF-8. Look what happened
with the interpretation of HTML in web browsers.

If the standard from the beginning stood firmly at disallowing
guessing what a malformed HTML was supposed to mean, then people
would learn how to produce correct HTML and the interpretation would
be unambiguous. But browsers tried to accept arbitrary contents and
interpret parts of HTML they found there, guessing how errors should
be resolved, being friendly to careless webmasters. The effect is
that too often they submitted a webpage after checking that it works
in their browser, but in fact it had basic syntax errors. Other
browsers interpreted the errors differently, and the page was
inaccessible or looked badly.

When designing XML, they learned from this mistake:
http://www.xml.com/axml/target.html#dt-fatal
http://www.xml.com/axml/notes/Draconian.html

That's why people here reject balkanization of UTF-8 by introducing
variations with subtle differences, like Java-modified UTF-8.

 Inaccessible filenames are something we shouldn't accept. All your
 discussion of non-empty empty directories is just approaching the problem
 from the wrong end. One should fix the root cause, not consequences.

The root cause is that users and programs use different encodings in
different places, and thus Unix filenames can't be unambiguously and
context-freely interpreted as character sequences.

Unfortunately it's hard to fix.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 My my, you are assuming all files are in the same encoding.

Yes. Otherwise nothing shows filenames correctly to the user.

 And what about all the references to the files in scripts?
 In configuration files?

Such files rarely use non-ASCII characters. Non-ASCII characters are
primarily used in names of documents created explicitly by the user.

 Soft links?

They can be fixed automatically.

 If you want to break things, this is definitely the way to do it.

Using non-ASCII filenames is risky to begin with. Existing tools don't
have a good answer to what should happen with these files when the
default encoding used by the user changes, or when a user using a
different encoding tries to access them.

As long as everybody uses the same encoding and files use it too,
things work. When the assumption is false, something will break.

 You mean, various programs will break at various points of time,
 instead of working correctly from the beginning?
 
 So far nothing broke. Because all the programs are in UTF-8.

This doesn't imply that they won't break. You are talking about
filenames which are *not* UTF-8, with the locale set to UTF-8.

Mozilla doesn't show such filenames in a directory listing. You
may consider it a bug, but this is a fact. Producing non-UTF-8 HTML
labeled as UTF-8 would be wrong too. There is no good solution to
the problem of filenames encoded in different encodings.

Handling such filenames is incompatible with using Unicode to process
strings. You have to go back to passing arrays of bytes with ambiguous
interpretation of non-ASCII characters, and live with inconveniences
like displaying garbage for non-ASCII filenames and broken sorting.

 Mixing any two incompatible filename encodings on the same file system
 is a bad idea.
 
 As soon as you realize you cannot convert filenames to UTF-8, you
 will see that all you can do is start adding new ones in UTF-8.
 Or forget about Unicode.

I'm not using a UTF-8 locale yet, because too many programs don't
support it. I'm using ISO-8859-2. But almost all filenames are ASCII.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk

D. Starner [EMAIL PROTECTED] writes:

 But demanding that each program which searches strings checks for 
 combining classes is I'm afraid too much. 

 How is it any different from a case-insenstive search?

We started from string equality, which somehow changed into searching.
Default string equality is case-sensitive.

Searching for an arbitrary substring entered by a user should use
user-friendly rules which fold various minor differences like
decomposition and case and soft hyphens, but it's a rare task and
changing rules generally affects convenience rather than correctness.

String equality is used for internal and important operations like
lookup in a dictionary (not necessarily of strings ever viewed by
the user), comparing XML tags, filenames, mail headers, program
identifiers, hyperlink addresses etc. They should be unambiguous,
simple and fast. Computing approximate equivalence by folding minor
differenes must be done explicitly when needed, as mandated by
relevant protocols and standards, not forced as the default.

  Does \n followed by a combining code point start a new line? 
  
  The Standard says no, that's a defective combining sequence. 
 
 Is there *any* program which behaves this way? 

 I misstated that; it's a new line followed by a defective combining
 sequence.

What is the definition of combining sequences?

 It doesn't matter that accented backslashes don't occur practice.
 I do care for unambiguous, consistent and simple rules.

 So do I; and the only unambiguous, consistent and simple rule that
 won't give users hell is that ba never matches b. Any programs
 for end-users must follow that rule.

Please give a precise definition of string equality. What representation
of strings it needs - a sequence of code points or something else?
Are all strings valid and comparable? Are there operations which give
different results for equal strings?

If string equality folded the difference between precomposed and
decomposed characters, then the API should hide that difference in
other places as well, otherwise string equality is not the finest
distinction between string values but some arbitrary equivalence
relation.

 My current implementation doesn't support filenames which can't be 
 encoded in the current default encoding. 

 The right thing to do, IMO, would be to support filenames as byte
 strings, and let the programmer convert them back and forth between
 character strings, knowing that it won't roundtrip.

Perhaps. Unfortunately it makes filename processing harder, e.g.
you can't store them in *text* files processed through a transparent
conversion between its encoding and Unicode. In effect we must go
back from manipulating context-insensitive character sequences to
manipulating byte sequences with context-dependent interpretation.

We can't even sort filenames using Unicode algorithms for collation
but must use some algorithms which are capable of processing both
strings in the locale's encoding and arbitrary byte sequences at the
same time. This is much more complicated than using Unicode algorithms
alone.

What is worse, in Windows filenames the primary representation of
filenames is Unicode, so programs which carefully use APIs based on
byte sequences for processing filenames will be less general than
Unicode-based APIs when the program is ported to Windows.

The computing world is slowly migrating from processing byte sequences
in ambiguous encodings to processing Unicode strings, often represented
by byte sequences in explicitly labeled encodings. There are relics
when the new paradigm doesn't fit well, like Unix filenames, but
sticking to the old paradigm means that programs will continue to
support mixing scripts poorly or not at all.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk

Philippe Verdy [EMAIL PROTECTED] writes:

 It's hard to create a general model that will work for all scripts
 encoded in Unicode. There are too many differences. So Unicode just
 appears to standardize a higher level of processing with combining
 sequences and normalization forms that are better approaching the
 linguistic and semantic of the scripts. Consider this level as an
 intermediate tool that will help simplify the identification of
 processing units.

While rendering and user input may use evolving rules with complex
specifications and implementations which depend on the environment
and user's configuration (actually there is no other choice: this
is inherently complicated for some scripts), string processing in
a programming language should have a stable base with well-defined
and easy to remember semantics which doesn't depend on too many
settable preferences and version variations.

The more complex rules a protocol demands (case-insensitive
programming language identifiers, compared after normalization,
after bidi processing, with soft hyphens removed etc.), the more
tools will implement it incorrectly. Usually with subtle errors
which don't manifest until someone tries to process an unusual name
(e.g. documentation generation tool will produce hyperlinks with
dangling links, because a WWW server does not perform sufficient
transformations of addresses).

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 Please make up your mind: either they are valid and programs are
 required to accept them, or they are invalid and programs are required
 to reject them.
 
 I don't know what they should be called. The fact is there shouldn't be any.
 And that current software should treat them as valid. So, they are not valid
 but cannot (and must not) be validated. As stupid as it sounds. I am sure
 one of the standardizers will find a Unicodally correct way of putting it.

I am sure they will not.

There is a tension to migrate from processing strings in terms of
bytes in some vaguely specified encoding to processing them in terms
of code points of a known encoding, or even further: combining
character sequences, graphemes etc.

20 years ago the distinction was moot: a byte was a character, except
for some specialied programs for handling CJK. Today when latin names
with accented characters mixed with cyrillic names are not displayed
correctly or not sorted according to lexicograpic conventions of some
culture, the program can be considered broken. Unfortunately supporting
this requires changing the paradigm. A font with 256 characters with
byte-based rendering engine is not enough for a display, and for
sorting it's no longer enough to compare a byte at a time.

You are trying to stick with processing byte sequences, carefully
preserving the storage format instead of preserving the meaning in
terms of Unicode characters. This leads to less robust software
which is not certain about the encoding of texts it processes and
thus can't apply algorithms like case mapping without risking doing
a meaningless damage to the text.

 Today, two invalid UTF-8 strings compare the same in UTF-16, after a
 valid conversion (using a single replacement char, U+FFFD) and they
 compare different in their original form,

Conversion should signal an error by default. Replacing errors by
U+FFFD should be done only when the data is processed purely for
showing it to the user, without any further processing, i.e. when it's
better to show the text partially even if we know that it's corrupted.

 Either you do everything in UTF-8, or everything in UTF-16. Not
 always, but typically. If comparisons are not always done in the
 same UTF, then you need to validate. And not validate while
 converting, but validate on its own. And now many designers will
 remember that they didn't. So, all UTF-8 programs (of that kind)
 will need to be fixed. Well, might as well adopt my broken
 conversion and fix all UTF-16 programs. Again, of that kind, not all
 in general, so there are few. And even those would not be all
 affected. It would depend on which conversion is used where. Things
 could be worked out. Even if we would start changing all the
 conversions. Even more so if a new conversion is added and only used
 when specifically requested.

I don't understand anything of this.

 I cannot afford not to access the files.

Then you have two choices:
- Don't use Unicode.
- Pretend that filenames are encoded in ISO-8859-1, and represent them
  as a sequence of code points U+0001..U+00FF. They will not be displayed
  correctly but the information will be preserved.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk

Philippe Verdy [EMAIL PROTECTED] writes:

[...]
 This was later amended in an errata for XML 1.0 which now says that
 the list of code points whose use is *discouraged* (but explicitly
 *not* forbidden) for the Char production is now:
[...]

Ugh, it's a mess...

IMHO Unicode is partially to blame, by introducing various kinds of
holes in code point numbering (non-characters, surrogages), by not
being clear when the unit of processing should be a code point and
when a combining character sequence, and earlier by pushing UTF-16 as
the fundamental representation of the text (which led to such horrible
descriptions as http://www.xml.com/axml/notes/Surrogates.html).

XML is just an example of a standard which must decide:
A. What is the unit of text processing? (code point? combining character
   sequence? something else? hopefully it would not be UTF-16 unit)
B. Which (sequences of) characters are valid when present in the raw
   source, i.e. what UTF-n really means?
C. Which (sequences of) characters can be formed by specifying a
   character number?

A programming language must do the same.

The language Kogut I'm designing and developing uses Unicode as string
representation, but the details can still be changed. I want to have
rules which are correct as far as Unicode is concerned, and which
are simple enough to be practical (e.g. if a standard forced me to
make the conversion from code point number to actual character
contextual, or if it forced me to unconditionally unify precomposed
and decomposed characters, then I quit and won't support a broken
standard).

Internal text processing in a programming language can be more
permissive than an application of such processing like XML parsing:
if a particular character is valid in UTF-8 but XML disallows it,
everything is fine, it can be rejected at some stage. It must not be
more restrictive however, as it would make impossible to implement XML
parsing in terms of string processing.

Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
   to process it in groups of combining character sequences.

I'm afraid that anything other than a mixture of 1 and 3 is too
complicated to be widely used. Almost everybody is representing
strings either as code points, or as even lower-level units like
UTF-16 units. And while 2 is nice from the user's point of view,
it's a nightmare from the programmer's point of view:
- Unicode character properties (like general category, character
  name, digit value) are defined in terms of code points. Choosing
  2 would immediately require two-stage processing: a string is
  a sequence of sequences of code points.
- Unicode algorithms (like collation, case mapping, normalization)
  are specified in terms of code points.
- Data exchange formats (UTF-n) are always closer to code points
  than to combining character sequences.
- Code points have a finite domain, so you can make dictionaries
  indexed by code points; for combining character sequences we would
  be forced to make functions which *compute* the relevant property
  basing on the structure of such a sequence.

I don't believe 2 is workable at all. The question is how to make 3
convenient enough to be used more often. Unfortunately it's much
harder than 1, unless strings used some completely different iteration
protocols than other sequences. I don't have an idea how to make 3
convenient.

Regarding B in the context of a programming language (not XML),
chapter 3.9 of the Unicode standard version 4.0 excludes only
surrogates: it does not exclude non-characters like U+.
But non-characters must be excluded somewhere, because otherwise
U+FFFE at the beginning would be mistaken for a BOM. I'm confused.

Regarding C, I'm confused too. Should a function which returns
the character of the given number accept surrogates? I guess no.
Should it accept non-characters? I don't know. I only know that
it should not accept values above 0x10.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 It's essential that any UTF-n can be translated to any other without
 loss of data. Because it allows to use an implementation of the given
 functionality which represents data in any form, not necessarily the
 form we have at hand, as long as correctness is concerned. Avoiding
 conversion should matter only for efficiency, not for correctness.
 
 When I am talking about roundtrip, I speak of arbitrary data, not
 just valid data.

You want to declare all byte sequences as valid. And thus valid data
is no longer preserved on round trip, because different UTFs are able
to encode different sequences of code points.

 Roundtrip for valid data is of course essential and needs to be
 preserved.

Your proposal does not do this.

 Unpaired surrogates are not valid UTF-16, and there are no surrogates
 in UTF-8 at all, so there is no point in trying to preserve UTF-16
 which is not really UTF-16.
 
 Actually, there is a point. It is just that you fail to understand it.
 But then, you needn't worry about it, since it is outside of your area
 of interest.

I would worry if my programs would no longer accept what Unicode
considers valid UTF-n. And I would worry if rules defined by Unicode
would make U+ encodable as UTF-n, U+ encodable too, but the
sequence U+ U+ not encodable (because UTF-n would no longer
be usable as a format for serialization of arbitrary strings of valid
code points).

I would also worry if an API, file format or network protocol intended
for use by various programs required a non-standard variant of UTF-n,
because I couldn't use standard UTF-n encoding and decoding functions
to interoperate with it.

I indeed don't worry in what way you abuse UTF-n, as long as it's not
an official Unicode standard and it's not widely used in practice.

 If UTC takes 128 unassigned codepoints and declares them to be a new
 set of surrogates, you needn't worry either (your valid data will
 still convert to any UTF).

No, because it would remove responsibility to not generate such data
and add responsibility to accept them, and thus some programs which
are not currently broken would be broken under changed rules.

 Unless you have a strict validator which already validates unpaired
 surrogates. But you don't. I am pretty sure about it.

I use system-supplied iconv() which does not accept anything which can
be described as unpaired surrogates.

 If a user encounters corrupt data and cannot process it with your
 program, she (she is 'politically correct', but in this case can
 be seen as sexism) will blame it on the program, not the data.

I don't care.

 This has been discussed mails back. UNIX filenames are already 'submitted'.
 Once you set your locale to UTF-8, you have labelled them all as UTF-8.
 Suggestions?

Convert them to be valid UTF-8 (as long as locales used in the system
use UTF-8 as the encoding, that is, otherwise keep them in the locale's
encoding).

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk

D. Starner [EMAIL PROTECTED] writes:

  This implies that every programmer needs an indepth knowledge of 
  Unicode to handle simple strings. 
 
 There is no way to avoid that. 

 Then there's no way that we're ever going to get reliable Unicode
 support.

This is probably true.

I wonder whether things could have been done significantly better,
or it's an inherent complexity of text. Just curious, it doesn't help
with the reality.

 If the runtime automatically performed NFC on input, then a part of a 
 program which is supposed to pass a string unmodified would sometimes 
 modify it. Similarly with NFD.

 No. By the same logic you used above, I can expect the programmer to
 understand their tools, and if they need to pass strings unmodified,
 they shouldn't load them using methods that normalize the string.

That's my point: if he normalizes, he does this explicitly.

If a standard (a programming language, XML, whatever) specifies that
identifiers should be normalized before comparison, a program should
do this. If it specifies that Cf characters are to be ignored, then a
program should comply. A standard doesn't have to specify such things
however, so a programming language shouldn't do too much automatically.
It's easier to apply a transformation than to undo a transformation
applied automatically.

 Sometimes things get ambiguous if one day #349; is matched by s and one
 day #349; isn't? That's absolutely wrong behavior; the program must serve
 the user, not the programmer.

If I use grep to search for a combining acute, I bet it will currently
match cases where it's a separate combining character but will not
match precomposed characters.

Do you say that this should be changed?

Hey, Linux grep matches only a single byte by ., even in UTF-8 locale.
Now, I can agree that this should be changed.

But demanding that each program which searches strings checks for
combining classes is I'm afraid too much.

 Does \n followed by a combining code point start a new line? 

 The Standard says no, that's a defective combining sequence.

Is there *any* program which behaves this way?

How useful is a rule in a standard which nobody obeys to?

 Does a double quote followed by a combining code point start a
 string literal?

 That would depend on your language. I'd prefer no, but it's obvious
 many have made other choices.

Since my language is young and almost doesn't have users, I can even
change decisions made earlier: I'm not constrained by compatibility
yet.

But if lexical structure of the program worked in terms of combining
character sequences, it would have to be somehow supported by generic
string processing functions, and it would have to consistely work for
all lexical features. For example */ followed by a combining accent
would not end a comment, accented backslash would not need escaping in
a string literal, and something unambiguous would have to be done with
an accented newline.

Such rules would be harder to support with most text processing tools.
I know no language in which searching for a backslash in a string would
not find an accented backslash.

It doesn't matter that accented backslashes don't occur practice. I do
care for unambiguous, consistent and simple rules.

 Does a slash followed by a combining code point separate 
 subdirectory names?

 In Unix, yes; that's because filenames in Unix are byte streams with
 the byte 0x2F acting as a path seperator.

My current implementation doesn't support filenames which can't be
encoded in the current default encoding. The encoding can be changed
from within a program (perhaps locally during execution of some code).
So one can process any Unix filename by temporarily setting the
encoding to Latin1. It's unfortunate that the default setting is more
restrictive than the OS, but I have found no sensible alternative
other than encouraging processing strings in their transportation
encoding.

Anyway, if a string *is* accepted as a file name, the program's idea
about directory separators is the same as the OS (as long as we assume
Unix; I don't yet provide any OS-generic pathname handling). If the
program assumed that an accented slash is not a directory separator,
I expect possible security holes (the program thinks that a string
doesn't include slashes, but from the OS point of view it does).

 The rules you are offering are only simple and unambiguous to the
 programmer; they appear completely random to the end user.

And yours are the opposite :-)

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 The other name for this is roundtripping. Currently, Unicode allows
 a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are
 several reasons why a UTF-8=UTF-16(32)=UTF-8 roundtrip is more
 valuable, even if it means that the other roundtrip is no longer
 guaranteed:

It's essential that any UTF-n can be translated to any other without
loss of data. Because it allows to use an implementation of the given
functionality which represents data in any form, not necessarily the
form we have at hand, as long as correctness is concerned. Avoiding
conversion should matter only for efficiency, not for correctness.

 Let me go a bit further. A UTF-16=UTF-8=UTF-16 roundtrip is only
 required for valid codepoints other than the surrogates. But it also
 works for surrogates unless you explicitly and intentionally break it.

Unpaired surrogates are not valid UTF-16, and there are no surrogates
in UTF-8 at all, so there is no point in trying to preserve UTF-16
which is not really UTF-16.

 I would opt for the latter (i.e. keep it working), according to my
 statement (in the thread When to validate) that validation should
 be separated from other processing, where possible.

Surely it should be separated: validation is only necessary when data
are passed from the external world to our system. Internal operations
should not produce invalid data from valid data. You don't have to
check at each point whether data is valid. You can assume that it is
always valid, as long as the combination of the programming language,
libraries and the program is not broken.

Some languages make it easier to ensure that strings are valid, to the
point that they guarantee it (they don't offer any way to construct
an invalid string). Unfortunately many languages don't: they say that
they represent strings in UTF-8 or UTF-16, but they are unsafe, they
do nothing to prevent constructing an array of words which is not
valid UTF-8 or UTF-16 and passing it to functions which assume that
it is. Blame these languages, not the definitions of UTF-n.

 A UTF-32=UTF-8=UTF-32 roundtrip is similar, except that 16-8-16 works even
 with concatenation, while 32-8-32 can be broken with concatenation.

It always works as long as data was really UTF-32 at the first place.
A word with a value of 0xD800 is not UTF-32.

 All this is known and presents no problems, or - only problems that
 can be kept under control. So, by introducing another set of 128
 'surrogates', we don't get a new type of a problem, just another
 instance of a well known one.

Nonsense. UTF-8, UTF-16 and UTF-32 are interchangeable, and you would
like to break this. No way.

 On the other hand, UTF-8=UTF-16=UTF-8 as well as UTF-8=UTF-32=UTF-8
 can be both achieved, with no exceptions. This is something no other
 roundtrip can offer at the moment.

But they do! An isolated byte with the highest bit set is not UTF-8,
so there is no point in converting it to UTF-16 and back.

 On top of it, I repeatedly stressed that it is UTF-8 data that has the
 highest probablility of any of the following:
 * contains portions that are not UTF-8
 * is not really UTF-8, but user has UTF-8 set as default encoding
 * is not really UTF-8, but was marked as such
 * a transmission error not only changes data but also creates invalid
 sequences

In this cases the data is broken and the damage should be signalled as
soon as possible, so the submitter can know this and correct it.

Alternatively you keep the original byte sequence, but don't pretend
that it's UTF-8. Delete the erroneous UTF-8 label instead of changing
the data.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 All assigned codepoints do roundtrip even in my concept.
 But unassigned codepoints are not valid data.

Please make up your mind: either they are valid and programs are
required to accept them, or they are invalid and programs are required
to reject them.

 Furthermore, I was proposing this concept to be used, but not
 unconditionally. So, you can, possibly even should, keep using
 whatever you are using.

So you prefer to make programs misbehave in unpredictable ways
(when they pass the data from a component which uses relaxed rules
to a component which uses strict rules) rather than have a clear and
unambiguous notion of a valid UTF-8?

 Perhaps I can convert mine, but I cannot convert all filenames on
 a user's system.

They you can't access his files.

With your proposal you couldn't as well, because you don't make them
valid unconditionally. Some programs would access them and some would
break, and it's not clear what should be fixed: programs or filenames.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: When to validate?

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk

Arcane Jill [EMAIL PROTECTED] writes:

 Here's something that's been bothering me. Suppose I write a function
 -
 let's call it trim(), which removes leading and trailing spaces from a
 string, represented as one of the UTFs. If I've understood this
 correctly, I'm supposed to validate the input, yes?

What do you mean by validate?

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk

D. Starner [EMAIL PROTECTED] writes:

 String equality in a programming language should not treat composed
 and decomposed forms as equal. Not this level of abstraction.

 This implies that every programmer needs an indepth knowledge of
 Unicode to handle simple strings.

There is no way to avoid that.

If the runtime automatically performed NFC on input, then a part of a
program which is supposed to pass a string unmodified would sometimes
modify it. Similarly with NFD.

You can't expect each and every program which compares strings to
perform normalization (e.g. Linux kernel with filenames).

Perhaps if there was a single normalization format which everybody
agreed to, and unnormalized strings were never used for data
interchange (if UTF-8 was specified such that to disallow unnormalized
data, etc.), things would be different. But Unicode treats both
composed and decomposed representations as valid.

 IMHO splitting into graphemes is the job of a rendering engine, not of
 a function which extracts a part of a string which matches a regex.

 So S should _sometimes_ match an accented S? Again, I feel extended misery
 of explaining to people why things aren't working right coming on.

Well, otherwise things get ambiguous, similarly to these XML issues.
Does \n followed by a combining code point start a new line? Does
a double quote followed by a combining code point start a string
literal? Does a slash followed by a combining code point separate
subdirectory names?

An iterator which delivers whole combining character sequences out of
a sequence of code points can be used. You can also manipulate strings
as arrays of combining character sequences. But if you insist that
this is the primary string representation, you become incompatible
with most programs which have different ideas about delimited strings.
You can't expect each and every program to check combining classes
of processed characters. It's hard enough to convince them that a
character is not the same as a byte.

 I expect breakage of XML-based protocols if implementations are
 actually changed to conform to these rules (I bet they don't now).

 Really? In what cases are you storing isolated combining code points
 in XML as text?

In case I want to circumvent security or deliberately cause a piece of
software to misbehave. Robustness require unambiguous and simple rules.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk

Philippe Verdy [EMAIL PROTECTED] writes:

 The XML/HTML core syntax is defined with fixed behavior of some
 individual characters like '', '', quotation marks, and with special
 behavior for spaces.

The point is: what characters mean in this sentence. Code points?
Combining character sequences? Something else?

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk

John Cowan [EMAIL PROTECTED] writes:

  The XML/HTML core syntax is defined with fixed behavior of some
  individual characters like '', '', quotation marks, and with special
  behavior for spaces.
 
 The point is: what characters mean in this sentence. Code points?
 Combining character sequences? Something else?

 Neither.  Unicode characters.

What does Unicode characters mean?

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk

John Cowan [EMAIL PROTECTED] writes:

  The XML/HTML core syntax is defined with fixed behavior of some
  individual characters like '', '', quotation marks, and with special
  behavior for spaces.
 
 The point is: what characters mean in this sentence. Code points?
 Combining character sequences? Something else?

 Neither.  Unicode characters.

http://www.w3.org/TR/2000/REC-xml-20001006#charsets
implies that the appropriate level for parsing XML is code points.

In particular XML allows a combining character directly after .

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk

D. Starner [EMAIL PROTECTED] writes:

 You could hide combining characters, which would be extremely useful if 
 we were just using Latin and Cyrillic scripts.

It would need a separate API for examining the contents of a combining
character. You can't avoid the sequence of code points completely.

It would yield to surprising semantics: for example if you concatenate
a string with N+1 possible positions of an iterator with a string with
M+1 positions, you don't necessarily get a string with N+M+1 positions
because there can be combining characters at the border.

It's simpler to overlay various grouping styles on top of a sequence
of code points than to start with automatically combined combining
characters and process inwards and outwards from there (sometimes
looking inside characters, sometimes grouping them even more).

It would impose complexity in cases where it's not needed. Most of the
time you don't care which code points are combining and which are not,
for example when you compose a text file from many pieces (constants
and parts filled by users) or when parsing (if a string is specified
as ending with a double quote, then programs will in general treat a
double quote followed by a combining character as an end marker).

I believe code points are the appropriate general-purpose unit of
string processing.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: If only MS Word was coded this well

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk

Theodore H. Smith [EMAIL PROTECTED] writes:

 It's because code points have variable lengths in bytes, so
 extracting individual characters is almost meaningless

 Same with UTF-16 and UTF-32. A character is multiple code-points,
 remember? (decomposed chars?)

 Nope. I've done tons of UTF-8 string processing. I've even done a case
 insensitive word-frequency measuring algorithm on UTF-8. It runs
 blastingly fast, because I can do the processing with bytes.

Ah, so first you say that a character mean a base code point plus a
number of combining code points, and then you admit that your program
actually process strings in terms of even lower level units: bytes of
UTF-8 encoding?

Why don't you treat a string as a sequence of base code point with
combining code points items?

Answer: because often this grouping is irrelevant, like in your
example of word statistics. Code point grouping is more important:
Unicode algorithms are typically described in terms of code points.

 It just requires you to understand the actual logic of UTF-8 well
 enough to know that you can treat it as bytes, most of the time.

When I implemented the word boundary algorithm from Unicode, I was
glad that I could do it in terms of UTF-32 and ISO-8859-1 instead of
UTF-8, even though I do understand the logic of UTF-8.

 As for isspace... sure there is a UTF-8 non-byte space.

I don't understand.

If a string is exposed as a sequence of UTF-8 units, it makes no sense
to ask whether a particular unit isspace. And it makes no sense to ask
this about a whole string either. It would have to be a function which
works in terms of some iterator over strings.

Well, some things do work in terms of positions inside strings, for
example word boundaries. But people are used to think about isspace as
a property of a *character*, whatever the language exactly means under
this concept. My language means a Unicode code point, for conceptual
simplicity of the concept of a string as seen by the language.

 My case insensitive utf-8 word frequency counter (which runs
 blastingly fast) however didn't find this to be any problem. It
 dealt with non-single byte all sorts of word breaks :o)

 It appears to run at about 3MB/second on my laptop, which involves
 for every word, doing a word check on the entire previous collection
 of words.

I happen to have written a case insensitive word frequency counter as
an example in my language, to test some Unicode algorithms.
It uses the word boundary algorithm to specify words; a segment
between boundaries must include a character of class L* or N* in order
to be counted as a word. It maintains subcounts of case-sensitive
forms of a case-insensitive word (implemented as a hash table of hash
tables of integers). It converts input using iconv(), i.e. from an
arbitrary locale encoding supported by the system.

It was not written with speed in mind. It has 24 lines, 10 of which
are formatting the output (statistics about 20 most common words).
http://cvs.sourceforge.net/viewcvs.py/kokogut/kokogut/tests/WordStat.ko?view=markup

It's written in a dynamically typed language, with dynamic dispatches
and higher order functions everywhere, where all values except small
integers are pointers, with immutable strings. Each line separately
is divided into words; a subsequence of spaces is materialized as a
string object before the program checks that there are no letters nor
numbers in it and thus it's not a word.

It processed 4.8MB in 3.2s on my machine (Athlon 2000, 1.25GHz), which
I think is good enough under these conditions. This input happens to
be ASCII (a mailbox) but the program didn't know beforehand that it's
ASCII.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Invalid UTF-8 sequences

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 Quite close. Except for the fact that:
 * U+EE93 is represented in UTF-32 as 0xEE93
 * U+EE93 is represented in UTF-16 as 0xEE93
 * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93)

Then it would be impossible to represent sequences like
U+ U+EEBA U+EE93 in UTF-8, and conversion UTF-32 - UTF-8 - UTF-32
would not round-trip.

Concatenation of UTF-8-encoded strings would not be equivalent to
UTF-8-encoding of the concatenation of code points.

This is broken.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk

D. Starner [EMAIL PROTECTED] writes:

 The semantics there are surprising, but that's true no matter what you
 do. An NFC string + an NFC string may not be NFC; the resulting text
 doesn't have N+M graphemes.

Which implies that automatically NFC-ing strings as they are processed
would be a bad idea. They can be NFC-ed at the end of processing if the
consumer of this data will demand this. Especially if other consumers
would want NFD.

String equality in a programming language should not treat composed
and decomposed forms as equal. Not this level of abstraction.

IMHO splitting into graphemes is the job of a rendering engine, not of
a function which extracts a part of a string which matches a regex.

 If you do so with an language that includes , you violate the Unicode
 standard, because #824; (not ) and #8814; are canonically equivalent.

I think that Unicode tries to push implications of equivalence
too far.

They are supposed to be equivalent when they are actual characters.
What if they are numeric character references? Should #824;
(7 characters) represent a valid plain-text character or be a broken
opening tag?

Note that if it's a valid plain-text character, it's impossible
to represent isolated combining code points in XML, and thus it's
impossible to use XML for transportation of data which allows isolated
combining code points (except by introducing custom escaping of
course, e.g. transmitting decimal numbers instead of characters).
I expect breakage of XML-based protocols if implementations are
actually changed to conform to these rules (I bet they don't now).

OTOH if it's not a valid plain-text character, then conversion between
numeric character references and actual characters is getting more
hairy.

 I'll see if I have time after finals to pound out a basic API that
 implements this, in Ada or Lisp or something.

My language is quite similar to Lisp semantically.

Implementing an API which works in terms of graphemes over an API
which works in terms of code points is more sane than the converse,
which suggests that the core API should use code points if both APIs
are sometimes needed at all.

While I'm not obsessed with efficiency, it would be nice if changing
the API would not slow down string processing too much.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk

John Cowan [EMAIL PROTECTED] writes:

 String equality in a programming language should not treat composed
 and decomposed forms as equal. Not this level of abstraction.

 Well, that assumes that there's a special string equality predicate,
 as distinct from just having various predicates that DWIM.

No, I meant the default generic equality predicate when applied to two
strings.

 It's a broken opening tag.

Ok, so it's the conversion from raw text to escaped character
references which should treat combining characters specially.

What about  with combining acute, which doesn't have a precomposed
form? A broken opening tag or a valid text character?

What about #65;ACUTE where ACUTE stands for combining acute? Is this
A with acute, or a broken character reference which ends with an
accented semicolon?

If it's a broken character reference, then what about A#769; (769 is
the code for combining acute if I'm not mistaken)? If *this* is A with
acute, then it's inconsistent: here combining accents are processed
after resolving numeric character references, and previously it was
in the opposite order. OTOH if this is something else, then it's
impossible to represent letters without precomposed forms with numeric
character references.

The general trouble is that numeric character references can only
encode individual code points rather than graphemes (is this a correct
term for a non-combining code point with a sequence of combining code
points?). So if XML is supposed to be treated as a sequence of
graphemes, weird effects arise in the above boundary cases...

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-06 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 This is simply what you have to do. You cannot convert the data
 into Unicode in a way that says I don't know how to convert this
 data into Unicode. You must either convert it properly, or leave
 the data in its original encoding (properly marked, preferably).

 Here lies the problem. Suppose you have a document in UTF-8, which
 somehow got corrupted and now contains a single invalid sequence.
 Are you proposing that this document needs to be stored separately?

He is not proposing that.

 Everything else in the database would be stored in UTF-16, but now
 one must add the capability to store this document separately.

No, it can be be stored in UTF-16 or whatever else is used. Except the
corrupted part of course, but it's corrupted, and thus useless, so it
doesn't matter what happens with it.

 Now suppose you have a UNIX filesystem, containing filenames in a legacy
 encoding (possibly even more than one). If one wants to switch to UTF-8
 filenames, what is one supposed to do? Convert all filenames to UTF-8?

Yes.

 Who will do that?

A system administrator (because he has access to all files).

 And when?

When the owners of the computer system decide to switch to UTF-8.

 Will all users agree?

It depends on who decides about such things. Either they don't have a
voice, or they agree and the change is made, or they don't agree and
the change is not made. What's the point?

 Should all filenames that do not conform to UTF-8 be declared invalid?

What do you mean by invalid? They are valid from the point of view
of the OS, but they will not work with reasonable applications which
use Unicode internally.

 If you keep all processing in UTF-8, then this is a decision you can
 postpone.

You mean, various programs will break at various points of time,
instead of working correctly from the beginning?

If it's broken, fix it, instead of applying patches which will
sometimes hide the fact that it's broken, or sometimes not.

 I didn't encourage users to mix UTF-8 filenames and Latin 1 filenames.
 Do you want to discourage them?

Mixing any two incompatible filename encodings on the same file system
is a bad idea.

 IMHO, preserving data is more important, but so far it seems it is
 not a goal at all. With a simple argument - that Unicode only
 defines how to process Unicode data. Understandably so, but this
 doesn't mean it needs to remain so.

If you don't know the encoding and want to preserve the values of
bytes, then don't convert it to Unicode.

 Well, you may have a wrong assumption here. You probably think that
 I convert invalid sequences into PUA characters and keep them as
 such in UTF-8. That is not the case. Any invalid sequences in UTF-8
 are left as they are. If they need to be converted to UTF-16, then
 PUA is used. If they are then converted to UTF-8, they are converted
 back to their original bytes, hence the incorrect sequences are
 re-created.

This does not make sense. If you want to preserve the bytes instead
of working in terms of characters, don't convert it at all - keep the
original byte stream.

 One more example of data loss that arises from your approach: If a
 single bit is changed in UTF-16 or UTF-32, that is all that will
 happen (in more than 99% of the cases). If a single bit changes in
 UTF-8, you risk that the entire character will be dropped or
 replaced with the U+FFFD. But funny, only if it ever gets converted
 to the UTF-16 or UTF-32. Not that this is a major problem on its
 own, but it indicates that there is something fishy in there.

If you change one bit in a file compressed by gzip, you might not be
able to recover any part of it. What's the point?

UTF-x were not designed to minimize the impact of corruption of
encoded bytes. If you want to preserve the text despite occasional
corruption, use a higher level protocol for this (if I remember
correctly, RAR can add additional information to an archive which
allows to recover the data even if parts of the archive, entire
blocks, have been lost).

 There was a discussion on nul characters not so long ago. Many text
 editors do not properly preserve nul characters in text files.
 But it is definitely a nice thing if they do. While preserving nul
 characters only has a limited value, preserving invalid sequences
 in text files could be crucial.

An editor should alert the user that the file is not encoded in a
particular encoding or that it's corrupted, instead of trying to guess
which characters were supposed to be there.

If it's supposed to edit binary files too, it should work on the bytes
instead of decoded characters.

 A UTF-8 based editor can easily do this. A UTF-16 based editor
 cannot do it at all. If you say that UTF-16 is not intended for such
 a purpose, then so be it. But this also means that UTF-8 is superior.

It's much easier with CP-1252, which shows that it's superior to UTF-8
:-)

 Yes, it is not related much. Except for the fact I was trying to see
 if UTF-32

Re: Nicest UTF

2004-12-05 Thread Marcin 'Qrczak' Kowalczyk

Philippe Verdy [EMAIL PROTECTED] writes:

 The point is that indexing should better be O(1).

 SCSU is also O(1) in terms of indexing complexity...

It is not. You can't extract the nth code point without scanning the
previous n-1 code points.

 But individual characters do not always have any semantic. For
 languages, the relevant unit is almost always the grapheme cluster,
 not the character (so not its code point...).

How do you determine the semantics of a grapheme cluster? Answer: by
splitting it into code points. A code point is atomic, it's not split
any more, because there is a finite number of them.

When a string is exchanged with another application or network
computer or the OS, it always uses some encoding which is closer to
code points than to grapheme clusters, no matter if it's UTF-8 or
UTF-16 or ISO-8859-something. If the string was originally stored as
an array of grapheme clusters, it would have to be translated to code
points before further conversion.

 Which represent will be the best is left to implementers, but I really
 think that compressed schemes are often introduced to increase the
 application performances and reduce the needed resources both in
 memory and for I/O, but also in networking where interoperability
 across systems and bandwidth optimization are also important design
 goals...

UTF-8 is much better for interoperability than SCSU, because it's
already widely supported and SCSU is not.

It's also easier to add support for UTF-8 than for SCSU. UTF-8 is
stateless, SCSU is stateful - this is very important. UTF-8 is easier
to encode and decode.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-05 Thread Marcin 'Qrczak' Kowalczyk

Philippe Verdy [EMAIL PROTECTED] writes:

 The question is why you would need to extract the nth codepoint so
 blindly.

For example I'm scanning a string backwards (to remove '\n' at the
end, to find and display the last N lines of a buffer, to find the
last '/' or last '.' in a file name). SCSU in general supports
traversal only forwards.

 But remember the context in which this discussion was introduced:
 which UTF would be the best to represent (and store) large sets of
 immutable strings. The discussion about indexes in substrings is not
 relevevant in that context.

It is relevant. A general purpose string representation should support
at least a bidirectional iterator, or preferably efficient random access.
Neither is possible with SCSU.

* * *

Now consider scanning forwards. We want to strip a beginning of a
string. For example the string is an irc message prefixed with a
command and we want to take the message only for further processing.
We have found the end of the prefix and we want to produce a string
from this position to the end (a copy, since strings are immutable).

With any stateless encoding a suitable library function will compute
the length of the result, allocate memory, and do an equivalent of
memcpy.

With SCSU it's not possible to copy the string without analysing it
because the prefix might have changed the state, so the suffix is not
correct when treated as a standalone string. If the stripped part is
short and the remaining part is long, it might pay off to scan the
part we want to strip and perform a shortcut of memcpy if the prefix
did not change the state (which is probably a common case). But in
general we must recompress the whole copied part! We can't even
precalculate its physical size. Decompressing into temporary memory
will negate benefits of a compressed encoding, so we should better
decompress and compress in parallel into a dynamically resizing
buffer. This is ridiculously complex compared to a memcpy.

The *only* advantage of SCSU is that it takes little space. Although
in most programs most strings are ASCII, and SCSU never beats
ISO-8859-1 which is what the implementation of my language is using
for strings which no characters above U+00FF, so it usually does
not have even this advantage.

Disadvantages are everywhere else: every operation which looks at the
contents of a string or produces contents of a string is more complex.
Some operations can't be supported at all with the same asymptotic
complexity, so the API would have to be changed as well to use opaque
iterators instead of indices. It's more complicated both for internal
processing and for interoperability (unless the other end understands
SCSU too, which is unlikely).

Plain immutable character arrays are not completely universal either
(e.g. they are not sufficient for a buffer of a text editor), but they
are appropriate as the default representation for common cases; for
representing filenames, URLs, email addresses, computer language
identifiers, command line option names, lines of a text file, messages
in a dialog in a GUI, names of columns of a database table etc. Most
strings are short and thus performing a physical copy when extracting
a substring is not disastrous. But the complexity of SCSU is too bad.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-04 Thread Marcin 'Qrczak' Kowalczyk

Philippe Verdy [EMAIL PROTECTED] writes:

 There's nothing that requires the string storage to use the same
 exposed array,

The point is that indexing should better be O(1).

Not having a constant side per code point requires one of three things:
1. Using opaque iterators instead of integer indices.
2. Exposing a different unit in the API.
3. Living with the fact that indexing is not O(1) in general; perhaps
   with clever caching it's good enough in common cases.

Altough all three choices can work, I would prefer to avoid them.
If I had to, I would probably choose 1. But for now I've chosen a
representation based on code points.

 Anyway, each time you use an index to access to some components of a
 String, the returned value is not an immutable String, but a mutable
 character or code unit or code point, from which you can build
 *other* immatable Strings

No, individual characters are immutable in almost every language.
Assignment to a character variable can be thought as changing the
reference to point to a different character object, even if it's
physically implemented by overwriting raw character code.

 When you do that, the returned character or code unit or code point
 does not guarantee that you'll build valid Unicode strings. In fact,
 such character-level interface is not enough to work with and
 transform Strings (for example it does not work to perform correct
 transformation of lettercase, or to manage grapheme clusters).

This is a different issue. Indeed transformations like case mapping
work in terms of strings, but in order to implement them you must
split a string into some units of bounded size (code points, bytes,
etc.).

All non-trivial string algorithms boil down to working on individual
units, because conditionals and dispatch tables must be driven by
finite sets. Any unit of a bounded size is technically workable, but
they are not equally convenient. Most algorithms are specified in
terms of code points, so I chose code points for the basic unit in
the API.

In fact in my language there is no separate character type: a code
point extracted from a string is represented by a string of length 1.
It doesn't change the fact that indexing a string by code point index
should run in constant time, and thus using UTF-8 internally would be
a bad idea unless we implement one of the three points above.

 Once you realize that, which UTF you use to handle immutable String
 objects is not important, because it becomes part of the blackbox
 implementation of String instances.

The black box must provide enough tools to implement any algorithm
specified in terms of characters, an algorithm which was not already
provided as a primitive by the language.

Algorithms generally scan strings sequentially, but in order to store
positions to come back to them later you must use indices or some
iterators. Indices are simpler (and in my case more efficient).

 Using SCSU for such String blackbox can be a good option if this
 effectively helps in store many strings in a compact (for global
 performance) but still very fast (for transformations) representation.

I disagree. SCSU can be a separate type to be used explicitly, but
it's a bad idea for the default string representation. Most strings
are short, and thus constant factors and simplicity matter more than
the amount of storage. And you wouldn't save much storage anyway:
as I said, in my representation strings which contain only characters
U+..U+00FF are stored one byte per character. The majority of
strings in average programs is ASCII.

In general what I don't like in SCSU is that there is no obvious
compression algorithm which makes good use of various features. Each
compression algorithm is either not as powerful as it could, or is
extremely slow (trying various choices), or is extremely complicated
(trying only sensible paths).

 Unfortunately, the immutable String implementations in Java or C#
 or Python does not allow the application designer to decide which
 representation will be the best (they are implemented as concrete
 classes instead of virtual interfaces with possible multiple
 implementations, as they should; the alternative to interfaces would
 have been class-level methods allowing the application to trade with
 the blackbox class implementation the tuning parameters).

Some functions accept any sequence of characters. Other functions
accept only standard strings. The question is how often to use each
style.

Choosing the first option increases flexibility but adds an overhead
in the common case. For example case mapping of a string would have to
either perform dispatching functions at each step, or be implemented
twice. Currently it's implemented for strings only, in C, and thus
avoids calling a generic indexing function and other overheads. At
some time I will probably implement it again, to work for arbitrary
sequences of characters, but it's more work for effects that I don't
currently need, so it's not a priority.

Re: Nicest UTF

2004-12-03 Thread Marcin 'Qrczak' Kowalczyk

Philippe Verdy [EMAIL PROTECTED] writes:

 Decoding SCSU is very straightforward,

But not for random access by code point index, which is needed by many
string APIs.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-02 Thread Marcin 'Qrczak' Kowalczyk

Arcane Jill [EMAIL PROTECTED] writes:

 Oh for a chip with 21-bit wide registers!

Not 21-bit but 20.087462841250343-bit :-)

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-01 Thread Marcin 'Qrczak' Kowalczyk

Theodore H. Smith [EMAIL PROTECTED] writes:

 Assuming you had no legacy code. And no handy libraries either,
[...]
 What would be the nicest UTF to use?

For internals of my language Kogut I've chosen a mixture of ISO-8859-1
and UTF-32. Normalized, i.e. a string with chracters which fit in
narrow characters is always stored in the narrow form.

I've chosen representations with fixed size code points because
nothing beats the simplicity of accessing characters by index, and the
most natural thing to index by is a code point.

Strings are immutable, so there is no need to upgrade or downgrade a
string in place, so having two representations doesn't hurt that much.
Since the majority of strings is ASCII, using UTF-32 for everything
would be wasteful.

Mutable and resizable character arrays use UTF-32 only.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Unicode IDNs

2004-11-09 Thread Marcin 'Qrczak' Kowalczyk

Donald Z. Osborn [EMAIL PROTECTED] writes:

 Is anyone aware of URLs that use extended Latin characters as examples?

http://w.pl/

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: bit notation in ISO-8859-x is wrong

2004-10-10 Thread Marcin 'Qrczak' Kowalczyk

[EMAIL PROTECTED] (James Kass) writes:

[...]
 If there are eight bits, why shouldn't they be bits one 
 through eight?

Because then the number of a bit doesn't correspond to the exponent
of its weight, so I even don't know in which order they are specified
(as many people order bits backwards, i.e. from the most significant).

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)

2004-08-14 Thread Marcin 'Qrczak' Kowalczyk

W licie z sob, 14-08-2004, godz. 12:35 +0200, Philippe Verdy napisa:

 Simply because, for both Unicode and ISO/IEC 10646, the character
 model includes the fact that ANY base character forms a combining
 character sequence with ANY following combining character or ZW(N)J
 character.

Shouldn't grapheme cluser boundary and word boundary rules in
http://www.unicode.org/reports/tr29/ handle ZW(N)J?

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Combining across markup?

2004-08-12 Thread Marcin 'Qrczak' Kowalczyk

W licie z czw, 12-08-2004, godz. 13:00 -0400, John Cowan napisa:


  Even better yet: Have the WC3 rephrase their demand that no element
  should start with a defective sequence (when considered in separate)
  as that no *block-level* element should etc., and leave things like
  span, i and other in-line elements free to start with a combining
  character (provided that the said in-line container is not the first
  within a block-level element, of course).
 
 The trouble with that idea is that in XML generally we don't know
 what is a block-level element: elements are just elements, and it's
 up to rendering routines whether they appear as block, inline, or
 not at all.

So if on that level of abstraction it is not known whether it would make
sense or not for the higher layers, it should be permitted in all cases.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Marcin 'Qrczak' Kowalczyk

W licie z wto, 10-08-2004, godz. 18:33 +0100, Jon Hanna napisa:

 By the rules of XML replacing #x338; with U+226F would mean the document was
 no longer well-formed.

Really? I don't have a XML spec handy, but character references like
#x338; can't be processed before parsing tags, because 60; is the
literal character , not the start of a tag.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Microsoft Unicode Article Review

2004-08-06 Thread Marcin 'Qrczak' Kowalczyk

W licie z czw, 05-08-2004, godz. 15:52 -0500, John Tisdale napisa:

 Yet, if you are working with an application that must parse and
 manipulate text at the byte-level, the costliness of variable length
 encoding will probably outweigh the benefits of ASCII compatibility.
 In such a case the fixed length of UCS-2 will usually prove the better
 choice. This is why Windows NT and subsequent Microsoft operating
 systems, SQL Server 7 (and subsequent ones), XML, Java, COM, ODBC,
 OLEDB and the .NET framework are all built on UCS-2 Unicode encoding.

At least some of them use UTF-16, not UCS-2, e.g. Java 1.5. I wonder
if not most of them actually. At least in theory.

 The uniform length of UCS provides a good foundation when it comes to
 complex data manipulation.

And thus this point does not apply to them (unless you count apps which
break for characters outside BMP).

 There are other technical differences between these standards that you
 may want to consider that are beyond the scope of this article (such
 as how UTF-16 supports surrogate pairs but UCS-2 does not).

I don't like perpetuating the myth that Unicode is a 16-bit encoding
and UCS-2 can represent all Unicode characters. Yes, in some places you
mention that there are also some characters above the first 64k, but the
general impression from the article is that UCS-2 is one of equally-
functional representations of Unicode, while in fact this is the only
representation which doesn't cover all code points.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: UAX 15 hangul composition

2004-08-03 Thread Marcin 'Qrczak' Kowalczyk

W licie z wto, 03-08-2004, godz. 13:47 +0200, Theo Veenker napisa:

 Don't know if this has been asked/reported before, but is the example code
 for hangul composition in UAX 15 correct?

I reported it a month ago and got a response stating that This has been
forwarded to the right people, and they are looking into it.

 The TIndex = TCount should be TIndex  TCount I think.

Right. Also, 0 = TIndex should be 0  TIndex.

 IMO the example would be more clear if the Hangul_Syllable_Type property
 would be used.

I prefer to have formulas rather than tables for something which can be
computed in a simple way.

Recently I implemented some Unicode algorithms in a way which resulted
in static linking of the relevant code into many programs. So it was
important to make the executable size small, which means that I had to
invent some compressed representation of various tables, and to prefer
formulas.

I used Hangul_Syllable_Type table before I realized that this data can
be computed.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Umlaut and Tréma, was: Variation selectors and vowel marks

2004-07-23 Thread Marcin 'Qrczak' Kowalczyk

W licie z pi, 23-07-2004, godz. 18:01 +0200, Philipp Reichmuth
napisa:

 However, to return to the original problem, I don't remember ever having
 seen a data where it would be necessary to distinguish between trema and
 diaeresis in the data itself.

A similar issue: a Polish encyclopaedia I have from 1985 sorts words
with  differently depending on whether this is Polish  (sorted between
O and P, like other Polish letters are after letters without accents)
or foreign  (folded with O, like other foreign accents are folded).
It's typeset in the same way.

   MOQUETTE
   MR [mo:r], city in Hungary
   MORA
   MRA [mo:ro] Ferenc, Hungarian writer
   MORACZEWSKA
   [...]
   MONOWADZTWO
   MR (a Polish word)
   [...]
   MDEK (a Polish word)
   MPHAHLELE

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Folding algorithm and canonical equivalence

2004-07-18 Thread Marcin 'Qrczak' Kowalczyk

W licie z sob, 17-07-2004, godz. 16:46 -0700, Asmus Freytag napisa:

 I wonder whether that's truly intended, or whether it could be replaced
 by a combination of
 
 AccentFolding
 OtherDiacriticFolding
 
 where AccentFolding removes *all* nonspacing marks following Latin, Greek 
 or Cyrillic letters and we would remove from DiacriticFolding all cases 
 that are already handled by accent folding.

I don't think folding cyrillic short I to I would be right. While
graphically it's a combining mark, semantically it would be like folding
I with J. What are the purposes of this folding?

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Looking for transcription or transliteration standards latin- arabic

2004-07-10 Thread Marcin 'Qrczak' Kowalczyk

W licie z pi, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisa:

 o-slash, can be analyzed as o and slash, even though that's not done 
 canonically in Unicode. Allowing users outside Scandinavia to perform 
 fuzzy  searches for words with this character is useful.
 
 In this view of folding, Language-specific fuzzy searches would be tailored 
 (usually by being based on collation information, rather than on generic 
 diacritic folding).

In Polish letters with diacritics  are sorted after the
corresponding letters without. Omitting diacritics is an error, even
though text without them is generally readable. They are removed when
the given protocol requires or encourages ASCII (e.g. filenames to be
used in URLs, login names, variable names in programming languages,
ancient computer systems). There is no alternate spelling scheme like
German AE/OE/UE/SS.

Polish leters are never folded when sorting lexicographically. This
applies to  in the same way as to other eight letters. Foreign
diacritics are always folded though, at least I don't remember seeing
any other case. I think  would be folded together with O in an
encyclopaedia if this is a foreign O with some accent, unrelated to
Polish  which is a separate letter (can you suggest some non-Polish
word starting with  which could be found in an encyclopaedia?).

But there are cases when I would prefer to fold Polish diacritics in
searches.

It's basically every case when you are not sure that all stored data is
using diacritics, for example in generic WWW searching. There are still
people who don't use diacritics in usenet and email, or in entries in
guest books and other unprofessional web content. There are even
sometimes people who insist that Polish letters *should not* be used in
usenet and email because some computer systems can't handle them.
Diacritics are rare on IRC (because the IRC protocol doesn't distinguish
between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers
(because of laziness). This is why for searching archives of unknown
data it's generally better to fold them.

As far as I know, the default UCA folds these letters except , and
standard Polish tailoring doesn't fold any Polish letter. While not
folding them in searching is technically correct and nobody would be
surprised that they are not folded, it's often more useful to fold them
and people would be pleasantly surprised if they don't have to repeat
the search with omitted diacritics.

If one wants to find data containing a word, rather than collect
statistics about usage of a word with and without diacritics, it's very
rare than folding does some harm.

Hmm, it's not that simple. When I'm searching for JZYK (existing word),
I will be happy to find occurrences of JEZYK too (non-existing word,
must have had diacritics stripped), but it makes no sense to return
JEYK (another existing word). It's not just making the letters
equivalent.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Marcin 'Qrczak' Kowalczyk

W licie z wto, 06-07-2004, godz. 10:50 +0100, Peter Kirk napisa:

 I guess another similar change would be Danzig - Gdansk, but 
 I don't know where the initial G came from so possibly the Polish form 
 is older than the German.

A name with initial Gd is older than with D:
   http://encyclopedia.thefreedictionary.com/Gdansk
   http://en.wikipedia.org/wiki/Gda%C5%84sk#Names
but Wikipedia has now a hot dispute about how it should call the city:
   http://en.wikipedia.org/wiki/Talk:Gdansk/Naming_convention

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Error in Hangul composition code

2004-07-05 Thread Marcin 'Qrczak' Kowalczyk

http://www.unicode.org/reports/tr15/ says:

int SIndex = last - SBase; 
if (0 = SIndex  SIndex  SCount  (SIndex % TCount) == 0) {
int TIndex = ch - TBase;
if (0 = TIndex  TIndex = TCount) {

// make syllable of form LVT

last += TIndex;
result.setCharAt(result.length()-1, last); // reset last
continue; // discard ch
}
}

But there is no character at TBase == U+11A7. TBase is put one code
point below the first trailing consonant, because TIndex == 0 as
computed from SIndex % TCount generally means that there is no trailing
consonant.

Also, the character at TBase + TCount doesn't compose with LV. Adding
a count to a base points to the first code point *after* the range.

So the condition should be if (0  TIndex  TIndex  TCount).

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Shape of the US Dollar Sign

2001-09-28 Thread Marcin 'Qrczak' Kowalczyk


Fri, 28 Sep 2001 09:58:39 -0600, Jim Melton [EMAIL PROTECTED] pisze:

 I believe this is nothing but a font/glyph/presentation issue.

A font for text mode I once made had the dollar like this:

  . . . . . . . . .
  . . . # . # . . .
  . . . # . # . . .
  . . # # # # # . .
  . # # . # . # # .
  . # # . # . . . .
  . # # . # . . . .
  . . # # # . . . .
  . . . . # # # . .
  . . . . # . # # .
  . . . . # . # # .
  . # # . # . # # .
  . . # # # # # . .
  . . . # . # . . .
  . . . # . # . . .
  . . . . . . . . .

-- 
 __(  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: 3rd-party cross-platform UTF-8 support

2001-09-22 Thread Marcin 'Qrczak' Kowalczyk


Thu, 20 Sep 2001 12:46:49 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze:

 If you are expecting better performance from a library that takes UTF-8
 API's and then does all its internal processing in UTF-8 *without*
 converting to UTF-16, then I think you are mistaken. UTF-8 is a bad
 form for much of the kind of internal processing that ICU has to do
 for all kinds of things -- particularly for collation weighting, for
 example. Any library worth its salt would *first* convert to UTF-16
 (or UTF-32) internally, anyway, before doing any significant semantic
 manipulation of the characters.

Why would UTF-16 be easier for internal processing than UTF-8?
Both are variable-length encodings.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Any tools to convert HTML unicode to JAVA unicode

2001-09-22 Thread Marcin 'Qrczak' Kowalczyk


Wed, 19 Sep 2001 03:47:59 -0700 (PDT), MindTerm [EMAIL PROTECTED] pisze:

   I would like to ask any tools to convert HTML
 unicode ( e.g.  # n n n n ) to JAVA unicode ( e.g. \u
 n n n n ) ? 

Here is a Perl program which does this:

perl -pe 'BEGIN {sub java ($) {sprintf "\\u%04x", $_[0]}}
s/#x([0-9A-Fa-f]+);/java hex $1/ge; s/#(\d+);/java $1/ge'

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: CESU-8 vs UTF-8

2001-09-16 Thread Marcin 'Qrczak' Kowalczyk


Sun, 16 Sep 2001 01:14:06 -0700, Carl W. Brown [EMAIL PROTECTED] pisze:

 If it can be demonstrated that there is a real need for an encoding
 like CESU-8 then is should be very different from UTF-8.  How does
 SCSU for example sort?

SCSU encoding is non-deterministic and its representations can't
be compared lexicographically at all (logically equal strings might
compare unequal).

Ehh, we wouldn't have the problem with CESU-8 now if Unicode hadn't
been described as a 16-bit encoding in the past. I still think that
UTF-16 was a big mistake. Too bad that it still affects people who
avoid it.

We can't change the past, but I hope that at least UTF-8 processing can
be done without treating surrogates in any special way. Surrogates are
relevant only for UTF-16; by not using UTF-16 you should be free of
surrogate issues, except by having a silly unused area in character
numbers and a silly highest character number. Please don't spread
UTF-16 madness where it doesn't belong.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: PDUTR #26 posted

2001-09-14 Thread Marcin 'Qrczak' Kowalczyk


Thu, 13 Sep 2001 12:52:04 -0700, Asmus Freytag [EMAIL PROTECTED] pisze:

 UTF-32 does have the same byte order issues as UTF-16, except that
 byte order is recognizable without a BOM.

UTF-8 would be used for external communication almost exclusively.
Especially as it's compatible with ASCII and thus fits nicely into
existing protocols.

 Since you speak of internal processing: One software architect I
 spoke with brought this to a nice point: With UTF-16 I can put twice
 the data in my in-memory hash table and have *on average* the same
 1:1 character code:code point characteristics for processing. That's
 a win-win.

Only if you manage to process characters above U+ correctly.
It's so easy to make processing efficient and wrong.

 UTF-8, while even more compressed for European data (it's 50% larger than 
 utf-16 for ideographs), uses multi-code element encoding for all but ASCII, 

But UTF-16 also uses multi-code element encoding! For program
complexity it doesn't matter how often it occurs if variable-length
encoding has to be handled anyway. You can't take a character from
a string by random index in either case for example.

 Since most operations are perforce exposed to its variable length,
 unlike UTF-16 processing, which can be optimized for the much more
 frequent 1-unit case,

How optimized? By managing a flag when all characters fit under U+1
and using separate routines for these cases? It's yet more efficient
to forget about UTF and store characters in 8, 16 or 32 bits, whatever
is the first which fits. Forget about surrogates. It's simpler.

 utf-8 cannot as readily be used as internal format.

It's as easy as UTF-16. Unless you want a broken implementation which
treats surrogates as pairs of characters. It's as broken as treating
multibyte sequences of UTF-8 as separate characters.

 Unicode limited to UTF-8 and UTF-32 would be a lot less attractive
 and you would not have seen it implemented in Windows, Office
 and other high volume platforms as early and as widespread as it
 has been.

I don't use Windows. I use UTF-8 much more often than UTF-16
(but still rarely).

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: PDUTR #26 posted

2001-09-13 Thread Marcin 'Qrczak' Kowalczyk


Wed, 12 Sep 2001 11:08:41 -0700, Julie Doll Allen [EMAIL PROTECTED] pisze:

 Proposed Draft Unicode Technical Report #26: Compatibility Encoding
 Scheme for UTF-16: 8-Bit (CESU-8) is now available at:
 http://www.unicode.org/unicode/reports/tr26/

IMHO Unicode would have been a better standard if UTF-16
hadn't existed. Just UTF-8 and UTF-32, code points in the range
U+..7FFF, no surrogates, no confusion about "how many bits is
Unicode", an ASCII-compatible encoding in most external transmissions,
uniform width for internal processing, and practically no byte
ordering issues. Much simpler.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: [OT] o-circumflex

2001-09-10 Thread Marcin 'Qrczak' Kowalczyk


Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti [EMAIL PROTECTED] pisze:

 It's as weird as some Italian names for German cities: Aquisgrana
 for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
 Baviera) for Mnchen.

Interesting that Polish names of these cities are more like Italian
than German: Akwizgran, Augsburg, Moguncja, Monachium.

Ko/benhavn is Kopenhaga, again more like other foreign forms than
Danish.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Nonsense in http://www.unicode.org/Public/PROGRAMS/CVTUTF/CVTUTF.C?

2001-08-25 Thread Marcin 'Qrczak' Kowalczyk


Wed, 22 Aug 2001 15:59:15 -0700, Michael (michka) Kaplan [EMAIL PROTECTED] pisze:

 Functions ConvertUCS4toUTF8 and ConvertUTF8toUCS4 use surrogates
 in UCS4. In particular ConvertUTF8toUCS4 converts a character above
 U+ into two UCS4 words. Why is this absurd there?!
 
 UCS-4 has no knowledge of surrogate code points or their significance; it is
 ap urely algorithmic conversion. Not sure why the results would be so
 surprising, given this?

I don't understand. I'm talking about characters above U+, not
about characters from the range U+D800..DFFF. They are represented
as themselves in UCS-4. But the said routine represents them as pairs
of surrogates.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: COMMERCIAL AT

2001-07-15 Thread Marcin 'Qrczak' Kowalczyk


Sat, 14 Jul 2001 11:51:29 +0100, Michael Everson [EMAIL PROTECTED] pisze:

 References to animals are the most common.  Germans, Dutch, Finns,
 Hungarians, Poles and South Africans see it as a  monkey tail.

Indeed it's commonly called "monkey" in Polish (in parallel with "at"),
but some call it "elephant's ear".

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Marcin 'Qrczak' Kowalczyk


Fri, 13 Jul 2001 03:01:10 EDT, [EMAIL PROTECTED] [EMAIL PROTECTED] pisze:

 Unfortunately, you don't hear much about SCSU, and in particular
 the Unicode Consortium doesn't really seem to promote it much
 (although they may be trying to avoid the "too many UTF's" syndrome).

SCSU doesn't look very nice for me. The idea is OK but it's just
too complicated. Various proposals of encodings differences or xors
between consecutive characters are IMHO technically better: much
simpler to implement and work as well.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Terms constructed script, invented script (was: FW: Re: Shavian)

2001-07-11 Thread Marcin 'Qrczak' Kowalczyk


7 Jul 2001 11:01:18 GMT, Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] pisze:

 I put a sample at http://qrczak.ids.net.pl/vi-001.gif

Now I put a prettier version there: with variable line width, serifs,
and by a slightly improved sizing engine (enlargement of rounded parts
to make them look the same size as straight parts happens locally
instead of only at the top and bottom of a letter), and with all
dots looking exactly the same due to rounding coordinates of their
centers to whole pixels (or whole pixels and a half, in case of an
even dot size).

I still can't have serifs on ends of slanted lines, but they happen
only in ASCII shapes, not in my script, so I'm not sure that I want
them badly enough. Serifs are really triangles, so they look like
traditional serifs only in small pixel sizes like that one.

It would be nice to be able to draw it with TeX, but I don't know
TeX well enough. I will not reimplement the whole Metafont myself
either:-)

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Terms constructed script, invented script (was: FW: Re: Shavian)

2001-07-07 Thread Marcin 'Qrczak' Kowalczyk


In a message dated 2001-07-06 0:31:39 Pacific Daylight Time, [EMAIL PROTECTED] 
writes:
 
 I wonder: why aren't languages with simple syllabic structures
 written in hiragana? It seems to be built for them.

I am using my own script inspired by hiragana 10 years ago for writing
Polish. It looks very differently, I only liked the idea of having
letters for consonant+vowel pairs and stretched it a bit.

I put a sample at http://qrczak.ids.net.pl/vi-001.gif (resolution
suitable for printing at 300dpi). For example the subject says:
Re: vi (Re: O wyższości znaku zachęty nad GUI), i.e. Re: vi (Re:
About the superiority of command-line prompt over GUI), which has
only 11 letters between the second Re: and GUI.

I won't dare proposing to encode it in Unicode. The number of users
is approaching two. But technically it's an interesting script with
a non-trivial rendering engine. I implemented the rendering engine
and a translator from standard Polish orthography (not perfect due to
ambiguities in our orthography - I modified the orthography a little
to resolve them). I did it to practice reading. I could only practice
writing before - it's hard to read what you just wrote, because you
remember what you wrote!

Letters are composed from core characters by the engine. There
are 35 consonants, 8 normal vowels, 1 extra vowel, joiner, and
non-joiner. They produce an unbounded number of letters.

(1) Adjacent consonants are joined up to some limit (2 is a good
choice, but there is no semantic difference here) and they are joined
with the following vowel if present (this is mandatory).

(2) A consonant+vowel pair must be split if this is a border
between a prefix and a stem or the like. Such pairs are also split
in some foreign words to force correct pronunciation (pronunciation
of a consonant sometimes depends on the following vowel and vice
versa). Non-joiner is used to encode such splitting in the stream of
core characters.

(3) The default (greedy) splitting of chunks of consonants is not
always perfect, e.g. when it would join a final part of a prefix with
the beginning of the stem. Joiner and non-joiner are used to prevent or
force splitting at certain points between consonants. Forced joining
overrides the limit of joined consonants.

(4) Any two letters can be joined by writing one above another with a
dot between. This is never required by the orthography but is sometimes
a good style, e.g. in the od prefix and in diphtongs. Joiner is
used to encode that.

Finally there are cases where a consonant+vowel pair is split according
to (2) and then joined according to (4). I am encoding such case with
joiner + non-joiner + joiner. I think that there is already a similar
practice in Unicode used for Arabic ligatures.

Actually I'm not using even PUA characters but an ASCII-based escaping
scheme, because I don't have an editor capable of editing text in
such a script. But simple non-joined letters put in a font with the
ability to directly edit joiners and non-joiners would be technically
workable. The meaning of a text file would then be unambiguous modulo
PUA assignment (the ASCII-based escaping is a hack).

-- 
 __(  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTĘPCZA
QRCZAK

Re: validity of lone surrogates

2001-07-04 Thread Marcin 'Qrczak' Kowalczyk


Tue, 3 Jul 2001 11:19:05 +0100, Michael Everson [EMAIL PROTECTED] pisze:

I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and
decoders to not worry about surrogates at all. Please leave surrogate
issues to UTF-16.
 
 But what if I want to put up a Web page in Etruscan?

UTF-8 and UTF-32 handle characters above U+ with no problem.
I mean: forget about surrogates, i.e. about encoding those characters
as pairs of words in the range 0xD800..DFFF in encodings other than
UTF-16. For those encodings U+D800..DFFF are just code points like
others; they encode the whole contiguous range U+..10 (maximum
would be U+7FFF if the idea of UTF-16 wasn't pushed so hard).

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)

2001-07-03 Thread Marcin 'Qrczak' Kowalczyk


27 Jun 2001 13:38:33 +0100, Gaute B Strokkenes [EMAIL PROTECTED] pisze:

 I would be indebted if any of the experts who hang out on the
 unicode list could sort out this confusion.

I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and
decoders to not worry about surrogates at all. Please leave surrogate
issues to UTF-16.

It's a pity that UTF-16 doesn't encode characters up to U+F, such
that code points corresponding to lone surrogates can be encoded as
pairs of surrogates.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: validity of lone surrogates (was Re: Unicode surrogates: just say no!)

2001-07-03 Thread Marcin 'Qrczak' Kowalczyk


Tue, 3 Jul 2001 01:50:56 -0700, Michael (michka) Kaplan [EMAIL PROTECTED] pisze:

 It's a pity that UTF-16 doesn't encode characters up to U+F, such
 that code points corresponding to lone surrogates can be encoded as
 pairs of surrogates.
 
 Unfortunately, we would then be stuck with what happens when two such
 surrogate surrogates are next to each other

There is no problem with that.

Encoding: A character U+..D7FF or U+E000.. is encoded as a single
16-bit word. A character U+D800..DFFF or U+1..F is encoded as two
16-bit words: 0xD800 + (ch  10) and 0xDC00 + (ch  0x3FF).

Decoding: A word 0x..D7FF or 0xE000.. stands for itself.
Otherwise a word 0xD800..DBFF must be followed by a word 0xDC00..DFFF,
and the code obtained from them must be in the range U+D800..DFFF or
U+1..F. The word stream is invalid in other cases (unpaired
surrogates or surrogates which encode a character which could be
encoded using a single word).

This gives unambiguous mapping of all code points U+..U+F to
single or double 16-bit words. The code space has exactly 20 bits.
Code points corresponding to surrogates could be even allocated for
real characters.

Unicode issues would be simpler if UTF-16 as defined today would not
exist. UTF-16 spreads its ugliness to other encoding forms and many
people think that Unicode implies 16 bits per character. There is a
tendency to use UTF-16 internally and ignore characters above U+,
treating surrogates as real characters which must come in pairs in
order to encode glyphs.

I suppose that we are stuck with UTF-16 forever, so please at least
don't spread surrogates to UTF-8 and UTF-32 which don't need to treat
the range U+D800..DFFF in any special way. It was hard enough for me
to accept that the code point space ends at a funny address U+10.
UTF-8 was so nice at 31 bits.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: How does Python Unicode treat surrogates?

2001-06-25 Thread Marcin 'Qrczak' Kowalczyk


Mon, 25 Jun 2001 07:24:28 -0700, Mark Davis [EMAIL PROTECTED] pisze:

 In most people's experience, it is best to leave the low level interfaces
 with indices in terms of code units, then supply some utility routines that
 tell you information about code points.

It's yet better to work on characters instead of code units internally,
i.e. use UTF-whatever only for interaction with external world.

Unfortunately some languages did a mistake of using only 16 bits per
character and it's not easy in them.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: How will software source code represent 21 bit unicode characters?

2001-04-17 Thread Marcin 'Qrczak' Kowalczyk


Tue, 17 Apr 2001 07:33:16 +0100, William Overington [EMAIL PROTECTED] 
pisze:

 In Java source code one may currently represent a 16 bit unicode character
 by using \u where each h is any hexadecimal character.
 
 How will Java, and maybe other languages, represent 21 bit unicode
 characters?

In Haskell the character U+FFFD can be written thus (inside character
or string literal):
\65533
\xFFFD
\o15
Such escape sequences can have any number of digits. The sequence \
expands to the empty string and is used to protect a sequence from
the following text if it begins with a digit.

 May I, with permission, start a discussion by suggesting that \u \vh
 and \whh would be good formats.  Programmers could then enter unicode
 characters into software source code using \u and four hexdecimal characters
 or using \v and five hexadecimal characters or using \w and six hexadecimal
 characters, as convenient for any particular character.

This conflicts with the usage of \v as vertical tab.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Latin digraph characters

2001-03-03 Thread Marcin 'Qrczak' Kowalczyk


Wed, 28 Feb 2001 13:35:17 -0800 (GMT-0800), Pierpaolo BERNARDI 
[EMAIL PROTECTED] pisze:

 The initial character of the name is transliterated as CH in English,
 TCH in French, TSCH in German, C or CI in Italian, C WITH CARON in the
 official Russian transliteration.

And CZ in Polish.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: [OT] Unicode-compatible SQL?

2001-02-05 Thread Marcin 'Qrczak' Kowalczyk


Mon, 5 Feb 2001 08:20:43 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze:

 The topic came up in a UTC meeting some time ago, a "UTF-8S". The
 motivation was for performance (having a form that reproduces the
 binary order of UTF-16).

This is unfair: it slows down the conversion UTF-8 - UTF-32.

In both cases the speed difference is almost none, and it's a big
portability problem. I hope that such trash will not be accepted.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Transcriptions of Unicode

2001-01-29 Thread Marcin 'Qrczak' Kowalczyk


Mon, 15 Jan 2001 13:09:47 -0800 (GMT-0800), G. Adam Stanislav [EMAIL PROTECTED] 
pisze:

 I would not be surprised if speakers of certain Slavic languages even
 changed the SPELLING to Unikod (with an acute over the [o]), as they
 have done with other imported words (such as futbal for football).

That is what we in Polish newsgroups often do, even if it's very
unofficial; I don't expect Unicode or Unikod in dictionaries soon.
Without acute over the [o], which would mean a different thing.
Actually "kod" in Polish means "code".

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Transcriptions of Unicode

2001-01-29 Thread Marcin 'Qrczak' Kowalczyk


Fri, 12 Jan 2001 07:28:18 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze:

 According to the references I have, the prefix "uni" is directly from
 Latin while the word "code" is through French. The Indo-European would
 have been *oi-no-kau-do ("give one strike"): *kau apparently being
 related to such English words as: hew, haggle, hoe, hag, hay, hack,
 caudad, caudal, caudate, caudex, coda, codex, codicil, coward, incus,
 and Kova (personal name: 'smith').

Oh, so my surname is related to Unicode? :-)
"Kowal" means "smith" in Polish.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Teletext mappings

2001-01-27 Thread Marcin 'Qrczak' Kowalczyk


Sun, 21 Jan 2001 09:29:56 -0800 (GMT-0800), Rob Hardy [EMAIL PROTECTED] 
pisze:

  [Polish set] contains the line
  0x5B 0x01B5 # LATIN CAPITAL LETTER Z WITH STROKE
  should supposedly be
  0x5B 0x017B # LATIN CAPITAL LETTER Z WITH DOT ABOVE
 
 My teletext spec definitely has a Z with a stroke.

In Polish capital Z with dot above is sometimes rendered with stroke
instead of the dot. It's just a glyph variant, the meaning is exactly
the same. The letter should be consistently encoded as Z WITH DOT ABOVE
even if it's rendered with a stroke.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Character properties

2000-10-25 Thread Marcin 'Qrczak' Kowalczyk


Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED] 
pisze:

  isDigit:Nd
  isHexDigit: '0'..'9', 'A'..'F', 'a'..'f'
  isDecDigit: '0'..'9'
  isOctDigit: '0'..'7'
 
 The definition "Nd" is what I would have proposed for isDecDigit.

The name isDecDigit is confusing indeed... isAsciiDigit?
But it would be inconsistent with the rest.

 In general, I would consider any script's digit for decimal and octal
 numbers.  Not so for hex numbers, that are probably strictly bound
 to computer programming languages and, hence, to the Latin script.

Octal digits are bound to programing languages as much as hex digits.
I'm not sure about names of Nd and '0'..'9', but I think that there
is no need for separate Nd-less-than-8 and '0'..'7', with '0'..'7'
being enough - it is used in programming languages and formats with
C-like string escapes.

 What is the meaning of isDigit? The intuitive meaning would be "Any
 kind of digit, as defined by the three specific functions below".

Any kind of digit which forms numbers in the positional decimal system,
convertible to an integer by the standard function digitToInt.

Actually digitToInt also understands 'A'..'F' and 'a'..'f' as hex
digits.

 So, I would say:

This does not provide any name for '0'..'9'. Nor for '0'..'9' +
'A'..'F' + 'a'..'f'. Since they are commonly used in existing formats
and programming languages, I'm afraid it's not enough. OTOH there
should not be too many variants that nobody will use.

  isUpper:Lu, Lt
  isLower:Ll
 
 I would say that "Lt" letter are *both* uppercase and lowercase.

An interesting point of view! Looks strange, but I must think about it.

Some derived tests are becoming incorrect (all letters are lowercase
must no longer be checked by "all isLower" but by "not . any isUpper").

 Or alternatively, if you can (and wish to) add a new API entry:

I think that this phenomenon is it's too rare for having a separate
entry. It will not be used in practice by most people.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Character properties

2000-10-25 Thread Marcin 'Qrczak' Kowalczyk


Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED] 
pisze:

  isDigit:Nd
  isHexDigit: '0'..'9', 'A'..'F', 'a'..'f'
  isDecDigit: '0'..'9'
  isOctDigit: '0'..'7'
 
 The definition "Nd" is what I would have proposed for isDecDigit.

The name isDecDigit is confusing indeed... isAsciiDigit?
But it would be inconsistent with the rest.

 In general, I would consider any script's digit for decimal and octal
 numbers.  Not so for hex numbers, that are probably strictly bound
 to computer programming languages and, hence, to the Latin script.

Octal digits are bound to programing languages as much as hex digits.
I'm not sure about names of Nd and '0'..'9', but I think that there
is no need for separate Nd-less-than-8 and '0'..'7', with '0'..'7'
being enough - it is used in programming languages and formats with
C-like string escapes.

 What is the meaning of isDigit? The intuitive meaning would be "Any
 kind of digit, as defined by the three specific functions below".

Any kind of digit which forms numbers in the positional decimal system,
convertible to an integer by the standard function digitToInt.

Actually digitToInt also understands 'A'..'F' and 'a'..'f' as hex
digits.

 So, I would say:

This does not provide any name for '0'..'9'. Nor for '0'..'9' +
'A'..'F' + 'a'..'f'. Since they are commonly used in existing formats
and programming languages, I'm afraid it's not enough. OTOH there
should not be too many variants that nobody will use.

  isUpper:Lu, Lt
  isLower:Ll
 
 I would say that "Lt" letter are *both* uppercase and lowercase.

An interesting point of view! Looks strange, but I must think about it.

Some derived tests are becoming incorrect (all letters are lowercase
must no longer be checked by "all isLower" but by "not . any isUpper").

 Or alternatively, if you can (and wish to) add a new API entry:

I think that this phenomenon is it's too rare for having a separate
entry. It will not be used in practice by most people.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Character properties

2000-10-21 Thread Marcin 'Qrczak' Kowalczyk


Wed, 11 Oct 2000 07:15:05 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze:

 Here is my take on the way Unicode general categories should be
 mapped to POSIX ones.

Reiterated, here is my compilation of mapping of properties proposed
for Haskell:

isAssigned: all except Cs, Cn
isControl:  Cc, Cf
isPrint:L*, M*, N*, P*, S*, Zs, Co
isSpace:Zs (except U+00A0, U+202F), TAB, LF, VT, FF, CR
isGraph:L*, M*, N*, P*, S*, Co
isPunct:P*
isSymbol:   S*
isAlphaNum: L*, M*, N*
isDigit:Nd
isHexDigit: '0'..'9', 'A'..'F', 'a'..'f'
isDecDigit: '0'..'9'
isOctDigit: '0'..'7'
isAlpha:L*, M*
isUpper:Lu, Lt
isLower:Ll
isLatin1:   U+..U+00FF
isAscii:U+..U+007F

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Character properties

2000-10-08 Thread Marcin 'Qrczak' Kowalczyk


Wed, 4 Oct 2000 18:48:17 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze:

 It is quite clear that many important character properties cannot
 be deduced from the General Category values in UnicodeData.txt alone.

What a pity. Especially as it does work for some properties and I
would like to avoid having too many arbitrary data sources.

  isControl  = c  ' ' || c = '\x7F'  c = '\x9F'
 
 This is fine if isControl is aimed at the ISO control codes associated
 with the ISO 2022 framework. However, Unicode introduces a number
 of other control functions encoded with characters, and it depends
 on what you want the property API to be sensitive to. An obvious
 example is the set of bidirectional format control characters.

The precise meaning is to be decided too.

I think that isControl should be more or less the complement of
isPrint, modulo unassigned characters and surrogates. They should
tell which characters should be output unescaped by programs like ls
(GNU ls uses isprint), or legal in the source of some languages or text
file formats. While isPrint are characters definitely safe for output,
isControl would be ones that should not occur in pure text and should
be always filtered out in some way before displaying (unless handled
explicitly like \n \t \f), and for characters in neither class it
depends on the application for which side does it want to err...
I'm not sure if this makes sense.

On the linux-utf8 mailing list I've got conflicting responses about
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
Should they be plain control characters or ones in the "third" class
without clear status.

  isPrint= category is other than [Zl,Zp,Cc,Cf,Cs,Co,Cn]
 
 It probably isn't a good idea to include Co (Other, private use) in
 the exclusion set for isPrint. In most typical usage, if a user-defined
 character is assigned, it will be a printable character.

I was told the same on linux-utf8, and for Cf as well. Cf surprised
me, and I was told that programs like ls should not avoid outputting
Cf characters. Hmm...

  isSpace= one of "\t\n\r\f\v" || category is one of [Zs,Zl,Zp]
 
 You need to decide whether this is for space per se or for whitespace
 (as you have defined it).

I think whitespace - places safe to break a line into words, or
stuff allowed between identifiers in some file formats or programming
languages (those which say "any Unicode whitespace character", e.g.
Haskell source).

I was told that I should exclude
U+00A0 NO-BREAK SPACE
U+202F NARROW NO-BREAK SPACE
because of the application for line breaking. They are excluded from
is[w]space in the newest glibc.

 Depending on your system, you may have to add U+0085 as well.

I have never heard about U+0085 being used anywhere... What is it for?

  isGraph= isPrint c  not (isSpace c)
  isPunct= isGraph c  not (isAlphaNum c)
 
 This is closer to a definition of something like isSymbol, rather
 than isPunct.

I was told the same on linux-utf8, and thus now I have separate
isPunct and isSymbol (despite the standard C library which puts
both into is[w]punct).

  isAlphaNum = category is one of [Lu,Ll,Lt,Nd,Nl,No,Lm,Lo]
 
 This is definitely wrong. See isAlpha below, which has the same problem.

This seems to be the biggest problem (and only real problem): the
number of exceptions from any category-based predicate is large.

 The issue is that many scripts have combining characters which are
 fully alphabetic. Their General Category is typically Mc. You cannot
 omit those from an isAlpha or isAlphaNum and get the right results.

IMHO isAlpha[Num] should tell which characters form words to be
used as identifiers in various contexts. This is one of predicates
important for Haskell source, not only its library.

I quickly wrote perl programs to compare PropList's Alphabetic +
Ideographic with subsets derived from categories. Basing on categories
L* + Mc + Nl, the exception list is still large: excluded twenty Lm
characters, two Mc characters, and 229 out of 447 Mn characters -
near the half! European accents are excluded, but many marks from
scripts that I don't know at all are included. It is not obvious why
characters like
U+073F SYRIAC RWAHA
U+0902 DEVANAGARI SIGN ANUSVARA
are included, and
U+0742 SYRIAC RUKKAKHA
U+093C DEVANAGARI SIGN NUKTA
are excluded.

I still don't know how to do it in an elegant way.

 Others pointed out the problem with this: isASCIIDigit  isDigit.

OK, this is fixed.

Perhaps there are important character classes that I omitted at all.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Character properties

2000-09-23 Thread Marcin 'Qrczak' Kowalczyk


Fri, 22 Sep 2000 22:11:44 -0800 (GMT-0800), Roozbeh Pournader 
[EMAIL PROTECTED] pisze:

 intToDigit should look at the locale to select the preferred digit
 form, I think.

Sorry, it cannot apply to Haskell, because it's a functional language.
It must work the same way all the time, unless it had a different
interface.

I am going to have isDigit and isAsciiDigit.

A framework for generic locale-dependent behavior is not designed yet.
The implementation of conversion between the default locale-dependent
byte encoding and Unicode will of course depend on the locale
internally - in its current design it is allowed. There is no external
interface to manual locale setting yet. Well, process-wide locale
setting is against the Haskell style, but I see no other convenient
interface...

What about definitions of other character predicates? They came
partially from my head, so may be incorrect or "incomplete".

*   *   *

What are best ways to implement the conversion between the
default locale-dependent byte encoding and Unicode on various
platforms? Especially ones to which the Glasgow Haskell Compiler
is currently ported:
  * i386-unknown-{linux,freebsd,netbsd,cygwin32,mingw32}
  * sparc-sun-solaris2
  * hppa1.1-hp-hpux{9,10}

I was told on the linux-utf8 mailing list that since the assumption
that wchar_t is Unicode is non-portable, the recommended generic way
is to use iconv, and carry an iconv implementation (like libiconv)
for platforms where it's not available. I don't like this very much,
but probably it's indeed the best way on Unices, and something
Windows-specific on Windows?

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Re: Character properties

2000-09-22 Thread Marcin 'Qrczak' Kowalczyk


Thu, 21 Sep 2000 23:55:24 +0330 (IRT), Roozbeh Pournader [EMAIL PROTECTED] 
pisze:

  isDigit intentionally recognizes ASCII digits only. IMHO it's more
  often needed and this is what the Haskell 98 Report says. (But I
  don't follow the report in some other cases.)
 
 Would you please give me some URL?

http://www.haskell.org/definition/
The Haskell 98 Library Report, module Char.

 I disagree with the isDigit case, simply because my main language,
 Persian, uses alternate digits when written.

Do they form numbers in the same way as ASCII digits?

Does Unicode character database provide a way to tell which digits
form numbers in this way (decimal, "big Endian")?

Do you think that they (and digits from other languages) should
be recognized as numbers in sources for programming languages that
generally accept foreign letters in identifiers? (I don't know what
Haskell gurus would say for that.)

What about isOctDigit and isHexDigit?

Haskell provides digitToInt and intToDigit which currently deal with
ASCII digits and hexadecimal "digits" A..F a..f. If isDigit accepted
foreign digits, it would make sense to extend digitToInt to convert
them too. But obviously not intToDigit.

BTW. For using foreign alphabets in identifiers, Haskell divides
identifiers into two classes basing on the case of the first letter,
similarly to Prolog, SML, OCaml, Clean. It is a problem for alphabets
without cases. I'm not sure what should be done with it. Haskell98
says that letters which are not lowercase should be considered
uppercase. I don't agree with it and my library extension/change
proposal allows characters which are isAlpha but neither isLower nor
isUpper. When carried to Haskell sources, it's not obvious how to
classify identifiers starting with these letters.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

Character properties

2000-09-21 Thread Marcin 'Qrczak' Kowalczyk


I am trying to improve character properties handling in the language
Haskell. What should the following functions return, i.e. what is
most standard/natural/preferred mapping between Unicode character
categories and predicates like isalpha etc.? What else should be
provided? Here are definitions that I use currently:

isControl  = c  ' ' || c = '\x7F'  c = '\x9F'
isPrint= category is other than [Zl,Zp,Cc,Cf,Cs,Co,Cn]
isSpace= one of "\t\n\r\f\v" || category is one of [Zs,Zl,Zp]
isGraph= isPrint c  not (isSpace c)
isPunct= isGraph c  not (isAlphaNum c)
isAlphaNum = category is one of [Lu,Ll,Lt,Nd,Nl,No,Lm,Lo]
isHexDigit = isDigit c || c = 'A'  c = 'F' || c = 'a'  c = 'f'
isDigit= c = '0'  c = '9'
isOctDigit = c = '0'  c = '7'
isAlpha= category is one of [Lu,Ll,Lt,Lm,Lo]
isUpper= category is one of [Lu,Lt]
isLower= category is Ll
isLatin1   = c = '\xFF'
isAscii= c  '\x80'

isDigit intentionally recognizes ASCII digits only. IMHO it's more
often needed and this is what the Haskell 98 Report says. (But I
don't follow the report in some other cases.)

Titlecase could be handled too. Even then I think that isUpper should
be True for titlecase letters (so it's usable for testing if the first
letter of a word is uppercase), and there should be a separate function
for category Lu only (for testing if all characters are uppercase).

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK

77 matches

Mail list logo