As one of the proponents of the UTF-8S proposal, I feel compelled to
respond to some of the recent comments regarding the proposal on the
unicode and unicore lists. Although there have been some good comments
about how the goals of the proposal could be accomplished without a new
encoding form, there have also been numerous arguments against UTF-8S
varying from simply unprofessional (the WTF thread) through to blatently
false (encoding doesn't imply a collation). Let me address each of the
comments individually. There's been a lot of talk about the UTF-8S
proposal on both the unicode and unicore list, so please forgive me (and
notify me if you feel the need) if I have missed any of the salient points
that require a response.
--
Toby Phipps
PeopleTools Product Manager - Global Technology
PeopleSoft, Inc.
[EMAIL PROTECTED]
1. UTF-8S doesn't need to be "accepted" or "approved" by the UTC, as its
use is within a proprietary, closed system.
Nothing could be further from the truth. Just look at which companies are
pushing the proposal (Oracle, SAP, PeopleSoft). These organizations all
share the same technological issue, but are also direct competitors. We
share a common technology - that of large SQL databases, and in the case of
PeopleSoft and SAP, heterogeneity across many different SQL databases. We
need a commonly understood UTF-8 encoding that can be used as a database
encoding, an in-memory encoding and other "internal" forms, but at the same
time, passed between systems from different vendors. PeopleSoft and SAP
support a range of database platforms, including Oracle, Microsoft SQL
Server, and IBM DB2. Communication between *applications* from one vendor
to a *database* from another vendor is not a closed system.
2. An encoding form does not imply a collation
False. The most basic collation in any system is the binary order of the
codepoints in their current encoding. That's what C gives you with the
strcmp( ) function, what COBOL gives you with " > ", what Java gives you
with its basic string classes. Even though the binary collation of each
Unicode transformation makes no linguistic sense, developers all over the
world make use of binary collation string comparisons to optimize code,
especially when dealing with huge volumes of data. Just looking at
PeopleSoft's tens of millions of lines of code, the great majority of our
collation-depedent comparisons (eg. comparisons returning more information
than simple equivalence) are used for performance and optimization.
There are most definately cases where we need a linguistic comparison, and
we have the appropriate syntax in each of our languages (except COBOL) to
deal with this. However, these cases are rare, and typically the developer
is aware that they are performing a collation whose result will be visible
to the user, and therefore needs to be in linguistic order.
Given the proliferation of UTF-16-based programming languages (Java,
Microsoft Win32 C/C++, increasing numbers of non-Win32 C compilers), the
combination of a UTF-16 based database client communicating with a UTF-8
based database server is common. Without UTF-8S (and UTF-32S to a lesser
extent) as a database encoding, creating a single, portable database client
in a UTF-16-based language environment that can operate against a database
backend encoded in any of the Unicode transforms would be very difficult.
Introducing an alternative database encoding along the lines of UTF-8S
would allow the same UTF-16-based client application to operate against
either a UTF-8S or UTF-16 database without change.
3. Vendors can't expect that other encodings collate the same in binary, so
why expect this of the Unicode transforms.
This is true. We can't expect most other encodings to compare the same in
binary. This often leads us to the situation where we only support servers
and clients that share the same encoding. Before we supported Unicode,
with a couple of exceptions (EBCDIC being one), this was the case at
PeopleSoft - we require our servers and clients to share the same encoding.
In reality, this wasn't a big deal for our customer base - there was very
little utility running a server in ISO 8859-2 and a client in ISO 8859-1.
Only the lower 7-bits represent common characters (and were therefore
usable), so the system may as well be running in 7-bit ASCII. Where this
did hurt were the CJK encodings. We don't support running a Shift-JIS
client against a EUC-JP database server. Binary collation is just one
reason. Expansion/contraction of character lengths is another. The
implementation of Unicode across our systems fixed most of this problem.
We all changed our database column size quantities to be character-based,
not byte based, so the character length issue went away, and until real
surrogates appeared on the scene with Unicode 3.1, we could rely on a
common binary collation between client and server tiers.
4. A database should be able to provide sorted output in any collation, not
just the binary collation of its encoding
True. However, for most SQL databases (at least those that use sorted
b-tree indexes such as Oracle, Microsoft SQL Server, Sybase, DB2/UDB etc.),
it is much, much faster and efficient to provide data collated in the
binary encoding of the database than in any other collation. Why?
Because column indexes are stored on-disk in a binary-sorted order. In
order to return a pre-ordered result set to a SQL query, the database
simply has to do what's known as a "index only scan". In this case, the
values returned in the result set are read directly from the index, and the
actual data blocks don't need to be fetched.
Of course, just about every database allows the result set to be in a
collation other than the binary sort of the database's binary encoding.
There are several ways of doing this. One is to sort data in
temporarily-allocated memory. This is incredibly inefficient, not only
because significant amounts of temporary space needs to be allocated and
freed, but also because the entire result set of the query has to be
processed and sorted before the first row is returned. With result sets
involving several million rows, this is a very significant overhead,
especially if the typical user only looks at the first couple of hundred.
So, some vendors allow the creation of additional indexes, sorted by a
weighted collation key of the original value. This works well in practice,
however it still doesn't allow for "index only scans" as in the binary
collation example, as the index only stores the numerical collation key,
and not the actual value. After fetching the row from the sorted index,
the database must then fetch the actual data from the data block.
Given this architecture (which is common across many SQL database
platforms), the most efficient way of encode the database is to use an
encoding where the binary representation of the data on-disk matches the
collation expected most often by the database's clients. In the case of a
database with many UTF-16 clients, a database encoding of UTF-16 or UTF-8S
would make sense.
5. Oracle is pushing this proposal as it makes it easier for them to
support surrogates without changing their architecture
False. Oracle already supports UTF-8S (called UTF8 in their engine for
historical reasons), true UTF-8, and UTF-16 all as core database encodings.
Oracle gains little from having the UTF-8S encoding accepted as a UTR other
than gaining a simple nomenclature to describe one of their supported
encodings. It is the large-scale users of Oracle Unicode databases such as
SAP and PeopleSoft who are strongly encouraging them to get a common
industry acceptance of the UTF-8S transformation for several reasons.
- We believe we won't be the only vendors to have the requirement of
equivalent binary sorts across different Unicode encodings. Ignoring
non-BMP characters, we have this equivalence now, and I can confidently
guess that the majority of database-based Unicode systems today aren't
using non-BMP characters in their systems, so their reliance on equivalent
binary sorting has not yet become acutely obvious.
- We need some well-known way of describing the encoding of data in the
database. This is important for discussions with our customers,
documentation and technical architecture disclosures. Without a accepted
name such as UTF-8S, we'll be forever talking about the fact that our
internal data representation is "like UTF-8, but with individually encoded
surrogate pairs". Why do people need to know what our internal database
representation is? Because we'll be speaking it over database APIs (eg.
PeopleSoft applications to a host Oracle database). Application Developers
will see it in-memory when they use our debugging tools. It may "leak"
into debug or trace files when things go wrong.
6. The UTF-8S proposal is asking for a "quasi-standard" acceptance which we
haven't seen before
False. The Unicode Consortium publishes the Unicode Standard (TUS) and
several Unicode Standard Annixes (UAX) which comprise TUS. These are
standardized components, and share components (such as the UTF-8
transformation and the code allocations) with ISO 10646. In addition to
TUS, the Unicode Consortium publishes Unicode Technical Reports (UTR).
UTRs are intended to make life easier for implementors of TUS by providnig
common techniques for character representation, encoding, collation and
more. There is absolutely no requirement for anyone to implement any
component of a UTR in order to claim compliance with TUS. They are for
guidance only.
We are proposing UTF-8S as the topic of a UTR. As such, thers is no
compunction for any implementor of Unicode to support such an encoding.
There is nothing compelling the encoding to be registered in the IANA
registry or be recognized by a web browser or XML parser. All we are
asking for is that the form of such an encoding be published and
recognized, so it can be referred to and used by implementors of the UTC
who share a need for equivalent binary collation that we have identified to
be not specific to one organization.
This is very similar to the acceptance of UTF-EBCDIC as UTR #16.
PeopleSoft is a big user of UTF-EBCDIC. We use it in our COBOL when it's
running on an EBCDIC platform. We use it in trace files and dump files on
our EBCDIC platforms. Do we expect it to be recognized in HTML? No. XML?
No. The same is true for UTF-8S.