converting a large website to Unicode: advice sought

Anatoly Vorobey Wed, 12 Dec 2001 13:38:48 -0800

Hello,

I'm in charge of preparing the conversion of a large DB-driven 
dynamic website to using UTF-8 uniformly. There're some thorny 
issues that I'm having difficulties deciding how to handle, so 
I'll be very grateful for any advice.
Below is the description of my difficulties.


The website in question is http://www.livejournal.com, which is a 
free service allowing one to keep an online journal conveniently 
(and much more). All the software is written in Perl, is open-source, 
and uses MySQL as the DB server. There is a large userbase (>300,000 
active users), so any solution must leave the existing journal entries and
other kinds of text usable.

Currently the code and the site in general are completely 8-bit 
clean and are as completely encodings-unaware. The vast majority of 
users are Americans which use ASCII but also sometimes Latin-1 characters; 
however there's also a fair amount of international users who use 
whatever 8-bit encodings they're accustomed to, and set their browsers 
appropriately.

The site, however, could benefit enormously from being converted to use 
Unicode; for just one example, one of the most attractive features of the 
site is "friends views", where you see on one page all latest entries 
entered by people from your subscription list, in reverse chronological 
order; obviously if you have "friends" writing in different languages, you 
cannot view their entries correctly on one page now. We'd also like to offer 
our users the ability to export their journals in XML, and other features 
which demand encoding knowledge.

The modifications I'm writing will make every page on the site to be built and
output in UTF-8, including pages with HTML forms, so that new entries and other
information submitted by users via these forms will automatically be submitted
by their browsers in UTF-8 and stored this way in the database (are there any
gotchas to be aware of here?) I'm planning to use UTF-8 strings as opaque 
strings to store in the database and handle in the code (they almost 
never need to be formatted), i.e. I'm not planning to use Perl's
native UTF-8 support ot MySQL's not-yet-existent UTF-8 support. This 
part seems to be relatively easy; the main problems I'm encountering are 
with the existing data which is in various 8-bit encodings I have no knowledge 
about. I can't translate it to UTF-8 automatically in the database.

Almost all of the text stored in the database consists of journal entries 
and comments to journal entries; I plan to add a new column to the 
appropriate tables which marks whether the entry or comment in question is 
in UTF-8 or not. If not, the code which needs to display the text will 
check the user's properties for a new "default encoding" property users will 
be able to set in their profiles; if there is a default encoding, the code 
will translate the text to UTF-8 on-the-fly, and if there's no default 
encoding, the code will refuse to display the text (unless it's pure ASCII).

This seems to take care well enough of most data, and leads me to my main
difficulty: how to deal with a lot of small miscellaneous text data left 
in the database: user names, profile information entered by the user such 
as a biography or an interest list, text in per-user to-do lists (another 
feature of the site), and more and more. There're a dozen or two places in 
the database where some small segments of text entered by the user are stored, 
and they're all currently in encoding-unaware 8-bit text. I can't deal with 
them by adding a new column for each such kind of data to mark whether it's 
UTF-8, as I'm doing for actual journal entries -- this will seriously bloat 
the database and complicate the code. I can, I guess, try to translate it 
to UTF-8 on-the-fly using the user's default encoding, but I still need 
to distinguish somehow, e.g. user names written in native 8-bit encodings 
which need such translation, and new user names entered after the site's 
been converted to UTF-8, which are already in UTF-8 and need no conversion. 
 
Should I use some kind of identifying mark inside the string (the BOM, maybe?)
Or should I perhaps check every string for UTF-8 correctness and assume that 
if it's non-ASCII 8-bit text, it'll fail this test? Or is there some other 
well-known solution to this problem?

Moreover, I'd like to provide an opportunity for a user to translate such 
miscellaneous information to UTF-8 in the database by using the encoding the
user says the data is in (we won't do unprompted translation, but only at the
user's request), but I can't think of a good way to do this, from the UI point
of view. How can I show to the user their 8-bit data, and say: select the
encoding this data is in, and preview whether it's displayed correctly when
translated to UTF-8 from this encoding, given that the HTML pages implementing
this conversion interface should themselves be in UTF-8?

Finally, one other technical problem I'd like to ask advice about is the
question of how to mark all the pages on the site as containing UTF-8 text:
in HTTP headers, in <meta> tag inside HTML HEAD section, or in both. Since the
site is completely dynamic, I can do it in any way I want; but I want to do
the Right Thing (TM). I found some pages on the web strongly advising not to use
meta tags, e.g. because of recoding proxies on the way; but these pages are
very old and I don't know whether this is still something to be worried about.
Aside from that, someone reported to me that using just the HTTP header without
the meta tag doesn't work for some browsers, but I was unable to replicate this
effect.

Many thanks in advance for any advice!

Yours,
Anatoly.

-- 
Anatoly Vorobey,
my journal (in Russian): http://www.livejournal.com/users/avva/
[EMAIL PROTECTED] http://pobox.com/~mellon/
"Angels can fly because they take themselves lightly" - G.K.Chesterton

converting a large website to Unicode: advice sought

Reply via email to