Hello, I'm in charge of preparing the conversion of a large DB-driven dynamic website to using UTF-8 uniformly. There're some thorny issues that I'm having difficulties deciding how to handle, so I'll be very grateful for any advice. Below is the description of my difficulties.
The website in question is http://www.livejournal.com, which is a free service allowing one to keep an online journal conveniently (and much more). All the software is written in Perl, is open-source, and uses MySQL as the DB server. There is a large userbase (>300,000 active users), so any solution must leave the existing journal entries and other kinds of text usable. Currently the code and the site in general are completely 8-bit clean and are as completely encodings-unaware. The vast majority of users are Americans which use ASCII but also sometimes Latin-1 characters; however there's also a fair amount of international users who use whatever 8-bit encodings they're accustomed to, and set their browsers appropriately. The site, however, could benefit enormously from being converted to use Unicode; for just one example, one of the most attractive features of the site is "friends views", where you see on one page all latest entries entered by people from your subscription list, in reverse chronological order; obviously if you have "friends" writing in different languages, you cannot view their entries correctly on one page now. We'd also like to offer our users the ability to export their journals in XML, and other features which demand encoding knowledge. The modifications I'm writing will make every page on the site to be built and output in UTF-8, including pages with HTML forms, so that new entries and other information submitted by users via these forms will automatically be submitted by their browsers in UTF-8 and stored this way in the database (are there any gotchas to be aware of here?) I'm planning to use UTF-8 strings as opaque strings to store in the database and handle in the code (they almost never need to be formatted), i.e. I'm not planning to use Perl's native UTF-8 support ot MySQL's not-yet-existent UTF-8 support. This part seems to be relatively easy; the main problems I'm encountering are with the existing data which is in various 8-bit encodings I have no knowledge about. I can't translate it to UTF-8 automatically in the database. Almost all of the text stored in the database consists of journal entries and comments to journal entries; I plan to add a new column to the appropriate tables which marks whether the entry or comment in question is in UTF-8 or not. If not, the code which needs to display the text will check the user's properties for a new "default encoding" property users will be able to set in their profiles; if there is a default encoding, the code will translate the text to UTF-8 on-the-fly, and if there's no default encoding, the code will refuse to display the text (unless it's pure ASCII). This seems to take care well enough of most data, and leads me to my main difficulty: how to deal with a lot of small miscellaneous text data left in the database: user names, profile information entered by the user such as a biography or an interest list, text in per-user to-do lists (another feature of the site), and more and more. There're a dozen or two places in the database where some small segments of text entered by the user are stored, and they're all currently in encoding-unaware 8-bit text. I can't deal with them by adding a new column for each such kind of data to mark whether it's UTF-8, as I'm doing for actual journal entries -- this will seriously bloat the database and complicate the code. I can, I guess, try to translate it to UTF-8 on-the-fly using the user's default encoding, but I still need to distinguish somehow, e.g. user names written in native 8-bit encodings which need such translation, and new user names entered after the site's been converted to UTF-8, which are already in UTF-8 and need no conversion. Should I use some kind of identifying mark inside the string (the BOM, maybe?) Or should I perhaps check every string for UTF-8 correctness and assume that if it's non-ASCII 8-bit text, it'll fail this test? Or is there some other well-known solution to this problem? Moreover, I'd like to provide an opportunity for a user to translate such miscellaneous information to UTF-8 in the database by using the encoding the user says the data is in (we won't do unprompted translation, but only at the user's request), but I can't think of a good way to do this, from the UI point of view. How can I show to the user their 8-bit data, and say: select the encoding this data is in, and preview whether it's displayed correctly when translated to UTF-8 from this encoding, given that the HTML pages implementing this conversion interface should themselves be in UTF-8? Finally, one other technical problem I'd like to ask advice about is the question of how to mark all the pages on the site as containing UTF-8 text: in HTTP headers, in <meta> tag inside HTML HEAD section, or in both. Since the site is completely dynamic, I can do it in any way I want; but I want to do the Right Thing (TM). I found some pages on the web strongly advising not to use meta tags, e.g. because of recoding proxies on the way; but these pages are very old and I don't know whether this is still something to be worried about. Aside from that, someone reported to me that using just the HTTP header without the meta tag doesn't work for some browsers, but I was unable to replicate this effect. Many thanks in advance for any advice! Yours, Anatoly. -- Anatoly Vorobey, my journal (in Russian): http://www.livejournal.com/users/avva/ [EMAIL PROTECTED] http://pobox.com/~mellon/ "Angels can fly because they take themselves lightly" - G.K.Chesterton

