On Thursday 06 May 2010 21:16:03 David Mintz wrote: > I don't really have a good understanding of issues around character sets, > encoding, what have you, though I am starting to work on it. > > My problem involves a MySQL database and accented characters such as those > you find in Spanish and French. My web server sends a "content-type: > text/html; charset=iso-8859-1" header and my docs have an equivalent meta > tag. My mysql's config says > > default-character-set = latin1 > character_set_server = latin1 > collation_server = latin1_general_ci
Here in mysql's configuration files, you could permanently set utf-8 as default character set and collation so that for new databases/tables it will be taken automatically. At this level you have solved problem of storing and retrieving utf-8 data. > > and my data tables "SHOW CREATE" typically look like > > CREATE TABLE `people` ( > `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT, > `lastname` varchar(40) COLLATE latin1_general_ci NOT NULL, > `firstname` varchar(40) COLLATE latin1_general_ci NOT NULL, > /* etc */ > ) ENGINE=MyISAM AUTO_INCREMENT=546 DEFAULT CHARSET=latin1 > COLLATE=latin1_general_ci You will need to convert existing data from latin1_* to utf8_* for consistent storage and retrieval of new and old data. > > So what's the problem? Generally there is none. Characters like ó and ñ > render correctly. The snag I am hitting now is writing a regular expression > to whitelist the characters I can accept in proper names. I would think > that the regex > > /^[-a-zA-Z\xC0-\xFF ']+$/ > > would test for anything that isn't a "letter" in most western european > languages, or a space, or an apostrophe. But it is returning true (meaning > yes there is an illegal character) in the name Barceló, where false is what > I would like to hear. Biggest problem with utf-8 data is text processing (sorting, searching, validation etc.) that is why full utf-8 support is lacking in many languages. But there extensions like mb_string, iconv and ICU which can be helpful in processing utf-8 data at satisfactory level. > > Would this regex work if the data were utf-8? Should I consider converting > everything and working in utf-8, and if so, how painful is it to convert a > MySQL database? My initial research suggests that it isn't painless. Yes, moreover you also need to change meta information in your web-pages to tell browser and server that your text is utf-8 encoded and not iso-8859... Finally your editor must be set to write you code/information in utf-8 format only. I don't think that at web server, OS and http level you need make any changes since now a days they have native support to handle utf-8 data. Thanks Anirudh Zala _______________________________________________ New York PHP Users Group Community Talk Mailing List http://lists.nyphp.org/mailman/listinfo/talk http://www.nyphp.org/Show-Participation