Re: [CODE4LIB] character-sets for dummies?
A classic general overview (on the topic of what the heck ARE character sets???): http://www.joelonsoftware.com/articles/Unicode.html On Wed, Dec 16, 2009 at 11:02 AM, Ken Irwin kir...@wittenberg.edu wrote: Hi all, I'm looking for a good source to help me understand character sets and how to use them. I pretty much know nothing about this - the whole world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me. My immediate issue is that I think I need to integrate data from a variety of character sets into one MySQL table - I expect I need some way to convert from one to another, but I don't really even know how to tell which data are in which format. Our homegrown journal list (akin to SerialsSolutions) includes data ingested from publishers, vendors, the library catalog (III), etc. When I look at the data in emacs, some of it renders like this: Revista de Oncolog\303\255a [slashes-and-digits instead of diacritics] And other data looks more like: Revista de Música Latinoamericana [weird characters instead of diacritics] My MySQL table is currently set up with the collation set to: utf8-bin , and the titles from the second category (weird characters display in emacs) render properly when the database data is output to the a web browser. The data from the former example (\###) renders as an I don't know what character this is placeholder in Firefox and IE. So, can someone please point me toward any or all of the following? · A good primer for understanding all of this stuff · A method for converting all of my data to the same character set so it plays nicely in the database · The names of which character-sets I might be working with here Many thanks! Ken
Re: [CODE4LIB] character-sets for dummies?
This probably one place to start: http://www.joelonsoftware.com/articles/Unicode.html Mike -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ken Irwin Sent: Wednesday, December 16, 2009 10:02 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] character-sets for dummies? Hi all, I'm looking for a good source to help me understand character sets and how to use them. I pretty much know nothing about this - the whole world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me. My immediate issue is that I think I need to integrate data from a variety of character sets into one MySQL table - I expect I need some way to convert from one to another, but I don't really even know how to tell which data are in which format. Our homegrown journal list (akin to SerialsSolutions) includes data ingested from publishers, vendors, the library catalog (III), etc. When I look at the data in emacs, some of it renders like this: Revista de Oncolog\303\255a [slashes-and-digits instead of diacritics] And other data looks more like: Revista de Música Latinoamericana[weird characters instead of diacritics] My MySQL table is currently set up with the collation set to: utf8-bin , and the titles from the second category (weird characters display in emacs) render properly when the database data is output to the a web browser. The data from the former example (\###) renders as an I don't know what character this is placeholder in Firefox and IE. So, can someone please point me toward any or all of the following? · A good primer for understanding all of this stuff · A method for converting all of my data to the same character set so it plays nicely in the database · The names of which character-sets I might be working with here Many thanks! Ken
Re: [CODE4LIB] character-sets for dummies?
So, character encodings are really confusing, even for those who have dealt with them before. I'm not sure if there is a good 'dealing with character encodings for dummies' book, but if there is, I think I could use it too! But from your case, I can say:Ideally your source records are in a _known_ character set. Either they are in a format where it's documented somewhere that that format is always in, or they are in a format that specifies exactly what the encoding is. You didn't mention exactly where your data is coming from. For instance, MARC data is (legally) always in either UTF-8 or MARC-8. And there's a byte somewhere in the MARC header that specifies which one. Assuming that's byte is set properly. If you really don't have any 'metadata' specifying what character encoding your data is in, and you have to guess from the data itself... that's not good. There's no foolproof way to do this, it's going to rely on heuristics. So you'd probably want to first narrow down a set of possibilities it could be encoded in, and then look around on the web for heuristic algorithms to try and guess from among that set. Or you could just try assuming everything is UTF-8, and see if it works. Your examples look like they _could_ be UTF-8, hard to say. Because once you do figure out what everything is, I can recommend with confidence that what you want to do is translate EVERYTHING into UTF-8 in your database. Try to do all UTF-8 all the time, and it will save you a world of headaches. And once you know what something is, you should be able to find a tool to translate it to UTF-8. Hope this helps somewhat get you started thinking about the questions to ask. Character encoding issues are definitely confusing. Which is why the more UTF-8 the better, just get everything into UTF-8 and don't look back. Jonathan Ken Irwin wrote: Hi all, I'm looking for a good source to help me understand character sets and how to use them. I pretty much know nothing about this - the whole world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me. My immediate issue is that I think I need to integrate data from a variety of character sets into one MySQL table - I expect I need some way to convert from one to another, but I don't really even know how to tell which data are in which format. Our homegrown journal list (akin to SerialsSolutions) includes data ingested from publishers, vendors, the library catalog (III), etc. When I look at the data in emacs, some of it renders like this: Revista de Oncolog\303\255a [slashes-and-digits instead of diacritics] And other data looks more like: Revista de Música Latinoamericana[weird characters instead of diacritics] My MySQL table is currently set up with the collation set to: utf8-bin , and the titles from the second category (weird characters display in emacs) render properly when the database data is output to the a web browser. The data from the former example (\###) renders as an I don't know what character this is placeholder in Firefox and IE. So, can someone please point me toward any or all of the following? · A good primer for understanding all of this stuff · A method for converting all of my data to the same character set so it plays nicely in the database · The names of which character-sets I might be working with here Many thanks! Ken
Re: [CODE4LIB] character-sets for dummies?
On Wed, Dec 16, 2009 at 11:24 AM, Walker, David dwal...@calstate.edu wrote: If you're looking to convert that data to UTF-8 (which I assume you would), then your best friend is a program from Index Data called yaz-marcdump, which comes with the Yaz toolkit. It runs on Linux and Windows, and can be invoked from the command line or from scripts to quickly and painlessly convert your catalog data into UTF-8. Do keep in mind that if you've got a *mix* of character encodings in your database, you may have a Big Annoying Problem. Unless you know what records are in what format, there's no general way to do a conversion. You can use the sweet sweet python 'chardet' library to get a good idea of what encoding things are in, and maybe run things through iconv to normalize them to UTF8. Cheers, -Nate
Re: [CODE4LIB] character-sets for dummies?
When you see these kind of errors: Revista de Música Latinoamericana[weird characters instead of diacritics] if you can look at the data in a web browser it can be used as a tool to help you identify the correct encoding. Web browsers usually render character sets based on whatever appears in this line in the HTML source: meta http-equiv=content-type content=text/html; charset=UTF-8 but most browsers allow you to force a different character encoding, so if something is rendering incorrectly you can use browser display options to try to find the correct set. It would be under something like View Encoding (whatever). I find Opera to be great for this because I was able to add a handy button to quickly cycle through the most common encodings. Of course, web browsers in general might not grok MARC-8, but you get the idea. Brian Stamper The Ohio State University Libraries Scholarly Resources Integration 610 Ackerman Road Rm. 5833 Columbus, OH 43202-4500
Re: [CODE4LIB] character-sets for dummies?
If you're looking for a book-length treatment, 'Unicode Explained' is fairly readable, and the first three chapters are about character encodings in general: http://books.google.com/books?id=PcWU2yxc8WkCprintsec=frontcover On 12/16/2009 12:02 PM, Ken Irwin wrote: Hi all, I'm looking for a good source to help me understand character sets and how to use them. I pretty much know nothing about this - the whole world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me. My immediate issue is that I think I need to integrate data from a variety of character sets into one MySQL table - I expect I need some way to convert from one to another, but I don't really even know how to tell which data are in which format. Our homegrown journal list (akin to SerialsSolutions) includes data ingested from publishers, vendors, the library catalog (III), etc. When I look at the data in emacs, some of it renders like this: Revista de Oncolog\303\255a [slashes-and-digits instead of diacritics] And other data looks more like: Revista de Música Latinoamericana[weird characters instead of diacritics] My MySQL table is currently set up with the collation set to: utf8-bin , and the titles from the second category (weird characters display in emacs) render properly when the database data is output to the a web browser. The data from the former example (\###) renders as an I don't know what character this is placeholder in Firefox and IE. So, can someone please point me toward any or all of the following? · A good primer for understanding all of this stuff · A method for converting all of my data to the same character set so it plays nicely in the database · The names of which character-sets I might be working with here Many thanks! Ken --- [This E-mail scanned for viruses by Declude Virus] attachment: rockliff.vcf
Re: [CODE4LIB] character-sets for dummies?
Hi Ken, In an effort to better understand character sets myself, I have brought together some information on my website, with an emphasis on library automation and the internet environment: Coded Character Sets A Technical Primer for Librarians http://rocky.uta.edu/doran/charsets/ Make sure you look at the Resources on the Web page, too (http://rocky.uta.edu/doran/charsets/resources.html). The quote about character sets that most resonated with me was An apparently simple subject which turns out to be brutally complicated. They are definitely worth learning about, though! Have fun. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ken Irwin Sent: Wednesday, December 16, 2009 11:02 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] character-sets for dummies? Hi all, I'm looking for a good source to help me understand character sets and how to use them. I pretty much know nothing about this - the whole world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me. My immediate issue is that I think I need to integrate data from a variety of character sets into one MySQL table - I expect I need some way to convert from one to another, but I don't really even know how to tell which data are in which format. Our homegrown journal list (akin to SerialsSolutions) includes data ingested from publishers, vendors, the library catalog (III), etc. When I look at the data in emacs, some of it renders like this: Revista de Oncolog\303\255a [slashes-and-digits instead of diacritics] And other data looks more like: Revista de Música Latinoamericana[weird characters instead of diacritics] My MySQL table is currently set up with the collation set to: utf8-bin , and the titles from the second category (weird characters display in emacs) render properly when the database data is output to the a web browser. The data from the former example (\###) renders as an I don't know what character this is placeholder in Firefox and IE. So, can someone please point me toward any or all of the following? · A good primer for understanding all of this stuff · A method for converting all of my data to the same character set so it plays nicely in the database · The names of which character-sets I might be working with here Many thanks! Ken
Re: [CODE4LIB] character-sets for dummies?
Hi all -- thanks for these fabulous replies. I'm learning a lot. Armed with a bit of new knowledge, I've done some tinkering. I think I've solved my original quandaries, and have opened new cans of worms. I have a few more specific questions: 1) It appears that once I switch my MySQL table over from a latin character set to UTF-8, it is not longer case-insensitive (this makes sense based on what I learned from the Joel on Software post). All of the scripting I've done until now takes advantage of the case insensitivity; is there an easy way to keep this case insensitive while in UTF-8? 2) Is there a good/easy way to make the database agnostic about diacritics, so that a search for cafe will also find café The answers to both of these may be convert data to some normalized A-Z field that never displays, but I can only imagine that normalizing even most-Roman-characters-with-diacritics to plain ASCII-style characters can be daunting task. Any advice on these particulars? Thanks, Ken
Re: [CODE4LIB] character-sets for dummies?
Ken, Great suggestions so far--I have just one thing to add. If you ever reach the point at which you find yourself examining code tables to figure out what character set something is using, you might also want to find a good hex editor so that you can examine your data byte by byte. Since what you're looking at otherwise is always going to be the data as interpreted by a particular program (email program, web browser, text editor), looking at it with a hex editor can give you a nice grounding in reality, without that extra layer of interpretation. I use XVI32: http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm Jason Thomale Metadata Librarian Texas Tech University Libraries -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ken Irwin Sent: Wednesday, December 16, 2009 11:02 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] character-sets for dummies? Hi all, I'm looking for a good source to help me understand character sets and how to use them. I pretty much know nothing about this - the whole world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me. My immediate issue is that I think I need to integrate data from a variety of character sets into one MySQL table - I expect I need some way to convert from one to another, but I don't really even know how to tell which data are in which format. Our homegrown journal list (akin to SerialsSolutions) includes data ingested from publishers, vendors, the library catalog (III), etc. When I look at the data in emacs, some of it renders like this: Revista de Oncolog\303\255a [slashes-and-digits instead of diacritics] And other data looks more like: Revista de Música Latinoamericana[weird characters instead of diacritics] My MySQL table is currently set up with the collation set to: utf8-bin , and the titles from the second category (weird characters display in emacs) render properly when the database data is output to the a web browser. The data from the former example (\###) renders as an I don't know what character this is placeholder in Firefox and IE. So, can someone please point me toward any or all of the following? · A good primer for understanding all of this stuff · A method for converting all of my data to the same character set so it plays nicely in the database · The names of which character-sets I might be working with here Many thanks! Ken
Re: [CODE4LIB] character-sets for dummies?
Ken Irwin wrote: Hi all, I'm looking for a good source to help me understand character sets and how to use them. I pretty much know nothing about this - the whole world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me. Other people have recommended a whole lot of fabulous resources, so I won't cover ground they already have. If, however, you need to deal with characters which don't qualify for inclusion in Unicode (or which do qualify but which haven't yet been assigned code points). I recommend tei:glyph: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html We use this to represent typographically interesting but short-lived approaches to the representation of Māori in printed works. See for example the 'wh' ligature (which looks like a 'vh' and is pronounced in modern usage like 'f') in the following text: http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html for the underlying TEI XML representation see: http://www.nzetc.org/tei-source/Auc1911NgaM.xml cheers stuart -- Stuart Yeates http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository
Re: [CODE4LIB] character-sets for dummies?
Ken-- You may find a reason to create a normalized stealth field, but I have a couple of suggestions that will probably help you avoid that scenario. 1) Read up a little on the Unicode Normalization Forms (http://unicode.org/reports/tr15/) and convert all your UTF-8 characters to the composed form (NFC). The standard for MARC data is the decomposed form (NFD), but this is a real pain to work with if you like things to sort nicely (at least in MySQL). One way to do this is in perl with Unicode::Normalize. 2) Use a collation other than utf8-bin (here's where you lost your case insensitivity, I think). Try utf8_unicode_ci (ci as in case insensitive). I wish I had written down everything I learned about this stuff, but I didn't--and I keep having to go back and refresh my memory. Mike -- Michael Kreyche Systems Librarian / Associate Professor Libraries and Media Services Kent State University 330-672-1918 -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ken Irwin Sent: Wednesday, December 16, 2009 1:26 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] character-sets for dummies? Hi all -- thanks for these fabulous replies. I'm learning a lot. Armed with a bit of new knowledge, I've done some tinkering. I think I've solved my original quandaries, and have opened new cans of worms. I have a few more specific questions: 1) It appears that once I switch my MySQL table over from a latin character set to UTF-8, it is not longer case-insensitive (this makes sense based on what I learned from the Joel on Software post). All of the scripting I've done until now takes advantage of the case insensitivity; is there an easy way to keep this case insensitive while in UTF-8? 2) Is there a good/easy way to make the database agnostic about diacritics, so that a search for cafe will also find café The answers to both of these may be convert data to some normalized A-Z field that never displays, but I can only imagine that normalizing even most-Roman-characters-with-diacritics to plain ASCII-style characters can be daunting task. Any advice on these particulars? Thanks, Ken
Re: [CODE4LIB] character-sets for dummies?
Hi Ken, 1) It appears that once I switch my MySQL table over from a latin character set to UTF-8 My understanding is that a database character set is essentially a *label* that means My intention is to put data encoded in X character set in columns/fields of certain string datatypes. I'm more familiar with Oracle than with MySQL, but I assume they are similar in that changing the database character set from Latin-1 to UTF-8 doesn't change any data, just how that data is labeled. If all that data *was* UTF-8 then all is well. If some of the data was a different character set, you still have a problem of data of mixed character sets in columns of similar datatype (a database no-no). 2) Is there a good/easy way to make the database agnostic about diacritics, so that a search for cafe will also find café The answers to both of these may be convert data to some normalized A- Z field that never displays, but I can only imagine that normalizing even most-Roman-characters-with-diacritics to plain ASCII-style characters can be daunting task. When I hear normalized A-Z it strikes me as a very English-centric approach. Which may be fine for your particular database and situation, but it tends not to scale well if at some point you find yourself having to deal with non-Roman languages. If you are learning about character sets, might as well aim for solutions that will have a wider applicability. ;-) As suggested by Michael Kreyche, normalization is important, both for your database data and also in regards to user-supplied search terms. Unlike Mr. Kreyche, I would strongly advocate for NFD, the *decomposed* normalized form. Once both the search terms and the data are NFD, the quick-and-dirty way is to then strip out any combining characters and match on what remains. This is not ideal, since in some languages, certain accented characters are considered to be different characters (and sort differently, too, if correctly localized) than the base, un-accented character. However, I am guessing that will probably work fine for your purposes. Personally, I think a search feature that would list exact matches first (i.e. terms that match before stripping out the combining characters) and then fuzzy matches (i.e. terms that didn't match the first iteration but that match after stripping out the combining characters) is better. But also more complex to implement and perhaps over-kill in this situation. Depending on which scripting language you are using (and how much trouble you want to go to) I may have some more (opinionated) suggestions. If you end up coding some of this yourself, you may also want to investigate the Unicode Properties/Sub-Properties available in regular expressions. They provide a lot of power and flexibility. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ken Irwin Sent: Wednesday, December 16, 2009 12:26 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] character-sets for dummies? Hi all -- thanks for these fabulous replies. I'm learning a lot. Armed with a bit of new knowledge, I've done some tinkering. I think I've solved my original quandaries, and have opened new cans of worms. I have a few more specific questions: 1) It appears that once I switch my MySQL table over from a latin character set to UTF-8, it is not longer case-insensitive (this makes sense based on what I learned from the Joel on Software post). All of the scripting I've done until now takes advantage of the case insensitivity; is there an easy way to keep this case insensitive while in UTF-8? 2) Is there a good/easy way to make the database agnostic about diacritics, so that a search for cafe will also find café The answers to both of these may be convert data to some normalized A- Z field that never displays, but I can only imagine that normalizing even most-Roman-characters-with-diacritics to plain ASCII-style characters can be daunting task. Any advice on these particulars? Thanks, Ken
Re: [CODE4LIB] character-sets for dummies?
Hi 2009/12/17 stuart yeates stuart.yea...@vuw.ac.nz: If, however, you need to deal with characters which don't qualify for inclusion in Unicode (or which do qualify but which haven't yet been assigned code points). I recommend tei:glyph: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html We use this to represent typographically interesting but short-lived approaches to the representation of Māori in printed works. See for example the 'wh' ligature (which looks like a 'vh' and is pronounced in modern usage like 'f') in the following text: an interesting approach, although not the only way to address that particular issue. and depends on whether you want to treat it as a ligature or as a character. Other approaches have been to : 1) use PUA assignments, e.g. the MUFI and SIL PUA assignments/registries as examples; or 2) use U+200D to request ligation Both these approaches would require specifically defined or modified fonts. http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html for the underlying TEI XML representation see: http://www.nzetc.org/tei-source/Auc1911NgaM.xml cheers stuart -- Stuart Yeates http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository -- Andrew Cunningham Vicnet Research and Development Coordinator State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: [CODE4LIB] character-sets for dummies?
Andrew Cunningham wrote: Hi 2009/12/17 stuart yeates stuart.yea...@vuw.ac.nz: If, however, you need to deal with characters which don't qualify for inclusion in Unicode (or which do qualify but which haven't yet been assigned code points). I recommend tei:glyph: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html We use this to represent typographically interesting but short-lived approaches to the representation of Māori in printed works. See for example the 'wh' ligature (which looks like a 'vh' and is pronounced in modern usage like 'f') in the following text: an interesting approach, although not the only way to address that particular issue. and depends on whether you want to treat it as a ligature or as a character. Other approaches have been to : 1) use PUA assignments, e.g. the MUFI and SIL PUA assignments/registries as examples; or 2) use U+200D to request ligation On reflection, this is a subtly more general approach than our TEI one, since this allows new non-glyph characters to be introduced as well as new glyph characters. OTOH, there are a limited number of PUA code-points, a constraint that the TEI approach does not suffer. [For those unfamiliar with Unicode PUA mechanisms, see http://unicode.org/faq/casemap_charprop.html#8 and http://www.alanwood.net/unicode/private_use_area.html ] Both these approaches would require specifically defined or modified fonts. In our case, when generating (X)HTML (our primary delivery formats) we substitute character images cut from page scans of the original documents. Generating the right HTML and CSS for this is non-trivial. cheers stuart -- Stuart Yeates http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository