Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Chad Fennell
A classic general overview (on the topic of what the heck ARE
character sets???):

http://www.joelonsoftware.com/articles/Unicode.html



On Wed, Dec 16, 2009 at 11:02 AM, Ken Irwin kir...@wittenberg.edu wrote:
 Hi all,

 I'm looking for a good source to help me understand character sets and how to 
 use them. I pretty much know nothing about this - the whole world of Unicode, 
 ASCII, octal, UTF-8, etc. is baffling to me.

 My immediate issue is that I think I need to integrate data from a variety of 
 character sets into one MySQL table - I expect I need some way to convert 
 from one to another, but I don't really even know how to tell which data are 
 in which format.

 Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
 from publishers, vendors, the library catalog (III), etc. When I look at the 
 data in emacs, some of it renders like this:
  Revista de Oncolog\303\255a                  [slashes-and-digits instead of 
 diacritics]
 And other data looks more like:
  Revista de Música Latinoamericana    [weird characters instead of 
 diacritics]

 My MySQL table is currently set up with the collation set to: utf8-bin , and 
 the titles from the second category (weird characters display in emacs) 
 render properly when the database data is output to the a web browser. The 
 data from the former example (\###) renders as an I don't know what 
 character this is placeholder in Firefox and IE.

 So, can someone please point me toward any or all of the following?

 ·         A good primer for understanding all of this stuff

 ·         A method for converting all of my data to the same character set so 
 it plays nicely in the database

 ·         The names of which character-sets I might be working with here

 Many thanks!

 Ken



Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Hagedon, Mike
This probably one place to start:

http://www.joelonsoftware.com/articles/Unicode.html

Mike

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ken 
Irwin
Sent: Wednesday, December 16, 2009 10:02 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] character-sets for dummies?

Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.

My immediate issue is that I think I need to integrate data from a variety of 
character sets into one MySQL table - I expect I need some way to convert from 
one to another, but I don't really even know how to tell which data are in 
which format.

Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
from publishers, vendors, the library catalog (III), etc. When I look at the 
data in emacs, some of it renders like this:
 Revista de Oncolog\303\255a  [slashes-and-digits instead of 
diacritics]
And other data looks more like:
 Revista de Música Latinoamericana[weird characters instead of diacritics]

My MySQL table is currently set up with the collation set to: utf8-bin , and 
the titles from the second category (weird characters display in emacs) render 
properly when the database data is output to the a web browser. The data from 
the former example (\###) renders as an I don't know what character this is 
placeholder in Firefox and IE.

So, can someone please point me toward any or all of the following?

· A good primer for understanding all of this stuff

· A method for converting all of my data to the same character set so 
it plays nicely in the database

· The names of which character-sets I might be working with here

Many thanks!

Ken


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Jonathan Rochkind
So, character encodings are really confusing, even for those who have 
dealt with them before. I'm not sure if there is a good 'dealing with 
character encodings for dummies' book, but if there is, I think I could 
use it too!


But from your case, I can say:Ideally your source records are in a 
_known_ character set.  Either they are in a format where it's 
documented somewhere that that format is always in, or they are in a 
format that specifies exactly what the encoding is. You didn't mention 
exactly where your data is coming from.


For instance, MARC data is (legally) always in either UTF-8 or MARC-8.  
And there's a byte somewhere in the MARC header that specifies which one.


Assuming that's byte is set properly.  If you really don't have any 
'metadata' specifying what character encoding your data is in, and you 
have to guess from the data itself... that's not good. There's no 
foolproof way to do this, it's going to rely on heuristics. So you'd 
probably want to first narrow down a set of possibilities it could be 
encoded in, and then look around on the web for heuristic algorithms to 
try and guess from among that set.


Or you could just try assuming everything is UTF-8, and see if it 
works.  Your examples look like they _could_ be UTF-8, hard to say.


Because once you do figure out what everything is, I can recommend with 
confidence that what you want to do is translate EVERYTHING into UTF-8 
in your database.  Try to do all UTF-8 all the time, and it will save 
you a world of headaches. And once you know what something is, you 
should be able to find a tool to translate it to UTF-8.


Hope this helps somewhat get you started thinking about the questions to 
ask. Character encoding issues are definitely confusing. Which is why 
the more UTF-8 the better, just get everything into UTF-8 and don't look 
back.


Jonathan

Ken Irwin wrote:

Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.

My immediate issue is that I think I need to integrate data from a variety of 
character sets into one MySQL table - I expect I need some way to convert from 
one to another, but I don't really even know how to tell which data are in 
which format.

Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
from publishers, vendors, the library catalog (III), etc. When I look at the 
data in emacs, some of it renders like this:
 Revista de Oncolog\303\255a  [slashes-and-digits instead of 
diacritics]
And other data looks more like:
 Revista de Música Latinoamericana[weird characters instead of diacritics]

My MySQL table is currently set up with the collation set to: utf8-bin , and the titles 
from the second category (weird characters display in emacs) render properly when the 
database data is output to the a web browser. The data from the former example (\###) 
renders as an I don't know what character this is placeholder in Firefox and 
IE.

So, can someone please point me toward any or all of the following?

· A good primer for understanding all of this stuff

· A method for converting all of my data to the same character set so 
it plays nicely in the database

· The names of which character-sets I might be working with here

Many thanks!

Ken

  


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Nate Vack
On Wed, Dec 16, 2009 at 11:24 AM, Walker, David dwal...@calstate.edu wrote:

 If you're looking to convert that data to UTF-8 (which I assume you would), 
 then your best friend is a program from Index Data called yaz-marcdump, which 
 comes with the Yaz toolkit.  It runs on Linux and Windows, and can be invoked 
 from the command line or from scripts to quickly and painlessly convert your 
 catalog data into UTF-8.

Do keep in mind that if you've got a *mix* of character encodings in
your database, you may have a Big Annoying Problem. Unless you know
what records are in what format, there's no general way to do a
conversion.

You can use the sweet sweet python 'chardet' library to get a good
idea of what encoding things are in, and maybe run things through
iconv to normalize them to UTF8.

Cheers,
-Nate


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Brian Stamper

When you see these kind of errors:

 Revista de Música Latinoamericana[weird characters instead of  
diacritics]


if you can look at the data in a web browser it can be used as a tool to  
help you identify the correct encoding. Web browsers usually render  
character sets based on whatever appears in this line in the HTML source:


meta http-equiv=content-type content=text/html; charset=UTF-8

but most browsers allow you to force a different character encoding, so if  
something is rendering incorrectly you can use browser display options to  
try to find the correct set. It would be under something like View   
Encoding  (whatever). I find Opera to be great for this because I was  
able to add a handy button to quickly cycle through the most common  
encodings. Of course, web browsers in general might not grok MARC-8, but  
you get the idea.



Brian Stamper
The Ohio State University Libraries
Scholarly Resources Integration
610 Ackerman Road Rm. 5833
Columbus, OH 43202-4500


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Cory Rockliff
If you're looking for a book-length treatment, 'Unicode Explained' is 
fairly readable, and the first three chapters are about character 
encodings in general:


http://books.google.com/books?id=PcWU2yxc8WkCprintsec=frontcover

On 12/16/2009 12:02 PM, Ken Irwin wrote:

Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.

My immediate issue is that I think I need to integrate data from a variety of 
character sets into one MySQL table - I expect I need some way to convert from 
one to another, but I don't really even know how to tell which data are in 
which format.

Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
from publishers, vendors, the library catalog (III), etc. When I look at the 
data in emacs, some of it renders like this:
  Revista de Oncolog\303\255a  [slashes-and-digits instead of 
diacritics]
And other data looks more like:
  Revista de Música Latinoamericana[weird characters instead of diacritics]

My MySQL table is currently set up with the collation set to: utf8-bin , and the titles 
from the second category (weird characters display in emacs) render properly when the 
database data is output to the a web browser. The data from the former example (\###) 
renders as an I don't know what character this is placeholder in Firefox and 
IE.

So, can someone please point me toward any or all of the following?

· A good primer for understanding all of this stuff

· A method for converting all of my data to the same character set so 
it plays nicely in the database

· The names of which character-sets I might be working with here

Many thanks!

Ken
---
[This E-mail scanned for viruses by Declude Virus]



   


attachment: rockliff.vcf

Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Doran, Michael D
Hi Ken,

In an effort to better understand character sets myself, I have brought 
together some information on my website, with an emphasis on library automation 
and the internet environment:
  
  Coded Character Sets  A Technical Primer for Librarians
  http://rocky.uta.edu/doran/charsets/

Make sure you look at the Resources on the Web page, too 
(http://rocky.uta.edu/doran/charsets/resources.html).

The quote about character sets that most resonated with me was An apparently 
simple subject which turns out to be brutally complicated.  They are 
definitely worth learning about, though!  Have fun.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
 

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ken Irwin
 Sent: Wednesday, December 16, 2009 11:02 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] character-sets for dummies?
 
 Hi all,
 
 I'm looking for a good source to help me understand character sets and
 how to use them. I pretty much know nothing about this - the whole
 world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me.
 
 My immediate issue is that I think I need to integrate data from a
 variety of character sets into one MySQL table - I expect I need some
 way to convert from one to another, but I don't really even know how to
 tell which data are in which format.
 
 Our homegrown journal list (akin to SerialsSolutions) includes data
 ingested from publishers, vendors, the library catalog (III), etc. When
 I look at the data in emacs, some of it renders like this:
  Revista de Oncolog\303\255a  [slashes-and-digits
 instead of diacritics]
 And other data looks more like:
  Revista de Música Latinoamericana[weird characters instead of
 diacritics]
 
 My MySQL table is currently set up with the collation set to: utf8-bin
 , and the titles from the second category (weird characters display in
 emacs) render properly when the database data is output to the a web
 browser. The data from the former example (\###) renders as an I don't
 know what character this is placeholder in Firefox and IE.
 
 So, can someone please point me toward any or all of the following?
 
 · A good primer for understanding all of this stuff
 
 · A method for converting all of my data to the same character
 set so it plays nicely in the database
 
 · The names of which character-sets I might be working with
 here
 
 Many thanks!
 
 Ken


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Ken Irwin
Hi all -- thanks for these fabulous replies. I'm learning a lot. 

Armed with a bit of new knowledge, I've done some tinkering. I think I've 
solved my original quandaries, and have opened new cans of worms. I have a few 
more specific questions:

1) It appears that once I switch my MySQL table over from a latin character set 
to UTF-8, it is not longer case-insensitive (this makes sense based on what I 
learned from the Joel on Software post). All of the scripting I've done until 
now takes advantage of the case insensitivity; is there an easy way to keep 
this case insensitive while in UTF-8? 

2) Is there a good/easy way to make the database agnostic about diacritics, so 
that a search for cafe will also find café 

The answers to both of these may be convert data to some normalized A-Z field 
that never displays, but I can only imagine that normalizing even 
most-Roman-characters-with-diacritics to plain ASCII-style characters can be 
daunting task.

Any advice on these particulars? 

Thanks,
Ken


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Thomale, J
Ken,

Great suggestions so far--I have just one thing to add.

If you ever reach the point at which you find yourself examining code tables to 
figure out what character set something is using, you might also want to find a 
good hex editor so that you can examine your data byte by byte. Since what 
you're looking at otherwise is always going to be the data as interpreted by a 
particular program (email program, web browser, text editor), looking at it 
with a hex editor can give you a nice grounding in reality, without that extra 
layer of interpretation.

I use XVI32: http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm

Jason Thomale
Metadata Librarian
Texas Tech University Libraries



 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ken Irwin
 Sent: Wednesday, December 16, 2009 11:02 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] character-sets for dummies?
 
 Hi all,
 
 I'm looking for a good source to help me understand character sets and
 how to use them. I pretty much know nothing about this - the whole
 world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me.
 
 My immediate issue is that I think I need to integrate data from a
 variety of character sets into one MySQL table - I expect I need some
 way to convert from one to another, but I don't really even know how to
 tell which data are in which format.
 
 Our homegrown journal list (akin to SerialsSolutions) includes data
 ingested from publishers, vendors, the library catalog (III), etc. When
 I look at the data in emacs, some of it renders like this:
  Revista de Oncolog\303\255a  [slashes-and-digits
 instead of diacritics]
 And other data looks more like:
  Revista de Música Latinoamericana[weird characters instead of
 diacritics]
 
 My MySQL table is currently set up with the collation set to: utf8-bin
 , and the titles from the second category (weird characters display in
 emacs) render properly when the database data is output to the a web
 browser. The data from the former example (\###) renders as an I don't
 know what character this is placeholder in Firefox and IE.
 
 So, can someone please point me toward any or all of the following?
 
 · A good primer for understanding all of this stuff
 
 · A method for converting all of my data to the same character
 set so it plays nicely in the database
 
 · The names of which character-sets I might be working with
 here
 
 Many thanks!
 
 Ken


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread stuart yeates

Ken Irwin wrote:

Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.


Other people have recommended a whole lot of fabulous resources, so I 
won't cover ground they already have.


If, however, you need to deal with characters which don't qualify for 
inclusion in Unicode (or which do qualify but which haven't yet been 
assigned code points). I recommend tei:glyph:


http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html

We use this to represent typographically interesting but short-lived 
approaches to the representation of Māori in printed works. See for 
example the 'wh' ligature (which looks like a 'vh' and is pronounced in 
modern usage like 'f') in the following text:


http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

for the underlying TEI XML representation see:

http://www.nzetc.org/tei-source/Auc1911NgaM.xml

cheers
stuart
--
Stuart Yeates
http://www.nzetc.org/   New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/ Institutional Repository


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread KREYCHE, MICHAEL
Ken--

You may find a reason to create a normalized stealth field, but I have a 
couple of suggestions that will probably help you avoid that scenario.

1) Read up a little on the Unicode Normalization Forms 
(http://unicode.org/reports/tr15/) and convert all your UTF-8 characters to the 
composed form (NFC). The standard for MARC data is the decomposed form (NFD), 
but this is a real pain to work with if you like things to sort nicely (at 
least in MySQL). One way to do this is in perl with Unicode::Normalize. 

2) Use a collation other than utf8-bin (here's where you lost your case 
insensitivity, I think). Try utf8_unicode_ci (ci as in case insensitive).

I wish I had written down everything I learned about this stuff, but I 
didn't--and I keep having to go back and refresh my memory.

Mike
--
Michael Kreyche
Systems Librarian / Associate Professor
Libraries and Media Services 
Kent State University
330-672-1918

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On 
 Behalf Of Ken Irwin
 Sent: Wednesday, December 16, 2009 1:26 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] character-sets for dummies?
 
 Hi all -- thanks for these fabulous replies. I'm learning a lot. 
 
 Armed with a bit of new knowledge, I've done some tinkering. 
 I think I've solved my original quandaries, and have opened 
 new cans of worms. I have a few more specific questions:
 
 1) It appears that once I switch my MySQL table over from a 
 latin character set to UTF-8, it is not longer 
 case-insensitive (this makes sense based on what I learned 
 from the Joel on Software post). All of the scripting I've 
 done until now takes advantage of the case insensitivity; is 
 there an easy way to keep this case insensitive while in UTF-8? 
 
 2) Is there a good/easy way to make the database agnostic 
 about diacritics, so that a search for cafe will also find café 
 
 The answers to both of these may be convert data to some 
 normalized A-Z field that never displays, but I can only 
 imagine that normalizing even 
 most-Roman-characters-with-diacritics to plain ASCII-style 
 characters can be daunting task.
 
 Any advice on these particulars? 
 
 Thanks,
 Ken
 


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Doran, Michael D
Hi Ken,

 1) It appears that once I switch my MySQL table over from a latin
 character set to UTF-8

My understanding is that a database character set is essentially a *label* that 
means My intention is to put data encoded in X character set in columns/fields 
of certain string datatypes.  I'm more familiar with Oracle than with MySQL, 
but I assume they are similar in that changing the database character set from 
Latin-1 to UTF-8 doesn't change any data, just how that data is labeled.  If 
all that data *was* UTF-8 then all is well.  If some of the data was a 
different character set, you still have a problem of data of mixed character 
sets in columns of similar datatype (a database no-no).

 2) Is there a good/easy way to make the database agnostic about
 diacritics, so that a search for cafe will also find café
 
 The answers to both of these may be convert data to some normalized A-
 Z field that never displays, but I can only imagine that normalizing
 even most-Roman-characters-with-diacritics to plain ASCII-style
 characters can be daunting task.

When I hear normalized A-Z it strikes me as a very English-centric approach.  
Which may be fine for your particular database and situation, but it tends not 
to scale well if at some point you find yourself having to deal with non-Roman 
languages.  If you are learning about character sets, might as well aim for 
solutions that will have a wider applicability.  ;-)

As suggested by Michael Kreyche, normalization is important, both for your 
database data and also in regards to user-supplied search terms.  Unlike Mr. 
Kreyche, I would strongly advocate for NFD, the *decomposed* normalized form.  
Once both the search terms and the data are NFD, the quick-and-dirty way is to 
then strip out any combining characters and match on what remains.  This is not 
ideal, since in some languages, certain accented characters are considered to 
be different characters (and sort differently, too, if correctly localized) 
than the base, un-accented character.  However, I am guessing that will 
probably work fine for your purposes.

Personally, I think a search feature that would list exact matches first (i.e. 
terms that match before stripping out the combining characters) and then fuzzy 
matches (i.e. terms that didn't match the first iteration but that match after 
stripping out the combining characters) is better.  But also more complex to 
implement and perhaps over-kill in this situation.

Depending on which scripting language you are using (and how much trouble you 
want to go to) I may have some more (opinionated) suggestions.  If you end up 
coding some of this yourself, you may also want to investigate the Unicode 
Properties/Sub-Properties available in regular expressions.  They provide a lot 
of power and flexibility.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
 

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ken Irwin
 Sent: Wednesday, December 16, 2009 12:26 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] character-sets for dummies?
 
 Hi all -- thanks for these fabulous replies. I'm learning a lot.
 
 Armed with a bit of new knowledge, I've done some tinkering. I think
 I've solved my original quandaries, and have opened new cans of worms.
 I have a few more specific questions:
 
 1) It appears that once I switch my MySQL table over from a latin
 character set to UTF-8, it is not longer case-insensitive (this makes
 sense based on what I learned from the Joel on Software post). All of
 the scripting I've done until now takes advantage of the case
 insensitivity; is there an easy way to keep this case insensitive while
 in UTF-8?
 
 2) Is there a good/easy way to make the database agnostic about
 diacritics, so that a search for cafe will also find café
 
 The answers to both of these may be convert data to some normalized A-
 Z field that never displays, but I can only imagine that normalizing
 even most-Roman-characters-with-diacritics to plain ASCII-style
 characters can be daunting task.
 
 Any advice on these particulars?
 
 Thanks,
 Ken


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Andrew Cunningham
Hi

2009/12/17 stuart yeates stuart.yea...@vuw.ac.nz:

 If, however, you need to deal with characters which don't qualify for
 inclusion in Unicode (or which do qualify but which haven't yet been
 assigned code points). I recommend tei:glyph:





 http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html

 We use this to represent typographically interesting but short-lived
 approaches to the representation of Māori in printed works. See for example
 the 'wh' ligature (which looks like a 'vh' and is pronounced in modern usage
 like 'f') in the following text:

an interesting approach, although not the only way to address that
particular issue.

and depends on whether you want to treat it as a ligature or as a character.

Other approaches have been to :
1) use PUA assignments, e.g. the MUFI and SIL PUA
assignments/registries as examples; or
2) use U+200D to request ligation

Both these approaches would require specifically defined or modified fonts.

 http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

 for the underlying TEI XML representation see:

 http://www.nzetc.org/tei-source/Auc1911NgaM.xml

 cheers
 stuart
 --
 Stuart Yeates
 http://www.nzetc.org/       New Zealand Electronic Text Centre
 http://researcharchive.vuw.ac.nz/     Institutional Repository




-- 
Andrew Cunningham
Vicnet Research and Development Coordinator
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread stuart yeates

Andrew Cunningham wrote:

Hi

2009/12/17 stuart yeates stuart.yea...@vuw.ac.nz:


If, however, you need to deal with characters which don't qualify for
inclusion in Unicode (or which do qualify but which haven't yet been
assigned code points). I recommend tei:glyph:

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html

We use this to represent typographically interesting but short-lived
approaches to the representation of Māori in printed works. See for example
the 'wh' ligature (which looks like a 'vh' and is pronounced in modern usage
like 'f') in the following text:


an interesting approach, although not the only way to address that
particular issue.

and depends on whether you want to treat it as a ligature or as a character.

Other approaches have been to :
1) use PUA assignments, e.g. the MUFI and SIL PUA
assignments/registries as examples; or
2) use U+200D to request ligation


On reflection, this is a subtly more general approach than our TEI one, 
since this allows new non-glyph characters to be introduced as well as 
new glyph characters.


OTOH, there are a limited number of PUA code-points, a constraint that 
the TEI approach does not suffer.


[For those unfamiliar with Unicode PUA mechanisms, see 
http://unicode.org/faq/casemap_charprop.html#8 and 
http://www.alanwood.net/unicode/private_use_area.html ]



Both these approaches would require specifically defined or modified fonts.


In our case, when generating (X)HTML (our primary delivery formats) we 
substitute character images cut from page scans of the original 
documents. Generating the right HTML and CSS for this is non-trivial.


cheers
stuart
--
Stuart Yeates
http://www.nzetc.org/   New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/ Institutional Repository