Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

2010-05-03 Thread Jakob Voss

Hi Stuart,

These have been included because they are in widespread use in a current 
written culture. The problems I personally have are down to characters 
used by a single publisher in a handful of books more than a hundred 
years ago. Such characters are explicitly excluded from Unicode.


In the early period of the standardisation of the Māori language there
were several competing ideas of what to use as a character set. One of
those included a 'wh' ligature as a character. Several works were
printed using this ligature. This ligature does not qualify for
inclusion in Unicode.


That is a matter of discussion. If you do not call it 'ligature' chances 
are higher to get it included.



To see how we handle the text, see:

http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

The underlying representation is TEI/XML, which has a mechanism to
handle such glyphs. The things I'm still unhappy with are:

* getting reasonable results when users cut-n-paste the text/image HTML
combination to some other application
* some browsers still like line-breaking on images in the middle of words


That's interesting and reminds me on the treatment of mathematical 
formula in journal titels which mostly end up as ugly images.


In Unicode you are allowed to assign private characters

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters

The U+200D ZERO WIDTH JOINER could also be used but most browsers will 
not support it - you need a font that supports your character anyway.


http://blogs.msdn.com/michkap/archive/2006/02/15/532394.aspx

In summary: Unicode is just a subset of all characters which have been 
used for written communication and whether a character gets included 
depends not only on objective properties but on lobbying and other 
circumstances. The deeper you dig the more nasty Unicode gets - as all 
complex formats and standards.


Cheers
Jakob

P.S: Michael Kaplan's  blog also contains a funny article about emoji: 
http://blogs.msdn.com/michkap/archive/2010/04/27/10002948.aspx


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

2010-05-03 Thread Jonathan Rochkind
Hmm, you could theoretically assign chars in the private unicode area to 
the chars you need -- but then have your application replace those chars 
by small images on rendering/display.


This seems as clean a solution as you are likely to find. Your TEI 
solution still requires chars-as-images for these unusual chars, right?  
So this is no better with regard to copying-and-pasting, browser 
display,  and general interoperability than your TEI solution, but no 
worse either -- it's pretty much the same thing. But it may be better in 
terms of those considerations for chars that actually ARE currently 
unicode codepoints.


If any of your private chars later become non-private unicode 
codepoints, you could always globally replace your private codepoints 
with the new standard ones.


With 137K private codepoints available, you _probably_ wouldn't run 
out. I think.  You could try standardizing these private codepoints 
among people in similar contexts/communities to you and your needs -- it 
looks like there are several existing efforts to document shared uses of 
private codepoints for chars that do not have official unicode 
codepoints. They are mentioned in the wikipedia article. 

[Reading that wikipedia article taught me something new I didn't know 
about Marc21 and unicode too -- a topic generally on top of my pile 
these days -- The MARC 21 standard uses the [Private Use Area] to 
encode East Asian characters present in MARC-8 that have no Unicode 
encoding. Who knew? ]


Jonathan

Jakob Voss wrote:

Hi Stuart,

  
These have been included because they are in widespread use in a current 
written culture. The problems I personally have are down to characters 
used by a single publisher in a handful of books more than a hundred 
years ago. Such characters are explicitly excluded from Unicode.


In the early period of the standardisation of the Māori language there
were several competing ideas of what to use as a character set. One of
those included a 'wh' ligature as a character. Several works were
printed using this ligature. This ligature does not qualify for
inclusion in Unicode.



That is a matter of discussion. If you do not call it 'ligature' chances 
are higher to get it included.


  

To see how we handle the text, see:

http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

The underlying representation is TEI/XML, which has a mechanism to
handle such glyphs. The things I'm still unhappy with are:

* getting reasonable results when users cut-n-paste the text/image HTML
combination to some other application
* some browsers still like line-breaking on images in the middle of words



That's interesting and reminds me on the treatment of mathematical 
formula in journal titels which mostly end up as ugly images.


In Unicode you are allowed to assign private characters

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters

The U+200D ZERO WIDTH JOINER could also be used but most browsers will 
not support it - you need a font that supports your character anyway.


http://blogs.msdn.com/michkap/archive/2006/02/15/532394.aspx

In summary: Unicode is just a subset of all characters which have been 
used for written communication and whether a character gets included 
depends not only on objective properties but on lobbying and other 
circumstances. The deeper you dig the more nasty Unicode gets - as all 
complex formats and standards.


Cheers
Jakob

P.S: Michael Kaplan's  blog also contains a funny article about emoji: 
http://blogs.msdn.com/michkap/archive/2010/04/27/10002948.aspx


  


Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

2010-05-02 Thread stuart yeates

Jakob Voss wrote:

Eric Hellman wrote:


May I just add here that of all the things we've talked about in
these threads, perhaps the only thing that will still be in use a
hundred years from now will be Unicode. إن شاء الله


Stuart Yeates wrote:

  Sadly, yes, I agree with you on this.
 
  Do you have any idea how demotivating that is for those of us
  maintaining collections with works containing characters that don't
  qualify for inclusion?

May I just add there that Unicode is evolving too and you can help to 
get missing characters included. One of the next updates will even 
include hundreds of icons such as a slice of pizza, a kissing couple, 
and the mount Fuji (See this zipped PDF: http://is.gd/bABl9 and 
http://en.wikipedia.org/wiki/Emoji).


Indeed.

These have been included because they are in widespread use in a current 
written culture. The problems I personally have are down to characters 
used by a single publisher in a handful of books more than a hundred 
years ago. Such characters are explicitly excluded from Unicode.


In the early period of the standardisation of the Māori language there
were several competing ideas of what to use as a character set. One of
those included a 'wh' ligature as a character. Several works were
printed using this ligature. This ligature does not qualify for
inclusion in Unicode.

To see how we handle the text, see:

http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

The underlying representation is TEI/XML, which has a mechanism to
handle such glyphs. The things I'm still unhappy with are:

* getting reasonable results when users cut-n-paste the text/image HTML
combination to some other application
* some browsers still like line-breaking on images in the middle of words

cheers
stuart
--
Stuart Yeates
http://www.nzetc.org/   New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/ Institutional Repository