Re: Displaying diacritics in a terminal vs. a browser

2004-07-09 Thread Ed Summers
On Thu, Jul 08, 2004 at 01:17:48PM -0400, Houghton,Andrew wrote:
> Unicode specifies four normalization methods, NFC, NFD, NFKC,
> and NFKD.  While RDF could have just accepted characters in
> unnormalized form, it decided to mandate that all data content
> be provided in NFC normalization form.  This has consequences
> when taking MARC-XML data to RDF.  MARC-XML uses NFD, loosely,
> as Michael Doran pointed out there are exceptions.

Ouch, is this documented somewhere? I imagine it must be, but I never
seem to have run across it before. It's probably in bold letters in the
spec :]

> I'm not announcing the availability of the code yet, but you can
> take a peek at http://staff.oclc.org/~houghtoa/repository/perl/utf-nf.pl

Thanks Andrew. I'll check it out. 

> I'm not sure how the internals of MARC::Charset work, but if it keeps
> the data in Perl's internal Unicode representation then all that you
> would need to do is call the normalize function in the Unicode::Normalize
> package.  If not then you would need to convert it first into Perl's
> internal Unicode representation, probably with the Encode package, which
> is also built into all 5.8.0 Perl distributions.

All MARC::Charset does is provide hash lookup for the LC mapping tables [1], 
and a fairly simple alogorithm for reading the MARC-8 escapes and
translating to UTF-8 appropriately. One somewhat nice bonus is that the
big East Asian mapping is stored in a BerkleyDb to save on memory--but I
guess memory is cheap these days. 

Thanks for the tip about Unicode::Normalize.  MARC::Charset already requires 
perl 5.8, so I think adding this normalization would be a good idea at
some point.

//Ed

[1] http://www.loc.gov/marc/specifications/specchartables.html



RE: Displaying diacritics in a terminal vs. a browser

2004-07-06 Thread Michael D Doran
> MARC-XML uses Unicode Normal form D, which means that the base
> character is separate from the diacritic.

I am not familiar with the MARC-XML specifications, so at the risk of
embarrasing myself would it be correct to posit that it may not be that
MARC-XML uses Unicode Normal form D, so much as the fact that the MARC 21
UCS/Unicode environment is essentially the MARC-8 character repertoire
translated into the Unicode equivalent code points [1].  Since the MARC-8
character repertoire relies largely on combining characters, the end result
will mostly be Unicode Normal form D.  However, there *are* exceptions.  One
example is UPPERCASE O-HOOK which is a single character in MARC-8 (hex AC),
and therefore a precomposed character in MARC UCS/Unicode (hex 01A1) [and
therefore I assume MARC-XML], even though there is a decomposed (i.e. Normal
Form D) Unicode version (hex 006F 031B) of that character.

I have been trying to learn about character sets, especially in regards to
MARC and library environments and have put some (hopefully) useful
information on the web [2].  Included is a technical primer for librarians
as well as extensive code charts/matrices for MARC character sets.  There is
a fairly decent list of web resources [3].  Note that the powerpoint slide
show is of limited use without the original commentary and is a huge file
due to including embedded fonts.

[1] Coded Character Sets > A Technical Primer for Librarians > MARC Unicode
http://rocky.uta.edu/doran/charsets/unicode.html

[2] Coded Character Sets
http://rocky.uta.edu/doran/charsets/

[3] Resources on the Web: With an emphasis on library automation and the
internet
http://rocky.uta.edu/doran/charsets/resources.html

BTW, the earlier message I sent to the list had an unfinished sentence.  I
should have proofread before sending and I apologize.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-239-5368 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 




RE: Displaying diacritics in a terminal vs. a browser

2004-07-06 Thread Michael D Doran
Hi Andy,

> From: Houghton,Andrew [mailto:[EMAIL PROTECTED] 
>
> It just so happens that I have recently been converting 
> MARC-XML to RDF.  The RDF specification mandates Unicode 
> Normal form C, which means that the base character and the 
> diacritic are combined.

That's rather unfortunate, since Unicode includes the precomposed characters
largely for backward compatibility and the preferred 

> So I hacked together some Perl scripts to convert 
> Unicode NFD <-> Unicode NFC.
> 
> I was talking with a colleague, just yesterday, about whether 
> we should unleash these on the Net...  They need to be 
> cleaned up a little and need some basic documentation on how 
> to run the Perl scripts.

The W3C provides a Perl app that (I think) purports to do that [1].  I don't
know how much overlap there may be with your script, but just in case you
were not already aware of the W3C script, you may want to see if there is a
duplication of effort.

[1] "Charlint - A Character Normalization Tool" 
http://www.w3.org/International/charlint/.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-239-5368 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> 


Re: Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Ed Summers
> A MARC-8 sequence places a combining diacritical mark BEFORE the letter 
> it's supposed to combine.  Whereas Unicode syntax is to put it AFTER the 
> letter it's supposed to combine with.  
>  
> Hence for example the letter: ZÌ
> is produced by the MARC-8 Sequence: 
> 75 5A (macron below + "Z")
> but 
> 0331 005A  ("Z" + Combining Macron below) in Unicode.
>  
> I believe if you don't account for this in your UTF-8 transformation, you 
> will get either no combining or combining with the wrong character.

Just FYI in case anyone is curious about what MARC::Charset does, to_utf8() 
will take care of repositioning the diacritics from before to after the 
character that they modify. 

//Ed


RE: Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Christopher Morgan
Jane,
 
Thanks very much for the information about Unicode and MARC-8.  I still have a lot to 
learn about the two formats! Since my MARC data is being manipulated primarily in a 
browser via a cgi script, I'll forego writing a converter for the terminal display for 
now, but I eventually plan to do that. Thanks again!
 
- Chris

  _  

From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 01, 2004 1:51 PM
To: 'Christopher Morgan'
Subject: RE: Displaying diacritics in a terminal vs. a browser



Hi Chris,

I hope my analysis is correct; I think that two problems are going on here:

1) Your terminal display is very likely not up to the "combining" aspect of combining 
diacriticals.
2) More importantly there is an important technical shift in placement of diacritical 
marks between MARC-8 and Unicode:
 
A MARC-8 sequence places a combining diacritical mark BEFORE the letter it's supposed 
to combine.  Whereas Unicode syntax is to put it AFTER the letter it's supposed to 
combine with.  
 
Hence for example the letter: ZÌ
is produced by the MARC-8 Sequence: 
75 5A (macron below + "Z")
but 
0331 005A  ("Z" + Combining Macron below) in Unicode.
 
I believe if you don't account for this in your UTF-8 transformation, you will get 
either no combining or combining with the wrong character.
 
Hope that's useful.
JJ
 



**Views expressed by the author do not necessarily represent those of the Queens 
Library.**

Jane Jacobs
Asst. Coord., Catalog Division
Queens Borough Public Library
89-11 Merrick Blvd.
Jamaica, NY 11432

tel.: (718) 990-0804
e-mail: [EMAIL PROTECTED]
FAX. (718) 990-8566



-Original Message-
From: Christopher Morgan [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 01, 2004 10:50 AM
To: [EMAIL PROTECTED]
Subject: Displaying diacritics in a terminal vs. a browser



Hi all,

I use the $cs->to_utf8 conversion from MARC::Charset to display MARC Authority records 
in a browser, and the diacritics display properly there. But they don't display 
properly via SDTOUT in my terminal window (I get two characters instead of one -- one 
with the letter and one with the accent mark). Am I doing something wrong? I'm using:

binmode (STDOUT, ":utf8");

Is there any way around this problem, or is it a limitation of terminal displays?

(I found a thread in the archives: http://www.mail-archive.com/[EMAIL 
PROTECTED]/msg00280.html
that discusses a similar issue, but it didn't really answer my question).

Thanks!

-- Chris Morgan






Re: Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Ed Summers
On Thu, Jul 01, 2004 at 11:22:42AM -0400, Houghton,Andrew wrote:
> I'm not sure what MARC::Charset does internally, but MARC-8 
> defines the diacritic separate from the base character.  So 
> even using binmode(STDOUT,":utf8") will produce two characters,
> one for the base character followed by the diacritic.  If you
> want them combined then you need to combine them.

As you suggest Andy, MARC::Charset simply translates MARC-8 combining 
characters into UTF-8 combining characters.

> It just so happens that I have recently been converting MARC-XML
> to RDF.  The RDF specification mandates Unicode Normal form C,
> which means that the base character and the diacritic are 
> combined.  MARC-XML uses Unicode Normal form D, which means that 
> the base character is separate from the diacritic.  So I hacked 
> together some Perl scripts to convert Unicode NFD <-> Unicode NFC.
> The scripts require Perl 5.8.0.

Wow, I've always been under the impression that the character sets
operated the same in RDF as they do in XML proper with the 'encoding'
attribute:

 


> I was talking with a colleague, just yesterday, about whether we 
> should unleash these on the Net...  They need to be cleaned up a 
> little and need some basic documentation on how to run the Perl 
> scripts.

It would be nice to have them wrapped up with a module interface for use
in non-command-line apps. I'd would be open to integrating this
functionality into MARC::Charset if you are interested.

//Ed


RE: Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Houghton,Andrew
> From: Paul Hoffman [mailto:[EMAIL PROTECTED] 
> Sent: 01 July, 2004 11:57
> Subject: Re: Displaying diacritics in a terminal vs. a browser
> 
> Unless I'm very much mistaken, Chris's code is outputting 
> UTF-8 to the terminal, not MARC-8.

> >> From: Christopher Morgan [mailto:[EMAIL PROTECTED]
> >> Sent: 01 July, 2004 10:50
> >> Subject: Displaying diacritics in a terminal vs. a browser
> >> 
> >> (I get two characters instead of one -- one with the letter 
> >> and one with the accent mark). Am I doing something wrong? 

I realized that he was outputting UTF-8, but if he started with
MARC-8 and used $cs->to_utf8 in MARC::Charset, MARC::Charset 
would most likely keep the data in Unicode Normal form D, which
is why he sees two characters.  When he views them with a browser,
the browser most likely receives the two characters but,
depending upon what fonts you are using, it will combine the
two characters to look as *if* they are one combined character.

> 
> http://mail.nl.linux.org/linux-utf8/2003-07/msg00231.html
> 

Nice reference...


Andy.

Andrew Houghton, OCLC Online Computer Library Center, Inc.
http://www.oclc.org/about/
http://www.oclc.org/research/staff/houghton.htm


Re: Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Paul Hoffman
Unless I'm very much mistaken, Chris's code is outputting UTF-8 to
the terminal, not MARC-8.
The key is to find a terminal program that correctly displays UTF-8.
I doubt you'll have any trouble finding one -- for example, there
are at least two for Mac OS X alone (Terminal.app and iTerm).
Depending on your platform, freshmeat.net or tucows.com may be the
place to go.  This thread from the linux-utf8 list may also be
helpful (I googled for 'terminal UTF-8'):
http://mail.nl.linux.org/linux-utf8/2003-07/msg00231.html
Paul.
On Thursday, July 1, 2004, at 11:22  AM, Houghton,Andrew wrote:
From: Christopher Morgan [mailto:[EMAIL PROTECTED]
Sent: 01 July, 2004 10:50
Subject: Displaying diacritics in a terminal vs. a browser
I use the $cs->to_utf8 conversion from MARC::Charset to
display MARC Authority records in a browser, and the
diacritics display properly there.
But they don't display properly via SDTOUT in my terminal
window (I get two characters instead of one -- one with the
letter and one with the accent mark). Am I doing something
wrong? I'm using:
binmode (STDOUT, ":utf8");
Is there any way around this problem, or is it a limitation
of terminal displays?
I'm not sure what MARC::Charset does internally, but MARC-8
defines the diacritic separate from the base character.  So
even using binmode(STDOUT,":utf8") will produce two characters,
one for the base character followed by the diacritic.  If you
want them combined then you need to combine them.
It just so happens that I have recently been converting MARC-XML
to RDF.  The RDF specification mandates Unicode Normal form C,
which means that the base character and the diacritic are
combined.  MARC-XML uses Unicode Normal form D, which means that
the base character is separate from the diacritic.  So I hacked
together some Perl scripts to convert Unicode NFD <-> Unicode NFC.
The scripts require Perl 5.8.0.
I was talking with a colleague, just yesterday, about whether we
should unleash these on the Net...  They need to be cleaned up a
little and need some basic documentation on how to run the Perl
scripts.
Andy.
Andrew Houghton, OCLC Online Computer Library Center, Inc.
http://www.oclc.org/about/
http://www.oclc.org/research/staff/houghton.htm
--
Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan
[EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/


RE: Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Christopher Morgan
Andy,

Many thanks. I'd be interested in looking at your scripts if you do post
them!

-- Chris 

-Original Message-
From: Houghton,Andrew [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 01, 2004 10:23 AM
To: [EMAIL PROTECTED]
Subject: RE: Displaying diacritics in a terminal vs. a browser

> From: Christopher Morgan [mailto:[EMAIL PROTECTED]
> Sent: 01 July, 2004 10:50
> Subject: Displaying diacritics in a terminal vs. a browser
> 
> I use the $cs->to_utf8 conversion from MARC::Charset to display MARC 
> Authority records in a browser, and the diacritics display properly 
> there.
> But they don't display properly via SDTOUT in my terminal window (I 
> get two characters instead of one -- one with the letter and one with 
> the accent mark). Am I doing something wrong? I'm using:
>  
>   binmode (STDOUT, ":utf8");
> 
> Is there any way around this problem, or is it a limitation of 
> terminal displays?

I'm not sure what MARC::Charset does internally, but MARC-8 defines the
diacritic separate from the base character.  So even using
binmode(STDOUT,":utf8") will produce two characters, one for the base
character followed by the diacritic.  If you want them combined then you
need to combine them.

It just so happens that I have recently been converting MARC-XML to RDF.
The RDF specification mandates Unicode Normal form C, which means that the
base character and the diacritic are combined.  MARC-XML uses Unicode Normal
form D, which means that the base character is separate from the diacritic.
So I hacked together some Perl scripts to convert Unicode NFD <-> Unicode
NFC.
The scripts require Perl 5.8.0.

I was talking with a colleague, just yesterday, about whether we should
unleash these on the Net...  They need to be cleaned up a little and need
some basic documentation on how to run the Perl scripts.


Andy.

Andrew Houghton, OCLC Online Computer Library Center, Inc.
http://www.oclc.org/about/
http://www.oclc.org/research/staff/houghton.htm




RE: Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Houghton,Andrew
> From: Christopher Morgan [mailto:[EMAIL PROTECTED] 
> Sent: 01 July, 2004 10:50
> Subject: Displaying diacritics in a terminal vs. a browser
> 
> I use the $cs->to_utf8 conversion from MARC::Charset to 
> display MARC Authority records in a browser, and the 
> diacritics display properly there.
> But they don't display properly via SDTOUT in my terminal 
> window (I get two characters instead of one -- one with the 
> letter and one with the accent mark). Am I doing something 
> wrong? I'm using:
>  
>   binmode (STDOUT, ":utf8");
> 
> Is there any way around this problem, or is it a limitation 
> of terminal displays? 

I'm not sure what MARC::Charset does internally, but MARC-8 
defines the diacritic separate from the base character.  So 
even using binmode(STDOUT,":utf8") will produce two characters,
one for the base character followed by the diacritic.  If you
want them combined then you need to combine them.

It just so happens that I have recently been converting MARC-XML
to RDF.  The RDF specification mandates Unicode Normal form C,
which means that the base character and the diacritic are 
combined.  MARC-XML uses Unicode Normal form D, which means that 
the base character is separate from the diacritic.  So I hacked 
together some Perl scripts to convert Unicode NFD <-> Unicode NFC.
The scripts require Perl 5.8.0.

I was talking with a colleague, just yesterday, about whether we 
should unleash these on the Net...  They need to be cleaned up a 
little and need some basic documentation on how to run the Perl 
scripts.


Andy.

Andrew Houghton, OCLC Online Computer Library Center, Inc.
http://www.oclc.org/about/
http://www.oclc.org/research/staff/houghton.htm


Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Christopher Morgan
 
Hi all,

I use the $cs->to_utf8 conversion from MARC::Charset to display MARC
Authority records in a browser, and the diacritics display properly there.
But they don't display properly via SDTOUT in my terminal window (I get two
characters instead of one -- one with the letter and one with the accent
mark). Am I doing something wrong? I'm using:
 
binmode (STDOUT, ":utf8");

Is there any way around this problem, or is it a limitation of terminal
displays? 

(I found a thread in the archives:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00280.html 
that discusses a similar issue, but it didn't really answer my question).

Thanks!

-- Chris Morgan