[CODE4LIB] a note on MARC8 to UTF8 transcoding: Character references

2013-11-05 Thread Jonathan Rochkind
Do you do sometimes deal with MARC in the MARC8 character encoding?  Do 
you deal with software that converts from MARC8 to UTF8?


Maybe sometimes you've seen weird escape sequences that look like HTML 
or XML character references, like, say #x200F;.


You, like me, might wonder what the heck that is about -- is it 
cataloger error, a catalgoer manually entered this or something in 
error? Is it a software error, some software accidentally stuck this in, 
at some part in the pipeline?


You can't, after all, just put HTML/XML character references wherever 
you want -- there's no reason #x200F; would mean anything other than 
, #, x, 2, etc, when embedded in MARC ISO 2709 binary, right?


Wrong, it turns out!

There is actually a standard that says you _can_ embed XML/HTML-style 
character references in MARC8, for glyphs that can't otherwise be 
represented in MARC8. Lossless conversion [from unicode] to MARC-8 
encoding.


http://www.loc.gov/marc/specifications/speccharconversion.html#lossless

Phew, who knew?!

Software that converts from MARC8 to UTF-8 may or may not properly 
un-escape these character references though. For instance, the Marc4K 
AnselToUnicode class which converts from Marc8 to UTF8 (or other 
unicode serializations) won't touch these lossless conversions (ie, 
HTML/XML character references), they'll leave them alone in the output, 
as is.


yaz-marcdump also will NOT un-escape these entities when converting from 
Marc8 to UTF8.


So, then, the system you then import your UTF8 records into will now 
just display the literal HTML/XML-style character reference, it won't 
know to un-escape them either, since those literals in UTF8 really _do_ 
just mean  followed by a # followed by an x, etc. It only means 
something special as a literal in HTML, or in XML -- or it turns out in 
MARC8, as a 'lossless character conversion'.


So, for instance, in my own Traject software that uses Marc4J to convert 
from Marc8 to UTF8 -- I'm going to have to go add another pass, that 
converts HTML/XML-character entities to actual UTF8 serializations.  Phew.


So be warned, you may need to add this to your software too.


Re: [CODE4LIB] a note on MARC8 to UTF8 transcoding: Character references

2013-11-05 Thread Terry Reese
Yeah -- this has been part of the MARC standard for quite some time 
(2004?)...LC added it as a way to protect round trip ability.  MarcEdit has 
supported this for years -- it's actually one of the questions that I have to 
answer occasionally when people translate UTF8 code outside of the MARC8 
specification back to MARC8.

--tr

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Jonathan Rochkind
Sent: Tuesday, November 5, 2013 4:05 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] a note on MARC8 to UTF8 transcoding: Character references

Do you do sometimes deal with MARC in the MARC8 character encoding?  Do you 
deal with software that converts from MARC8 to UTF8?

Maybe sometimes you've seen weird escape sequences that look like HTML or XML 
character references, like, say #x200F;.

You, like me, might wonder what the heck that is about -- is it cataloger 
error, a catalgoer manually entered this or something in error? Is it a 
software error, some software accidentally stuck this in, at some part in the 
pipeline?

You can't, after all, just put HTML/XML character references wherever you want 
-- there's no reason #x200F; would mean anything other than , #, x, 2, etc, 
when embedded in MARC ISO 2709 binary, right?

Wrong, it turns out!

There is actually a standard that says you _can_ embed XML/HTML-style character 
references in MARC8, for glyphs that can't otherwise be represented in MARC8. 
Lossless conversion [from unicode] to MARC-8 encoding.

http://www.loc.gov/marc/specifications/speccharconversion.html#lossless

Phew, who knew?!

Software that converts from MARC8 to UTF-8 may or may not properly un-escape 
these character references though. For instance, the Marc4K AnselToUnicode 
class which converts from Marc8 to UTF8 (or other unicode serializations) won't 
touch these lossless conversions (ie, HTML/XML character references), they'll 
leave them alone in the output, as is.

yaz-marcdump also will NOT un-escape these entities when converting from
Marc8 to UTF8.

So, then, the system you then import your UTF8 records into will now just 
display the literal HTML/XML-style character reference, it won't know to 
un-escape them either, since those literals in UTF8 really _do_ just mean  
followed by a # followed by an x, etc. It only means something special as a 
literal in HTML, or in XML -- or it turns out in MARC8, as a 'lossless 
character conversion'.

So, for instance, in my own Traject software that uses Marc4J to convert from 
Marc8 to UTF8 -- I'm going to have to go add another pass, that converts 
HTML/XML-character entities to actual UTF8 serializations.  Phew.

So be warned, you may need to add this to your software too.


Re: [CODE4LIB] a note on MARC8 to UTF8 transcoding: Character references

2013-11-05 Thread Bryan Baldus
So be warned, you may need to add this to your software too.

One of these that may cause problems in some systems (including the ones we 
use; hopefully our customers' systems deal with it more appropriately) is the 
character used in the middle of [1],  the Extended Roman alif character 
which was changed to #x02bc; in 2005 [2], though I only saw that code in 
bibliographic and authority records starting around April of 2013, about the 
same time as NAR n 79046204 was updated (I don't believe those were related 
events, though). Some systems aren't able to translate the #x02bc to the 
appropriate ' (apostrophe) character, making searching for things with that 
character more challenging.

[1] http://lccn.loc.gov/n79046204
[2] http://www.loc.gov/marc/marbi/2005/2005-05.html

Thank you,

Bryan Baldus
Senior Cataloger
Quality Books Inc.
The Best of America's Independent Presses
1-800-323-4241x402
bryan.bal...@quality-books.com