[CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind
I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread LeVan,Ralph
There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the ?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Kyle Banerjee
What's the legal thing to do? What's actually found 'in the wild' with MarcXML? In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle -- -- Kyle Banerjee

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind
So what if the ?xml? decleration says one charset encoding, but the MARC header included in the MarcXML says a different encoding... which one is the 'legal' one to believe? Is it legal to have MarcXML that is not UTF-8 _or_ Marc8, that is an entirely different charset that is legal in XML?

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind
On 4/17/2012 1:57 PM, Kyle Banerjee wrote: In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle So would you use the Marc header payload instead? Or you're just saying you wouldn't trust _any_ encoding declerations

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind
Okay, maybe here's another way to approach the question. If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? If I want to have a

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread LeVan,Ralph
If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Doran, Michael D
,Ralph Sent: Tuesday, April 17, 2012 12:51 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the ?xml? header is where you find out what

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind
Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Sheila M. Morrissey
] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 2:46 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Houghton,Andrew
Jonathan Rochkind Sent: Tuesday, April 17, 2012 14:18 Subject: Re: [CODE4LIB] MarcXML and char encodings Okay, maybe here's another way to approach the question. If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Kyle Banerjee
So would you use the Marc header payload instead? Or you're just saying you wouldn't trust _any_ encoding declerations you find anywhere? This. The short version is that too many vendors and systems just supply some value without making sure that's what they're spitting out. I haven't had

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Karen Coyle
The discussions at the MARC standards group relating to Unicode all had to do with using Unicode *within* ISO2709. I can't find any evidence that MARCXML ever went through the standards process. (This may not be a bad thing.) So none of what we know about the MARBI discussions and resulting

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Houghton,Andrew
Karen Coyle Sent: Tuesday, April 17, 2012 15:41 Subject: Re: [CODE4LIB] MarcXML and char encodings The discussions at the MARC standards group relating to Unicode all had to do with using Unicode *within* ISO2709. I can't find any evidence that MARCXML ever went through the standards

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Decasm
: Re: [CODE4LIB] MarcXML and char encodings From: Jonathan Rochkind rochk...@jhu.edu To: CODE4LIB@LISTSERV.ND.EDU CC: Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind
On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote: No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? The things that appear there need to be from a

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread LeVan,Ralph
No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? Nope, you can't do that. There is no approved name for the MARC-8 encoding. As Andy said, the closest

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Sheila M. Morrissey
[mailto:rochk...@jhu.edu] Sent: Tuesday, April 17, 2012 4:19 PM To: Code for Libraries Cc: Sheila M. Morrissey Subject: Re: [CODE4LIB] MarcXML and char encodings On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote: No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Eric Lease Morgan
MARC-8. Cool in its time. Dumb now. Typical. --ELM

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Sheila M. Morrissey
[mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of LeVan,Ralph Sent: Tuesday, April 17, 2012 4:21 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou