Re: [CODE4LIB] marcxml to mrc
many kudos for the quick response, and the valuable advice i get many many times this... element datafield: Schemas validity error : Element '{ http://www.loc.gov/MARC21/slim}datafield': This element is not expected. Expected is ( {http://www.loc.gov/MARC21/slim}leader ). 2015-08-27 16:52 GMT+03:00 Galen Charlton g...@esilibrary.com: Hi, On Thu, Aug 27, 2015 at 9:41 AM, Sergio Letuche code4libus...@gmail.com wrote: ?xml version=1.0 encoding=UTF-8? collection xmlns=http://www.loc.gov/MARC21/slim; record OK, that's what I'd expect from a MARC21slim MARCXML file. Since you say the file is valid, try comparing it against the schema. This could be done like this: curl http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd MARC21slim.xsd xmllint --noout --schema MARC21slim.xsd INPUTFILE Regards, Galen -- Galen Charlton Infrastructure and Added Services Manager Equinox Software, Inc. / The Open Source Experts email: g...@esilibrary.com direct: +1 770-709-5581 cell: +1 404-984-4366 skype: gmcharlt web:http://www.esilibrary.com/ Supporting Koha and Evergreen: http://koha-community.org http://evergreen-ils.org
Re: [CODE4LIB] marcxml to mrc
Hi, On Thu, Aug 27, 2015 at 9:22 AM, Sergio Letuche code4libus...@gmail.com wrote: How one could make fron an marcxml record an mrc? what could be the yaz-marcdump command for this? We have utf-8 iso2709 records To convert from ISO2709 to MARCXML yaz-marcdump -i marc -o marcxml INPUTFILE OUTPUTFILE To go from MARCXML to ISO2709 yaz-marcdump -i marcxml -o marc INPUTFILE OUTPUTFILE Regards, Galen -- Galen Charlton Infrastructure and Added Services Manager Equinox Software, Inc. / The Open Source Experts email: g...@esilibrary.com direct: +1 770-709-5581 cell: +1 404-984-4366 skype: gmcharlt web:http://www.esilibrary.com/ Supporting Koha and Evergreen: http://koha-community.org http://evergreen-ils.org
Re: [CODE4LIB] marcxml to mrc
Hi, On Thu, Aug 27, 2015 at 9:41 AM, Sergio Letuche code4libus...@gmail.com wrote: ?xml version=1.0 encoding=UTF-8? collection xmlns=http://www.loc.gov/MARC21/slim; record OK, that's what I'd expect from a MARC21slim MARCXML file. Since you say the file is valid, try comparing it against the schema. This could be done like this: curl http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd MARC21slim.xsd xmllint --noout --schema MARC21slim.xsd INPUTFILE Regards, Galen -- Galen Charlton Infrastructure and Added Services Manager Equinox Software, Inc. / The Open Source Experts email: g...@esilibrary.com direct: +1 770-709-5581 cell: +1 404-984-4366 skype: gmcharlt web:http://www.esilibrary.com/ Supporting Koha and Evergreen: http://koha-community.org http://evergreen-ils.org
Re: [CODE4LIB] marcxml to mrc
Terry Reese created Marc Edit which is a great program and will do just what you need. http://marcedit.reeset.net/ Mark Sullivan Executive Director IDS Project Milne Library 1 College Circle SUNY Geneseo Geneseo, NY 14454 (585) 245-5172 On 8/27/2015 9:22 AM, Sergio Letuche wrote: Hello, How one could make fron an marcxml record an mrc? what could be the yaz-marcdump command for this? We have utf-8 iso2709 records
[CODE4LIB] marcxml to mrc
Hello, How one could make fron an marcxml record an mrc? what could be the yaz-marcdump command for this? We have utf-8 iso2709 records
Re: [CODE4LIB] marcxml to mrc
?xml version=1.0 encoding=UTF-8? collection xmlns=http://www.loc.gov/MARC21/slim; record 2015-08-27 16:40 GMT+03:00 Galen Charlton g...@esilibrary.com: Hi, On Thu, Aug 27, 2015 at 9:36 AM, Sergio Letuche code4libus...@gmail.com wrote: and i get yaz_marc_read_xml failed the marcxml is valid, what could i do more? Could you send the beginning snippet of the MARCXML file? That yaz-marcdump command I sent assumes that the MARC21slim XML serialization is used, but there are others. Regards, Galen -- Galen Charlton Infrastructure and Added Services Manager Equinox Software, Inc. / The Open Source Experts email: g...@esilibrary.com direct: +1 770-709-5581 cell: +1 404-984-4366 skype: gmcharlt web:http://www.esilibrary.com/ Supporting Koha and Evergreen: http://koha-community.org http://evergreen-ils.org
Re: [CODE4LIB] marcxml to mrc
Hi, On Thu, Aug 27, 2015 at 10:03 AM, Sergio Letuche code4libus...@gmail.com wrote: many kudos for the quick response, and the valuable advice i get many many times this... element datafield: Schemas validity error : Element '{ http://www.loc.gov/MARC21/slim}datafield': This element is not expected. Expected is ( {http://www.loc.gov/MARC21/slim}leader ). Sounds like some or all of the records lack a leader element, and that would lead to the yaz_marc_read_xml failed you've been seeing. Making sure that each record starts with a leader element along the lines of leader0a220 4500/leader might get yaz-marcdump to emit ISO2709 records. However, for those to be useful, you'll likely want to set positions 5, 6, and 7 in the leader to values that make sense for the type of records you're dealing with. Regards, Galen -- Galen Charlton Infrastructure and Added Services Manager Equinox Software, Inc. / The Open Source Experts email: g...@esilibrary.com direct: +1 770-709-5581 cell: +1 404-984-4366 skype: gmcharlt web:http://www.esilibrary.com/ Supporting Koha and Evergreen: http://koha-community.org http://evergreen-ils.org
[CODE4LIB] MarcXML and char encodings
I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the ?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and that has always worked for me. It turns out the standard says US-ASCII is the default encoding. But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 12:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: MarcXML and char encodings I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
What's the legal thing to do? What's actually found 'in the wild' with MarcXML? In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edu / 503.999.9787
Re: [CODE4LIB] MarcXML and char encodings
So what if the ?xml? decleration says one charset encoding, but the MARC header included in the MarcXML says a different encoding... which one is the 'legal' one to believe? Is it legal to have MarcXML that is not UTF-8 _or_ Marc8, that is an entirely different charset that is legal in XML? If you did that, what should the MARC header included in the XML say? I know how char encodings work in XML. I don't understand what the standards say about how that interacts with the MARC data in MarcXML. Jonathan On 4/17/2012 1:51 PM, LeVan,Ralph wrote: There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and that has always worked for me. It turns out the standard says US-ASCII is the default encoding. But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 12:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: MarcXML and char encodings I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
On 4/17/2012 1:57 PM, Kyle Banerjee wrote: In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle So would you use the Marc header payload instead? Or you're just saying you wouldn't trust _any_ encoding declerations you find anywhere? When writing a library to handle marc, I think the base line should be making it do the official legal standards-complaint right thing. Extra heuristics to deal with invalid data can be added on top. But my trouble here is I can't even figure out what the official legal standards-compliant thing is. Maybe that's becuase the MarcXML standard simply doesn't address it, and it's all implementation dependent. sigh. The problem is how the XML documents own char encoding is supposed to interact with the MARC header; especially because there's no way to put Marc8 in an XML char encoding doctype (is there?); and whether encodings other than Marc8 or UTF8 are legal in MarcXML, even though they aren't in MARC ISO binary. I think the answer might be nobody knows, and there is no standard right way to do it. Which is unfortunate.
Re: [CODE4LIB] MarcXML and char encodings
Okay, maybe here's another way to approach the question. If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- is this legal at all? And if so, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? On 4/17/2012 1:57 PM, Kyle Banerjee wrote: What's the legal thing to do? What's actually found 'in the wild' with MarcXML? In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle
Re: [CODE4LIB] MarcXML and char encodings
If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- is this legal at all? And if so, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? I'd claim this is legal, if it is legal XML. Set your encoding to anything that is valid. As a Java programmer, using java XML tools, the encoding is just a hint to the tools. I end up with Unicode strings after the XML is read. So I always ignore the encoding byte in the leader. Following that logic, that byte is about encoding. It has meaning when ISO 2709 is the transfer mechanism. But, in this case, XML is the transfer mechanism and it's rules for identifying the encoding are what matter. I'm proposing that the encoding byte in the leader is meaningless. Ralph
Re: [CODE4LIB] MarcXML and char encodings
Hi Ralph, But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. That rule no longer applies per the December 2007 revision of the MARC 21 Specifications: To facilitate the movement of records between MARC-8 and Unicode environments, it was recommended for an initial period that the use of Unicode be restricted to a repertoire identical in extent to the MARC-8 repertoire. [...] however, such a restriction is no longer appropriate. The full UCS repertoire, as currently defined at the Unicode web site, is valid for encoding MARC 21 records subject only to the constraints described [in the current MARC 21 Specifications]. -- from MARC 21 Specifications (revised December 2007) [1] -- Michael [1] http://www.loc.gov/marc/specifications/speccharucs.html -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of LeVan,Ralph Sent: Tuesday, April 17, 2012 12:51 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the ?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and that has always worked for me. It turns out the standard says US-ASCII is the default encoding. But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 12:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: MarcXML and char encodings I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML
Re: [CODE4LIB] MarcXML and char encodings
Re: But do others agree that there is in fact no legal way to have Marc8 in MarcXML? No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, and you will want to be aware that XML processors are only REQUIRED to process UTF-8 and UTF-16 -- in practice many (including JAVA-based one) can handle other encodings -- but you will have to make sure whatever XML processor you use, in whatever language it is written, has a handy-dandy MARC8 coder/decoder ring Sheila -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 2:46 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML
Re: [CODE4LIB] MarcXML and char encodings
Jonathan Rochkind Sent: Tuesday, April 17, 2012 14:18 Subject: Re: [CODE4LIB] MarcXML and char encodings Okay, maybe here's another way to approach the question. If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- is this legal at all? And if so, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? You cannot have a MARC-XML document encoded in MARC-8, well sort of, but it's not standard. To answer your questions you have to refer to a variety of standards: http://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncodingDecl In an encoding declaration, the values UTF-8 , UTF-16 , ISO-10646-UCS-2 , and ISO-10646-UCS-4 should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values ISO-8859-1 , ISO-8859-2 , ... ISO-8859- n (where n is the part number) should be used for the parts of ISO 8859, and the values ISO-2022-JP , Shift_JIS , and EUC-JP should be used for the various encoded forms of JIS X-0208-1997. It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an x- prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-! registered encodings). In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration. 1) The above says that ?xml version=1.0 ? means the same as ?xml version=1.0 encoding=utf-8 ? and if you prefer you can omit the XML declaration and that is assumed to be UTF-8 unless there is a BOM (Byte Order Mark) which determines UTF-8 vs UTF-16BE vs UTF-16LE. 2) If you really wanted to encode the XML in MARC-8 you need to specify x- since if you refer to: http://www.iana.org/assignments/character-sets MARC-8 isn't a registered character set, hence cannot be specified in the encoding attribute unless the name was prefixed with x-. Which implies that no standard XML library will know how to convert the MARC-8 characters into Unicode so the XML DOM can be used. So unless you want to write your own MARC-8 = Unicode conversion routines and integrate them your preferred XML library it isn't going to work out of the box for anyone else but yourself. When dealing with MARC-XML you should ignore the values in LDR/00-04, LDR/10, LDR/11, LDR/12-16, LDR/20-23. If you look at the MARC-XML schema you will note that the definition for leaderDataType specifies LDR/00-04 [\d ]{5}, LDR/10 and LDR/11 (2| ), LDR/12-16 [\d ]{5}, LDR/20-23 (4500| ). Note the MARC-XML schema allows spaces in those positions because they are not relevant in the XML format, though very relevant in the binary format. You probably should ignore LDR/09 since most MARC to MARC-XML converters do not change this value to 'a' although many converters do change the value when converting MARC binary between MARC-8 and UTF-8. The only valid character set for MARC-XML is Unicode and it *should* be encoded in UTF-8 in Unicode normalization form D (NFD) although most XML libraries will not know the difference if it was encoded as UTF-16BE or UTF-16LE in Unicode normalization form D since the XML libraries internally work with Unicode. I could have sworn that this information was specified on LC's site at one point in time, but I'm having trouble finding the documentation. Hope this helps, Andy.
Re: [CODE4LIB] MarcXML and char encodings
So would you use the Marc header payload instead? Or you're just saying you wouldn't trust _any_ encoding declerations you find anywhere? This. The short version is that too many vendors and systems just supply some value without making sure that's what they're spitting out. I haven't had to mess with this stuff for a few years, so I'm hoping Terry Reese weighs in on this conversation -- he has a lot of experience dealing with encoding headaches. However, the bottom line is that the most reliable method is to use heuristics to detect what's going on. Yeah, that totally kills the point of listing encodings in first place, but just as is the case with any unreliably used data point, it's all GIGO. When writing a library to handle marc, I think the base line should be making it do the official legal standards-complaint right thing. Extra heuristics to deal with invalid data can be added on top. I'm hoping things have improved, but if heuristics are more reliable than reading the right areas of the record, you have to ignore what's there (which makes even reading it pointless). I do think there is value in encouraging vendors to actually pay attention to this stuff as such basic screwups undermine both the the credibility of the data source and the service that depends on the data. But my trouble here is I can't even figure out what the official legal standards-compliant thing is. Maybe that's becuase the MarcXML standard simply doesn't address it, and it's all implementation dependent. sigh. The problem is how the XML documents own char encoding is supposed to interact with the MARC header; especially because there's no way to put Marc8 in an XML char encoding doctype (is there?); and whether encodings other than Marc8 or UTF8 are legal in MarcXML, even though they aren't in MARC ISO binary. I think the answer might be nobody knows, and there is no standard right way to do it. Which is unfortunate. A good summary of the situation as I understand it. kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edu / 503.999.9787
Re: [CODE4LIB] MarcXML and char encodings
The discussions at the MARC standards group relating to Unicode all had to do with using Unicode *within* ISO2709. I can't find any evidence that MARCXML ever went through the standards process. (This may not be a bad thing.) So none of what we know about the MARBI discussions and resulting standards can really help us here, except perhaps by analogy. In LC's own example on the MARCXML page (the Sandburg example) the Leader is copied without change from the ISO2709/MARC-8 record to the MARCXML/Unicode record -- in other words, it still has a blank in offset 09, which means MARC-8. (The XML record is UTF-8.) My gut feeling is that the Leader in MARCXML should be treated like the human appendix -- something that once had a use, but is now just being carried along for historical reasons. I would not expect it to reflect the XML record within which it is embedded. Unfortunately, it is the only source of some key information, like type of record. The more I think about it, the more MARCXML strikes me as a really messed-up format. kc On 4/17/12 11:46 AM, Jonathan Rochkind wrote: Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] MarcXML and char encodings
Karen Coyle Sent: Tuesday, April 17, 2012 15:41 Subject: Re: [CODE4LIB] MarcXML and char encodings The discussions at the MARC standards group relating to Unicode all had to do with using Unicode *within* ISO2709. I can't find any evidence that MARCXML ever went through the standards process. (This may not be a bad thing.) So none of what we know about the MARBI discussions and resulting standards can really help us here, except perhaps by analogy. Well I can confirm that the MARCXML didn't go through MARBI since I was one of OCLC's representatives who solidified MARCXML. MARCXML came out of a meeting at LC between the MARC Standards office, OCLC, RLG, and one or two other interested parties whom I cannot remember or find in my emails or notes about the meeting. Andy.
Re: [CODE4LIB] MarcXML and char encodings
Let me make some recommendations. These are what I would consider best practices for interoperability. 1) Never put marc8 in xml. Just don't do it. No one expects it. Few will be willing to bother with it. 2) Always prefer utf8 for marcxml. You can use any standard charset if you need to, but without special circumstances, use utf8 3) ignore leader 9 in marcxml. Only consider the prolog. (consider not trust.) If you reasonably can, fail when the charset is Wrong. /dev Sent via the Samsung Galaxy S™ II Skyrocket™, an ATT 4G LTE smartphone. Original message Subject: Re: [CODE4LIB] MarcXML and char encodings From: Jonathan Rochkind rochk...@jhu.edu To: CODE4LIB@LISTSERV.ND.EDU CC: Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML
Re: [CODE4LIB] MarcXML and char encodings
On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote: No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? The things that appear there need to be from a specific list, and I didn't think Marc8 was on that list? Can you give me an example? And, if you happen to have it, link to XML standard that says this is legal?
Re: [CODE4LIB] MarcXML and char encodings
No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? Nope, you can't do that. There is no approved name for the MARC-8 encoding. As Andy said, the closest you could get would be to make up an experimental name, like x-marc-8, but no tool in the world would recognize that. Ralph
Re: [CODE4LIB] MarcXML and char encodings
In XML standard: It is RECOMMENDED that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to usingtheir registered names; other encodings SHOULD use names starting with an x- prefix. XML processors SHOULD match character encoding names in a case-insensitive way and SHOULDeither interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA- registered encodings). As I suggested -- since MARC8 isn't (so far as I know) registered -- you won't get far with most standard tools, in whatever language -- you'll have to extend them to first recognize the encoding name, and second, decode the content. smm -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Tuesday, April 17, 2012 4:19 PM To: Code for Libraries Cc: Sheila M. Morrissey Subject: Re: [CODE4LIB] MarcXML and char encodings On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote: No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? The things that appear there need to be from a specific list, and I didn't think Marc8 was on that list? Can you give me an example? And, if you happen to have it, link to XML standard that says this is legal?
Re: [CODE4LIB] MarcXML and char encodings
MARC-8. Cool in its time. Dumb now. Typical. --ELM
Re: [CODE4LIB] MarcXML and char encodings
I think this is a case of being in violent agreement -- see some earlier replies in this thread -- Pragmatically, if you are going to hew to marc-8 encoding transported in XML -- you are losing the usefulness of standard tools for xml -- smm -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of LeVan,Ralph Sent: Tuesday, April 17, 2012 4:21 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? Nope, you can't do that. There is no approved name for the MARC-8 encoding. As Andy said, the closest you could get would be to make up an experimental name, like x-marc-8, but no tool in the world would recognize that. Ralph
[CODE4LIB] MARCXML to MADS 2.0 XSLT stylesheet (Revision 2.05)
The Library of Congress' MARCXML to MADS 2.0 XSLT stylesheet (Revision 2.05) http://www.loc.gov/standards/marcxml/xslt/MARC21slim2MADS.xsl is now available--it incorporates edits made in response to comments received since the release of Revision 2.04. The MARCXML to MADS 2.0 XSLT is based on the MARC to MADS 2.0 mapping made available by the Library of Congress (June 2011) http://www.loc.gov/standards/mads/mads-mapping.html. The mapping and the XSLT are also available via the Library of Congress' MADS Web sitehttp://www.loc.gov/standards/mads/. They will be revised periodically as users' comments are received and as subsequent MODS/MADS Editorial Committee analysis and decisions evolve. General questions about the mapping may be directed to the MODS/MADS Editorial Committee members via nd...@loc.govmailto:nd...@loc.gov. Specific questions about the stylesheet may be addressed to Tracy Meehleib at t...@loc.govmailto:t...@loc.gov . Thank you, Tracy Tracy Meehleib Network Development and MARC Standards Office Library of Congress 101 Independence Ave SE Washington, DC 20540-4402 +1 202 707 0121 (voice) +1 202 707 0115 (fax)
[CODE4LIB] MARCXML to MODS: 590 Field
Dear hive-mind, Does anyone know why the Library of Congress-supplied MARCXML to MODS XSLT [1] does not handle the MARC 590 Local Notes field? It seems to handle everything else, not that I've done an exhaustive search... :) Granted, I could copy/create my own XSLT and add this functionality in myself, but I'm curious as to whether or not there's some logic behind this decision to not include it. Logic that I would not naturally understand since I'm not formally trained as a librarian. Thanks! --Joel [1] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3-4.xsl Joel Richard IT Specialist, Web Services Department Smithsonian Institution Libraries | http://www.sil.si.edu/ (202) 633-1706 | richar...@si.edu
Re: [CODE4LIB] MARCXML to MODS: 590 Field
I'm going to guess that it's because 59x fields are defined for local use: http://www.loc.gov/marc/bibliographic/bd59x.html ...but someone from LC should be able to confirm. -Jon -- Jon Stroop Metadata Analyst Firestone Library Princeton University Princeton, NJ 08544 Email: jstr...@princeton.edu Phone: (609)258-0059 Fax: (609)258-0441 http://pudl.princeton.edu http://diglib.princeton.edu http://diglib.princeton.edu/ead http://www.cpanda.org/cpanda On 05/19/2011 11:45 AM, Richard, Joel M wrote: Dear hive-mind, Does anyone know why the Library of Congress-supplied MARCXML to MODS XSLT [1] does not handle the MARC 590 Local Notes field? It seems to handle everything else, not that I've done an exhaustive search... :) Granted, I could copy/create my own XSLT and add this functionality in myself, but I'm curious as to whether or not there's some logic behind this decision to not include it. Logic that I would not naturally understand since I'm not formally trained as a librarian. Thanks! --Joel [1] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3-4.xsl Joel Richard IT Specialist, Web Services Department Smithsonian Institution Libraries | http://www.sil.si.edu/ (202) 633-1706 | richar...@si.edu
Re: [CODE4LIB] MARCXML to MODS: 590 Field
Thanks, Karen and Jon! That's what I suspected, but I couldn't find anything on the web about the thought process behind ignoring the 590 altogether. We'll likely end up using a local version of the XSLT to map it the mods:note as you suggested. We simply don't want this information to be lost in our MODS record as we, for example, embed it inside a METS document. --Joel On May 19, 2011, at 12:34 PM, Karen Miller wrote: Joel, The 590 is indeed defined for local use, so whatever your local institution uses it for should guide your mapping to MODS. There are some examples of what it's used for on the OCLC Bibliographic Formats and Standards pages: http://www.oclc.org/bibformats/en/5xx/590.shtm Frequently it's used as a note that is specific to a local copy of an item. If your institution uses it inconsistently, you might want to just map it to mods:note. Karen Karen D. Miller Monographic/Digital Projects Cataloger Bibliographic Services Dept. Northwestern University Library Evanston, IL k-mill...@northwestern.edu 847-467-3462 -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jon Stroop Sent: Thursday, May 19, 2011 11:07 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML to MODS: 590 Field I'm going to guess that it's because 59x fields are defined for local use: http://www.loc.gov/marc/bibliographic/bd59x.html ...but someone from LC should be able to confirm. -Jon -- Jon Stroop Metadata Analyst Firestone Library Princeton University Princeton, NJ 08544 Email: jstr...@princeton.edu Phone: (609)258-0059 Fax: (609)258-0441 http://pudl.princeton.edu http://diglib.princeton.edu http://diglib.princeton.edu/ead http://www.cpanda.org/cpanda On 05/19/2011 11:45 AM, Richard, Joel M wrote: Dear hive-mind, Does anyone know why the Library of Congress-supplied MARCXML to MODS XSLT [1] does not handle the MARC 590 Local Notes field? It seems to handle everything else, not that I've done an exhaustive search... :) Granted, I could copy/create my own XSLT and add this functionality in myself, but I'm curious as to whether or not there's some logic behind this decision to not include it. Logic that I would not naturally understand since I'm not formally trained as a librarian. Thanks! --Joel [1] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3-4.xsl Joel Richard IT Specialist, Web Services Department Smithsonian Institution Libraries | http://www.sil.si.edu/ (202) 633-1706 | richar...@si.edu
Re: [CODE4LIB] MARCXML - What is it for?
I saw a similar speedup when I switched from an OO approach to a more functional style. Using MARC::Record, it was taking a lot longer to run some data than I wanted. I rewrote my script, with ad-hoc functional code. And though I can't give a real rate increase, because I never bothered to wait for the OO version to finish, I can say that it went from hours to minutes. I didn't compare to the filter capacity of MARC::File::USMARC, though. Maybe that would have been fast enough for my needs. Ultimately, though, I was just dumping these fields into a file, and didn't need any objects for that. The speed increase I saw was made possible by the directory. I wouldn't have even been able to try that with the XML version of the data. /dev -- Devon Smith Consulting Software Engineer OCLC Research http://www.oclc.org/research/people/smith.htm -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Nate Vack Sent: Friday, November 19, 2010 12:34 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML - What is it for? On Mon, Oct 25, 2010 at 2:22 PM, Eric Hellman e...@hellman.net wrote: I think you'd have a very hard time demonstrating any speed advantage to MARC over MARCXML. Not to bring up this old topic again, but I'm just finishing up a conversion from parse this text structure to blit this binary data structure into memory. Both written in python. The text parsing is indeed fast -- tens of milliseconds to parse 100k or so of data on my laptop. The binary code, though, is literally 1,000 times faster -- tens of *microseconds* to read the same data. (And in this application, yeah, it'll matter.) Blitting is much, much, much faster than lexing and parsing, or even running a regexp over the data. Cheers, -Nate
Re: [CODE4LIB] marcxml
On 11 Nov 2010, at 14:47, Galen Charlton wrote: Hi, On Thu, Nov 11, 2010 at 6:26 AM, J.D.Gravestock j.d.gravest...@open.ac.uk wrote: I'd be interested to know if anyone is using a good marcxml to marc converter (other than marcedit, i.e. non windows). I've tried the perl module marc::xml but having a few problems with the conversion which I can't replicate in marcedit. Are there any that I've missed? As far as Perl modules are concerned, MARC::XML is a bit long in the tooth. MARC::File::XML used in conjunction with MARC::Record may give you better results. Or File_MARC on PEAR if you prefer PHP. -- Dan Field d...@llgc.org.uk Ffôn/Tel. +44 1970 632 582 Peiriannydd Meddalwedd Senior Software Engineer Llyfrgell Genedlaethol Cymru National Library of Wales
Re: [CODE4LIB] marcxml
The XC team wrote (and uses) the oaitoolkit ( http://code.google.com/p/xcoaitoolkit/) for this. We've run our entire collection (5.8M records) through it. -Ben On Thu, Nov 11, 2010 at 11:41 AM, Reese, Terry terry.re...@oregonstate.eduwrote: Yes -- that's right. There is a zip file with install instructions for any non-windows based system for which a MONO port is present. --TR -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Joel Marchesoni Sent: Thursday, November 11, 2010 8:40 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] marcxml There actually is a version of MARCEdit for Linux now. I think (although I can't remember and can't find it on the site) that it relies on Mono. MARCEdit download page: http://people.oregonstate.edu/~reeset/marcedit/html/downloads.htmlhttp://people.oregonstate.edu/%7Ereeset/marcedit/html/downloads.html Joel -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of J.D.Gravestock Sent: Thursday, November 11, 2010 6:26 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] marcxml I'd be interested to know if anyone is using a good marcxml to marc converter (other than marcedit, i.e. non windows). I've tried the perl module marc::xml but having a few problems with the conversion which I can't replicate in marcedit. Are there any that I've missed? Jill ** Jill Gravestock Open University Library Milton Keynes -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302). --
Re: [CODE4LIB] marcxml
On Nov 11, 2010, at 6:26 AM, J.D.Gravestock wrote: I'd be interested to know if anyone is using a good marcxml to marc converter (other than marcedit, i.e. non windows). If I understand your question correctly, then try Index Data's yaz-marcdump application which is a component of Yaz. [1] Once compiled and installed you do something like this from the command line: yaz-marcdump -i marcxml -o marc file.xml file.xml [1] Yaz - http://www.indexdata.com/yaz/ BTW, say hello to James N. for me. -- Eric Lease Morgan University of Notre Dame
Re: [CODE4LIB] marcxml
There actually is a version of MARCEdit for Linux now. I think (although I can't remember and can't find it on the site) that it relies on Mono. MARCEdit download page: http://people.oregonstate.edu/~reeset/marcedit/html/downloads.html Joel -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of J.D.Gravestock Sent: Thursday, November 11, 2010 6:26 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] marcxml I'd be interested to know if anyone is using a good marcxml to marc converter (other than marcedit, i.e. non windows). I've tried the perl module marc::xml but having a few problems with the conversion which I can't replicate in marcedit. Are there any that I've missed? Jill ** Jill Gravestock Open University Library Milton Keynes -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302). --
Re: [CODE4LIB] marcxml
Yes -- that's right. There is a zip file with install instructions for any non-windows based system for which a MONO port is present. --TR -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Joel Marchesoni Sent: Thursday, November 11, 2010 8:40 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] marcxml There actually is a version of MARCEdit for Linux now. I think (although I can't remember and can't find it on the site) that it relies on Mono. MARCEdit download page: http://people.oregonstate.edu/~reeset/marcedit/html/downloads.html Joel -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of J.D.Gravestock Sent: Thursday, November 11, 2010 6:26 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] marcxml I'd be interested to know if anyone is using a good marcxml to marc converter (other than marcedit, i.e. non windows). I've tried the perl module marc::xml but having a few problems with the conversion which I can't replicate in marcedit. Are there any that I've missed? Jill ** Jill Gravestock Open University Library Milton Keynes -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302). --
Re: [CODE4LIB] MARCXML - What is it for?
I've only just had a chance to catch up on this thread. I'm not offended in the least by Turbomarc (anything round-trippable should serve just as well as an internal representation of MARC, right?), but I am a little puzzled--what are the 'special cases' alluded to in the blog post? When would there ever be a non-alphanumeric attribute value in MARCXML? Is this a non-MARC21 thing? C On 10/25/10 3:35 PM, MJ Suhonos wrote: I'll just leave this here: http://www.indexdata.com/blog/2010/05/turbomarc-faster-xml-marc-records That trade-off ought to offend both camps, though I happen to think it's quite clever. MJ On 2010-10-25, at 3:22 PM, Eric Hellman wrote: I think you'd have a very hard time demonstrating any speed advantage to MARC over MARCXML. XML parsers have been speed optimized out the wazoo; If there exists a MARC parser that has ever been speed-optimized without serious compromise, I'm sure someone on this list will have a good story about it. On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote: Dear Nate, There is a trade-off: do you want very fast processing of data - go for binary data. do you want to share your data globally easily in many (not per se library related) environments - go for XML/RDF. Open your data and do both :-) Pat Sent from my iPhone On 25 Oct 2010, at 20:39, Nate Vacknjv...@wisc.edu wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/ @gluejar --- [This E-mail scanned for viruses by Declude Virus] -- Cory Rockliff Technical Services Librarian Bard Graduate Center: Decorative Arts, Design History, Material Culture 18 West 86th Street New York, NY 10024 T: (212) 501-3037 rockl...@bgc.bard.edu BGC Exhibitions: In the Main Gallery: January 26, 2011– April 17, 2011 Cloisonné: Chinese Enamels from the Yuan, Ming, and Qing Dynasties Organized in collaboration with the Musée des arts Décoratifs, Paris. In the Focus Gallery: January 26, 2011– April 17, 2011 Objects of Exchange: Social and Material Transformation on the Late-Nineteenth-Century Northwest Coast Organized in collaboration with the American Museum of Natural History --- [This E-mail scanned for viruses by Declude Virus]
Re: [CODE4LIB] MARCXML - What is it for?
Let me openly state that I've never used Turbomarc. I believe the special case they are referring to is the subfield code with a value of η, which is non-alphanumeric. I don't know enough about MARC to even begin guessing what this means or why it might occur (or not). The use case I see for Turbomarc is when you: 1- have a need for high performance 2- are converting binary MARC to XML 3- are writing your own XSLT to manipulate that XML (since it's not MARCXML) The first comment claims a 30-40% increase in XML parsing, which seems obvious when you compare the number of characters in the example provided: 277 vs. 419, or about 34% fewer going through the parser. But, really, look at that XML (if it can even be called that). Turbomarc somehow manages to make MARC even more inscrutable. But hey, it's fast. MJ On 2010-10-28, at 11:35 AM, Cory Rockliff wrote: I've only just had a chance to catch up on this thread. I'm not offended in the least by Turbomarc (anything round-trippable should serve just as well as an internal representation of MARC, right?), but I am a little puzzled--what are the 'special cases' alluded to in the blog post? When would there ever be a non-alphanumeric attribute value in MARCXML? Is this a non-MARC21 thing? C On 10/25/10 3:35 PM, MJ Suhonos wrote: I'll just leave this here: http://www.indexdata.com/blog/2010/05/turbomarc-faster-xml-marc-records That trade-off ought to offend both camps, though I happen to think it's quite clever. MJ On 2010-10-25, at 3:22 PM, Eric Hellman wrote: I think you'd have a very hard time demonstrating any speed advantage to MARC over MARCXML. XML parsers have been speed optimized out the wazoo; If there exists a MARC parser that has ever been speed-optimized without serious compromise, I'm sure someone on this list will have a good story about it. On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote: Dear Nate, There is a trade-off: do you want very fast processing of data - go for binary data. do you want to share your data globally easily in many (not per se library related) environments - go for XML/RDF. Open your data and do both :-) Pat Sent from my iPhone On 25 Oct 2010, at 20:39, Nate Vacknjv...@wisc.edu wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/ @gluejar --- [This E-mail scanned for viruses by Declude Virus] -- Cory Rockliff Technical Services Librarian Bard Graduate Center: Decorative Arts, Design History, Material Culture 18 West 86th Street New York, NY 10024 T: (212) 501-3037 rockl...@bgc.bard.edu BGC Exhibitions: In the Main Gallery: January 26, 2011– April 17, 2011 Cloisonné: Chinese Enamels from the Yuan, Ming, and Qing Dynasties Organized in collaboration with the Musée des arts Décoratifs, Paris. In the Focus Gallery: January 26, 2011– April 17, 2011 Objects of Exchange: Social and Material Transformation on the Late-Nineteenth-Century Northwest Coast Organized in collaboration with the American Museum of Natural History --- [This E-mail scanned for viruses by Declude Virus]
Re: [CODE4LIB] MARCXML - What is it for?
On 28 October 2010 17:37, MJ Suhonos m...@suhonos.ca wrote: Let me openly state that I've never used Turbomarc. I believe the special case they are referring to is the subfield code with a value of η, which is non-alphanumeric. I don't know enough about MARC to even begin guessing what this means or why it might occur (or not). The use case I see for Turbomarc is when you: 1- have a need for high performance 2- are converting binary MARC to XML 3- are writing your own XSLT to manipulate that XML (since it's not MARCXML) The first comment claims a 30-40% increase in XML parsing, which seems obvious when you compare the number of characters in the example provided: 277 vs. 419, or about 34% fewer going through the parser. The speedup can be much greater than that -- from the blog post itself, Using xsltproc --timing showed that our transformations were faster by a factor of 4-5. Shortening the element names only improved performance fractionally, but since everything counts, we decided to do this as well. xsltproc uses the highly optimised LibXML/LibXSLT stack, which I guess maybe doesn't have so much constant-time overhead as the PHP simplexml parser that yielder the smaller speedup.
Re: [CODE4LIB] MARCXML - What is it for?
The first comment claims a 30-40% increase in XML parsing, which seems obvious when you compare the number of characters in the example provided: 277 vs. 419, or about 34% fewer going through the parser. The speedup can be much greater than that -- from the blog post itself, Using xsltproc --timing showed that our transformations were faster by a factor of 4-5. Shortening the element names only improved performance fractionally, but since everything counts, we decided to do this as well. xsltproc uses the highly optimised LibXML/LibXSLT stack, which I guess maybe doesn't have so much constant-time overhead as the PHP simplexml parser that yielder the smaller speedup. Sure, but XML parsing (libxml) and XSLT (libxslt) transforming are very different operations. I would expect parsing to scale linearly with the byte-length of XML being parsed. XSLT, on the other hand, is presumably much more dependent on the complexity of the XSL being applied (depth/structure of XML, number of templates matches, complexity of XPath statements, etc.) So I'd expect a series of XSLT transforms to have a much more variable change in performance than just parsing. As I say, if you're writing custom XSL anyway, then certainly having a more compact syntax is going to yield better performance. I'm sure to those for whom Turbomarc is useful, it's *very* useful, but it definitely seems to be nearing the limit of the readability-performance balance. ;-) Also, standing w00t: Indexdata++ MJ
Re: [CODE4LIB] MARCXML - What is it for?
I've been involved in several projects lambasted because managers think MARCXML is solving some imaginary problem It seems to me that this is really the heart of your argument. You had this experience, and now are projecting the opinions of these managers onto lots of people in the library world. I've worked in libraries for nearly a decade, and have never met anyone (manager or otherwise) who held the belief that XML in general, or MARC-XML in particular, somehow magically solves all metadata problems. I guess our two experiences cancel each other out, then. And, ultimately, none of that has anything to do with MARC-XML itself. --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Alexander Johannesen [alexander.johanne...@gmail.com] Sent: Monday, October 25, 2010 7:10 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML - What is it for? On Tue, Oct 26, 2010 at 12:48 PM, Bill Dueber b...@dueber.com wrote: Here, I think you're guilty of radically underestimating lots of people around the library world. No one thinks MARC is a good solution to our modern problems, and no one who actually knows what MARC is has trouble understanding MARC-XML as an XML serialization of the same old data -- certainly not anyone capable of meaningful contribution to work on an alternative. Slow down, Tex. Lots of people in the library world is not the same as developers, or even good developers, or even good XML developers, or even good XML developers who knows what the document model imposes to a data-centric approach. The problem we're dealing with is *hard*. Mind-numbingly hard. This is no justification for not doing things better. (And I'd love to know what the hard bits are; always interesting to hear from various people as to what they think are the *real* problems of library problems, as opposed to any other problem they have) The library world has several generations of infrastructure built around MARC (by which I mean AACR2), and devising data structures and standards that are a big enough improvement over MARC to warrant replacing all that infrastructure is an engineering and political nightmare. Political? For sure. Engineering? Not so much. This is just that whole blinded by MARC issue that keeps cropping up from time to time, and rightly so; it is truly a beast - at least the way we have come to know it through AACR2 and all its friends and its death-defying focus on all things bibliographic - that has paralyzed library innovation, probably to the point of making libraries almost irrelevant to the world. I'm happy to take potshots at the RDA stuff from the sidelines, but I never forget that I'm on the sidelines, and that the people active in the game are among the best and brightest we have to offer, working on a problem that invariably seems more intractable the deeper in you go. Well, that's a pretty scary sentence, for all sorts of reasons, but I think I shall not go there. If you think MARC-XML is some sort of an actual problem What, because you don't agree with me the problem doesn't exist? :) and that people just need to be shouted at to realize that and do something about it, then, well, I think you're just plain wrong. Fair enough, although you seem to be under the assumption that all of the stuff I'm saying is a figment of my imagination (I've been involved in several projects lambasted because managers think MARCXML is solving some imaginary problem; this is not bullshit, but pain and suffering from the battlefields of library development), that I'm not one of those developers (or one of you, although judging from this discussion it's clear that I am not), that the things I say somehow doesn't apply because you don't agree with, umm, what I'm assuming is my somewhat direct approach to stating my heretic opinions. Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
One way is to first transform the MARC into MARC-XML. Then you can use XSLT to crosswalk the MARC-XML into that other schema. Very handy. Your criticisms of MARC-XML all seem to presume that MARC-XML is the goal, the end point in the process. But MARC-XML is really better seen as a utility, a middle step between binary MARC and the real goal, which is some other useful and interesting XML schema. Unless useful and interesting is a euphemism for Dublin Core, then using XSLT for crosswalking is not really an option. Well, not a good option. On the other end of the spectrum, assume Onix for useful and interesting and XSLT simply won't work. Crosswalking doesn't hold water as a justification for MARCXML. /dev -- Devon Smith Consulting Software Engineer OCLC Research http://www.oclc.org/research/people/smith.htm -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Walker, David Sent: Monday, October 25, 2010 8:57 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML - What is it for? b) expanding it to be actual useful and interesting. But here I think you've missed the very utility of MARC-XML. Let's say you have a binary MARC file (the kind that comes out of an ILS) and want to transform that into MODS, Dublin Core, or maybe some other XML schema. How would you do that? One way is to first transform the MARC into MARC-XML. Then you can use XSLT to crosswalk the MARC-XML into that other schema. Very handy. Your criticisms of MARC-XML all seem to presume that MARC-XML is the goal, the end point in the process. But MARC-XML is really better seen as a utility, a middle step between binary MARC and the real goal, which is some other useful and interesting XML schema. --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Alexander Johannesen [alexander.johanne...@gmail.com] Sent: Monday, October 25, 2010 12:38 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML - What is it for? Hiya, On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote: Switching to an XML format doesn't help with that at all. I'm willing to take it further and say that MARCXML was the worst thing the library world ever did. Some might argue it was a good first step, and that it was better with something rather than nothing, to which I respond ; Poppycock! MARCXML is nothing short of evil. Not only does it goes against every principal of good XML anywhere (don't rely on whitespace, structure over code, namespace conventions, identity management, document control, separation of entities and properties, and on and on), it breaks the ontological commitment that a better treatment of the MARC data could bring, deterring people from actually a) using the darn thing as anything but a bare minimal crutch, and b) expanding it to be actual useful and interesting. The quicker the library world can get rid of this monstrosity, the better, although I doubt that will ever happen; it will hang around like a foul stench for as long as there is MARC in the world. A long time. A long sad time. A few extra notes; http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html Can you tell I'm not a fan? :) Kind regards, Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
But it looks just like the old thing using insert data scheme and some templates? Ah yes, but now we're doing it in XML! I think this applies to 90% of instances where XML was adopted, especially within the enterprise IT industry. Through marketing or misunderstanding, XML was presumed to be the magic fairy dust that would solve countless problems simply by switching to it. The library world is certainly not unique in this respect. Returning to the original question, what is MARCXML for, I think there have been some very clear examples of where it can be useful to some people, sometimes. If it works for you, use it. It not, don't. To wit, I propose: Some people, when confronted with a problem, think I know, I'll use MARCXML. Now they have three problems: MARC, XML, and the one they started with. Moving on. MJ
Re: [CODE4LIB] MARCXML - What is it for?
Alex, I think the problem is data like this: http://lccn.loc.gov/96516389/marcxml And while we can probably figure out a pattern to get the semantics out this record, there is no telling how many other variations exist within our collections. So we've got lots of this data that is both hard to parse and, frankly, hard to find (since it has practically zero machine readable data in fields we actually use) and it needs to coexist with some newer, semantically richer format. What I'm saying is that the library's legacy data problem is almost to the point of being existential. This is certainly a detriment to forward progress. Analogously (although at a much smaller scale), my wife and I have been trying for about 2 years to move our checking account from our out of state bank to something local. The problem is that we have built up a lot of infrastructure around our old bank (direct deposit and lots of automatic bill pay, etc.): migration would not only be time consuming, any mistakes made could potentially be quite expensive and we have a lot of uncertainty of how long it would actually take to migrate (and how that might affect the flow of payments, etc.). It's been, to date, easier for us just to drive across the state line (despite the fact that it's way out of our way to anywhere) rather than actually deal with it. In the meantime, more direct bill pay things have been set up and whatnot making our eventual migration that much more difficult. I do think it would be useful to figure out what exactly in our legacy data is found only in libraries (that is, we could ditch this shoddy The Last Waltz record and pull the data from LinkedMDB or Freebase or somewhere) and determine the scale of the problem that only we can address, but even just this environmental scan is a fairly large undertaking. -Ross. On Mon, Oct 25, 2010 at 10:10 PM, Alexander Johannesen alexander.johanne...@gmail.com wrote: On Tue, Oct 26, 2010 at 12:48 PM, Bill Dueber b...@dueber.com wrote: Here, I think you're guilty of radically underestimating lots of people around the library world. No one thinks MARC is a good solution to our modern problems, and no one who actually knows what MARC is has trouble understanding MARC-XML as an XML serialization of the same old data -- certainly not anyone capable of meaningful contribution to work on an alternative. Slow down, Tex. Lots of people in the library world is not the same as developers, or even good developers, or even good XML developers, or even good XML developers who knows what the document model imposes to a data-centric approach. The problem we're dealing with is *hard*. Mind-numbingly hard. This is no justification for not doing things better. (And I'd love to know what the hard bits are; always interesting to hear from various people as to what they think are the *real* problems of library problems, as opposed to any other problem they have) The library world has several generations of infrastructure built around MARC (by which I mean AACR2), and devising data structures and standards that are a big enough improvement over MARC to warrant replacing all that infrastructure is an engineering and political nightmare. Political? For sure. Engineering? Not so much. This is just that whole blinded by MARC issue that keeps cropping up from time to time, and rightly so; it is truly a beast - at least the way we have come to know it through AACR2 and all its friends and its death-defying focus on all things bibliographic - that has paralyzed library innovation, probably to the point of making libraries almost irrelevant to the world. I'm happy to take potshots at the RDA stuff from the sidelines, but I never forget that I'm on the sidelines, and that the people active in the game are among the best and brightest we have to offer, working on a problem that invariably seems more intractable the deeper in you go. Well, that's a pretty scary sentence, for all sorts of reasons, but I think I shall not go there. If you think MARC-XML is some sort of an actual problem What, because you don't agree with me the problem doesn't exist? :) and that people just need to be shouted at to realize that and do something about it, then, well, I think you're just plain wrong. Fair enough, although you seem to be under the assumption that all of the stuff I'm saying is a figment of my imagination (I've been involved in several projects lambasted because managers think MARCXML is solving some imaginary problem; this is not bullshit, but pain and suffering from the battlefields of library development), that I'm not one of those developers (or one of you, although judging from this discussion it's clear that I am not), that the things I say somehow doesn't apply because you don't agree with, umm, what I'm assuming is my somewhat direct approach to stating my heretic opinions. Alex -- Project Wrangler, SOA, Information Alchemist, UX,
Re: [CODE4LIB] MARCXML - What is it for?
This is no justification for not doing things better. (And I'd love to know what the hard bits are; always interesting to hear from various people as to what they think are the *real* problems of library problems, as opposed to any other problem they have) The problem is you have to deal with legacy systems and data. That's as real as it gets. That this is somehow a shortcoming peculiar to the library community is nonsense. Just changing the way dates were stored so that Y2K wasn't a big deal caused total chaos in the business world for years and required many billions of dollars worth of development. We still use 4 digit numeric PINs to access bank accounts. If I created some crummy website that used that level of protection, people would rightly call me an idiot. Eliminating MARC and basing systems on a completely different data structure would have far more reaching impact on system design than twiddling with a couple date digits or allowing something more secure than 4 digits to protect access to thousands of dollars. So as crappy as our systems are, I don't buy we're so much worse than everyone else out there. There is always the issue of developing the new standard in the first place, convincing all the vendors to adopt it, and retrofitting the systems to work with it. Problems are easiest to solve when it's someone else's job to make it happen. kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edu / 503.877.9773
Re: [CODE4LIB] MARCXML - What is it for?
Hi, On Tue, Oct 26, 2010 at 1:23 PM, Bill Dueber b...@dueber.com wrote: Sorry. That was rude, and uncalled for. I disagree that the problem is easily solved, even without the politics. There've been lots of attempts to try to come up with a sufficiently expressive toolset for dealing with biblio data, and we're still working on it. If you do think you've got some insight, I'm sure we're all ears, but try to frame it terms of the existing work if you can (RDA, some of the dublin core stuff, etc.) so we have a frame of reference. Well, I've wined enough both here and on NGC4LIB, and I'm kinda over it, just like I'm sure most people are over my whining. But sufficient to say is that FRBR is a 15 year old model that has still not been proven in the Real World[TM] in any meaningful way (the prototypes works fine until you dig a bit) and probably never will as long as MARC21 runs the show, and trying to stick RDA on top with rules that has got use-cases that are old enough to be my kids, well, I'm not very positive about that either. The direction of going ontological is a good one, and in the lack of anything else, RDF-infused FRBR / RDA is probably the way to go (except I'd ditch RDA and, uh, perhaps even FRBR, or at least seriously modify it), but the community is decidedly not talking about ontological interoperability nor extensions nor the semantics involved to solve actual problems in the bibliographic world (including the fact that it is inherently bibliographic). There needs to be much more involvement by library geeks and managers in defining semantic reuse and extensibility, to properly define those things that are almost absent from the AACR2 and friends; the relationships between entities themselves. In other words, you need to get away from the record-centered view, and embrace the subject-centric view. Anyway, enough from this old grumpy bum. Sorry to stir up the dust. Regards, Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
On Tue, 2010-10-26 at 03:32 +0200, Alexander Johannesen wrote: Here's our new thing. And we did it by simply converting all our MARC into MARCXML that runs on a cron job every midnight, and a bit of horrendous XSLT that's impossible to maintain. I am in the development department of our library. We're a diverse bunch of guys, ranging from the bottom (that's me, hacking Lucene) to the top (our graphics guy). Somewhere in the middle we have 2 librarians. They do not program in traditional languages, but have been trained to produce XSLT's and it actually works! They are capable of translating their vast knowledge of the myriad of standards we encounter into code that transforms our XML-input into something we can use for indexing. Aha!, you counter, why not train them to use X instead, since X is much better at transforming normal MARC?. The answer is that MARC isn't the only format they need to handle. We currently have 20+ different sources that they need to transform. All of them except one is XML. The one is ISO 2709 MARC, which we - naturally - transform into MARCXML so that it can be processed the same way as the rest. There might be better tools than XSLT for transformation of XML that we could use, but the XML-part is so ubiquitous at this point in time that it is the obvious choice for common ground. MARC is just one in many. It might be the most evil and unruly beast of the bunch, but we tame it with the same tools as the rest.
Re: [CODE4LIB] MARCXML - What is it for?
I think: 1. Marc must die. It has lived long enough. 2. But everybody uses Marc (which is in fact good), too many people are keeping it alive. 3. MARC in XML does not solve the problem, but it makes the suffering so much less painful Peter
Re: [CODE4LIB] MARCXML - What is it for?
Political? For sure. Engineering? Not so much. Ok. Solve it. Let us know when you're done. Wow, lamest reply so far. Surely you could muster a tad bit better? I was excited about getting a list of the hardest problems, for example, I'd love to see that. Then by that perhaps you could explain what this unsurmountable hard mind-boggeling problem actually is, because, you know, you never actually said. Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
I'd suspect that MARCXML isn't going anywhere fast, a shame perhaps. The key difference between MARCXML and MARC is that MARCXML inherits XMLs internationalisation features. It is an aspect at which MARC is very poor. Andrew -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: [CODE4LIB] MARCXML - What is it for?
On Oct 25, 2010, at 10:31 PM, Alexander Johannesen wrote: Political? For sure. Engineering? Not so much. Ok. Solve it. Let us know when you're done. Wow, lamest reply so far. Surely you could muster a tad bit better? I was excited about getting a list of the hardest problems, for example, I'd love to see that. Then by that perhaps you could explain what this unsurmountable hard mind-boggeling problem actually is, because, you know, you never actually said. Now, now, boys. Don't make us turn this mailing list around and go right back home. Because we will. And you'll go to bed without dinner! Seriously, though, I've been following this thread closely since I'm new to the library world and the petty bickering undermines both of your points and distracts from an otherwise intellectual and enlightening discussion. --Joel Joel Richard IT Specialist, Web Services Department Smithsonian Institution Libraries | http://www.sil.si.edu/ (202) 633-1706 | (202) 786-2861 (f) | richar...@si.edu
Re: [CODE4LIB] MARCXML - What is it for?
Crosswalking doesn't hold water as a justification for MARCXML. To be fair, though, most of us have simpler cross walking needs than OCLC. And if I need to go from binary MARC to some XML schema (which I sometimes do), then MARC-XML and the XSLT style sheets at LOC seem like a pretty good starting point to me. Better than starting from scratch. Which isn't to say that that approach is always the right one for every project. I very much agree with MJ: If it works for you, use it. If not, don't. But if someone else has a better, general purpose solution to this problem, then by all means open source that puppy and let the rest of us have at it! --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Smith,Devon [smit...@oclc.org] Sent: Tuesday, October 26, 2010 7:44 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML - What is it for? One way is to first transform the MARC into MARC-XML. Then you can use XSLT to crosswalk the MARC-XML into that other schema. Very handy. Your criticisms of MARC-XML all seem to presume that MARC-XML is the goal, the end point in the process. But MARC-XML is really better seen as a utility, a middle step between binary MARC and the real goal, which is some other useful and interesting XML schema. Unless useful and interesting is a euphemism for Dublin Core, then using XSLT for crosswalking is not really an option. Well, not a good option. On the other end of the spectrum, assume Onix for useful and interesting and XSLT simply won't work. Crosswalking doesn't hold water as a justification for MARCXML. /dev -- Devon Smith Consulting Software Engineer OCLC Research http://www.oclc.org/research/people/smith.htm -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Walker, David Sent: Monday, October 25, 2010 8:57 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML - What is it for? b) expanding it to be actual useful and interesting. But here I think you've missed the very utility of MARC-XML. Let's say you have a binary MARC file (the kind that comes out of an ILS) and want to transform that into MODS, Dublin Core, or maybe some other XML schema. How would you do that? One way is to first transform the MARC into MARC-XML. Then you can use XSLT to crosswalk the MARC-XML into that other schema. Very handy. Your criticisms of MARC-XML all seem to presume that MARC-XML is the goal, the end point in the process. But MARC-XML is really better seen as a utility, a middle step between binary MARC and the real goal, which is some other useful and interesting XML schema. --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Alexander Johannesen [alexander.johanne...@gmail.com] Sent: Monday, October 25, 2010 12:38 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML - What is it for? Hiya, On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote: Switching to an XML format doesn't help with that at all. I'm willing to take it further and say that MARCXML was the worst thing the library world ever did. Some might argue it was a good first step, and that it was better with something rather than nothing, to which I respond ; Poppycock! MARCXML is nothing short of evil. Not only does it goes against every principal of good XML anywhere (don't rely on whitespace, structure over code, namespace conventions, identity management, document control, separation of entities and properties, and on and on), it breaks the ontological commitment that a better treatment of the MARC data could bring, deterring people from actually a) using the darn thing as anything but a bare minimal crutch, and b) expanding it to be actual useful and interesting. The quicker the library world can get rid of this monstrosity, the better, although I doubt that will ever happen; it will hang around like a foul stench for as long as there is MARC in the world. A long time. A long sad time. A few extra notes; http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html Can you tell I'm not a fan? :) Kind regards, Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
[CODE4LIB] MARCXML - What is it for?
Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate
Re: [CODE4LIB] MARCXML - What is it for?
MARC records break parsing far too frequently. Apart from requiring no truly specialized tools, MARCXML should—should!—eliminate many of those problems. That's not to mention that MARC character sets vary a lot (DanMARC anyone?), and more even in practice than in theory. From my perspective the problem is simply that MARCXML isn't as ubiquitous as MARC. For what we do, at least, there's no point. We'd need to parse non-XML MARC data anyway. So if we're going to do it, we might as well do it for everything. Best, Tim On Mon, Oct 25, 2010 at 2:38 PM, Nate Vack njv...@wisc.edu wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate -- Check out my library at http://www.librarything.com/profile/timspalding
Re: [CODE4LIB] MARCXML - What is it for?
I'm not a big user of MARCXML, but I can think of a few reasons off the top of my head: - Existing libraries for reading, manipulating and searching XML-based documents are very mature. - Documents can be validated for their well-formedness using these existing tools and a pre-defined schema (a validator for MARC would need to be custom-coded) - MARCXML can easily be incorporated into XML-based meta-metadata schemas, like METS. - It can be parsed and manipulated in a web service context without sending a binary blob over the wire. - XML is self-describing, binary is not. There's nothing stopping you from reading the MARCXML into a binary blob and working on it from there. But when sharing documents from different institutions around the globe, using a wide variety of tools and techniques, XML seems to be the lowest common denominator. -Andrew On 2010-10-25, at 2:38 PM, Nate Vack wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate
Re: [CODE4LIB] MARCXML - What is it for?
Dear Nate, There is a trade-off: do you want very fast processing of data - go for binary data. do you want to share your data globally easily in many (not per se library related) environments - go for XML/RDF. Open your data and do both :-) Pat Sent from my iPhone On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate
Re: [CODE4LIB] MARCXML - What is it for?
It's helpful to think of MARCXML as a sort of lingua franca. - Existing libraries for reading, manipulating and searching XML-based documents are very mature. Including XSLT and XPath; very powerful stuff. There's nothing stopping you from reading the MARCXML into a binary blob and working on it from there. But when sharing documents from different institutions around the globe, using a wide variety of tools and techniques, XML seems to be the lowest common denominator. Assuming it's also round-trippable, MARC-in-JSON would accomplish this as well. Not to mention it's nice to be able to read and edit MARC records in any (any!!) text editor for those of us who are comfortable looking at JSON or XML but can't handle staring at binary bytestreams without having an aneurysm. MJ On 2010-10-25, at 2:38 PM, Nate Vack wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate
Re: [CODE4LIB] MARCXML - What is it for?
- XML is self-describing, binary is not. Not to quibble, but that's only in a theoretical sense here. Something like Amazon XML is truly self-describing. MARCXML is self-obfuscating. At least MARC records kinda imitate catalog cards. :) Tim On Mon, Oct 25, 2010 at 2:50 PM, Andrew Hankinson andrew.hankin...@gmail.com wrote: I'm not a big user of MARCXML, but I can think of a few reasons off the top of my head: - Existing libraries for reading, manipulating and searching XML-based documents are very mature. - Documents can be validated for their well-formedness using these existing tools and a pre-defined schema (a validator for MARC would need to be custom-coded) - MARCXML can easily be incorporated into XML-based meta-metadata schemas, like METS. - It can be parsed and manipulated in a web service context without sending a binary blob over the wire. - XML is self-describing, binary is not. There's nothing stopping you from reading the MARCXML into a binary blob and working on it from there. But when sharing documents from different institutions around the globe, using a wide variety of tools and techniques, XML seems to be the lowest common denominator. -Andrew On 2010-10-25, at 2:38 PM, Nate Vack wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate -- Check out my library at http://www.librarything.com/profile/timspalding
Re: [CODE4LIB] MARCXML - What is it for?
On Monday, October 25, 2010 1:50 PM, Andrew Hankinson wrote: - Documents can be validated for their well-formedness using these existing tools and a pre-defined schema (a validator for MARC would need to be custom-coded) In Perl, MARC::Lint might be an example of such a validator (though I need to update it with the most recent MARC updates at some point soon). MarcEdit also includes a validator. Bryan Baldus bryan.bal...@quality-books.com eij...@cpan.org http://home.comcast.net/~eijabb/
Re: [CODE4LIB] MARCXML - What is it for?
I think you'd have a very hard time demonstrating any speed advantage to MARC over MARCXML. XML parsers have been speed optimized out the wazoo; If there exists a MARC parser that has ever been speed-optimized without serious compromise, I'm sure someone on this list will have a good story about it. On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote: Dear Nate, There is a trade-off: do you want very fast processing of data - go for binary data. do you want to share your data globally easily in many (not per se library related) environments - go for XML/RDF. Open your data and do both :-) Pat Sent from my iPhone On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/ @gluejar
Re: [CODE4LIB] MARCXML - What is it for?
On Mon, Oct 25, 2010 at 2:09 PM, Tim Spalding t...@librarything.com wrote: - XML is self-describing, binary is not. Not to quibble, but that's only in a theoretical sense here. Something like Amazon XML is truly self-describing. MARCXML is self-obfuscating. At least MARC records kinda imitate catalog cards. Yeah -- this is kinda the source of my confusion. In the case of the files I'm reading, it's not that it's hard to find out where the nMeasurement field lives (it's six short ints starting at offset 64), but what the field means, and whether or not I care about it. Switching to an XML format doesn't help with that at all. WRT character encoding issues and validation: if MARC and MARCXML are round-trippable, a solution in one environment is equivalent to a solution in the other. And I think we've all seen plenty of unvalidated, badly-formed XML, and plenty with Character Encoding Problemsâ„¢ ;-) Thanks for the input! -Nate
Re: [CODE4LIB] MARCXML - What is it for?
Hiya, On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote: Switching to an XML format doesn't help with that at all. I'm willing to take it further and say that MARCXML was the worst thing the library world ever did. Some might argue it was a good first step, and that it was better with something rather than nothing, to which I respond ; Poppycock! MARCXML is nothing short of evil. Not only does it goes against every principal of good XML anywhere (don't rely on whitespace, structure over code, namespace conventions, identity management, document control, separation of entities and properties, and on and on), it breaks the ontological commitment that a better treatment of the MARC data could bring, deterring people from actually a) using the darn thing as anything but a bare minimal crutch, and b) expanding it to be actual useful and interesting. The quicker the library world can get rid of this monstrosity, the better, although I doubt that will ever happen; it will hang around like a foul stench for as long as there is MARC in the world. A long time. A long sad time. A few extra notes; http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html Can you tell I'm not a fan? :) Kind regards, Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
I guess what I meant is that in MARCXML, you have a datafield element with subsequent subfield elements each with fairly clear attributes, which, while not my idea of fun Sunday-afternoon reading, requires less specialized tools to parse (hello Textmate!) and is a bit easier than trying to count INT positions. One quick XPath query and you can have all 245 fields, regardless of their length or position in the record. On 2010-10-25, at 3:26 PM, Nate Vack wrote: On Mon, Oct 25, 2010 at 2:09 PM, Tim Spalding t...@librarything.com wrote: - XML is self-describing, binary is not. Not to quibble, but that's only in a theoretical sense here. Something like Amazon XML is truly self-describing. MARCXML is self-obfuscating. At least MARC records kinda imitate catalog cards. Yeah -- this is kinda the source of my confusion. In the case of the files I'm reading, it's not that it's hard to find out where the nMeasurement field lives (it's six short ints starting at offset 64), but what the field means, and whether or not I care about it. Switching to an XML format doesn't help with that at all. WRT character encoding issues and validation: if MARC and MARCXML are round-trippable, a solution in one environment is equivalent to a solution in the other. And I think we've all seen plenty of unvalidated, badly-formed XML, and plenty with Character Encoding Problemsâ„¢ ;-) Thanks for the input! -Nate
Re: [CODE4LIB] MARCXML - What is it for?
I'll just leave this here: http://www.indexdata.com/blog/2010/05/turbomarc-faster-xml-marc-records That trade-off ought to offend both camps, though I happen to think it's quite clever. MJ On 2010-10-25, at 3:22 PM, Eric Hellman wrote: I think you'd have a very hard time demonstrating any speed advantage to MARC over MARCXML. XML parsers have been speed optimized out the wazoo; If there exists a MARC parser that has ever been speed-optimized without serious compromise, I'm sure someone on this list will have a good story about it. On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote: Dear Nate, There is a trade-off: do you want very fast processing of data - go for binary data. do you want to share your data globally easily in many (not per se library related) environments - go for XML/RDF. Open your data and do both :-) Pat Sent from my iPhone On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/ @gluejar
Re: [CODE4LIB] MARCXML - What is it for?
On Mon, Oct 25, 2010 at 12:38 PM, Tim Spalding t...@librarything.com wrote: Does processing speed of something matter anymore? You'd have to be doing a LOT of processing to care, wouldn't you? Data migrations and data dumps are a common use case. Needing to break or make hundreds of thousands or millions of records is not uncommon. kyle
Re: [CODE4LIB] MARCXML - What is it for?
Does processing speed of something matter anymore? You'd have to be doing a LOT of processing to care, wouldn't you? Tim On Mon, Oct 25, 2010 at 3:35 PM, MJ Suhonos m...@suhonos.ca wrote: I'll just leave this here: http://www.indexdata.com/blog/2010/05/turbomarc-faster-xml-marc-records That trade-off ought to offend both camps, though I happen to think it's quite clever. MJ On 2010-10-25, at 3:22 PM, Eric Hellman wrote: I think you'd have a very hard time demonstrating any speed advantage to MARC over MARCXML. XML parsers have been speed optimized out the wazoo; If there exists a MARC parser that has ever been speed-optimized without serious compromise, I'm sure someone on this list will have a good story about it. On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote: Dear Nate, There is a trade-off: do you want very fast processing of data - go for binary data. do you want to share your data globally easily in many (not per se library related) environments - go for XML/RDF. Open your data and do both :-) Pat Sent from my iPhone On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/ @gluejar -- Check out my library at http://www.librarything.com/profile/timspalding
Re: [CODE4LIB] MARCXML - What is it for?
On Mon, Oct 25, 2010 at 12:22 PM, Eric Hellman e...@hellman.net wrote: I think you'd have a very hard time demonstrating any speed advantage to MARC over MARCXML. XML parsers have been speed optimized out the wazoo; If there exists a MARC parser that has ever been speed-optimized without serious compromise, I'm sure someone on this list will have a good story about it. I'll take MarcEdit over a XML parser for MARCXML any day. For a benchmark test, try roundtripping a million records. Unless I've been messing with the wrong stuff, the differences are dramatic. kyle
Re: [CODE4LIB] MARCXML - What is it for?
Yes, it is designed to be a round-trippable expression of ordinary marc in XML. Some reasons this is useful: 1. No maximum record length, unlike actual marc which tops out at ~10k. 2. You can use XSLT and other XML tools to work with it, and store it in stores optimized for XML (or that only accept XML), etc. 3. You can embed it inside XML schema's that allow arbitrary embeddable XML. 4. (Of much lesser importance than these others, but still ends up being important to me -- saving the time of the developer does matter) it's a lot easier to debug the raw data, doesn't require me to open up a hex editor and count bytes. Nate Vack wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate
Re: [CODE4LIB] MARCXML - What is it for?
MODS was an attempt to mostly-but-not-entirely-roundtrippably represent data in MARC in a format that's more 'normal' XML, without packed bytes in elements, with element names that are more or less self-documenting, etc. It's caught on even less than MARCXML though, so if you find MARCXML under-adopted (I disagree), you won't like MODS. Personally I think MODS is kind of the worst of both worlds. The only reason to stick with something that looks anything like MARC is to be round-trippable with legacy MARC, which MODS is not. But if you're going to give that up, you really want more improvements than MODS supplies, it's still got a lot of the unfortunate legacy of MARC in it. Nate Vack wrote: On Mon, Oct 25, 2010 at 2:09 PM, Tim Spalding t...@librarything.com wrote: - XML is self-describing, binary is not. Not to quibble, but that's only in a theoretical sense here. Something like Amazon XML is truly self-describing. MARCXML is self-obfuscating. At least MARC records kinda imitate catalog cards. Yeah -- this is kinda the source of my confusion. In the case of the files I'm reading, it's not that it's hard to find out where the nMeasurement field lives (it's six short ints starting at offset 64), but what the field means, and whether or not I care about it. Switching to an XML format doesn't help with that at all. WRT character encoding issues and validation: if MARC and MARCXML are round-trippable, a solution in one environment is equivalent to a solution in the other. And I think we've all seen plenty of unvalidated, badly-formed XML, and plenty with Character Encoding Problemsâ„¢ ;-) Thanks for the input! -Nate
Re: [CODE4LIB] MARCXML - What is it for?
Marc in JSON can be a nice middle-ground, faster/smaller than MarcXML (although still probably not as binary), based on a standard low-level data format so easier to work with using existing tools (and developers eyes) than binary, no maximum record length. There have been a couple competing attempts to define a marc-expressed-in-json 'standard', none have really caught on yet. I like Ross's latest attempt: http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/ Patrick Hochstenbach wrote: Dear Nate, There is a trade-off: do you want very fast processing of data - go for binary data. do you want to share your data globally easily in many (not per se library related) environments - go for XML/RDF. Open your data and do both :-) Pat Sent from my iPhone On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate
Re: [CODE4LIB] MARCXML - What is it for?
Tim Spalding wrote: Does processing speed of something matter anymore? You'd have to be doing a LOT of processing to care, wouldn't you? Yes,which sometimes you are. Say, when you're indexing 2 or 3 or 10 million marc records into, say, solr. Which is faster depends on what language and what libraries you are using for both binary marc and marcxml. But in many of our experiences, parseing and serializing binary marc _is_ significantly faster than parseing and serializing marcxml. That is of course just one of the various criteria that comes into play when choosing a format. Here's Bill Dueber's benchmarks comparing MarcXML, marc binary, and a marc-in-json format; in ruby, using various library alternatives. I rather like the marc-in-json format for being a happy medium. Whether it's standard or not doesn't neccesarily matter when you're dealing with your own records, passing them through several stops on a toolchain, and have tools available that can do it. Who cares if any/everyone else uses it. http://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/
Re: [CODE4LIB] MARCXML - What is it for?
JSON++ I routinely re-index about 2.5M JSON records (originally from binary MARC), and it's several orders of magnitude faster than XML (measured in single-digit minutes rather than double-digit hours). I'm not sure if it's in the same range as binary MARC, but as Tim says, it's plenty fast enough for pragmatic purposes. Unfortunately JSON doesn't have as many mature tools for manipulation as XML (yet?), but I'd be inclined to call it the best of both worlds rather than a middle-ground or compromise. MJ Marc in JSON can be a nice middle-ground, faster/smaller than MarcXML (although still probably not as binary), based on a standard low-level data format so easier to work with using existing tools (and developers eyes) than binary, no maximum record length. There have been a couple competing attempts to define a marc-expressed-in-json 'standard', none have really caught on yet. I like Ross's latest attempt: http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/ Patrick Hochstenbach wrote: Dear Nate, There is a trade-off: do you want very fast processing of data - go for binary data. do you want to share your data globally easily in many (not per se library related) environments - go for XML/RDF. Open your data and do both :-) Pat Sent from my iPhone On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote: Hi all, I've just spent the last couple of weeks delving into and decoding a binary file format. This, in turn, got me thinking about MARCXML. In a nutshell, it looks like it's supposed to contain the exact same data as a normal MARC record, except in XML form. As in, it should be round-trippable. What's the advantage to this? I can see using a human-readable format for poorly-documented file formats -- they're relatively easy to read and understand. But MARC is well, well-documented, with more than one free implementation in cursory searching. And once you know a binary file's format, it's no harder to parse than XML, and the data's smaller and processing faster. So... why the XML? Curious, -Nate
Re: [CODE4LIB] MARCXML - What is it for?
Kyle Banerjee wrote: On Mon, Oct 25, 2010 at 12:38 PM, Tim Spalding t...@librarything.com wrote: Does processing speed of something matter anymore? You'd have to be doing a LOT of processing to care, wouldn't you? Data migrations and data dumps are a common use case. Needing to break or make hundreds of thousands or millions of records is not uncommon. kyle To make this concrete, we processes the MARC records from 14 separate ILS's throughout the University of Wisconsin System. We extract, sort on OCLC number, dedup and merge pieces from any campus that has a record for the work. The MARC that we then index and display here http://forward.library.wisconsin.edu/catalog/ocm37443537?school_code=WU is not identical to the version of the MARC record from any of the 4 schools that hold it. We extract 13 million records and dedup down to 8 million every week. Speed is paramount. -sm -- Stephen Meyer Library Application Developer UW-Madison Libraries 436 Memorial Library 728 State St. Madison, WI 53706 sme...@library.wisc.edu 608-265-2844 (ph) Just don't let the human factor fail to be a factor at all. - Andrew Bird, Tables and Chairs
Re: [CODE4LIB] MARCXML - What is it for?
It really is possible to make your point without being quite so obnoxious. Everyone else seems to be able to do so. --Ray -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Alexander Johannesen Sent: Monday, October 25, 2010 3:38 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML - What is it for? Hiya, On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote: Switching to an XML format doesn't help with that at all. I'm willing to take it further and say that MARCXML was the worst thing the library world ever did. Some might argue it was a good first step, and that it was better with something rather than nothing, to which I respond ; Poppycock! MARCXML is nothing short of evil. Not only does it goes against every principal of good XML anywhere (don't rely on whitespace, structure over code, namespace conventions, identity management, document control, separation of entities and properties, and on and on), it breaks the ontological commitment that a better treatment of the MARC data could bring, deterring people from actually a) using the darn thing as anything but a bare minimal crutch, and b) expanding it to be actual useful and interesting. The quicker the library world can get rid of this monstrosity, the better, although I doubt that will ever happen; it will hang around like a foul stench for as long as there is MARC in the world. A long time. A long sad time. A few extra notes; http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html Can you tell I'm not a fan? :) Kind regards, Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
Ray Denenberg, Library of Congress r...@loc.gov wrote: It really is possible to make your point without being quite so obnoxious. Obnoxious? Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
I know there are two parts of this discussion (speed on the one hand, applicability/features on teh other), but for the former, running a little benchmark just isn't that hard. Aren't we supposed to, you know, prefer to make decisions based on data? Note: I'm only testing deserialization because there's isn't, as of now, a fast serialization option for ruby-marc. It uses REXML, and it's dog-slow. I already looked marc-in-json vs marc binary at http://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/ Benchmark Source: http://gist.github.com/645683 18,883 records as either an XML collection or newline-delimited json. Open the file, read every record, pull out a title. Repeat 5 times for a total of 94,415 records (i.e., just under 100K records total). Under ruby-marc, using the libxml deserializer is the fastest option. If you're using the REXML parser, well, god help us all. ruby 1.8.7 (2010-08-16 patchlevel 302) [i686-darwin9.8.0]. User time reported in seconds. xml w/libxml 227 seconds marc-in-json w/yajl 130 seconds Soquite a bit faster (more than 40%). For a million records (assuming I can just say 10*these_values) you're talking about a difference of 16 minutes due to just reading speed. Assuming, of course, you're running your code on my desktop. Today. For the 8M records I have to deal with, that'd be roughly 8M * ((227-130) / 94,415) = 7806 seconds, or about 130 minutes. S...a lot. Of course, if you're using a slower XML library or a slower JSON library, your numbers will vary quite a bit. REXML is unforgivingly slow, and json/pure (and even 'json') are quite a bit slower than yajl. And don't forget that you need to serialize these things from your source somehow... -Bill- On Mon, Oct 25, 2010 at 4:23 PM, Stephen Meyer sme...@library.wisc.eduwrote: Kyle Banerjee wrote: On Mon, Oct 25, 2010 at 12:38 PM, Tim Spalding t...@librarything.com wrote: Does processing speed of something matter anymore? You'd have to be doing a LOT of processing to care, wouldn't you? Data migrations and data dumps are a common use case. Needing to break or make hundreds of thousands or millions of records is not uncommon. kyle To make this concrete, we processes the MARC records from 14 separate ILS's throughout the University of Wisconsin System. We extract, sort on OCLC number, dedup and merge pieces from any campus that has a record for the work. The MARC that we then index and display here http://forward.library.wisconsin.edu/catalog/ocm37443537?school_code=WU is not identical to the version of the MARC record from any of the 4 schools that hold it. We extract 13 million records and dedup down to 8 million every week. Speed is paramount. -sm -- Stephen Meyer Library Application Developer UW-Madison Libraries 436 Memorial Library 728 State St. Madison, WI 53706 sme...@library.wisc.edu 608-265-2844 (ph) Just don't let the human factor fail to be a factor at all. - Andrew Bird, Tables and Chairs -- Bill Dueber Library Systems Programmer University of Michigan Library
Re: [CODE4LIB] MARCXML - What is it for?
b) expanding it to be actual useful and interesting. But here I think you've missed the very utility of MARC-XML. Let's say you have a binary MARC file (the kind that comes out of an ILS) and want to transform that into MODS, Dublin Core, or maybe some other XML schema. How would you do that? One way is to first transform the MARC into MARC-XML. Then you can use XSLT to crosswalk the MARC-XML into that other schema. Very handy. Your criticisms of MARC-XML all seem to presume that MARC-XML is the goal, the end point in the process. But MARC-XML is really better seen as a utility, a middle step between binary MARC and the real goal, which is some other useful and interesting XML schema. --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Alexander Johannesen [alexander.johanne...@gmail.com] Sent: Monday, October 25, 2010 12:38 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARCXML - What is it for? Hiya, On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote: Switching to an XML format doesn't help with that at all. I'm willing to take it further and say that MARCXML was the worst thing the library world ever did. Some might argue it was a good first step, and that it was better with something rather than nothing, to which I respond ; Poppycock! MARCXML is nothing short of evil. Not only does it goes against every principal of good XML anywhere (don't rely on whitespace, structure over code, namespace conventions, identity management, document control, separation of entities and properties, and on and on), it breaks the ontological commitment that a better treatment of the MARC data could bring, deterring people from actually a) using the darn thing as anything but a bare minimal crutch, and b) expanding it to be actual useful and interesting. The quicker the library world can get rid of this monstrosity, the better, although I doubt that will ever happen; it will hang around like a foul stench for as long as there is MARC in the world. A long time. A long sad time. A few extra notes; http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html Can you tell I'm not a fan? :) Kind regards, Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
On Oct 25, 2010, at 8:56 PM, Walker, David wrote: Your criticisms of MARC-XML all seem to presume that MARC-XML is the goal, the end point in the process. But MARC-XML is really better seen as a utility, a middle step between binary MARC and the real goal, which is some other useful and interesting XML schema. Exactly. -- Eric Morgan
Re: [CODE4LIB] MARCXML - What is it for?
On Tue, Oct 26, 2010 at 11:56 AM, Walker, David dwal...@calstate.edu wrote: Your criticisms of MARC-XML all seem to presume that MARC-XML is the goal, the end point in the process. But MARC-XML is really better seen as a utility, a middle step between binary MARC and the real goal, which is some other useful and interesting XML schema. How do you create an ontological commitment in a community to an expanding and useful set of tools and vocabularies? I think I need to remind people of what MARCXML is supposed to be ; a framework for working with MARC data in a XML environment. This framework is intended to be flexible and extensible to allow users to work with MARC data in ways specific to their needs. The framework itself includes many components such as schemas, stylesheets, and software tools. I'm not assuming MARCXML is a goal, no matter how we define that. I'm poo-pooing MARCXML for the semantics we, as a community, have been given by a process I suspect had goals very different from reality. Very few people would work with MARC through MARCXML, they would use it to convert it, filter it, hack around it to something else entirely. And I'm afraid lots of people are missing the point of stubbing the developments in a community by embracing tools that pushes a packet that inhibits innovation. So, here's the point, in paraphrased point; Here's our new thing. And we did it by simply converting all our MARC into MARCXML that runs on a cron job every midnight, and a bit of horrendous XSLT that's impossible to maintain. But it looks just like the old thing using MARC and some templates? Ah yes, but now we're doing it in XML! (Yeah, yeah, your mileage will vary) I'm sorry if I'm overly pessimistic about the XML goodness in the world, not for the XML itself, but the consequences of the named entities involved. I've been a die-hard XML wonk for far too many years, and the tools in that tool-chest doesn't automatically solve hard problems better by wrapping stuff up in angle brackets, and - dare I say it? - perhaps introduces a whole fleet of other problems rarely talked about when XML is the latest buzz-word, like using a document model on what's a traditional records model, character encodings, whitespace issues, unicode, size and efficiencies (the other part of this thread), and so on. But let me also be a bit more specific about that hard semantic problem I'm talking about; Lots of people around the library world infra-structure will think that since your data is now in XML it has taken some important step towards being inter-operable with the rest of the world, that library data now is part of the real world in *any* meaningful way, but this is simply demonstrably deceivingly not true. By having our data in XML has killed a few good projects where people have gone A new project to convert our MARC into useful XML? Aha! LoC has already solved that problem for us. Btw, to those who find me so obnoxious, at no point do I say it was intentionally evil, just evil none the same. The road to hell is, as always, paved with good intentions. Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
On Mon, Oct 25, 2010 at 9:32 PM, Alexander Johannesen alexander.johanne...@gmail.com wrote: Lots of people around the library world infra-structure will think that since your data is now in XML it has taken some important step towards being inter-operable with the rest of the world, that library data now is part of the real world in *any* meaningful way, but this is simply demonstrably deceivingly not true. Here, I think you're guilty of radically underestimating lots of people around the library world. No one thinks MARC is a good solution to our modern problems, and no one who actually knows what MARC is has trouble understanding MARC-XML as an XML serialization of the same old data -- certainly not anyone capable of meaningful contribution to work on an alternative. You seem to presuppose that there's an enormous pent-up energy poised to sweep in changes to an obviously-better data format, and that the existence of MARC-XML somehow defuses all that energy. The truth is that a high percentage of people that work with MARC data actively think about (or curse) things that are wrong with it and gobs and gobs of ridiculously-smart people work on a variety of alternate solutions (not the least of which is RDA) and get their organizations to spend significant money to do so. The problem we're dealing with is *hard*. Mind-numbingly hard. The library world has several generations of infrastructure built around MARC (by which I mean AACR2), and devising data structures and standards that are a big enough improvement over MARC to warrant replacing all that infrastructure is an engineering and political nightmare. I'm happy to take potshots at the RDA stuff from the sidelines, but I never forget that I'm on the sidelines, and that the people active in the game are among the best and brightest we have to offer, working on a problem that invariably seems more intractable the deeper in you go. If you think MARC-XML is some sort of an actual problem, and that people just need to be shouted at to realize that and do something about it, then, well, I think you're just plain wrong. -Bill- -- Bill Dueber Library Systems Programmer University of Michigan Library
Re: [CODE4LIB] MARCXML - What is it for?
On Tue, Oct 26, 2010 at 12:48 PM, Bill Dueber b...@dueber.com wrote: Here, I think you're guilty of radically underestimating lots of people around the library world. No one thinks MARC is a good solution to our modern problems, and no one who actually knows what MARC is has trouble understanding MARC-XML as an XML serialization of the same old data -- certainly not anyone capable of meaningful contribution to work on an alternative. Slow down, Tex. Lots of people in the library world is not the same as developers, or even good developers, or even good XML developers, or even good XML developers who knows what the document model imposes to a data-centric approach. The problem we're dealing with is *hard*. Mind-numbingly hard. This is no justification for not doing things better. (And I'd love to know what the hard bits are; always interesting to hear from various people as to what they think are the *real* problems of library problems, as opposed to any other problem they have) The library world has several generations of infrastructure built around MARC (by which I mean AACR2), and devising data structures and standards that are a big enough improvement over MARC to warrant replacing all that infrastructure is an engineering and political nightmare. Political? For sure. Engineering? Not so much. This is just that whole blinded by MARC issue that keeps cropping up from time to time, and rightly so; it is truly a beast - at least the way we have come to know it through AACR2 and all its friends and its death-defying focus on all things bibliographic - that has paralyzed library innovation, probably to the point of making libraries almost irrelevant to the world. I'm happy to take potshots at the RDA stuff from the sidelines, but I never forget that I'm on the sidelines, and that the people active in the game are among the best and brightest we have to offer, working on a problem that invariably seems more intractable the deeper in you go. Well, that's a pretty scary sentence, for all sorts of reasons, but I think I shall not go there. If you think MARC-XML is some sort of an actual problem What, because you don't agree with me the problem doesn't exist? :) and that people just need to be shouted at to realize that and do something about it, then, well, I think you're just plain wrong. Fair enough, although you seem to be under the assumption that all of the stuff I'm saying is a figment of my imagination (I've been involved in several projects lambasted because managers think MARCXML is solving some imaginary problem; this is not bullshit, but pain and suffering from the battlefields of library development), that I'm not one of those developers (or one of you, although judging from this discussion it's clear that I am not), that the things I say somehow doesn't apply because you don't agree with, umm, what I'm assuming is my somewhat direct approach to stating my heretic opinions. Alex -- Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps --- http://shelter.nu/blog/ -- -- http://www.google.com/profiles/alexander.johannesen ---
Re: [CODE4LIB] MARCXML - What is it for?
i'm not a coder but i undertook a study of XML some years after it came onto the scene and with a likely confused notion that it would be the next significant technology, I learned some XSL and later was able to weave PubMed Central journal information (CSV transformed into XML) together with Dublin Core metadata of journal articles into MARCXML during harvest with MarcEdit (which the inestimable Terry Reece continues to tweak). Also used the same XML journal data to augment NLM journal records with PubMed Central holdings and other data with a transform in my IDE though it took me weeks to get right..so, no asperations to become a coder. Probably did not get all of the MARC cataloging rules right and I can empathize with those who come to MARC and cataloging standards without cataloging training, experience. My library experience was primarily as library director...my expertise on library specializations would always be under question. regards, dana -- Dana Pearson dbpearsonmlis.com
Re: [CODE4LIB] MARCXML - What is it for?
On Mon, Oct 25, 2010 at 10:10 PM, Alexander Johannesen alexander.johanne...@gmail.com wrote: Political? For sure. Engineering? Not so much. Ok. Solve it. Let us know when you're done. -- Bill Dueber Library Systems Programmer University of Michigan Library
Re: [CODE4LIB] MARCXML - What is it for?
Sorry. That was rude, and uncalled for. I disagree that the problem is easily solved, even without the politics. There've been lots of attempts to try to come up with a sufficiently expressive toolset for dealing with biblio data, and we're still working on it. If you do think you've got some insight, I'm sure we're all ears, but try to frame it terms of the existing work if you can (RDA, some of the dublin core stuff, etc.) so we have a frame of reference. On Mon, Oct 25, 2010 at 10:18 PM, Bill Dueber b...@dueber.com wrote: On Mon, Oct 25, 2010 at 10:10 PM, Alexander Johannesen alexander.johanne...@gmail.com wrote: Political? For sure. Engineering? Not so much. Ok. Solve it. Let us know when you're done. -- Bill Dueber Library Systems Programmer University of Michigan Library -- Bill Dueber Library Systems Programmer University of Michigan Library