subject:"Re\: \[CODE4LIB\] marcxml"

So what if the ?xml? decleration says one charset encoding, but the 
MARC header included in the MarcXML says a different encoding... which 
one is the 'legal' one to believe?


Is it legal to have MarcXML that is not UTF-8 _or_ Marc8, that is an 
entirely different charset that is legal in XML?  If you did that, what 
should the MARC header included in the XML say?


I know how char encodings work in XML.  I don't understand what the 
standards say about how that interacts with the MARC data in MarcXML.


Jonathan

On 4/17/2012 1:51 PM, LeVan,Ralph wrote:

There are probably a couple of answers to that.

XML rules define what characterset is used. The encoding attribute on
the?xml?  header is where you find out what characterset is being
used.

I've always gone under the assumption that if an encoding wasn't
specified, then UTF-8 is in effect and that has always worked for me.
It turns out the standard says US-ASCII is the default encoding.

But, ignoring the encoding, the original MarcXML rules were the same as
the MARC-21 rules for character repertoire and you were suppose to
restrict yourself to characters that could be mapped back into MARC-8.
I don't know if that rule is still in force, but everyone ignores it.

I hope that helps!

Ralph

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, April 17, 2012 12:35 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: MarcXML and char encodings

I know how char encodings work in MARC ISO binary -- the encoding can
legally be either Marc8 or UTF8 (nothing else).  The encoding of a
record is specified in it's header. In the wild, specified encodings are

frequently wrong, or data includes weird mixed encodings. Okay!

But what's going on with MarcXML?  What are the legal encodings for
MarcXML?  Only Marc8 and UTF8, or anything that can be expressed in
XML?  The MARC header is (or can) be present in MarcXML -- trust the
MARC header, or trust the XML doctype char encoding?

What's the legal thing  to do? What's actually found 'in the wild' with
MarcXML?

Can anyone advise?

Jonathan

Re: [CODE4LIB] MarcXML and char encodings


On 4/17/2012 1:57 PM, Kyle Banerjee wrote:

In some cases, invalid XML. In an ideal world, the encoding should be 
included in the declaration. But I wouldn't trust it. kyle 


So would you use the Marc header payload instead?

Or you're just saying you wouldn't trust _any_ encoding declerations you 
find anywhere?


When writing a library to handle marc, I think the base line should be 
making it do the official legal standards-complaint right thing.  Extra 
heuristics to deal with invalid data can be added on top.


But my trouble here is I can't even figure out what the official legal 
standards-compliant thing is.


Maybe that's becuase the MarcXML standard simply doesn't address it, and 
it's all implementation dependent. sigh.


The problem is how the XML documents own char encoding is supposed to 
interact with the MARC header; especially because there's no way to put 
Marc8 in an XML char encoding doctype (is there?);  and whether 
encodings other than Marc8 or UTF8 are legal in MarcXML, even though 
they aren't in MARC ISO binary.


I think the answer might be nobody knows, and there is no standard 
right way to do it. Which is unfortunate.

Re: [CODE4LIB] MarcXML and char encodings


Okay, maybe here's another way to approach the question.

If I want to have a MarcXML document encoded in Marc8 -- what should it 
look like?  What should be in the XML decleration? What should be in the 
MARC header embedded in the XML?  Or is it not in fact legal at all?


If I want to have a MarcXML document encoded in UTF8, what should it 
look like? What should be in the XML decleration? What should be in the 
MARC header embedded in the XML?


If I want to have a MarcXML document with a char encoding that is 
_neither_ Marc8 nor UTF8, but something else generally legal for XML -- 
is this legal at all? And if so, what should it look like? What should 
be in the XML decleration? What should be in the MARC header embedded in 
the XML?


On 4/17/2012 1:57 PM, Kyle Banerjee wrote:

What's the legal thing  to do? What's actually found 'in the wild' with
MarcXML?


In some cases, invalid XML.

In an ideal world, the encoding should be included in the declaration. But
I wouldn't trust it.

kyle

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread LeVan,Ralph

 If I want to have a MarcXML document encoded in Marc8 -- what should
it 
 look like?  What should be in the XML decleration? What should be in
the 
 MARC header embedded in the XML?  Or is it not in fact legal at all?

I'm going out on a limb here, but I don't think it is legal.  There is
no formal encoding that corresponds to MARC-8, so there's no way to tell
XML tools how to interpret the bytes.


 If I want to have a MarcXML document encoded in UTF8, what should it 
 look like? What should be in the XML decleration? What should be in
the 
 MARC header embedded in the XML?

?xml encoding=UTF-8?

I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
really matter to any XML tools.


 If I want to have a MarcXML document with a char encoding that is 
 _neither_ Marc8 nor UTF8, but something else generally legal for XML
-- 
 is this legal at all? And if so, what should it look like? What should

 be in the XML decleration? What should be in the MARC header embedded
in 
 the XML?

I'd claim this is legal, if it is legal XML.  Set your encoding to
anything that is valid.

As a Java programmer, using java XML tools, the encoding is just a hint
to the tools.  I end up with Unicode strings after the XML is read.  So
I always ignore the encoding byte in the leader.

Following that logic, that byte is about encoding.  It has meaning when
ISO 2709 is the transfer mechanism.  But, in this case, XML is the
transfer mechanism and it's rules for identifying the encoding are what
matter.  I'm proposing that the encoding byte in the leader is
meaningless.

Ralph

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Doran, Michael D

Hi Ralph,

 But, ignoring the encoding, the original MarcXML rules were the same as
 the MARC-21 rules for character repertoire and you were suppose to
 restrict yourself to characters that could be mapped back into MARC-8.
 I don't know if that rule is still in force, but everyone ignores it.

That rule no longer applies per the December 2007 revision of the MARC 21 
Specifications:

To facilitate the movement of records between MARC-8 
and Unicode environments, it was recommended for an 
initial period that the use of Unicode be restricted 
to a repertoire identical in extent to the MARC-8 
repertoire. [...] however, such a restriction is no 
longer appropriate. The full UCS repertoire, as currently 
defined at the Unicode web site, is valid for encoding 
MARC 21 records subject only to the constraints described 
[in the current MARC 21 Specifications].

-- from MARC 21 Specifications (revised December 2007) [1]

-- Michael

[1] http://www.loc.gov/marc/specifications/speccharucs.html

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 LeVan,Ralph
 Sent: Tuesday, April 17, 2012 12:51 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MarcXML and char encodings
 
 There are probably a couple of answers to that.
 
 XML rules define what characterset is used. The encoding attribute on
 the ?xml? header is where you find out what characterset is being
 used.
 
 I've always gone under the assumption that if an encoding wasn't
 specified, then UTF-8 is in effect and that has always worked for me.
 It turns out the standard says US-ASCII is the default encoding.
 
 But, ignoring the encoding, the original MarcXML rules were the same as
 the MARC-21 rules for character repertoire and you were suppose to
 restrict yourself to characters that could be mapped back into MARC-8.
 I don't know if that rule is still in force, but everyone ignores it.
 
 I hope that helps!
 
 Ralph
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Tuesday, April 17, 2012 12:35 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: MarcXML and char encodings
 
 I know how char encodings work in MARC ISO binary -- the encoding can
 legally be either Marc8 or UTF8 (nothing else).  The encoding of a
 record is specified in it's header. In the wild, specified encodings are
 
 frequently wrong, or data includes weird mixed encodings. Okay!
 
 But what's going on with MarcXML?  What are the legal encodings for
 MarcXML?  Only Marc8 and UTF8, or anything that can be expressed in
 XML?  The MARC header is (or can) be present in MarcXML -- trust the
 MARC header, or trust the XML doctype char encoding?
 
 What's the legal thing  to do? What's actually found 'in the wild' with
 MarcXML?
 
 Can anyone advise?
 
 Jonathan

Re: [CODE4LIB] MarcXML and char encodings


Thanks, this is helpful feedback at least.

I think it's completely irrelevant, when determining what is legal under 
standards, to talk about what certain Java tools happen to do though, I 
don't care too much what some tool you happen to use does.


In this case, I'm _writing_ the tools. I want to make them do 'the right 
thing', with some mix of what's actually official legally correct and 
what's practically useful.  What your Java tools do is more or less 
irrelevant to me. I certainly _could_ make my tool respect the Marc 
leader encoded in MarcXML over the XML decleration if I wanted to. I 
could even make it assume the data is Marc8 in XML, even though there's 
no XML charset type for it, if the leader says it's Marc8.


But do others agree that there is in fact no legal way to have Marc8 in 
MarcXML?


Do others agree that you can use non-UTF8 encodings in MarcXML, so long 
as they are legal XML?


I won't even ask someone to cite standards documents, because it's 
pretty clear that LC forgot to consider this when establishing MarcXML.  
(And I have no faith that one could get LC to make a call on this and 
publish it any time this century).


Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How 
is it represented with regard to the XML leader and the Marc header?


Has anyone seen any MarcXML with char encodings that are neither Marc8 
nor UTF8 in the wild? Are they common? How are they represented with 
regard to XML leader and Marc header?


On 4/17/2012 2:32 PM, LeVan,Ralph wrote:

If I want to have a MarcXML document encoded in Marc8 -- what should

it

look like?  What should be in the XML decleration? What should be in

the

MARC header embedded in the XML?  Or is it not in fact legal at all?

I'm going out on a limb here, but I don't think it is legal.  There is
no formal encoding that corresponds to MARC-8, so there's no way to tell
XML tools how to interpret the bytes.



If I want to have a MarcXML document encoded in UTF8, what should it
look like? What should be in the XML decleration? What should be in

the

MARC header embedded in the XML?

?xml encoding=UTF-8?

I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
really matter to any XML tools.



If I want to have a MarcXML document with a char encoding that is
_neither_ Marc8 nor UTF8, but something else generally legal for XML

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Sheila M. Morrissey

Re: But do others agree that there is in fact no legal way to have Marc8 in 
MarcXML?

No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in 
the XML prolog, and you will want to be aware that XML processors are only 
REQUIRED to process UTF-8 and UTF-16 -- in practice many (including JAVA-based 
one) can handle other encodings -- but you will have to make sure whatever XML 
processor you use, in whatever language it is written, has a handy-dandy MARC8 
coder/decoder ring

Sheila

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Jonathan Rochkind
Sent: Tuesday, April 17, 2012 2:46 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MarcXML and char encodings

Thanks, this is helpful feedback at least.

I think it's completely irrelevant, when determining what is legal under 
standards, to talk about what certain Java tools happen to do though, I 
don't care too much what some tool you happen to use does.

In this case, I'm _writing_ the tools. I want to make them do 'the right 
thing', with some mix of what's actually official legally correct and 
what's practically useful.  What your Java tools do is more or less 
irrelevant to me. I certainly _could_ make my tool respect the Marc 
leader encoded in MarcXML over the XML decleration if I wanted to. I 
could even make it assume the data is Marc8 in XML, even though there's 
no XML charset type for it, if the leader says it's Marc8.

But do others agree that there is in fact no legal way to have Marc8 in 
MarcXML?

Do others agree that you can use non-UTF8 encodings in MarcXML, so long 
as they are legal XML?

I won't even ask someone to cite standards documents, because it's 
pretty clear that LC forgot to consider this when establishing MarcXML.  
(And I have no faith that one could get LC to make a call on this and 
publish it any time this century).

Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How 
is it represented with regard to the XML leader and the Marc header?

Has anyone seen any MarcXML with char encodings that are neither Marc8 
nor UTF8 in the wild? Are they common? How are they represented with 
regard to XML leader and Marc header?

On 4/17/2012 2:32 PM, LeVan,Ralph wrote:
 If I want to have a MarcXML document encoded in Marc8 -- what should
 it
 look like?  What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?  Or is it not in fact legal at all?
 I'm going out on a limb here, but I don't think it is legal.  There is
 no formal encoding that corresponds to MARC-8, so there's no way to tell
 XML tools how to interpret the bytes.


 If I want to have a MarcXML document encoded in UTF8, what should it
 look like? What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?
 ?xml encoding=UTF-8?

 I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
 really matter to any XML tools.


 If I want to have a MarcXML document with a char encoding that is
 _neither_ Marc8 nor UTF8, but something else generally legal for XML

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Houghton,Andrew

 Jonathan Rochkind
 Sent: Tuesday, April 17, 2012 14:18
 Subject: Re: [CODE4LIB] MarcXML and char encodings
 
 Okay, maybe here's another way to approach the question.
 
 If I want to have a MarcXML document encoded in Marc8 -- what should it
 look like?  What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?  Or is it not in fact legal at all?
 
 If I want to have a MarcXML document encoded in UTF8, what should it
 look like? What should be in the XML decleration? What should be in the
 MARC header embedded in the XML?
 
 If I want to have a MarcXML document with a char encoding that is
 _neither_ Marc8 nor UTF8, but something else generally legal for XML --
 is this legal at all? And if so, what should it look like? What should
 be in the XML decleration? What should be in the MARC header embedded
 in
 the XML?

You cannot have a MARC-XML document encoded in MARC-8, well sort of, but it's 
not standard. To answer your questions you have to refer to a variety of 
standards:

http://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncodingDecl
In an encoding declaration, the values  UTF-8 ,  UTF-16 ,  ISO-10646-UCS-2 
, and  ISO-10646-UCS-4  should be used for the various encodings and 
transformations of Unicode / ISO/IEC 10646, the values  ISO-8859-1 ,  
ISO-8859-2 , ...  ISO-8859- n  (where n is the part number) should be used 
for the parts of ISO 8859, and the values  ISO-2022-JP ,  Shift_JIS , and  
EUC-JP  should be used for the various encoded forms of JIS X-0208-1997. It is 
recommended that character encodings registered (as charsets) with the Internet 
Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be 
referred to using their registered names; other encodings should use names 
starting with an x- prefix. XML processors should match character encoding 
names in a case-insensitive way and should either interpret an IANA-registered 
name as the encoding registered at IANA for that name or treat it as unknown 
(processors are, of course, not required to support all IANA-!
 registered encodings).

In the absence of information provided by an external transport protocol (e.g. 
HTTP or MIME), it is a fatal error for an entity including an encoding 
declaration to be presented to the XML processor in an encoding other than that 
named in the declaration, or for an entity which begins with neither a Byte 
Order Mark nor an encoding declaration to use an encoding other than UTF-8. 
Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not 
strictly need an encoding declaration.


1) The above says that ?xml version=1.0 ? means the same as ?xml 
version=1.0 encoding=utf-8 ? and if you prefer you can omit the XML 
declaration and that is assumed to be UTF-8 unless there is a BOM (Byte Order 
Mark) which determines UTF-8 vs UTF-16BE vs UTF-16LE.

2) If you really wanted to encode the XML in MARC-8 you need to specify x- 
since if you refer to: http://www.iana.org/assignments/character-sets MARC-8 
isn't a registered character set, hence cannot be specified in the encoding 
attribute unless the name was prefixed with x-. Which implies that no 
standard XML library will know how to convert the MARC-8 characters into 
Unicode so the XML DOM can be used. So unless you want to write your own MARC-8 
= Unicode conversion routines and integrate them your preferred XML library 
it isn't going to work out of the box for anyone else but yourself.

When dealing with MARC-XML you should ignore the values in LDR/00-04, LDR/10, 
LDR/11, LDR/12-16, LDR/20-23. If you look at the MARC-XML schema you will note 
that the definition for leaderDataType specifies LDR/00-04 [\d ]{5}, LDR/10 
and LDR/11 (2| ), LDR/12-16 [\d ]{5}, LDR/20-23 (4500| ). Note the 
MARC-XML schema allows spaces in those positions because they are not relevant 
in the XML format, though very relevant in the binary format.

You probably should ignore LDR/09 since most MARC to MARC-XML converters do not 
change this value to 'a' although many converters do change the value when 
converting MARC binary between MARC-8 and UTF-8. The only valid character set 
for MARC-XML is Unicode and it *should* be encoded in UTF-8 in Unicode 
normalization form D (NFD) although most XML libraries will not know the 
difference if it was encoded as UTF-16BE or UTF-16LE in Unicode normalization 
form D since the XML libraries internally work with Unicode.

I could have sworn that this information was specified on LC's site at one 
point in time, but I'm having trouble finding the documentation.


Hope this helps, Andy.

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Kyle Banerjee


 So would you use the Marc header payload instead?


 Or you're just saying you wouldn't trust _any_ encoding declerations you
 find anywhere?


This.

The short version is that too many vendors and systems just supply some
value without making sure that's what they're spitting out. I haven't had
to mess with this stuff for a few years, so I'm hoping Terry Reese weighs
in on this conversation -- he has a lot of experience dealing with encoding
headaches. However, the bottom line is that the most reliable method is to
use heuristics to detect what's going on. Yeah, that totally kills the
point of listing encodings in first place, but just as is the case with any
unreliably used data point, it's all GIGO.

When writing a library to handle marc, I think the base line should be
 making it do the official legal standards-complaint right thing.  Extra
 heuristics to deal with invalid data can be added on top.


I'm hoping things have improved, but if heuristics are more reliable than
reading the right areas of the record, you have to ignore what's there
(which makes even reading it pointless). I do think there is value in
encouraging vendors to actually pay attention to this stuff as such basic
screwups undermine both the the credibility of the data source and the
service that depends on the data.


 But my trouble here is I can't even figure out what the official legal
 standards-compliant thing is.

 Maybe that's becuase the MarcXML standard simply doesn't address it, and
 it's all implementation dependent. sigh.

 The problem is how the XML documents own char encoding is supposed to
 interact with the MARC header; especially because there's no way to put
 Marc8 in an XML char encoding doctype (is there?);  and whether encodings
 other than Marc8 or UTF8 are legal in MarcXML, even though they aren't in
 MARC ISO binary.

 I think the answer might be nobody knows, and there is no standard right
 way to do it. Which is unfortunate.


A good summary of the situation as I understand it.

kyle

-- 
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edu / 503.999.9787

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Karen Coyle

The discussions at the MARC standards group relating to Unicode all had 
to do with using Unicode *within* ISO2709. I can't find any evidence 
that MARCXML ever went through the standards process. (This may not be a 
bad thing.) So none of what we know about the MARBI discussions and 
resulting standards can really help us here, except perhaps by analogy.


In LC's own example on the MARCXML page (the Sandburg example) the 
Leader is copied without change from the ISO2709/MARC-8 record to the 
MARCXML/Unicode record -- in other words, it still has a blank in offset 
09, which means MARC-8. (The XML record is UTF-8.) My gut feeling is 
that the Leader in MARCXML should be treated like the human appendix -- 
something that once had a use, but is now just being carried along for 
historical reasons. I would not expect it to reflect the XML record 
within which it is embedded. Unfortunately, it is the only source of 
some key information, like type of record. The more I think about it, 
the more MARCXML strikes me as a really messed-up format.


kc



On 4/17/12 11:46 AM, Jonathan Rochkind wrote:

Thanks, this is helpful feedback at least.

I think it's completely irrelevant, when determining what is legal under
standards, to talk about what certain Java tools happen to do though, I
don't care too much what some tool you happen to use does.

In this case, I'm _writing_ the tools. I want to make them do 'the right
thing', with some mix of what's actually official legally correct and
what's practically useful. What your Java tools do is more or less
irrelevant to me. I certainly _could_ make my tool respect the Marc
leader encoded in MarcXML over the XML decleration if I wanted to. I
could even make it assume the data is Marc8 in XML, even though there's
no XML charset type for it, if the leader says it's Marc8.

But do others agree that there is in fact no legal way to have Marc8 in
MarcXML?

Do others agree that you can use non-UTF8 encodings in MarcXML, so long
as they are legal XML?

I won't even ask someone to cite standards documents, because it's
pretty clear that LC forgot to consider this when establishing MarcXML.
(And I have no faith that one could get LC to make a call on this and
publish it any time this century).

Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How
is it represented with regard to the XML leader and the Marc header?

Has anyone seen any MarcXML with char encodings that are neither Marc8
nor UTF8 in the wild? Are they common? How are they represented with
regard to XML leader and Marc header?

On 4/17/2012 2:32 PM, LeVan,Ralph wrote:

If I want to have a MarcXML document encoded in Marc8 -- what should

it

look like? What should be in the XML decleration? What should be in

the

MARC header embedded in the XML? Or is it not in fact legal at all?

I'm going out on a limb here, but I don't think it is legal. There is
no formal encoding that corresponds to MARC-8, so there's no way to tell
XML tools how to interpret the bytes.



If I want to have a MarcXML document encoded in UTF8, what should it
look like? What should be in the XML decleration? What should be in

the

MARC header embedded in the XML?

?xml encoding=UTF-8?

I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
really matter to any XML tools.



If I want to have a MarcXML document with a char encoding that is
_neither_ Marc8 nor UTF8, but something else generally legal for XML


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Houghton,Andrew

 Karen Coyle
 Sent: Tuesday, April 17, 2012 15:41
 Subject: Re: [CODE4LIB] MarcXML and char encodings
 
 The discussions at the MARC standards group relating to Unicode all had
 to do with using Unicode *within* ISO2709. I can't find any evidence
 that MARCXML ever went through the standards process. (This may not be
 a
 bad thing.) So none of what we know about the MARBI discussions and
 resulting standards can really help us here, except perhaps by analogy.

Well I can confirm that the MARCXML didn't go through MARBI since I was
one of OCLC's representatives who solidified MARCXML. MARCXML came out
of a meeting at LC between the MARC Standards office, OCLC, RLG, and 
one or two other interested parties whom I cannot remember or find in
my emails or notes about the meeting.


Andy.

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Decasm

Let me make some recommendations. These are what I would consider best 
practices for interoperability.

1) Never put marc8 in xml. Just don't do it. No one expects it. Few will be 
willing to bother with it.

2) Always prefer utf8 for marcxml. You can use any standard charset if you need 
 to, but without special circumstances, use utf8

3) ignore leader 9 in marcxml. Only consider the prolog. (consider not trust.)
If you reasonably can, fail when the charset is Wrong.

/dev

Sent via the Samsung Galaxy S™ II Skyrocket™, an ATT 4G LTE smartphone.

 Original message 
Subject: Re: [CODE4LIB] MarcXML and char encodings 
From: Jonathan Rochkind rochk...@jhu.edu 
To: CODE4LIB@LISTSERV.ND.EDU 
CC:  

Thanks, this is helpful feedback at least.

I think it's completely irrelevant, when determining what is legal under 
standards, to talk about what certain Java tools happen to do though, I 
don't care too much what some tool you happen to use does.

In this case, I'm _writing_ the tools. I want to make them do 'the right 
thing', with some mix of what's actually official legally correct and 
what's practically useful.  What your Java tools do is more or less 
irrelevant to me. I certainly _could_ make my tool respect the Marc 
leader encoded in MarcXML over the XML decleration if I wanted to. I 
could even make it assume the data is Marc8 in XML, even though there's 
no XML charset type for it, if the leader says it's Marc8.

But do others agree that there is in fact no legal way to have Marc8 in 
MarcXML?

Do others agree that you can use non-UTF8 encodings in MarcXML, so long 
as they are legal XML?

I won't even ask someone to cite standards documents, because it's 
pretty clear that LC forgot to consider this when establishing MarcXML.  
(And I have no faith that one could get LC to make a call on this and 
publish it any time this century).

Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How 
is it represented with regard to the XML leader and the Marc header?

Has anyone seen any MarcXML with char encodings that are neither Marc8 
nor UTF8 in the wild? Are they common? How are they represented with 
regard to XML leader and Marc header?

On 4/17/2012 2:32 PM, LeVan,Ralph wrote:
 If I want to have a MarcXML document encoded in Marc8 -- what should
 it
 look like?  What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?  Or is it not in fact legal at all?
 I'm going out on a limb here, but I don't think it is legal.  There is
 no formal encoding that corresponds to MARC-8, so there's no way to tell
 XML tools how to interpret the bytes.


 If I want to have a MarcXML document encoded in UTF8, what should it
 look like? What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?
 ?xml encoding=UTF-8?

 I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
 really matter to any XML tools.


 If I want to have a MarcXML document with a char encoding that is
 _neither_ Marc8 nor UTF8, but something else generally legal for XML

Re: [CODE4LIB] MarcXML and char encodings


On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote:

No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in 
the XML prolog,


Wait, how canyou declare a Marc8 encoding in an XML 
decleration/prolog/whatever it's called?


The things that appear there need to be from a specific list, and I 
didn't think Marc8 was on that list?


Can you give me an example?  And, if you happen to have it, link to XML 
standard that says this is legal?

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread LeVan,Ralph

 No -- it is perfectly legal - -but you MUST declare the encoding to
BE Marc8 in the XML prolog,

 Wait, how canyou declare a Marc8 encoding in an XML 
 decleration/prolog/whatever it's called?

Nope, you can't do that.  There is no approved name for the MARC-8
encoding.  As Andy said, the closest you could get would be to make up
an experimental name, like x-marc-8, but no tool in the world would
recognize that.

Ralph

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Sheila M. Morrissey

In XML standard:

It is RECOMMENDED that character encodings registered (as charsets) 
with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those 
just listed, be referred to usingtheir registered names; other encodings 
SHOULD use names starting with an x- prefix. XML processors SHOULD match 
character encoding names in a case-insensitive way and SHOULDeither 
interpret an IANA-registered name as the encoding registered at IANA for that 
name or treat it as unknown (processors are, of course, not required to support 
all IANA-  registered encodings).


As I suggested -- since MARC8 isn't (so far as I know) registered -- you won't 
get far with most standard tools, in whatever language -- you'll have to extend 
them to first recognize the encoding name, and second, decode the content.

smm

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Tuesday, April 17, 2012 4:19 PM
To: Code for Libraries
Cc: Sheila M. Morrissey
Subject: Re: [CODE4LIB] MarcXML and char encodings



On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote:
 No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 
 in the XML prolog,

Wait, how canyou declare a Marc8 encoding in an XML 
decleration/prolog/whatever it's called?

The things that appear there need to be from a specific list, and I 
didn't think Marc8 was on that list?

Can you give me an example?  And, if you happen to have it, link to XML 
standard that says this is legal?

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Eric Lease Morgan

MARC-8. Cool in its time. Dumb now. Typical. --ELM

Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Sheila M. Morrissey

I think this is a case of being in violent agreement -- see some earlier 
replies in this thread -- 
Pragmatically, if you are going to hew to marc-8 encoding transported in XML -- 
you are losing the usefulness of standard tools for xml --
smm

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
LeVan,Ralph
Sent: Tuesday, April 17, 2012 4:21 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MarcXML and char encodings

 No -- it is perfectly legal - -but you MUST declare the encoding to
BE Marc8 in the XML prolog,

 Wait, how canyou declare a Marc8 encoding in an XML 
 decleration/prolog/whatever it's called?

Nope, you can't do that.  There is no approved name for the MARC-8
encoding.  As Andy said, the closest you could get would be to make up
an experimental name, like x-marc-8, but no tool in the world would
recognize that.

Ralph

Re: [CODE4LIB] MARCXML to MODS: 590 Field

2011-05-19 Thread Jon Stroop


I'm going to guess that it's because 59x fields are defined for local use:

http://www.loc.gov/marc/bibliographic/bd59x.html

...but someone from LC should be able to confirm.
-Jon

--
Jon Stroop
Metadata Analyst
Firestone Library
Princeton University
Princeton, NJ 08544

Email: jstr...@princeton.edu
Phone: (609)258-0059
Fax: (609)258-0441

http://pudl.princeton.edu
http://diglib.princeton.edu
http://diglib.princeton.edu/ead
http://www.cpanda.org/cpanda



On 05/19/2011 11:45 AM, Richard, Joel M wrote:

Dear hive-mind,

Does anyone know why the Library of Congress-supplied MARCXML to MODS XSLT [1] 
does not handle the MARC 590 Local Notes field? It seems to handle everything 
else, not that I've done an exhaustive search... :)

Granted, I could copy/create my own XSLT and add this functionality in myself, 
but I'm curious as to whether or not there's some logic behind this decision to 
not include it. Logic that I would not naturally understand since I'm not 
formally trained as a librarian.

Thanks!
--Joel

[1] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3-4.xsl


Joel Richard
IT Specialist, Web Services Department
Smithsonian Institution Libraries | http://www.sil.si.edu/
(202) 633-1706 | richar...@si.edu

Re: [CODE4LIB] MARCXML to MODS: 590 Field

2011-05-19 Thread Richard, Joel M

Thanks, Karen and Jon!

That's what I suspected, but I couldn't find anything on the web about the 
thought process behind ignoring the 590 altogether. We'll likely end up using a 
local version of the XSLT to map it the mods:note as you suggested. We simply 
don't want this information to be lost in our MODS record as we, for example, 
embed it inside a METS document.

--Joel


On May 19, 2011, at 12:34 PM, Karen Miller wrote:

 Joel,
 
 The 590 is indeed defined for local use, so whatever your local institution
 uses it for should guide your mapping to MODS. There are some examples of
 what it's used for on the OCLC Bibliographic Formats and Standards pages:
 
 http://www.oclc.org/bibformats/en/5xx/590.shtm
 
 Frequently it's used as a note that is specific to a local copy of an item.
 If your institution uses it inconsistently, you might want to just map it to
 mods:note.
 
 Karen
 
 Karen D. Miller
 Monographic/Digital Projects Cataloger
 Bibliographic Services Dept.
 Northwestern University Library
 Evanston, IL 
 k-mill...@northwestern.edu
 847-467-3462
 
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jon
 Stroop
 Sent: Thursday, May 19, 2011 11:07 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARCXML to MODS: 590 Field
 
 I'm going to guess that it's because 59x fields are defined for local use:
 
 http://www.loc.gov/marc/bibliographic/bd59x.html
 
 ...but someone from LC should be able to confirm.
 -Jon
 
 -- 
 Jon Stroop
 Metadata Analyst
 Firestone Library
 Princeton University
 Princeton, NJ 08544
 
 Email: jstr...@princeton.edu
 Phone: (609)258-0059
 Fax: (609)258-0441
 
 http://pudl.princeton.edu
 http://diglib.princeton.edu
 http://diglib.princeton.edu/ead
 http://www.cpanda.org/cpanda
 
 
 
 On 05/19/2011 11:45 AM, Richard, Joel M wrote:
 Dear hive-mind,
 
 Does anyone know why the Library of Congress-supplied MARCXML to MODS XSLT
 [1] does not handle the MARC 590 Local Notes field? It seems to handle
 everything else, not that I've done an exhaustive search... :)
 
 Granted, I could copy/create my own XSLT and add this functionality in
 myself, but I'm curious as to whether or not there's some logic behind this
 decision to not include it. Logic that I would not naturally understand
 since I'm not formally trained as a librarian.
 
 Thanks!
 --Joel
 
 [1] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3-4.xsl
 
 
 Joel Richard
 IT Specialist, Web Services Department
 Smithsonian Institution Libraries | http://www.sil.si.edu/
 (202) 633-1706 | richar...@si.edu

Re: [CODE4LIB] MARCXML - What is it for?

2010-11-19 Thread Smith,Devon

I saw a similar speedup when I switched from an OO approach to a more
functional style.

Using MARC::Record, it was taking a lot longer to run some data than I
wanted. I rewrote my script, with ad-hoc functional code. And though I
can't give a real rate increase, because I never bothered to wait for
the OO version to finish, I can say that it went from hours to minutes.
I didn't compare to the filter capacity of MARC::File::USMARC, though.
Maybe that would have been fast enough for my needs. Ultimately, though,
I was just dumping these fields into a file, and didn't need any objects
for that.

The speed increase I saw was made possible by the directory. I wouldn't
have even been able to try that with the XML version of the data.

/dev

-- 
Devon Smith
Consulting Software Engineer
OCLC Research
http://www.oclc.org/research/people/smith.htm

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Nate Vack
Sent: Friday, November 19, 2010 12:34 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARCXML - What is it for?

On Mon, Oct 25, 2010 at 2:22 PM, Eric Hellman e...@hellman.net wrote:
 I think you'd have a very hard time demonstrating any speed advantage
to MARC over MARCXML.

Not to bring up this old topic again, but I'm just finishing up a
conversion from parse this text structure to blit this binary data
structure into memory. Both written in python.

The text parsing is indeed fast -- tens of milliseconds to parse 100k
or so of data on my laptop.

The binary code, though, is literally 1,000 times faster -- tens of
*microseconds* to read the same data. (And in this application, yeah,
it'll matter.)

Blitting is much, much, much faster than lexing and parsing, or even
running a regexp over the data.

Cheers,
-Nate

Re: [CODE4LIB] marcxml

2010-11-15 Thread Dan Field


On 11 Nov 2010, at 14:47, Galen Charlton wrote:


Hi,

On Thu, Nov 11, 2010 at 6:26 AM, J.D.Gravestock
j.d.gravest...@open.ac.uk wrote:
I'd be interested to know if anyone is using a good marcxml to marc  
converter (other than marcedit, i.e. non windows).  I've tried the  
perl module marc::xml but having a few problems with the conversion  
which I can't replicate in marcedit. Are there any that I've missed?


As far as Perl modules are concerned, MARC::XML is a bit long in the
tooth.  MARC::File::XML used in conjunction with MARC::Record may give
you better results.



Or File_MARC on PEAR if you prefer PHP.

--
Dan Field d...@llgc.org.uk   Ffôn/Tel. +44 1970 632 582
Peiriannydd Meddalwedd  Senior Software Engineer
Llyfrgell Genedlaethol Cymru   National Library of Wales

Re: [CODE4LIB] marcxml

2010-11-12 Thread Benjamin Anderson

The XC team wrote (and uses) the oaitoolkit (
http://code.google.com/p/xcoaitoolkit/) for this.  We've run our entire
collection (5.8M records) through it.

-Ben

On Thu, Nov 11, 2010 at 11:41 AM, Reese, Terry
terry.re...@oregonstate.eduwrote:

 Yes -- that's right.  There is a zip file with install instructions for any
 non-windows based system for which a MONO port is present.

 --TR

  -Original Message-
  From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
  Joel Marchesoni
  Sent: Thursday, November 11, 2010 8:40 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] marcxml
 
  There actually is a version of MARCEdit for Linux now. I think
  (although I can't remember and can't find it on the site) that it
  relies on Mono.
 
  MARCEdit download page:
  http://people.oregonstate.edu/~reeset/marcedit/html/downloads.htmlhttp://people.oregonstate.edu/%7Ereeset/marcedit/html/downloads.html
 
  Joel
 
  -Original Message-
  From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
  J.D.Gravestock
  Sent: Thursday, November 11, 2010 6:26 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: [CODE4LIB] marcxml
 
  I'd be interested to know if anyone is using a good marcxml to marc
  converter (other than marcedit, i.e. non windows).  I've tried the perl
  module marc::xml but having a few problems with the conversion which I
  can't replicate in marcedit. Are there any that I've missed?
 
 
  Jill
 
  **
  Jill Gravestock
  Open University Library
  Milton Keynes
 
 
 
 
  --
  The Open University is incorporated by Royal Charter (RC 000391), an
  exempt charity in England  Wales and a charity registered in Scotland
  (SC 038302).
 
 
  --

Re: [CODE4LIB] marcxml

2010-11-11 Thread Eric Lease Morgan

On Nov 11, 2010, at 6:26 AM, J.D.Gravestock wrote:

 I'd be interested to know if anyone is using a good marcxml to marc converter 
 (other than marcedit, i.e. non windows).

If I understand your question correctly, then try Index Data's yaz-marcdump 
application which is a component of Yaz. [1] Once compiled and installed you do 
something like this from the command line:

  yaz-marcdump -i marcxml -o marc file.xml  file.xml

[1] Yaz - http://www.indexdata.com/yaz/

BTW, say hello to James N. for me.

-- 
Eric Lease Morgan
University of Notre Dame

Re: [CODE4LIB] marcxml

2010-11-11 Thread Joel Marchesoni

There actually is a version of MARCEdit for Linux now. I think (although I 
can't remember and can't find it on the site) that it relies on Mono.

MARCEdit download page: 
http://people.oregonstate.edu/~reeset/marcedit/html/downloads.html

Joel

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of 
J.D.Gravestock
Sent: Thursday, November 11, 2010 6:26 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] marcxml

I'd be interested to know if anyone is using a good marcxml to marc converter 
(other than marcedit, i.e. non windows).  I've tried the perl module marc::xml 
but having a few problems with the conversion which I can't replicate in 
marcedit. Are there any that I've missed?


Jill

**
Jill Gravestock
Open University Library
Milton Keynes




-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England  Wales and a charity registered in Scotland (SC 038302).


--

Re: [CODE4LIB] marcxml

2010-11-11 Thread Reese, Terry

Yes -- that's right.  There is a zip file with install instructions for any 
non-windows based system for which a MONO port is present.

--TR

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Joel Marchesoni
 Sent: Thursday, November 11, 2010 8:40 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] marcxml

 There actually is a version of MARCEdit for Linux now. I think
 (although I can't remember and can't find it on the site) that it
 relies on Mono.

 MARCEdit download page:
 http://people.oregonstate.edu/~reeset/marcedit/html/downloads.html

 Joel

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 J.D.Gravestock
 Sent: Thursday, November 11, 2010 6:26 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] marcxml

 I'd be interested to know if anyone is using a good marcxml to marc
 converter (other than marcedit, i.e. non windows).  I've tried the perl
 module marc::xml but having a few problems with the conversion which I
 can't replicate in marcedit. Are there any that I've missed?

 Jill

 **
 Jill Gravestock
 Open University Library
 Milton Keynes

 --
 The Open University is incorporated by Royal Charter (RC 000391), an
 exempt charity in England  Wales and a charity registered in Scotland
 (SC 038302).

 --

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-28 Thread Cory Rockliff

 I've only just had a chance to catch up on this thread. I'm not 
offended in the least by Turbomarc (anything round-trippable should 
serve just as well as an internal representation of MARC, right?), but I 
am a little puzzled--what are the 'special cases' alluded to in the blog 
post? When would there ever be a non-alphanumeric attribute value in 
MARCXML? Is this a non-MARC21 thing?


C

On 10/25/10 3:35 PM, MJ Suhonos wrote:

I'll just leave this here:

http://www.indexdata.com/blog/2010/05/turbomarc-faster-xml-marc-records

That trade-off ought to offend both camps, though I happen to think it's quite 
clever.

MJ

On 2010-10-25, at 3:22 PM, Eric Hellman wrote:


I think you'd have a very hard time demonstrating any speed advantage to MARC 
over MARCXML. XML parsers have been speed optimized out the wazoo; If there 
exists a MARC parser that has ever been speed-optimized without serious 
compromise, I'm sure someone on this list will have a good story about it.

On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote:


Dear Nate,

There is a trade-off: do you want very fast processing of data -  go for binary 
data. do you want to share your data globally easily in many (not per se library 
related) environments -  go for XML/RDF.
Open your data and do both :-)

Pat

Sent from my iPhone

On 25 Oct 2010, at 20:39, Nate Vacknjv...@wisc.edu  wrote:


Hi all,

I've just spent the last couple of weeks delving into and decoding a
binary file format. This, in turn, got me thinking about MARCXML.

In a nutshell, it looks like it's supposed to contain the exact same
data as a normal MARC record, except in XML form. As in, it should be
round-trippable.

What's the advantage to this? I can see using a human-readable format
for poorly-documented file formats -- they're relatively easy to read
and understand. But MARC is well, well-documented, with more than one
free implementation in cursory searching. And once you know a binary
file's format, it's no harder to parse than XML, and the data's
smaller and processing faster.

So... why the XML?

Curious,
-Nate

Eric Hellman
President, Gluejar, Inc.
41 Watchung Plaza, #132
Montclair, NJ 07042
USA

e...@hellman.net
http://go-to-hellman.blogspot.com/
@gluejar

---
[This E-mail scanned for viruses by Declude Virus]






--
Cory Rockliff
Technical Services Librarian
Bard Graduate Center: Decorative Arts, Design History, Material Culture
18 West 86th Street
New York, NY 10024
T: (212) 501-3037
rockl...@bgc.bard.edu

BGC Exhibitions:
In the Main Gallery:
January 26, 2011– April 17, 2011
Cloisonné: Chinese Enamels from the Yuan, Ming, and Qing Dynasties
Organized in collaboration with the Musée des arts Décoratifs, Paris.
In the Focus Gallery:
January 26, 2011– April 17, 2011
Objects of Exchange: Social and Material Transformation on the 
Late-Nineteenth-Century Northwest Coast
Organized in collaboration with the American Museum of Natural History

---
[This E-mail scanned for viruses by Declude Virus]

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-28 Thread MJ Suhonos

Let me openly state that I've never used Turbomarc.  I believe the special 
case they are referring to is the subfield code with a value of η, which is 
non-alphanumeric.  I don't know enough about MARC to even begin guessing what 
this means or why it might occur (or not).

The use case I see for Turbomarc is when you:

1- have a need for high performance
2- are converting binary MARC to XML
3- are writing your own XSLT to manipulate that XML (since it's not MARCXML)

The first comment claims a 30-40% increase in XML parsing, which seems obvious 
when you compare the number of characters in the example provided: 277 vs. 419, 
or about 34% fewer going through the parser.

But, really, look at that XML (if it can even be called that).  Turbomarc 
somehow manages to make MARC even more inscrutable.  But hey, it's fast.

MJ

On 2010-10-28, at 11:35 AM, Cory Rockliff wrote:

 I've only just had a chance to catch up on this thread. I'm not offended in 
 the least by Turbomarc (anything round-trippable should serve just as well as 
 an internal representation of MARC, right?), but I am a little puzzled--what 
 are the 'special cases' alluded to in the blog post? When would there ever be 
 a non-alphanumeric attribute value in MARCXML? Is this a non-MARC21 thing?
 
 C
 
 On 10/25/10 3:35 PM, MJ Suhonos wrote:
 I'll just leave this here:
 
 http://www.indexdata.com/blog/2010/05/turbomarc-faster-xml-marc-records
 
 That trade-off ought to offend both camps, though I happen to think it's 
 quite clever.
 
 MJ
 
 On 2010-10-25, at 3:22 PM, Eric Hellman wrote:
 
 I think you'd have a very hard time demonstrating any speed advantage to 
 MARC over MARCXML. XML parsers have been speed optimized out the wazoo; If 
 there exists a MARC parser that has ever been speed-optimized without 
 serious compromise, I'm sure someone on this list will have a good story 
 about it.
 
 On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote:
 
 Dear Nate,
 
 There is a trade-off: do you want very fast processing of data -  go for 
 binary data. do you want to share your data globally easily in many (not 
 per se library related) environments -  go for XML/RDF.
 Open your data and do both :-)
 
 Pat
 
 Sent from my iPhone
 
 On 25 Oct 2010, at 20:39, Nate Vacknjv...@wisc.edu  wrote:
 
 Hi all,
 
 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.
 
 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.
 
 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.
 
 So... why the XML?
 
 Curious,
 -Nate
 Eric Hellman
 President, Gluejar, Inc.
 41 Watchung Plaza, #132
 Montclair, NJ 07042
 USA
 
 e...@hellman.net
 http://go-to-hellman.blogspot.com/
 @gluejar
 ---
 [This E-mail scanned for viruses by Declude Virus]
 
 
 
 
 
 -- 
 Cory Rockliff
 Technical Services Librarian
 Bard Graduate Center: Decorative Arts, Design History, Material Culture
 18 West 86th Street
 New York, NY 10024
 T: (212) 501-3037
 rockl...@bgc.bard.edu
 
 BGC Exhibitions:
 In the Main Gallery:
 January 26, 2011– April 17, 2011
 Cloisonné: Chinese Enamels from the Yuan, Ming, and Qing Dynasties
 Organized in collaboration with the Musée des arts Décoratifs, Paris.
 In the Focus Gallery:
 January 26, 2011– April 17, 2011
 Objects of Exchange: Social and Material Transformation on the 
 Late-Nineteenth-Century Northwest Coast
 Organized in collaboration with the American Museum of Natural History
 
 ---
 [This E-mail scanned for viruses by Declude Virus]

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-28 Thread Mike Taylor

On 28 October 2010 17:37, MJ Suhonos m...@suhonos.ca wrote:
 Let me openly state that I've never used Turbomarc.  I believe the special 
 case they are referring to is the subfield code with a value of η, which 
 is non-alphanumeric.  I don't know enough about MARC to even begin guessing 
 what this means or why it might occur (or not).

 The use case I see for Turbomarc is when you:

 1- have a need for high performance
 2- are converting binary MARC to XML
 3- are writing your own XSLT to manipulate that XML (since it's not MARCXML)

 The first comment claims a 30-40% increase in XML parsing, which seems 
 obvious when you compare the number of characters in the example provided: 
 277 vs. 419, or about 34% fewer going through the parser.

The speedup can be much greater than that -- from the blog post
itself, Using xsltproc --timing showed that our transformations were
faster by a factor of 4-5. Shortening the element names only improved
performance fractionally, but since everything counts, we decided to
do this as well.  xsltproc uses the highly optimised LibXML/LibXSLT
stack, which I guess maybe doesn't have so much constant-time overhead
as the PHP simplexml parser that yielder the smaller speedup.

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-28 Thread MJ Suhonos

 The first comment claims a 30-40% increase in XML parsing, which seems 
 obvious when you compare the number of characters in the example provided: 
 277 vs. 419, or about 34% fewer going through the parser.
 
 The speedup can be much greater than that -- from the blog post
 itself, Using xsltproc --timing showed that our transformations were
 faster by a factor of 4-5. Shortening the element names only improved
 performance fractionally, but since everything counts, we decided to
 do this as well.  xsltproc uses the highly optimised LibXML/LibXSLT
 stack, which I guess maybe doesn't have so much constant-time overhead
 as the PHP simplexml parser that yielder the smaller speedup.


Sure, but XML parsing (libxml) and XSLT (libxslt) transforming are very 
different operations.  I would expect parsing to scale linearly with the 
byte-length of XML being parsed.  XSLT, on the other hand, is presumably much 
more dependent on the complexity of the XSL being applied (depth/structure of 
XML, number of templates matches, complexity of XPath statements, etc.)

So I'd expect a series of XSLT transforms to have a much more variable change 
in performance than just parsing.  As I say, if you're writing custom XSL 
anyway, then certainly having a more compact syntax is going to yield better 
performance.

I'm sure to those for whom Turbomarc is useful, it's *very* useful, but it 
definitely seems to be nearing the limit of the readability-performance 
balance.  ;-)

Also, standing w00t:

Indexdata++

MJ

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Walker, David

 I've been involved in several projects lambasted 
 because managers think MARCXML is solving 
 some imaginary problem

It seems to me that this is really the heart of your argument.  You had this 
experience, and now are projecting the opinions of these managers onto lots of 
people in the library world.

I've worked in libraries for nearly a decade, and have never met anyone 
(manager or otherwise) who held the belief that XML in general, or MARC-XML in 
particular, somehow magically solves all metadata problems.  

I guess our two experiences cancel each other out, then.  And, ultimately, none 
of that has anything to do with MARC-XML itself. 

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Alexander 
Johannesen [alexander.johanne...@gmail.com]
Sent: Monday, October 25, 2010 7:10 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARCXML - What is it for?

On Tue, Oct 26, 2010 at 12:48 PM, Bill Dueber b...@dueber.com wrote:
 Here, I think you're guilty of radically underestimating lots of people
 around the library world. No one thinks MARC is a good solution to
 our modern problems, and no one who actually knows what MARC
 is has trouble understanding MARC-XML as an XML serialization of
 the same old data -- certainly not anyone capable of meaningful
 contribution to work on an alternative.

Slow down, Tex. Lots of people in the library world is not the same
as developers, or even good developers, or even good XML developers,
or even good XML developers who knows what the document model imposes
to a data-centric approach.

 The problem we're dealing with is *hard*. Mind-numbingly hard.

This is no justification for not doing things better. (And I'd love to
know what the hard bits are; always interesting to hear from various
people as to what they think are the *real* problems of library
problems, as opposed to any other problem they have)

 The library world has several generations of infrastructure built
 around MARC (by which I mean AACR2), and devising data
 structures and standards that are a big enough improvement over
  MARC to warrant replacing all that infrastructure is an engineering
  and political nightmare.

Political? For sure. Engineering? Not so much. This is just that whole
blinded by MARC issue that keeps cropping up from time to time, and
rightly so; it is truly a beast - at least the way we have come to
know it through AACR2 and all its friends and its death-defying focus
on all things bibliographic - that has paralyzed library innovation,
probably to the point of making libraries almost irrelevant to the
world.

 I'm happy to take potshots at the RDA stuff from the sidelines, but I never
 forget that I'm on the sidelines, and that the people active in the game are
 among the best and brightest we have to offer, working on a problem that
  invariably seems more intractable the deeper in you go.

Well, that's a pretty scary sentence, for all sorts of reasons, but I
think I shall not go there.

 If you think MARC-XML is some sort of an actual problem

What, because you don't agree with me the problem doesn't exist? :)

 and that people
 just need to be shouted at to realize that and do something about it, then,
 well, I think you're just plain wrong.

Fair enough, although you seem to be under the assumption that all of
the stuff I'm saying is a figment of my imagination (I've been
involved in several projects lambasted because managers think MARCXML
is solving some imaginary problem; this is not bullshit, but pain and
suffering from the battlefields of library development), that I'm not
one of those developers (or one of you, although judging from this
discussion it's clear that I am not), that the things I say somehow
doesn't apply because you don't agree with, umm, what I'm assuming is
my somewhat direct approach to stating my heretic opinions.


Alex
--
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ --
-- http://www.google.com/profiles/alexander.johannesen ---

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Smith,Devon

 One way is to first transform the MARC into MARC-XML.  Then you can
use XSLT to crosswalk the MARC-XML
 into that other schema.  Very handy.

 Your criticisms of MARC-XML all seem to presume that MARC-XML is the
goal, the end point in the process.
 But MARC-XML is really better seen as a utility, a middle step between
binary MARC and the real goal,
 which is some other useful and interesting XML schema.

Unless useful and interesting is a euphemism for Dublin Core, then
using XSLT for crosswalking is not really an option. Well, not a good
option. On the other end of the spectrum, assume Onix for useful and
interesting and XSLT simply won't work.

Crosswalking doesn't hold water as a justification for MARCXML.

/dev
-- 
Devon Smith
Consulting Software Engineer
OCLC Research
http://www.oclc.org/research/people/smith.htm




-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Walker, David
Sent: Monday, October 25, 2010 8:57 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARCXML - What is it for?

 b) expanding it to be actual useful and interesting.

But here I think you've missed the very utility of MARC-XML.

Let's say you have a binary MARC file (the kind that comes out of an
ILS) and want to transform that into MODS, Dublin Core, or maybe some
other XML schema.  

How would you do that?  

One way is to first transform the MARC into MARC-XML.  Then you can use
XSLT to crosswalk the MARC-XML into that other schema.  Very handy.

Your criticisms of MARC-XML all seem to presume that MARC-XML is the
goal, the end point in the process.  But MARC-XML is really better seen
as a utility, a middle step between binary MARC and the real goal, which
is some other useful and interesting XML schema.

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of
Alexander Johannesen [alexander.johanne...@gmail.com]
Sent: Monday, October 25, 2010 12:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARCXML - What is it for?

Hiya,

On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote:
 Switching to an XML format doesn't help with that at all.

I'm willing to take it further and say that MARCXML was the worst
thing the library world ever did. Some might argue it was a good first
step, and that it was better with something rather than nothing, to
which I respond ;

Poppycock!

MARCXML is nothing short of evil. Not only does it goes against every
principal of good XML anywhere (don't rely on whitespace, structure
over code, namespace conventions, identity management, document
control, separation of entities and properties, and on and on), it
breaks the ontological commitment that a better treatment of the MARC
data could bring, deterring people from actually a) using the darn
thing as anything but a bare minimal crutch, and b) expanding it to be
actual useful and interesting.

The quicker the library world can get rid of this monstrosity, the
better, although I doubt that will ever happen; it will hang around
like a foul stench for as long as there is MARC in the world. A long
time. A long sad time.

A few extra notes;
   http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html

Can you tell I'm not a fan? :)


Kind regards,

Alex
--
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic
Maps
--- http://shelter.nu/blog/
--
-- http://www.google.com/profiles/alexander.johannesen
---

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread MJ Suhonos

   But it looks just like the old thing using insert data scheme and some 
 templates?
 
   Ah yes, but now we're doing it in XML!

I think this applies to 90% of instances where XML was adopted, especially 
within the enterprise IT industry.  Through marketing or misunderstanding, 
XML was presumed to be the magic fairy dust that would solve countless 
problems simply by switching to it.  The library world is certainly not 
unique in this respect.

Returning to the original question, what is MARCXML for, I think there have 
been some very clear examples of where it can be useful to some people, 
sometimes.  If it works for you, use it.  It not, don't.

To wit, I propose:

Some people, when confronted with a problem, think I know, I'll use MARCXML. 
Now they have three problems: MARC, XML, and the one they started with.

Moving on.

MJ

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Ross Singer

Alex,

I think the problem is data like this:

http://lccn.loc.gov/96516389/marcxml

And while we can probably figure out a pattern to get the semantics
out this record, there is no telling how many other variations exist
within our collections.

So we've got lots of this data that is both hard to parse and,
frankly, hard to find (since it has practically zero machine readable
data in fields we actually use) and it needs to coexist with some
newer, semantically richer format.

What I'm saying is that the library's legacy data problem is almost to
the point of being existential.  This is certainly a detriment to
forward progress.

Analogously (although at a much smaller scale), my wife and I have
been trying for about 2 years to move our checking account from our
out of state bank to something local.  The problem is that we have
built up a lot of infrastructure around our old bank (direct deposit
and lots of automatic bill pay, etc.):  migration would not only be
time consuming, any mistakes made could potentially be quite expensive
and we have a lot of uncertainty of how long it would actually take to
migrate (and how that might affect the flow of payments, etc.).  It's
been, to date, easier for us just to drive across the state line
(despite the fact that it's way out of our way to anywhere) rather
than actually deal with it.  In the meantime, more direct bill pay
things have been set up and whatnot making our eventual migration that
much more difficult.

I do think it would be useful to figure out what exactly in our legacy
data is found only in libraries (that is, we could ditch this shoddy
The Last Waltz record and pull the data from LinkedMDB or Freebase
or somewhere) and determine the scale of the problem that only we can
address, but even just this environmental scan is a fairly large
undertaking.

-Ross.

On Mon, Oct 25, 2010 at 10:10 PM, Alexander Johannesen
alexander.johanne...@gmail.com wrote:
 On Tue, Oct 26, 2010 at 12:48 PM, Bill Dueber b...@dueber.com wrote:
 Here, I think you're guilty of radically underestimating lots of people
 around the library world. No one thinks MARC is a good solution to
 our modern problems, and no one who actually knows what MARC
 is has trouble understanding MARC-XML as an XML serialization of
 the same old data -- certainly not anyone capable of meaningful
 contribution to work on an alternative.

 Slow down, Tex. Lots of people in the library world is not the same
 as developers, or even good developers, or even good XML developers,
 or even good XML developers who knows what the document model imposes
 to a data-centric approach.

 The problem we're dealing with is *hard*. Mind-numbingly hard.

 This is no justification for not doing things better. (And I'd love to
 know what the hard bits are; always interesting to hear from various
 people as to what they think are the *real* problems of library
 problems, as opposed to any other problem they have)

 The library world has several generations of infrastructure built
 around MARC (by which I mean AACR2), and devising data
 structures and standards that are a big enough improvement over
  MARC to warrant replacing all that infrastructure is an engineering
  and political nightmare.

 Political? For sure. Engineering? Not so much. This is just that whole
 blinded by MARC issue that keeps cropping up from time to time, and
 rightly so; it is truly a beast - at least the way we have come to
 know it through AACR2 and all its friends and its death-defying focus
 on all things bibliographic - that has paralyzed library innovation,
 probably to the point of making libraries almost irrelevant to the
 world.

 I'm happy to take potshots at the RDA stuff from the sidelines, but I never
 forget that I'm on the sidelines, and that the people active in the game are
 among the best and brightest we have to offer, working on a problem that
  invariably seems more intractable the deeper in you go.

 Well, that's a pretty scary sentence, for all sorts of reasons, but I
 think I shall not go there.

 If you think MARC-XML is some sort of an actual problem

 What, because you don't agree with me the problem doesn't exist? :)

 and that people
 just need to be shouted at to realize that and do something about it, then,
 well, I think you're just plain wrong.

 Fair enough, although you seem to be under the assumption that all of
 the stuff I'm saying is a figment of my imagination (I've been
 involved in several projects lambasted because managers think MARCXML
 is solving some imaginary problem; this is not bullshit, but pain and
 suffering from the battlefields of library development), that I'm not
 one of those developers (or one of you, although judging from this
 discussion it's clear that I am not), that the things I say somehow
 doesn't apply because you don't agree with, umm, what I'm assuming is
 my somewhat direct approach to stating my heretic opinions.


 Alex
 --
  Project Wrangler, SOA, Information Alchemist, UX,

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Kyle Banerjee


 This is no justification for not doing things better. (And I'd love to
 know what the hard bits are; always interesting to hear from various
 people as to what they think are the *real* problems of library
 problems, as opposed to any other problem they have)


The problem is you have to deal with legacy systems and data. That's as real
as it gets.

That this is somehow a shortcoming peculiar to the library community is
nonsense. Just changing the way dates were stored so that Y2K wasn't a big
deal caused total chaos in the business world for years and required many
billions of dollars worth of development. We still use 4 digit numeric PINs
to access bank accounts. If I created some crummy website that used that
level of protection, people would rightly call me an idiot.

Eliminating MARC and basing systems on a completely different data structure
would have far more reaching impact on system design than twiddling with a
couple date digits or allowing something more secure than 4 digits to
protect access to thousands of dollars. So as crappy as our systems are, I
don't buy we're so much worse than everyone else out there.

There is always the issue of developing the new standard in the first place,
convincing all the vendors to adopt it, and retrofitting the systems to work
with it. Problems are easiest to solve when it's someone else's job to make
it happen.

kyle


-- 
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edu / 503.877.9773

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Alexander Johannesen

Hi,

On Tue, Oct 26, 2010 at 1:23 PM, Bill Dueber b...@dueber.com wrote:
 Sorry. That was rude, and uncalled for. I disagree that the problem is
 easily solved, even without the politics. There've been lots of attempts to
 try to come up with a sufficiently expressive toolset for dealing with
 biblio data, and we're still working on it. If you do think you've got some
 insight, I'm sure we're all ears, but try to frame it terms of the existing
 work if you can (RDA, some of the dublin core stuff, etc.) so we have a
 frame of reference.

Well, I've wined enough both here and on NGC4LIB, and I'm kinda over
it, just like I'm sure most people are over my whining. But sufficient
to say is that FRBR is a 15 year old model that has still not been
proven in the Real World[TM] in any meaningful way (the prototypes
works fine until you dig a bit) and probably never will as long as
MARC21 runs the show, and trying to stick RDA on top with rules that
has got use-cases that are old enough to be my kids, well, I'm not
very positive about that either.

The direction of going ontological is a good one, and in the lack of
anything else, RDF-infused FRBR / RDA is probably the way to go
(except I'd ditch RDA and, uh, perhaps even FRBR, or at least
seriously modify it), but the community is decidedly not talking about
ontological interoperability nor extensions nor the semantics involved
to solve actual problems in the bibliographic world (including the
fact that it is inherently bibliographic). There needs to be much more
involvement by library geeks and managers in defining semantic reuse
and extensibility, to properly define those things that are almost
absent from the AACR2 and friends; the relationships between entities
themselves. In other words, you need to get away from the
record-centered view, and embrace the subject-centric view.

Anyway, enough from this old grumpy bum. Sorry to stir up the dust.


Regards,

Alex
-- 
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ --
-- http://www.google.com/profiles/alexander.johannesen ---

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Toke Eskildsen

On Tue, 2010-10-26 at 03:32 +0200, Alexander Johannesen wrote:
Here's our new thing. And we did it by simply converting all our
 MARC into MARCXML that runs on a cron job every midnight, and a bit of
 horrendous XSLT that's impossible to maintain.

I am in the development department of our library. We're a diverse bunch
of guys, ranging from the bottom (that's me, hacking Lucene) to the top
(our graphics guy). Somewhere in the middle we have 2 librarians. They
do not program in traditional languages, but have been trained to
produce XSLT's and it actually works! They are capable of translating
their vast knowledge of the myriad of standards we encounter into code
that transforms our XML-input into something we can use for indexing.

Aha!, you counter, why not train them to use X instead, since X is
much better at transforming normal MARC?. The answer is that MARC isn't
the only format they need to handle. We currently have 20+ different
sources that they need to transform. All of them except one is XML. The
one is ISO 2709 MARC, which we - naturally - transform into MARCXML so
that it can be processed the same way as the rest. There might be better
tools than XSLT for transformation of XML that we could use, but the
XML-part is so ubiquitous at this point in time that it is the obvious
choice for common ground.

MARC is just one in many. It might be the most evil and unruly beast of
the bunch, but we tame it with the same tools as the rest.

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Boheemen, Peter van

I think:

1. Marc must die.  It has lived long enough. 
2. But everybody uses Marc (which is in fact good), too many people are keeping 
it alive.
3. MARC in XML does not solve the problem, but it makes the suffering so much 
less painful 


Peter

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Alexander Johannesen

 Political? For sure. Engineering? Not so much.

 Ok. Solve it. Let us know when you're done.

Wow, lamest reply so far. Surely you could muster a tad bit better? I
was excited about getting a list of the hardest problems, for example,
I'd love to see that. Then by that perhaps you could explain what this
unsurmountable hard mind-boggeling problem actually is, because, you
know, you never actually said.


Alex
-- 
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ --
-- http://www.google.com/profiles/alexander.johannesen ---

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Andrew Cunningham

I'd suspect that MARCXML isn't going anywhere fast, a shame perhaps.

The key difference between MARCXML and MARC is that MARCXML inherits
XMLs internationalisation features.

It is an aspect at which MARC is very poor.

Andrew

-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Richard, Joel M

On Oct 25, 2010, at 10:31 PM, Alexander Johannesen wrote:

 Political? For sure. Engineering? Not so much.
 
 Ok. Solve it. Let us know when you're done.
 
 Wow, lamest reply so far. Surely you could muster a tad bit better? I
 was excited about getting a list of the hardest problems, for example,
 I'd love to see that. Then by that perhaps you could explain what this
 unsurmountable hard mind-boggeling problem actually is, because, you
 know, you never actually said.


Now, now, boys. Don't make us turn this mailing list around and go right back 
home. Because we will. And you'll go to bed without dinner!

Seriously, though, I've been following this thread closely since I'm new to the 
library world and the petty bickering undermines both of your points and 
distracts from an otherwise intellectual and enlightening discussion.

--Joel

Joel Richard
IT Specialist, Web Services Department
Smithsonian Institution Libraries | http://www.sil.si.edu/
(202) 633-1706 | (202) 786-2861 (f) | richar...@si.edu

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Walker, David

 Crosswalking doesn't hold water as a justification for MARCXML.

To be fair, though, most of us have simpler cross walking needs than OCLC.  

And if I need to go from binary MARC to some XML schema (which I sometimes do), 
then MARC-XML and the XSLT style sheets at LOC seem like a pretty good starting 
point to me.  Better than starting from scratch.

Which isn't to say that that approach is always the right one for every 
project. I very much agree with MJ: If it works for you, use it.  If not, don't.

But if someone else has a better, general purpose solution to this problem, 
then by all means open source that puppy and let the rest of us have at it!

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Smith,Devon 
[smit...@oclc.org]
Sent: Tuesday, October 26, 2010 7:44 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARCXML - What is it for?

 One way is to first transform the MARC into MARC-XML.  Then you can
use XSLT to crosswalk the MARC-XML
 into that other schema.  Very handy.

 Your criticisms of MARC-XML all seem to presume that MARC-XML is the
goal, the end point in the process.
 But MARC-XML is really better seen as a utility, a middle step between
binary MARC and the real goal,
 which is some other useful and interesting XML schema.

Unless useful and interesting is a euphemism for Dublin Core, then
using XSLT for crosswalking is not really an option. Well, not a good
option. On the other end of the spectrum, assume Onix for useful and
interesting and XSLT simply won't work.

Crosswalking doesn't hold water as a justification for MARCXML.

/dev
--
Devon Smith
Consulting Software Engineer
OCLC Research
http://www.oclc.org/research/people/smith.htm




-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Walker, David
Sent: Monday, October 25, 2010 8:57 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARCXML - What is it for?

 b) expanding it to be actual useful and interesting.

But here I think you've missed the very utility of MARC-XML.

Let's say you have a binary MARC file (the kind that comes out of an
ILS) and want to transform that into MODS, Dublin Core, or maybe some
other XML schema.

How would you do that?

One way is to first transform the MARC into MARC-XML.  Then you can use
XSLT to crosswalk the MARC-XML into that other schema.  Very handy.

Your criticisms of MARC-XML all seem to presume that MARC-XML is the
goal, the end point in the process.  But MARC-XML is really better seen
as a utility, a middle step between binary MARC and the real goal, which
is some other useful and interesting XML schema.

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of
Alexander Johannesen [alexander.johanne...@gmail.com]
Sent: Monday, October 25, 2010 12:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARCXML - What is it for?

Hiya,

On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote:
 Switching to an XML format doesn't help with that at all.

I'm willing to take it further and say that MARCXML was the worst
thing the library world ever did. Some might argue it was a good first
step, and that it was better with something rather than nothing, to
which I respond ;

Poppycock!

MARCXML is nothing short of evil. Not only does it goes against every
principal of good XML anywhere (don't rely on whitespace, structure
over code, namespace conventions, identity management, document
control, separation of entities and properties, and on and on), it
breaks the ontological commitment that a better treatment of the MARC
data could bring, deterring people from actually a) using the darn
thing as anything but a bare minimal crutch, and b) expanding it to be
actual useful and interesting.

The quicker the library world can get rid of this monstrosity, the
better, although I doubt that will ever happen; it will hang around
like a foul stench for as long as there is MARC in the world. A long
time. A long sad time.

A few extra notes;
   http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html

Can you tell I'm not a fan? :)


Kind regards,

Alex
--
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic
Maps
--- http://shelter.nu/blog/
--
-- http://www.google.com/profiles/alexander.johannesen
---

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Tim Spalding

MARC records break parsing far too frequently. Apart from requiring no
truly specialized tools, MARCXML should—should!—eliminate many of
those problems. That's not to mention that MARC character sets vary a
lot (DanMARC anyone?), and more even in practice than in theory.

From my perspective the problem is simply that MARCXML isn't as
ubiquitous as MARC. For what we do, at least, there's no point. We'd
need to parse non-XML MARC data anyway. So if we're going to do it, we
might as well do it for everything.

Best,
Tim

On Mon, Oct 25, 2010 at 2:38 PM, Nate Vack njv...@wisc.edu wrote:
 Hi all,

 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.

 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.

 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.

 So... why the XML?

 Curious,
 -Nate




-- 
Check out my library at http://www.librarything.com/profile/timspalding

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Andrew Hankinson

I'm not a big user of MARCXML, but I can think of a few reasons off the top of 
my head:

- Existing libraries for reading, manipulating and searching XML-based 
documents are very mature.
- Documents can be validated for their well-formedness using these existing 
tools and a pre-defined schema (a validator for MARC would need to be 
custom-coded)
- MARCXML can easily be incorporated into XML-based meta-metadata schemas, like 
METS.
- It can be parsed and manipulated in a web service context without sending a 
binary blob over the wire.
- XML is self-describing, binary is not.

There's nothing stopping you from reading the MARCXML into a binary blob and 
working on it from there. But when sharing documents from different 
institutions around the globe, using a wide variety of tools and techniques, 
XML seems to be the lowest common denominator.

-Andrew

On 2010-10-25, at 2:38 PM, Nate Vack wrote:

 Hi all,
 
 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.
 
 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.
 
 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.
 
 So... why the XML?
 
 Curious,
 -Nate

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Patrick Hochstenbach

Dear Nate,

There is a trade-off: do you want very fast processing of data - go for binary 
data. do you want to share your data globally easily in many (not per se 
library related) environments - go for XML/RDF. 
Open your data and do both :-)

Pat

Sent from my iPhone

On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote:

 Hi all,
 
 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.
 
 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.
 
 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.
 
 So... why the XML?
 
 Curious,
 -Nate

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread MJ Suhonos

It's helpful to think of MARCXML as a sort of lingua franca.

 - Existing libraries for reading, manipulating and searching XML-based 
 documents are very mature.
Including XSLT and XPath; very powerful stuff.

 There's nothing stopping you from reading the MARCXML into a binary blob and 
 working on it from there. But when sharing documents from different 
 institutions around the globe, using a wide variety of tools and techniques, 
 XML seems to be the lowest common denominator.

Assuming it's also round-trippable, MARC-in-JSON would accomplish this as well.

Not to mention it's nice to be able to read and edit MARC records in any 
(any!!) text editor for those of us who are comfortable looking at JSON or XML 
but can't handle staring at binary bytestreams without having an aneurysm.

MJ

 On 2010-10-25, at 2:38 PM, Nate Vack wrote:
 
 Hi all,
 
 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.
 
 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.
 
 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.
 
 So... why the XML?
 
 Curious,
 -Nate

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Tim Spalding

- XML is self-describing, binary is not.

Not to quibble, but that's only in a theoretical sense here. Something
like Amazon XML is truly self-describing. MARCXML is self-obfuscating.
At least MARC records kinda imitate catalog cards.
:)

Tim

On Mon, Oct 25, 2010 at 2:50 PM, Andrew Hankinson
andrew.hankin...@gmail.com wrote:
 I'm not a big user of MARCXML, but I can think of a few reasons off the top 
 of my head:

 - Existing libraries for reading, manipulating and searching XML-based 
 documents are very mature.
 - Documents can be validated for their well-formedness using these existing 
 tools and a pre-defined schema (a validator for MARC would need to be 
 custom-coded)
 - MARCXML can easily be incorporated into XML-based meta-metadata schemas, 
 like METS.
 - It can be parsed and manipulated in a web service context without sending a 
 binary blob over the wire.
 - XML is self-describing, binary is not.

 There's nothing stopping you from reading the MARCXML into a binary blob and 
 working on it from there. But when sharing documents from different 
 institutions around the globe, using a wide variety of tools and techniques, 
 XML seems to be the lowest common denominator.

 -Andrew

 On 2010-10-25, at 2:38 PM, Nate Vack wrote:

 Hi all,

 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.

 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.

 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.

 So... why the XML?

 Curious,
 -Nate




-- 
Check out my library at http://www.librarything.com/profile/timspalding

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Bryan Baldus

On  Monday, October 25, 2010 1:50 PM, Andrew Hankinson wrote:
- Documents can be validated for their well-formedness using these existing 
tools and a pre-defined schema (a validator for MARC would need to be 
custom-coded)

In Perl, MARC::Lint might be an example of such a validator (though I need to 
update it with the most recent MARC updates at some point soon). MarcEdit also 
includes a validator.

Bryan Baldus
bryan.bal...@quality-books.com
eij...@cpan.org
http://home.comcast.net/~eijabb/

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Eric Hellman

I think you'd have a very hard time demonstrating any speed advantage to MARC 
over MARCXML. XML parsers have been speed optimized out the wazoo; If there 
exists a MARC parser that has ever been speed-optimized without serious 
compromise, I'm sure someone on this list will have a good story about it.

On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote:

 Dear Nate,
 
 There is a trade-off: do you want very fast processing of data - go for 
 binary data. do you want to share your data globally easily in many (not per 
 se library related) environments - go for XML/RDF. 
 Open your data and do both :-)
 
 Pat
 
 Sent from my iPhone
 
 On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote:
 
 Hi all,
 
 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.
 
 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.
 
 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.
 
 So... why the XML?
 
 Curious,
 -Nate

Eric Hellman
President, Gluejar, Inc.
41 Watchung Plaza, #132
Montclair, NJ 07042
USA

e...@hellman.net 
http://go-to-hellman.blogspot.com/
@gluejar

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Nate Vack

On Mon, Oct 25, 2010 at 2:09 PM, Tim Spalding t...@librarything.com wrote:
 - XML is self-describing, binary is not.

 Not to quibble, but that's only in a theoretical sense here. Something
 like Amazon XML is truly self-describing. MARCXML is self-obfuscating.
 At least MARC records kinda imitate catalog cards.

Yeah -- this is kinda the source of my confusion. In the case of the
files I'm reading, it's not that it's hard to find out where the
nMeasurement field lives (it's six short ints starting at offset 64),
but what the field means, and whether or not I care about it.

Switching to an XML format doesn't help with that at all.

WRT character encoding issues and validation: if MARC and MARCXML are
round-trippable, a solution in one environment is equivalent to a
solution in the other.

And I think we've all seen plenty of unvalidated, badly-formed XML,
and plenty with Character Encoding Problemsâ„¢ ;-)

Thanks for the input!
-Nate

Re: [CODE4LIB] MARCXML - What is it for?

Hiya,

On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote:
 Switching to an XML format doesn't help with that at all.

I'm willing to take it further and say that MARCXML was the worst
thing the library world ever did. Some might argue it was a good first
step, and that it was better with something rather than nothing, to
which I respond ;

Poppycock!

MARCXML is nothing short of evil. Not only does it goes against every
principal of good XML anywhere (don't rely on whitespace, structure
over code, namespace conventions, identity management, document
control, separation of entities and properties, and on and on), it
breaks the ontological commitment that a better treatment of the MARC
data could bring, deterring people from actually a) using the darn
thing as anything but a bare minimal crutch, and b) expanding it to be
actual useful and interesting.

The quicker the library world can get rid of this monstrosity, the
better, although I doubt that will ever happen; it will hang around
like a foul stench for as long as there is MARC in the world. A long
time. A long sad time.

A few extra notes;
   http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html

Can you tell I'm not a fan? :)


Kind regards,

Alex
-- 
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ --
-- http://www.google.com/profiles/alexander.johannesen ---

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Andrew Hankinson

I guess what I meant is that in MARCXML, you have a datafield element with 
subsequent subfield elements each with fairly clear attributes, which, while 
not my idea of fun Sunday-afternoon reading, requires less specialized tools to 
parse (hello Textmate!) and is a bit easier than trying to count INT positions. 
One quick XPath query and you can have all 245 fields, regardless of their 
length or position in the record.


On 2010-10-25, at 3:26 PM, Nate Vack wrote:

 On Mon, Oct 25, 2010 at 2:09 PM, Tim Spalding t...@librarything.com wrote:
 - XML is self-describing, binary is not.
 
 Not to quibble, but that's only in a theoretical sense here. Something
 like Amazon XML is truly self-describing. MARCXML is self-obfuscating.
 At least MARC records kinda imitate catalog cards.
 
 Yeah -- this is kinda the source of my confusion. In the case of the
 files I'm reading, it's not that it's hard to find out where the
 nMeasurement field lives (it's six short ints starting at offset 64),
 but what the field means, and whether or not I care about it.
 
 Switching to an XML format doesn't help with that at all.
 
 WRT character encoding issues and validation: if MARC and MARCXML are
 round-trippable, a solution in one environment is equivalent to a
 solution in the other.
 
 And I think we've all seen plenty of unvalidated, badly-formed XML,
 and plenty with Character Encoding Problemsâ„¢ ;-)
 
 Thanks for the input!
 -Nate

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread MJ Suhonos

I'll just leave this here:

http://www.indexdata.com/blog/2010/05/turbomarc-faster-xml-marc-records

That trade-off ought to offend both camps, though I happen to think it's quite 
clever.

MJ

On 2010-10-25, at 3:22 PM, Eric Hellman wrote:

 I think you'd have a very hard time demonstrating any speed advantage to MARC 
 over MARCXML. XML parsers have been speed optimized out the wazoo; If there 
 exists a MARC parser that has ever been speed-optimized without serious 
 compromise, I'm sure someone on this list will have a good story about it.
 
 On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote:
 
 Dear Nate,
 
 There is a trade-off: do you want very fast processing of data - go for 
 binary data. do you want to share your data globally easily in many (not per 
 se library related) environments - go for XML/RDF. 
 Open your data and do both :-)
 
 Pat
 
 Sent from my iPhone
 
 On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote:
 
 Hi all,
 
 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.
 
 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.
 
 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.
 
 So... why the XML?
 
 Curious,
 -Nate
 
 Eric Hellman
 President, Gluejar, Inc.
 41 Watchung Plaza, #132
 Montclair, NJ 07042
 USA
 
 e...@hellman.net 
 http://go-to-hellman.blogspot.com/
 @gluejar

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Kyle Banerjee

On Mon, Oct 25, 2010 at 12:38 PM, Tim Spalding t...@librarything.com wrote:

 Does processing speed of something matter anymore? You'd have to be
 doing a LOT of processing to care, wouldn't you?


Data migrations and data dumps are a common use case. Needing to break or
make hundreds of thousands or millions of records is not uncommon.

kyle

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Tim Spalding

Does processing speed of something matter anymore? You'd have to be
doing a LOT of processing to care, wouldn't you?

Tim

On Mon, Oct 25, 2010 at 3:35 PM, MJ Suhonos m...@suhonos.ca wrote:
 I'll just leave this here:

 http://www.indexdata.com/blog/2010/05/turbomarc-faster-xml-marc-records

 That trade-off ought to offend both camps, though I happen to think it's 
 quite clever.

 MJ

 On 2010-10-25, at 3:22 PM, Eric Hellman wrote:

 I think you'd have a very hard time demonstrating any speed advantage to 
 MARC over MARCXML. XML parsers have been speed optimized out the wazoo; If 
 there exists a MARC parser that has ever been speed-optimized without 
 serious compromise, I'm sure someone on this list will have a good story 
 about it.

 On Oct 25, 2010, at 3:05 PM, Patrick Hochstenbach wrote:

 Dear Nate,

 There is a trade-off: do you want very fast processing of data - go for 
 binary data. do you want to share your data globally easily in many (not 
 per se library related) environments - go for XML/RDF.
 Open your data and do both :-)

 Pat

 Sent from my iPhone

 On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote:

 Hi all,

 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.

 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.

 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.

 So... why the XML?

 Curious,
 -Nate

 Eric Hellman
 President, Gluejar, Inc.
 41 Watchung Plaza, #132
 Montclair, NJ 07042
 USA

 e...@hellman.net
 http://go-to-hellman.blogspot.com/
 @gluejar




-- 
Check out my library at http://www.librarything.com/profile/timspalding

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Kyle Banerjee

On Mon, Oct 25, 2010 at 12:22 PM, Eric Hellman e...@hellman.net wrote:

 I think you'd have a very hard time demonstrating any speed advantage to
 MARC over MARCXML. XML parsers have been speed optimized out the wazoo; If
 there exists a MARC parser that has ever been speed-optimized without
 serious compromise, I'm sure someone on this list will have a good story
 about it.


I'll take MarcEdit over a XML parser for MARCXML any day. For a benchmark
test, try roundtripping a million records. Unless I've been messing with the
wrong stuff, the differences are dramatic.

kyle

Re: [CODE4LIB] MARCXML - What is it for?

Yes, it is designed to be a round-trippable expression of ordinary marc 
in XML. Some reasons this is useful:


1. No maximum record length, unlike actual marc which tops out at ~10k.
2. You can use XSLT and other XML tools to work with it, and store it in 
stores optimized for XML (or that only accept XML), etc.
3. You can embed it inside XML schema's that allow arbitrary embeddable 
XML.
4. (Of much lesser importance than these others, but still ends up being 
important to me -- saving the time of the developer does matter) it's a 
lot easier to debug the raw data, doesn't require me to open up a hex 
editor and count bytes.


Nate Vack wrote:

Hi all,

I've just spent the last couple of weeks delving into and decoding a
binary file format. This, in turn, got me thinking about MARCXML.

In a nutshell, it looks like it's supposed to contain the exact same
data as a normal MARC record, except in XML form. As in, it should be
round-trippable.

What's the advantage to this? I can see using a human-readable format
for poorly-documented file formats -- they're relatively easy to read
and understand. But MARC is well, well-documented, with more than one
free implementation in cursory searching. And once you know a binary
file's format, it's no harder to parse than XML, and the data's
smaller and processing faster.

So... why the XML?

Curious,
-Nate

Re: [CODE4LIB] MARCXML - What is it for?

MODS was an attempt to mostly-but-not-entirely-roundtrippably represent 
data in MARC in a format that's more 'normal' XML, without packed bytes 
in elements, with element names that are more or less self-documenting, 
etc.  It's caught on even less than MARCXML though, so if you find 
MARCXML under-adopted (I disagree), you won't like MODS.


Personally I think MODS is kind of the worst of both worlds. The only 
reason to stick with something that looks anything like MARC is to be 
round-trippable with legacy MARC, which MODS is not.  But if you're 
going to give that up, you really want more improvements than MODS 
supplies, it's still got a lot of the unfortunate legacy of MARC in it.


Nate Vack wrote:

On Mon, Oct 25, 2010 at 2:09 PM, Tim Spalding t...@librarything.com wrote:
  

- XML is self-describing, binary is not.

Not to quibble, but that's only in a theoretical sense here. Something
like Amazon XML is truly self-describing. MARCXML is self-obfuscating.
At least MARC records kinda imitate catalog cards.



Yeah -- this is kinda the source of my confusion. In the case of the
files I'm reading, it's not that it's hard to find out where the
nMeasurement field lives (it's six short ints starting at offset 64),
but what the field means, and whether or not I care about it.

Switching to an XML format doesn't help with that at all.

WRT character encoding issues and validation: if MARC and MARCXML are
round-trippable, a solution in one environment is equivalent to a
solution in the other.

And I think we've all seen plenty of unvalidated, badly-formed XML,
and plenty with Character Encoding Problemsâ„¢ ;-)

Thanks for the input!
-Nate

Re: [CODE4LIB] MARCXML - What is it for?

Marc in JSON can be a nice middle-ground, faster/smaller than MarcXML
(although still probably not as binary), based on a standard low-level
data format so easier to work with using existing tools (and developers
eyes) than binary, no maximum record length.

There have been a couple competing attempts to define a
marc-expressed-in-json 'standard', none have really caught on yet. I
like Ross's latest attempt:
http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/

Patrick Hochstenbach wrote:

Dear Nate,

There is a trade-off: do you want very fast processing of data - go for binary data. do you want to share your data globally easily in many (not per se library related) environments - go for XML/RDF.
Open your data and do both :-)

Pat

Sent from my iPhone

On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote:

Hi all,

I've just spent the last couple of weeks delving into and decoding a
binary file format. This, in turn, got me thinking about MARCXML.

In a nutshell, it looks like it's supposed to contain the exact same
data as a normal MARC record, except in XML form. As in, it should be
round-trippable.

What's the advantage to this? I can see using a human-readable format
for poorly-documented file formats -- they're relatively easy to read
and understand. But MARC is well, well-documented, with more than one
free implementation in cursory searching. And once you know a binary
file's format, it's no harder to parse than XML, and the data's
smaller and processing faster.

So... why the XML?

Curious,
-Nate

Re: [CODE4LIB] MARCXML - What is it for?


Tim Spalding wrote:

Does processing speed of something matter anymore? You'd have to be
doing a LOT of processing to care, wouldn't you?
  


Yes,which sometimes you are. Say, when you're indexing 2 or 3 or 10 
million marc records into, say, solr.


Which is faster depends on what language and what libraries you are 
using for both binary marc and marcxml. But in many of our experiences, 
parseing and serializing binary marc _is_ significantly faster than 
parseing and serializing marcxml.  That is of course just one of the 
various criteria that comes into play when choosing a format.


Here's Bill Dueber's benchmarks comparing MarcXML, marc binary, and a 
marc-in-json format; in ruby, using various library alternatives.  I 
rather like the marc-in-json format for being a happy medium.  Whether 
it's standard or not doesn't neccesarily matter when you're dealing 
with your own records, passing them through several stops on a 
toolchain, and have tools available that can do it. Who cares if 
any/everyone else uses it.


http://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread MJ Suhonos

JSON++

I routinely re-index about 2.5M JSON records (originally from binary MARC), and 
it's several orders of magnitude faster than XML (measured in single-digit 
minutes rather than double-digit hours).  I'm not sure if it's in the same 
range as binary MARC, but as Tim says, it's plenty fast enough for pragmatic 
purposes.

Unfortunately JSON doesn't have as many mature tools for manipulation as XML 
(yet?), but I'd be inclined to call it the best of both worlds rather than a 
middle-ground or compromise.

MJ

 Marc in JSON can be a nice middle-ground, faster/smaller than MarcXML 
 (although still probably not as binary), based on a standard low-level data 
 format so easier to work with using existing tools (and developers eyes) than 
 binary, no maximum record length. 
 There have been a couple competing attempts to define a 
 marc-expressed-in-json 'standard', none have really caught on yet. I like 
 Ross's latest attempt:  
 http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/
 
 Patrick Hochstenbach wrote:
 Dear Nate,
 
 There is a trade-off: do you want very fast processing of data - go for 
 binary data. do you want to share your data globally easily in many (not per 
 se library related) environments - go for XML/RDF. Open your data and do 
 both :-)
 
 Pat
 
 Sent from my iPhone
 
 On 25 Oct 2010, at 20:39, Nate Vack njv...@wisc.edu wrote:
 
  
 Hi all,
 
 I've just spent the last couple of weeks delving into and decoding a
 binary file format. This, in turn, got me thinking about MARCXML.
 
 In a nutshell, it looks like it's supposed to contain the exact same
 data as a normal MARC record, except in XML form. As in, it should be
 round-trippable.
 
 What's the advantage to this? I can see using a human-readable format
 for poorly-documented file formats -- they're relatively easy to read
 and understand. But MARC is well, well-documented, with more than one
 free implementation in cursory searching. And once you know a binary
 file's format, it's no harder to parse than XML, and the data's
 smaller and processing faster.
 
 So... why the XML?
 
 Curious,
 -Nate

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Stephen Meyer


Kyle Banerjee wrote:

On Mon, Oct 25, 2010 at 12:38 PM, Tim Spalding t...@librarything.com wrote:


Does processing speed of something matter anymore? You'd have to be
doing a LOT of processing to care, wouldn't you?



Data migrations and data dumps are a common use case. Needing to break or
make hundreds of thousands or millions of records is not uncommon.

kyle


To make this concrete, we processes the MARC records from 14 separate 
ILS's throughout the University of Wisconsin System. We extract, sort on 
OCLC number, dedup and merge pieces from any campus that has a record 
for the work. The MARC that we then index and display here


 http://forward.library.wisconsin.edu/catalog/ocm37443537?school_code=WU

is not identical to the version of the MARC record from any of the 4 
schools that hold it.


We extract 13 million records and dedup down to 8 million every week. 
Speed is paramount.


-sm
--
Stephen Meyer
Library Application Developer
UW-Madison Libraries
436 Memorial Library
728 State St.
Madison, WI 53706

sme...@library.wisc.edu
608-265-2844 (ph)


Just don't let the human factor fail to be a factor at all.
- Andrew Bird, Tables and Chairs

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Ray Denenberg, Library of Congress

It really is possible to make your point without being quite so obnoxious.
Everyone else seems to be able to do so. --Ray

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Alexander Johannesen
Sent: Monday, October 25, 2010 3:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARCXML - What is it for?

Hiya,

On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote:
 Switching to an XML format doesn't help with that at all.

I'm willing to take it further and say that MARCXML was the worst thing the
library world ever did. Some might argue it was a good first step, and that
it was better with something rather than nothing, to which I respond ;

Poppycock!

MARCXML is nothing short of evil. Not only does it goes against every
principal of good XML anywhere (don't rely on whitespace, structure over
code, namespace conventions, identity management, document control,
separation of entities and properties, and on and on), it breaks the
ontological commitment that a better treatment of the MARC data could bring,
deterring people from actually a) using the darn thing as anything but a
bare minimal crutch, and b) expanding it to be actual useful and
interesting.

The quicker the library world can get rid of this monstrosity, the better,
although I doubt that will ever happen; it will hang around like a foul
stench for as long as there is MARC in the world. A long time. A long sad
time.

A few extra notes;
   http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html

Can you tell I'm not a fan? :)

Kind regards,

Alex
--
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ --
-- http://www.google.com/profiles/alexander.johannesen ---

Re: [CODE4LIB] MARCXML - What is it for?

Ray Denenberg, Library of Congress r...@loc.gov wrote:
 It really is possible to make your point without being quite so obnoxious.

Obnoxious?


Alex
-- 
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ --
-- http://www.google.com/profiles/alexander.johannesen ---

Re: [CODE4LIB] MARCXML - What is it for?

I know there are two parts of this discussion (speed on the one hand,
applicability/features on teh other), but for the former, running a little
benchmark just isn't that hard. Aren't we supposed to, you know, prefer to
make decisions based on data?

Note: I'm only testing deserialization because there's isn't, as of now, a
fast serialization option for ruby-marc. It uses REXML, and it's dog-slow. I
already looked marc-in-json vs marc binary at
http://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/

Benchmark Source: http://gist.github.com/645683

18,883 records as either an XML collection or newline-delimited json.
Open the file, read every record, pull out a title. Repeat 5 times for a
total of 94,415 records (i.e., just under 100K records total).

Under ruby-marc, using the libxml deserializer is the fastest option. If
you're using the REXML parser, well, god help us all.

ruby 1.8.7 (2010-08-16 patchlevel 302) [i686-darwin9.8.0]. User time
reported in seconds.

xml w/libxml 227 seconds
marc-in-json w/yajl 130 seconds

Soquite a bit faster (more than 40%). For a million records (assuming I
can just say 10*these_values) you're talking about a difference of 16
minutes due to just reading speed. Assuming, of course, you're running your
code on my desktop. Today.

For the 8M records I have to deal with, that'd be roughly 8M * ((227-130)
/ 94,415) = 7806 seconds, or about 130 minutes. S...a lot.

Of course, if you're using a slower XML library or a slower JSON library,
your numbers will vary quite a bit. REXML is unforgivingly slow, and
json/pure (and even 'json') are quite a bit slower than yajl. And don't
forget that you need to serialize these things from your source somehow...

-Bill-

On Mon, Oct 25, 2010 at 4:23 PM, Stephen Meyer sme...@library.wisc.eduwrote:

Kyle Banerjee wrote:

On Mon, Oct 25, 2010 at 12:38 PM, Tim Spalding t...@librarything.com
wrote:

Does processing speed of something matter anymore? You'd have to be
doing a LOT of processing to care, wouldn't you?

Data migrations and data dumps are a common use case. Needing to break or
make hundreds of thousands or millions of records is not uncommon.

kyle

To make this concrete, we processes the MARC records from 14 separate ILS's
throughout the University of Wisconsin System. We extract, sort on OCLC
number, dedup and merge pieces from any campus that has a record for the
work. The MARC that we then index and display here

http://forward.library.wisconsin.edu/catalog/ocm37443537?school_code=WU

is not identical to the version of the MARC record from any of the 4
schools that hold it.

We extract 13 million records and dedup down to 8 million every week. Speed
is paramount.

-sm
--
Stephen Meyer
Library Application Developer
UW-Madison Libraries
436 Memorial Library
728 State St.
Madison, WI 53706

sme...@library.wisc.edu
608-265-2844 (ph)

Just don't let the human factor fail to be a factor at all.
- Andrew Bird, Tables and Chairs

--
Bill Dueber
Library Systems Programmer
University of Michigan Library

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Walker, David

 b) expanding it to be actual useful and interesting.

But here I think you've missed the very utility of MARC-XML.

Let's say you have a binary MARC file (the kind that comes out of an ILS) and 
want to transform that into MODS, Dublin Core, or maybe some other XML schema.  

How would you do that?  

One way is to first transform the MARC into MARC-XML.  Then you can use XSLT to 
crosswalk the MARC-XML into that other schema.  Very handy.

Your criticisms of MARC-XML all seem to presume that MARC-XML is the goal, the 
end point in the process.  But MARC-XML is really better seen as a utility, a 
middle step between binary MARC and the real goal, which is some other useful 
and interesting XML schema.

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Alexander 
Johannesen [alexander.johanne...@gmail.com]
Sent: Monday, October 25, 2010 12:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARCXML - What is it for?

Hiya,

On Tue, Oct 26, 2010 at 6:26 AM, Nate Vack njv...@wisc.edu wrote:
 Switching to an XML format doesn't help with that at all.

I'm willing to take it further and say that MARCXML was the worst
thing the library world ever did. Some might argue it was a good first
step, and that it was better with something rather than nothing, to
which I respond ;

Poppycock!

MARCXML is nothing short of evil. Not only does it goes against every
principal of good XML anywhere (don't rely on whitespace, structure
over code, namespace conventions, identity management, document
control, separation of entities and properties, and on and on), it
breaks the ontological commitment that a better treatment of the MARC
data could bring, deterring people from actually a) using the darn
thing as anything but a bare minimal crutch, and b) expanding it to be
actual useful and interesting.

The quicker the library world can get rid of this monstrosity, the
better, although I doubt that will ever happen; it will hang around
like a foul stench for as long as there is MARC in the world. A long
time. A long sad time.

A few extra notes;
   http://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html

Can you tell I'm not a fan? :)


Kind regards,

Alex
--
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ --
-- http://www.google.com/profiles/alexander.johannesen ---

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Eric Lease Morgan

On Oct 25, 2010, at 8:56 PM, Walker, David wrote:

 Your criticisms of MARC-XML all seem to presume that MARC-XML is the goal, 
 the end point in the process.  But MARC-XML is really better seen as a 
 utility, a middle step between binary MARC and the real goal, which is some 
 other useful and interesting XML schema.

Exactly.

-- 
Eric Morgan

Re: [CODE4LIB] MARCXML - What is it for?

On Tue, Oct 26, 2010 at 11:56 AM, Walker, David dwal...@calstate.edu wrote:
 Your criticisms of MARC-XML all seem to presume that MARC-XML is the
 goal, the end point in the process.  But MARC-XML is really better seen as a
 utility, a middle step between binary MARC and the real goal, which is some
 other useful and interesting XML schema.

How do you create an ontological commitment in a community to an
expanding and useful set of tools and vocabularies? I think I need to
remind people of what MARCXML is supposed to be ;

a framework for working with MARC data in a XML environment. This
framework is intended to be flexible and extensible to allow users to
work with MARC data in ways specific to their needs. The framework
itself includes many components such as schemas, stylesheets, and
software tools.

I'm not assuming MARCXML is a goal, no matter how we define that. I'm
poo-pooing MARCXML for the semantics we, as a community, have been
given by a process I suspect had goals very different from reality.
Very few people would work with MARC through MARCXML, they would use
it to convert it, filter it, hack around it to something else
entirely. And I'm afraid lots of people are missing the point of
stubbing the developments in a community by embracing tools that
pushes a packet that inhibits innovation. So, here's the point, in
paraphrased point;

   Here's our new thing. And we did it by simply converting all our
MARC into MARCXML that runs on a cron job every midnight, and a bit of
horrendous XSLT that's impossible to maintain.

   But it looks just like the old thing using MARC and some templates?

   Ah yes, but now we're doing it in XML!

   (Yeah, yeah, your mileage will vary)

I'm sorry if I'm overly pessimistic about the XML goodness in the
world, not for the XML itself, but the consequences of the named
entities involved. I've been a die-hard XML wonk for far too many
years, and the tools in that tool-chest doesn't automatically solve
hard problems better by wrapping stuff up in angle brackets, and -
dare I say it? - perhaps introduces a whole fleet of other problems
rarely talked about when XML is the latest buzz-word, like using a
document model on what's a traditional records model, character
encodings, whitespace issues, unicode, size and efficiencies (the
other part of this thread), and so on.

But let me also be a bit more specific about that hard semantic
problem I'm talking about;

Lots of people around the library world infra-structure will think
that since your data is now in XML it has taken some important step
towards being inter-operable with the rest of the world, that library
data now is part of the real world in *any* meaningful way, but this
is simply demonstrably deceivingly not true. By having our data in XML
has killed a few good projects where people have gone A new project
to convert our MARC into useful XML? Aha! LoC has already solved that
problem for us.

Btw, to those who find me so obnoxious, at no point do I say it was
intentionally evil, just evil none the same. The road to hell is, as
always, paved with good intentions.


Alex
-- 
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ --
-- http://www.google.com/profiles/alexander.johannesen ---

Re: [CODE4LIB] MARCXML - What is it for?

On Mon, Oct 25, 2010 at 9:32 PM, Alexander Johannesen 
alexander.johanne...@gmail.com wrote:

 Lots of people around the library world infra-structure will think
 that since your data is now in XML it has taken some important step
 towards being inter-operable with the rest of the world, that library
 data now is part of the real world in *any* meaningful way, but this
 is simply demonstrably deceivingly not true.


Here, I think you're guilty of radically underestimating lots of people
around the library world. No one thinks MARC is a good solution to our
modern problems, and no one who actually knows what MARC is has trouble
understanding MARC-XML as an XML serialization of the same old data --
certainly not anyone capable of meaningful contribution to work on an
alternative.

You seem to presuppose that there's an enormous pent-up energy poised to
sweep in changes to an obviously-better data format, and that the existence
of MARC-XML somehow defuses all that energy. The truth is that a high
percentage of people that work with MARC data actively think about (or
curse) things that are wrong with it and gobs and gobs of ridiculously-smart
people work on a variety of alternate solutions (not the least of which is
RDA) and get their organizations to spend significant money to do so. The
problem we're dealing with is *hard*. Mind-numbingly hard.

The library world has several generations of infrastructure built around
MARC (by which I mean AACR2), and devising data structures and standards
that are a big enough improvement over MARC to warrant replacing all
that infrastructure is an engineering and political nightmare. I'm happy to
take potshots at the RDA stuff from the sidelines, but I never forget that
I'm on the sidelines, and that the people active in the game are among the
best and brightest we have to offer, working on a problem that invariably
seems more intractable the deeper in you go.

If you think MARC-XML is some sort of an actual problem, and that people
just need to be shouted at to realize that and do something about it, then,
well, I think you're just plain wrong.

  -Bill-

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library

Re: [CODE4LIB] MARCXML - What is it for?

On Tue, Oct 26, 2010 at 12:48 PM, Bill Dueber b...@dueber.com wrote:
 Here, I think you're guilty of radically underestimating lots of people
 around the library world. No one thinks MARC is a good solution to
 our modern problems, and no one who actually knows what MARC
 is has trouble understanding MARC-XML as an XML serialization of
 the same old data -- certainly not anyone capable of meaningful
 contribution to work on an alternative.

Slow down, Tex. Lots of people in the library world is not the same
as developers, or even good developers, or even good XML developers,
or even good XML developers who knows what the document model imposes
to a data-centric approach.

 The problem we're dealing with is *hard*. Mind-numbingly hard.

This is no justification for not doing things better. (And I'd love to
know what the hard bits are; always interesting to hear from various
people as to what they think are the *real* problems of library
problems, as opposed to any other problem they have)

 The library world has several generations of infrastructure built
 around MARC (by which I mean AACR2), and devising data
 structures and standards that are a big enough improvement over
  MARC to warrant replacing all that infrastructure is an engineering
  and political nightmare.

Political? For sure. Engineering? Not so much. This is just that whole
blinded by MARC issue that keeps cropping up from time to time, and
rightly so; it is truly a beast - at least the way we have come to
know it through AACR2 and all its friends and its death-defying focus
on all things bibliographic - that has paralyzed library innovation,
probably to the point of making libraries almost irrelevant to the
world.

 I'm happy to take potshots at the RDA stuff from the sidelines, but I never
 forget that I'm on the sidelines, and that the people active in the game are
 among the best and brightest we have to offer, working on a problem that
  invariably seems more intractable the deeper in you go.

Well, that's a pretty scary sentence, for all sorts of reasons, but I
think I shall not go there.

 If you think MARC-XML is some sort of an actual problem

What, because you don't agree with me the problem doesn't exist? :)

 and that people
 just need to be shouted at to realize that and do something about it, then,
 well, I think you're just plain wrong.

Fair enough, although you seem to be under the assumption that all of
the stuff I'm saying is a figment of my imagination (I've been
involved in several projects lambasted because managers think MARCXML
is solving some imaginary problem; this is not bullshit, but pain and
suffering from the battlefields of library development), that I'm not
one of those developers (or one of you, although judging from this
discussion it's clear that I am not), that the things I say somehow
doesn't apply because you don't agree with, umm, what I'm assuming is
my somewhat direct approach to stating my heretic opinions.


Alex
-- 
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ --
-- http://www.google.com/profiles/alexander.johannesen ---

Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Dana Pearson

i'm not a coder but i undertook a study of XML some years after it
came onto the scene and with a likely confused notion that it would be
the next significant technology, I learned some XSL and later was able
to weave PubMed Central journal information (CSV transformed into XML)
together with Dublin Core metadata of journal articles into MARCXML
during harvest with MarcEdit (which the inestimable Terry Reece
continues to tweak).  Also used the same XML journal data to augment
NLM  journal records with PubMed Central holdings and other data with
a transform in my IDE though it took me weeks to get right..so, no
asperations to become a coder.

Probably did not get all of the MARC cataloging rules right and I can
empathize with those who come to MARC and cataloging standards without
cataloging training, experience. My library experience was primarily
as library director...my expertise on library specializations would
always be under question.

regards,
dana








-- 
Dana Pearson
dbpearsonmlis.com

Re: [CODE4LIB] MARCXML - What is it for?

On Mon, Oct 25, 2010 at 10:10 PM, Alexander Johannesen 
alexander.johanne...@gmail.com wrote:

 Political? For sure. Engineering? Not so much.


Ok. Solve it. Let us know when you're done.


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library

Re: [CODE4LIB] MARCXML - What is it for?