Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-18 Thread Jonathan Rochkind
So do you think the marc-hash-to-json proto-spec should suggest that 
the encoding HAS to be UTF-8, or should it leave it open to anything 
that's legal JSON?   (Is there a problem I don't know about with 
expressing characters outside of the Basic Multilingual Plane in 
UTF-8?  Any unicode char can be encoded in any of the unicode encodings, 
right?). 

If collections means what I think, Bill's blog proto-spec says they 
should be serialized as JSON-seperated-by-newlines, right?  That is, 
JSON for each record, seperated by newlines. Rather than the alternative 
approach you hypothesize there; there are various reasons to prefer 
json-seperated-by-newlines, which is an actual convention used in the 
wild, not something made up just for here.


Jonathan

Dan Scott wrote:

Hey Bill:

Do you have unit tests for MARC-HASH / JSON anywhere? If you do, that would 
make it easier for me to create a compliant PHP File_MARC_JSON variant, which 
I'll be happy-ish to create.

The only concerns I have with your write-up are:
  * JSON itself allows UTF8, UTF16, and UTF32 encoding - and we've seen in 
Evergreen some cases where characters outside of the Basic Multilingual Plane 
are required. We eventually wound up resorting to surrogate pairs, in that 
case; so maybe this isn't a real issue.
  * You've mentioned that you would like to see better support for collections 
in File_MARC / File_MARCXML; but I don't see any mention of how collections 
would work in MARC-HASH / JSON. Would it just be something like the following?

collection: [
  {
type : marc-hash
version : [1, 0]
leader : …leader string … 
fields : [array, of, fields]
  },
  {
type : marc-hash
version : [1, 0]
leader : …leader string … 
fields : [array, of, fields]
  }
]

Dan

  

Bill Dueber b...@dueber.com 03/15/10 12:22 PM 


I'm pretty sure Andrew was (a) completely unaware of anything I'd done, and
(b) looking to match marc-xml as strictly as reasonable.

I also like the array-based rather than hash-based format, but I'm not gonna
go to the mat for it if no one else cares much.

I would like to see ind1 and ind2 get their own fields, though, for easier
use of stuff like jsonpath in json-centric nosql databases.

On Mon, Mar 15, 2010 at 10:52 AM, Jonathan Rochkind rochk...@jhu.eduwrote:

  

I would just ask why you didn't use Bill Dueber's already existing
proto-spec, instead of making up your own incomptable one.

I'd think we could somehow all do the same consistent thing here.

Since my interest in marc-json is getting as small a package as possible
for transfer accross the wire, I prefer Bill's approach.

http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/


Houghton,Andrew wrote:



From: Houghton,Andrew
  

Sent: Saturday, March 06, 2010 06:59 PM
To: Code for Libraries
Subject: RE: [CODE4LIB] Q: XML2JSON converter

Depending on how much time I get next week I'll talk with the developer
network folks to see what I need to do to put a specification under
their infrastructure




I finished documenting our existing use of MARC-JSON.  The specification
can be found on the OCLC developer network wiki [1].  Since it is a wiki,
registered developer network members can edit the specification and I would
ask that you refrain from doing so.

However, please do use the discussion tab to record issues with the
specification or add additional information to existing issues.  There are
already two open issues on the discussion tab and you can use them as a
template for new issues.  The first issue is Bill Dueber's request for some
sort of versioning and the second issue is whether the specification should
specify the flavor of MARC, e.g., marc21, unicode, etc.

It is recommended that you place issues on the discussion tab since that
will be the official place for documenting and disposing of them.  I do
monitor this listserve and the OCLC developer network listserve, but I only
selectively look at messages on those listserves.  If you would like to use
this listserve or the OCLC developer network listserve to discuss the
MARC-JSON specification, make sure you place MARC-JSON in the subject line,
to give me a clue that I *should* look at that message, or directly CC my
e-mail address on your post.

This message marks the beginning of a two week comment period on the
specification which will end on midnight 2010-03-28.

[1] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11


Thanks, Andy.


  



  


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-18 Thread Dan Scott
I hate Groupwise for forcing me to top-post.

Yes, you are right about everything. Limiting MARC-HASH to just UTF8, rather 
than supporting the full range of encodings allowed by JSON, probably makes it 
easier to generate and parse; it will bloat the size of the format for 
characters outside of the Basic Multilingual Plane but probably nobody cares, 
bandwidth is cheap, right? And this is primarily meant as a transmission format.

I missed the part in the blog entry about the newline-delimited JSON because I 
was specifically looking for a mention of collections. newline-delimited JSON 
would work, yes, and probably be easier / faster / less memory-intensive to 
parse.

Dan

 Jonathan Rochkind rochk...@jhu.edu 03/18/10 10:41 AM 
So do you think the marc-hash-to-json proto-spec should suggest that 
the encoding HAS to be UTF-8, or should it leave it open to anything 
that's legal JSON?   (Is there a problem I don't know about with 
expressing characters outside of the Basic Multilingual Plane in 
UTF-8?  Any unicode char can be encoded in any of the unicode encodings, 
right?). 

If collections means what I think, Bill's blog proto-spec says they 
should be serialized as JSON-seperated-by-newlines, right?  That is, 
JSON for each record, seperated by newlines. Rather than the alternative 
approach you hypothesize there; there are various reasons to prefer 
json-seperated-by-newlines, which is an actual convention used in the 
wild, not something made up just for here.

Jonathan

Dan Scott wrote:
 Hey Bill:

 Do you have unit tests for MARC-HASH / JSON anywhere? If you do, that would 
 make it easier for me to create a compliant PHP File_MARC_JSON variant, which 
 I'll be happy-ish to create.

 The only concerns I have with your write-up are:
   * JSON itself allows UTF8, UTF16, and UTF32 encoding - and we've seen in 
 Evergreen some cases where characters outside of the Basic Multilingual Plane 
 are required. We eventually wound up resorting to surrogate pairs, in that 
 case; so maybe this isn't a real issue.
   * You've mentioned that you would like to see better support for 
 collections in File_MARC / File_MARCXML; but I don't see any mention of how 
 collections would work in MARC-HASH / JSON. Would it just be something like 
 the following?

 collection: [
   {
 type : marc-hash
 version : [1, 0]
 leader : …leader string … 
 fields : [array, of, fields]
   },
   {
 type : marc-hash
 version : [1, 0]
 leader : …leader string … 
 fields : [array, of, fields]
   }
 ]

 Dan

   
 Bill Dueber b...@dueber.com 03/15/10 12:22 PM 
 
 I'm pretty sure Andrew was (a) completely unaware of anything I'd done, and
 (b) looking to match marc-xml as strictly as reasonable.

 I also like the array-based rather than hash-based format, but I'm not gonna
 go to the mat for it if no one else cares much.

 I would like to see ind1 and ind2 get their own fields, though, for easier
 use of stuff like jsonpath in json-centric nosql databases.

 On Mon, Mar 15, 2010 at 10:52 AM, Jonathan Rochkind rochk...@jhu.eduwrote:

   
 I would just ask why you didn't use Bill Dueber's already existing
 proto-spec, instead of making up your own incomptable one.

 I'd think we could somehow all do the same consistent thing here.

 Since my interest in marc-json is getting as small a package as possible
 for transfer accross the wire, I prefer Bill's approach.

 http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/


 Houghton,Andrew wrote:

 
 From: Houghton,Andrew
   
 Sent: Saturday, March 06, 2010 06:59 PM
 To: Code for Libraries
 Subject: RE: [CODE4LIB] Q: XML2JSON converter

 Depending on how much time I get next week I'll talk with the developer
 network folks to see what I need to do to put a specification under
 their infrastructure


 
 I finished documenting our existing use of MARC-JSON.  The specification
 can be found on the OCLC developer network wiki [1].  Since it is a wiki,
 registered developer network members can edit the specification and I would
 ask that you refrain from doing so.

 However, please do use the discussion tab to record issues with the
 specification or add additional information to existing issues.  There are
 already two open issues on the discussion tab and you can use them as a
 template for new issues.  The first issue is Bill Dueber's request for some
 sort of versioning and the second issue is whether the specification should
 specify the flavor of MARC, e.g., marc21, unicode, etc.

 It is recommended that you place issues on the discussion tab since that
 will be the official place for documenting and disposing of them.  I do
 monitor this listserve and the OCLC developer network listserve, but I only
 selectively look at messages on those listserves.  If you would like to use
 this listserve or the OCLC developer network listserve to discuss the
 MARC-JSON specification, make sure you place MARC-JSON in the subject line

Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-18 Thread Jonathan Rochkind
Oh, I wasn't actually suggesting limiting to UTF-8 was the right way to 
go, I was asking your opinion!  It's not at all clear to me, but if your 
opinion is that UTF-8 is indeed the right way to go, that's comforting. :)


Bandwidth _does_ matter I think, it's primarily intended as a 
transmission format, and the reasons _I_ am interested in it as a 
transmission format over MarcXML is in large part precisely because it 
will be so much smaller a package, I'm running into various performance 
problems caused by the very large package size of MarcXML. (Disk space 
might be cheap, but bandwidth, over the network, or to the file system, 
is not neccesarily, for me anyway.)


But I'm not sure I'm concerned about UTF-8 bloating size of response, I 
think it will still be manageable and worth it to avoid confusion. I 
pretty much do _everything_ in UTF-8 myself these days, because it's 
just not worth the headache to me to do anything else. But I have MUCH 
less experience dealing with international character sets than you, 
which is why I was curious as to your opinion.  There's no reason the 
marc-hash-in-json proto-spec couldn't allow any valid JSON character 
encoding, if you/we/someone thinks it's neccesary/more-convenient.


Jonathan

Dan Scott wrote:

I hate Groupwise for forcing me to top-post.

Yes, you are right about everything. Limiting MARC-HASH to just UTF8, rather 
than supporting the full range of encodings allowed by JSON, probably makes it 
easier to generate and parse; it will bloat the size of the format for 
characters outside of the Basic Multilingual Plane but probably nobody cares, 
bandwidth is cheap, right? And this is primarily meant as a transmission format.

I missed the part in the blog entry about the newline-delimited JSON because I was 
specifically looking for a mention of collections. newline-delimited JSON 
would work, yes, and probably be easier / faster / less memory-intensive to parse.

Dan

  

Jonathan Rochkind rochk...@jhu.edu 03/18/10 10:41 AM 

So do you think the marc-hash-to-json proto-spec should suggest that 
the encoding HAS to be UTF-8, or should it leave it open to anything 
that's legal JSON?   (Is there a problem I don't know about with 
expressing characters outside of the Basic Multilingual Plane in 
UTF-8?  Any unicode char can be encoded in any of the unicode encodings, 
right?). 

If collections means what I think, Bill's blog proto-spec says they 
should be serialized as JSON-seperated-by-newlines, right?  That is, 
JSON for each record, seperated by newlines. Rather than the alternative 
approach you hypothesize there; there are various reasons to prefer 
json-seperated-by-newlines, which is an actual convention used in the 
wild, not something made up just for here.


Jonathan

Dan Scott wrote:
  

Hey Bill:

Do you have unit tests for MARC-HASH / JSON anywhere? If you do, that would 
make it easier for me to create a compliant PHP File_MARC_JSON variant, which 
I'll be happy-ish to create.

The only concerns I have with your write-up are:
  * JSON itself allows UTF8, UTF16, and UTF32 encoding - and we've seen in 
Evergreen some cases where characters outside of the Basic Multilingual Plane 
are required. We eventually wound up resorting to surrogate pairs, in that 
case; so maybe this isn't a real issue.
  * You've mentioned that you would like to see better support for collections 
in File_MARC / File_MARCXML; but I don't see any mention of how collections 
would work in MARC-HASH / JSON. Would it just be something like the following?

collection: [
  {
type : marc-hash
version : [1, 0]
leader : …leader string … 
fields : [array, of, fields]
  },
  {
type : marc-hash
version : [1, 0]
leader : …leader string … 
fields : [array, of, fields]
  }
]

Dan

  


Bill Dueber b...@dueber.com 03/15/10 12:22 PM 

  

I'm pretty sure Andrew was (a) completely unaware of anything I'd done, and
(b) looking to match marc-xml as strictly as reasonable.

I also like the array-based rather than hash-based format, but I'm not gonna
go to the mat for it if no one else cares much.

I would like to see ind1 and ind2 get their own fields, though, for easier
use of stuff like jsonpath in json-centric nosql databases.

On Mon, Mar 15, 2010 at 10:52 AM, Jonathan Rochkind rochk...@jhu.eduwrote:

  


I would just ask why you didn't use Bill Dueber's already existing
proto-spec, instead of making up your own incomptable one.

I'd think we could somehow all do the same consistent thing here.

Since my interest in marc-json is getting as small a package as possible
for transfer accross the wire, I prefer Bill's approach.

http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/


Houghton,Andrew wrote:


  

From: Houghton,Andrew
  


Sent: Saturday, March 06, 2010 06:59 PM
To: Code for Libraries
Subject: RE: [CODE4LIB] Q: XML2JSON converter

Depending on how much time I get next

Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-17 Thread Dan Scott
Hey Bill:

Do you have unit tests for MARC-HASH / JSON anywhere? If you do, that would 
make it easier for me to create a compliant PHP File_MARC_JSON variant, which 
I'll be happy-ish to create.

The only concerns I have with your write-up are:
  * JSON itself allows UTF8, UTF16, and UTF32 encoding - and we've seen in 
Evergreen some cases where characters outside of the Basic Multilingual Plane 
are required. We eventually wound up resorting to surrogate pairs, in that 
case; so maybe this isn't a real issue.
  * You've mentioned that you would like to see better support for collections 
in File_MARC / File_MARCXML; but I don't see any mention of how collections 
would work in MARC-HASH / JSON. Would it just be something like the following?

collection: [
  {
type : marc-hash
version : [1, 0]
leader : …leader string … 
fields : [array, of, fields]
  },
  {
type : marc-hash
version : [1, 0]
leader : …leader string … 
fields : [array, of, fields]
  }
]

Dan

 Bill Dueber b...@dueber.com 03/15/10 12:22 PM 
I'm pretty sure Andrew was (a) completely unaware of anything I'd done, and
(b) looking to match marc-xml as strictly as reasonable.

I also like the array-based rather than hash-based format, but I'm not gonna
go to the mat for it if no one else cares much.

I would like to see ind1 and ind2 get their own fields, though, for easier
use of stuff like jsonpath in json-centric nosql databases.

On Mon, Mar 15, 2010 at 10:52 AM, Jonathan Rochkind rochk...@jhu.eduwrote:

 I would just ask why you didn't use Bill Dueber's already existing
 proto-spec, instead of making up your own incomptable one.

 I'd think we could somehow all do the same consistent thing here.

 Since my interest in marc-json is getting as small a package as possible
 for transfer accross the wire, I prefer Bill's approach.

 http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/


 Houghton,Andrew wrote:

 From: Houghton,Andrew
 Sent: Saturday, March 06, 2010 06:59 PM
 To: Code for Libraries
 Subject: RE: [CODE4LIB] Q: XML2JSON converter

 Depending on how much time I get next week I'll talk with the developer
 network folks to see what I need to do to put a specification under
 their infrastructure



 I finished documenting our existing use of MARC-JSON.  The specification
 can be found on the OCLC developer network wiki [1].  Since it is a wiki,
 registered developer network members can edit the specification and I would
 ask that you refrain from doing so.

 However, please do use the discussion tab to record issues with the
 specification or add additional information to existing issues.  There are
 already two open issues on the discussion tab and you can use them as a
 template for new issues.  The first issue is Bill Dueber's request for some
 sort of versioning and the second issue is whether the specification should
 specify the flavor of MARC, e.g., marc21, unicode, etc.

 It is recommended that you place issues on the discussion tab since that
 will be the official place for documenting and disposing of them.  I do
 monitor this listserve and the OCLC developer network listserve, but I only
 selectively look at messages on those listserves.  If you would like to use
 this listserve or the OCLC developer network listserve to discuss the
 MARC-JSON specification, make sure you place MARC-JSON in the subject line,
 to give me a clue that I *should* look at that message, or directly CC my
 e-mail address on your post.

 This message marks the beginning of a two week comment period on the
 specification which will end on midnight 2010-03-28.

 [1] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11


 Thanks, Andy.





-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-16 Thread Ere Maijala

On 03/15/2010 06:22 PM, Houghton,Andrew wrote:

Secondly, Bill's specification looses semantics from ISO 2709, as I
previously pointed out.  His specification clumps control and data
fields into one property named fields. According to ISO 2709, control
and data fields have different semantics.  You could have a control
field tagged as 001 and a data field tagged as 001 which have
different semantics.  MARC-21 has imposed certain rules for


I won't comment on Bill's proposal, but I'll just say that I don't think 
you can have a control field and a data field with the same code in a 
single MARC format. Well, technically it's possible, but in practice 
everything I've seen relies on rules of the MARC format at hand. You 
could actually say that ISO 2709 works more like Bill's JSON, and 
MARCXML is the different one, as in ISO 2709 the directory doesn't 
separate control and data fields.


--Ere

--
Ere Maijala (Mr.)
The National Library of Finland


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-16 Thread Jonathan Rochkind
Bill's format would allow there to be a control field and a data field 
with the same tag, however, so it's all good either way.


Ere Maijala wrote:

On 03/15/2010 06:22 PM, Houghton,Andrew wrote:
  

Secondly, Bill's specification looses semantics from ISO 2709, as I
previously pointed out.  His specification clumps control and data
fields into one property named fields. According to ISO 2709, control
and data fields have different semantics.  You could have a control
field tagged as 001 and a data field tagged as 001 which have
different semantics.  MARC-21 has imposed certain rules for



I won't comment on Bill's proposal, but I'll just say that I don't think 
you can have a control field and a data field with the same code in a 
single MARC format. Well, technically it's possible, but in practice 
everything I've seen relies on rules of the MARC format at hand. You 
could actually say that ISO 2709 works more like Bill's JSON, and 
MARCXML is the different one, as in ISO 2709 the directory doesn't 
separate control and data fields.


--Ere

  


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Jonathan Rochkind
I would just ask why you didn't use Bill Dueber's already existing 
proto-spec, instead of making up your own incomptable one.


I'd think we could somehow all do the same consistent thing here.

Since my interest in marc-json is getting as small a package as possible 
for transfer accross the wire, I prefer Bill's approach.


http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

Houghton,Andrew wrote:

From: Houghton,Andrew
Sent: Saturday, March 06, 2010 06:59 PM
To: Code for Libraries
Subject: RE: [CODE4LIB] Q: XML2JSON converter

Depending on how much time I get next week I'll talk with the developer
network folks to see what I need to do to put a specification under
their infrastructure



I finished documenting our existing use of MARC-JSON.  The specification can be 
found on the OCLC developer network wiki [1].  Since it is a wiki, registered 
developer network members can edit the specification and I would ask that you 
refrain from doing so.

However, please do use the discussion tab to record issues with the 
specification or add additional information to existing issues.  There are 
already two open issues on the discussion tab and you can use them as a 
template for new issues.  The first issue is Bill Dueber's request for some 
sort of versioning and the second issue is whether the specification should 
specify the flavor of MARC, e.g., marc21, unicode, etc.

It is recommended that you place issues on the discussion tab since that will 
be the official place for documenting and disposing of them.  I do monitor this 
listserve and the OCLC developer network listserve, but I only selectively look 
at messages on those listserves.  If you would like to use this listserve or 
the OCLC developer network listserve to discuss the MARC-JSON specification, 
make sure you place MARC-JSON in the subject line, to give me a clue that I 
*should* look at that message, or directly CC my e-mail address on your post.

This message marks the beginning of a two week comment period on the 
specification which will end on midnight 2010-03-28.

[1] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11


Thanks, Andy. 

  


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Jonathan Rochkind
 Sent: Monday, March 15, 2010 11:53 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
 
 I would just ask why you didn't use Bill Dueber's already existing
 proto-spec, instead of making up your own incomptable one.

Because the internal use of our specification predated Bill's blog entry, dated 
2010-02-25, by almost a year.  Bill's post reminded me that I had not published 
or publicly discussed our specification.

Secondly, Bill's specification looses semantics from ISO 2709, as I previously 
pointed out.  His specification clumps control and data fields into one 
property named fields. According to ISO 2709, control and data fields have 
different semantics.  You could have a control field tagged as 001 and a data 
field tagged as 001 which have different semantics.  MARC-21 has imposed 
certain rules for assignment of tags such that this isn't a concern, but other 
systems based on ISO 2709 may not.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Bill Dueber
 Sent: Monday, March 15, 2010 12:19 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
 
 I would like to see ind1 and ind2 get their own fields, though, for
 easier
 use of stuff like jsonpath in json-centric nosql databases.

I'll add this issue to the discussion page.  Your point on being able to index 
a specific indicator property has value.  So a resolution to the issue would be 
have 9 indicator properties, per ISO 2709 and MarcXchange, but require the ones 
based on the leader indicator number value.


Thanks, Andy.


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Bill Dueber
On the one hand, I'm all for following specs. But on the other...should we
really be too concerned about dealing with the full flexibility of the 2709
spec, vs. what's actually used? I mean, I hope to god no one is actually
creating new formats based on 2709!

If there are real-life examples in the wild of, say, multi-character
indicators, or subfield codes of more than one character, that's one thing.

BTW, in the stuff I proposed, you know a controlfield vs. a datafield
because of the length of the array (2 vs 5); it's well-specified, but by the
size of the tuple, not by label.

On Mon, Mar 15, 2010 at 11:22 AM, Houghton,Andrew hough...@oclc.org wrote:

  From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
  Jonathan Rochkind
  Sent: Monday, March 15, 2010 11:53 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
 
  I would just ask why you didn't use Bill Dueber's already existing
  proto-spec, instead of making up your own incomptable one.

 Because the internal use of our specification predated Bill's blog entry,
 dated 2010-02-25, by almost a year.  Bill's post reminded me that I had not
 published or publicly discussed our specification.

 Secondly, Bill's specification looses semantics from ISO 2709, as I
 previously pointed out.  His specification clumps control and data fields
 into one property named fields. According to ISO 2709, control and data
 fields have different semantics.  You could have a control field tagged as
 001 and a data field tagged as 001 which have different semantics.  MARC-21
 has imposed certain rules for assignment of tags such that this isn't a
 concern, but other systems based on ISO 2709 may not.


 Andy.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Bill Dueber
 Sent: Monday, March 15, 2010 12:40 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
 
 On the one hand, I'm all for following specs. But on the other...should
 we really be too concerned about dealing with the full flexibility of
 the 2709 spec, vs. what's actually used? I mean, I hope to god no one 
 is actually creating new formats based on 2709!
 
 If there are real-life examples in the wild of, say, multi-character
 indicators, or subfield codes of more than one character, that's one
 thing.

Yes there are real-life examples, e.g., MarcXchange, now ISO 25577, being the 
one that comes to mind.  Where IFLA was *compelled* to create a new MARC-XML 
specification, in a different namespace, and the main difference between the 
two specification was being able to specify up to nine indicator values.  Given 
that they are optional in the MarcXchange XML schema, personally I feel that, 
IFLA and LC could have just extended the MARC-XML schema and added the optional 
attributes making all existing MARC-XML documents MarcXchange documents and the 
library community wouldn't have to deal with two XML specifications.  This is 
another example of the library community creating more barriers for people 
entering their market.

 BTW, in the stuff I proposed, you know a controlfield vs. a datafield
 because of the length of the array (2 vs 5); it's well-specified, but
 by the size of the tuple, not by label.

Ahh... I overlooked that aspect of your proposal.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-08 Thread Benjamin Young

On 3/6/10 6:59 PM, Houghton,Andrew wrote:

From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Bill Dueber
Sent: Saturday, March 06, 2010 05:11 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q: XML2JSON converter

Anyway, hopefully, it won't be a huge surprise that I don't disagree
with any of the quote above in general; I would assert, though, that
application/json and application/marc+json should both return JSON
(in the same way that text/xml, application/xml, and
application/marc+xml can all be expected to return XML).
Newline-delimited json is starting to crop up in a few places
(e.g. couchdb) and should probably have its own mime type
and associated extension. So I would say something like:

application/json -- return json (obviously)
application/marc+json  -- return json
application/marc+ndj  -- return newline-delimited json
 

This sounds like consensus on how to deal with newline-delimited JSON in a 
standards based manner.

I'm not familiar with CouchDB, but I am using MongoDB which is similar.  I'll 
have to dig into how they deal with this newline-delimited JSON.  Can you 
provide any references to get me started?
   
Rather than using a newline-delimited format (the whole of which would 
not together be considered a valid JSON object) why not use the JSON 
array format with or without new lines? Something like:


[{key:value}, {key,value}]

You could include new line delimiters after the , if you needed to 
make pre-parsing easier (in a streaming context), but may be able to get 
away with just looking for the next , or ] after each valid JSON object.


That would allow the entire stream, if desired, to be saved to disk and 
read in as a single JSON object, or the same API to serve smaller JSON 
collections in a JSON standard way.


CouchDB uses this array notation when returning multiple document 
revisions in one request. CouchDB also offers a slightly more annotated 
structure (which might be useful with streaming as well):


{
  total_rows: 2,
  offset: 0,
  rows:[{key:value}, {key,value}]
}

Rows here plays the same roll as the above array-based format, but 
provides an initial row count for the consumer to use (if it wants) for 
knowing what's ahead. The offset key is specific to CouchDB, but 
similar application specific information could be stored in the header 
of the JSON object using this method.

In all cases, we should agree on a standard record serialization,
though, and the pure-json returns should include something that
indicates what the heck it is (hopefully a URI that can act as a
distinct namespace-type identifier, including a version in it).
 

I agree that our MARC-JSON serialization needs some namespace identifier in 
it and it occurred to me that the way it is handling indicators, e.g., ind1 and ind2 
properties, might be better handled as an array to accommodate IFLA's MARC-XML-ish where 
they can have from 1-9 indicator values.

BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON 
standard, which means you need to use \u notation to specify characters in 
strings, not sure I made that clear in earlier posts.  A downside to the 
current ECMA 262 specification is that it doesn't support \U00XX, as Python 
does, for the extended characters.  Hopefully that will get rectified in a 
future ECMA 262 specification.

   

The question for me, I think, is whether within this community,  anyone
who provides one of these types (application/marc+json and
application/marc+ndj) should automatically be expected to provide both.
I don't have an answer for that.
 
As far as mime-type declarations go in general, I'd recommend avoiding 
any format specific mime types and sticking to the application/json 
format and providing document level hints (if needed) for the content 
type. If you do find a need for the special case mime types, I'd 
recommend still responding to Accepts: application/json whenever 
possible--for the sake of standards. :)


All told, I'm just glad to see this discussion being had. I'll be happy 
to provide some CouchDB test cases (replication, etc) if that's of 
interest to anyone.


Thanks,
Benjamin

I think this issue gets into familiar territory when dealing with RDF formats.  
Let's see, there is N3, NT, XML, Turtle, etc.  Do you need to provide all of 
them?  No, but it's nice of the server to at least provide NT or Turtle and 
XML.  Ultimately it's up to the server.  But the only difference between use 
cases #2 and #3 is whether the output is wrapped in an array, so it's probably 
easy for the server to produce both.

Depending on how much time I get next week I'll talk with the developer network 
folks to see what I need to do to put a specification under their 
infrastructure.  Looks like from my schedule it's going to be another week of 
hell :(


Andy.
   


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-08 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Benjamin Young
 Sent: Monday, March 08, 2010 09:32 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 Rather than using a newline-delimited format (the whole of which would
 not together be considered a valid JSON object) why not use the JSON
 array format with or without new lines? Something like:
 
 [{key:value}, {key,value}]
 
 You could include new line delimiters after the , if you needed to
 make pre-parsing easier (in a streaming context), but may be able to
 get
 away with just looking for the next , or ] after each valid JSON
 object.
 
 That would allow the entire stream, if desired, to be saved to disk and
 read in as a single JSON object, or the same API to serve smaller JSON
 collections in a JSON standard way.

I think we just went around full circle again.  There appear to be two distinct 
use cases when dealing with MARC collections.  The first conforms to the ECMA 
262 JSON subset.  Which is what you described, above:

[ { key : value }, { key : value } ]

its media type should be specified as application/json.

The second use case, which there was some discussion between Bill Dueber and 
myself, is a newline delimited format where the JSON array specifiers are 
omitted and the objects are specified one per line without commas separating 
objects.  The misunderstanding between Bill and I was that this malformed 
JSON was being sent as media type application/json which is not what he was 
proposing and I misunderstood.  This newline delimited JSON appears to be an 
import/export format in both CouchDB and MongoDB.

In the FAST work I'm doing I'm probably going to take an alternate approach to 
generating our 10,000 MARC record collection files for download.  The approach 
I'm going to take is to create valid JSON but make it easier for the CouchDB 
and MongoDB folks to import the collection of records.  The format will be:

[
{ key : value }
,
{ key : value }
]

the objects will be one per line, but the array specifier and comma delimiters 
between objects will appear on a separate line.  This would allow the CouchDB 
and MongoDB folks to run a simple sed script on the file before import:

sed -e '/^.$/D' file.json  file.txt

or if they are reading the data as a raw text file, they can just ignore all 
lines that start with opening brace, comma, or closing brace, or alternately 
only process lines starting with an opening brace.

However, this doesn't mean that I'm balking on pursuing a separate media type 
specific to the library community that specifies a specific MARC JSON 
serialization encoded as a single line.

I see multiple steps here with the first being a consensus on serializing MARC 
(ISO 2709) in JSON.  Which begins with me documenting it so people can throw 
some darts at.  I don't think what we are proposing is controversial, but it's 
beneficial to have a variety of perspectives as input.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Bill Dueber
 Sent: Friday, March 05, 2010 08:48 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew hough...@oclc.org
 wrote:
 
  OK, I will bite, you stated:
 
  1. That large datasets are a problem.
  2. That streaming APIs are a pain to deal with.
  3. That tool sets have memory constraints.
 
  So how do you propose to process large JSON datasets that:
 
  1. Comply with the JSON specification.
  2. Can be read by any JavaScript/JSON processor.
  3. Do not require the use of streaming API.
  4. Do not exceed the memory limitations of current JSON processors.
 
 
 What I'm proposing is that we don't process large JSON datasets; I'm
 proposing that we process smallish JSON documents one at a time by
 pulling
 them out of a stream based on an end-of-record character.
 
 This is basically what we use for MARC21 binary format -- have a
 defined
 structure for a valid record, and separate multiple well-formed record
 structures with an end-of-record character. This preserves JSON
 specification adherence at the record level and uses a different scheme
 to represent collections. Obviously, MARC-XML uses a different 
 mechanism to define a collection of records -- putting well-formed 
 record structures inside a collection tag.
 
 So... I'm proposing define what we mean by a single MARC record
 serialized to JSON (in whatever format; I'm not very opinionated 
 on this point) that preserves the order, indicators, tags, data, 
 etc. we need to round-trip between marc21binary, marc-xml, and 
 marc-json.
 
 And then separate those valid records with an end-of-record character
 -- \n.

Ok, what I see here are divergent use cases and the willingness of the library 
community to break existing Web standards.  This is how the library community 
makes it more difficult to use their data and places additional barriers for 
people and organizations to enter their market because of these library centric 
protocols and standards.

If I were to try to sell this idea to the Web community, at large, and tell 
them that when they send an HTTP request with an Accept: application/json 
header to our services, our services will respond with a 200 HTTP status and 
deliver them malformed JSON, I would be immediately impaled with multiple 
arrows and daggers :(  Not to mention that OCLC would disparaged by a certain 
crowd in their blogs as being idiots who cannot follow standards.

OCLC's goals are use and conform to Web standards to make library data easier 
to use by people or organizations outside the library community, otherwise 
libraries and their data will become irrelevant.  The JSON serialization is a 
standard and the Web community expects that when they make HTTP requests with 
an Accept: application/json header that they will be get back JSON conforming 
to the standard.  JSON's main use case is in AJAX scenarios where you are not 
suppose to be sending megabytes of data across the wire.

Your proposal is asking me to break a widely deployed Web standard that is used 
by AJAX frameworks and to access millions (ok, many) Web sites.

 Unless I've read all this wrong, you've come to the conclusion that the
 benefit of having a JSON serialization that is valid JSON at both the
 record and collection level outweighs the pain of having to deal with
 a streaming parser and writer.  This allows a single collection to be
 treated as any other JSON document, which has obvious benefits (which 
 I certainly don't mean to minimize) and all the drawbacks we've been 
 talking about *ad nauseam*.

The goal is to adhere to existing Web standards and your underlying assumption 
is that you can or will be retrieving large datasets through an AJAX scenario.  
As I pointed out this is more an API design issue and due to the way AJAX works 
you should never design an API in that manner.  Your assumption that you can or 
will be retrieving large datasets through an AJAX scenario is false given the 
caveat of a well designed API.  Therefore you will never be put into the 
scenario requiring the use of JSON streaming so your argument from this point 
of view is mute.

But for arguments sake let's say you could retrieve a line delimited list of 
JSON objects.  You can no longer use any existing AJAX framework for getting 
back that JSON since it's malformed.  You could use the AJAX framework's 
XMLHTTP to retrieve this line delimited list of JSON objects, but this still 
doesn't help because the XMLHTTP object will keep the entire response in memory.

So when our service sends the user agent 100MB of line delimited JSON objects, 
the XMLHTTP object is going to try to slurp the entire 100MB HTTP response into 
memory and that is going to exceed the memory requirement of the JSON/Javascrpt 
processor or the browser that is controlling the XMLHTTP object and the 
application will never get

Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Bill Dueber
On Sat, Mar 6, 2010 at 1:57 PM, Houghton,Andrew hough...@oclc.org wrote:

  A way to fix this issue is to say that use cases #1 and #2 conform to
 media type application/json and use case #3 conforms to a new media type
 say: application/marc+json.  This new application/marc+json media type now
 becomes a library centric standard and it avoids breaking a widely deployed
 Web standard.


I'm so sorry -- it never dawned on me that anyone would think that I was
asserting that a JSON MIME type should return anything but JSON. For the
record, I think that's batshit crazy. JSON needs to return json. I'd been
hoping to convince folks that we need to have a standard way to pass records
around that doesn't require a streaming parser/writer; not ignore standard
MIME-types willy-nilly. My use cases exist almost entirely outside the
browse environment (because, my god, I don't want to have to try to deal
with MARC21, whatever the serialization, in a browser environment); it
sounds like Andy is almost purely worried about working with a MARC21
serialization within a browser-based javascript environment.

Anyway, hopefully, it won't be a huge surprise that I don't disagree with
any of the quote above in general; I would assert, though, that
application/json and application/mac+json should both return JSON (in the
same way that text/xml, application/xml, and application/marc+xml can all be
expected to return XML). Newline-delmited json is starting to crop up in a
few places (e.g. couchdb) and should probably have its own mime type and
associated extension. So I would say something like:

application/json -- return json (obviously)
application/marc+json  -- return json
application/marc+ndj  -- return newline-delimited json

In all cases, we should agree on a standard record serialization, though,
and the pure-json returns should include something that indicates what the
heck it is (hopefully a URI that can act as a distinct namespace-type
identifier, including a version in it).

The question for me, I think, is whether within this community,  anyone who
provides one of these types (application/marc+json and application/marc+ndj)
should automatically be expected to provide both. I don't have an answer for
that.

 -Bill-


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Bill Dueber
 Sent: Saturday, March 06, 2010 05:11 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 Anyway, hopefully, it won't be a huge surprise that I don't disagree
 with any of the quote above in general; I would assert, though, that
 application/json and application/marc+json should both return JSON
 (in the same way that text/xml, application/xml, and 
 application/marc+xml can all be expected to return XML). 
 Newline-delimited json is starting to crop up in a few places 
 (e.g. couchdb) and should probably have its own mime type
 and associated extension. So I would say something like:
 
 application/json -- return json (obviously)
 application/marc+json  -- return json
 application/marc+ndj  -- return newline-delimited json

This sounds like consensus on how to deal with newline-delimited JSON in a 
standards based manner.

I'm not familiar with CouchDB, but I am using MongoDB which is similar.  I'll 
have to dig into how they deal with this newline-delimited JSON.  Can you 
provide any references to get me started?

 In all cases, we should agree on a standard record serialization,
 though, and the pure-json returns should include something that 
 indicates what the heck it is (hopefully a URI that can act as a 
 distinct namespace-type identifier, including a version in it).

I agree that our MARC-JSON serialization needs some namespace identifier in 
it and it occurred to me that the way it is handling indicators, e.g., ind1 and 
ind2 properties, might be better handled as an array to accommodate IFLA's 
MARC-XML-ish where they can have from 1-9 indicator values.

BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON 
standard, which means you need to use \u notation to specify characters in 
strings, not sure I made that clear in earlier posts.  A downside to the 
current ECMA 262 specification is that it doesn't support \U00XX, as Python 
does, for the extended characters.  Hopefully that will get rectified in a 
future ECMA 262 specification.

 The question for me, I think, is whether within this community,  anyone
 who provides one of these types (application/marc+json and
 application/marc+ndj) should automatically be expected to provide both.
 I don't have an answer for that.

I think this issue gets into familiar territory when dealing with RDF formats.  
Let's see, there is N3, NT, XML, Turtle, etc.  Do you need to provide all of 
them?  No, but it's nice of the server to at least provide NT or Turtle and 
XML.  Ultimately it's up to the server.  But the only difference between use 
cases #2 and #3 is whether the output is wrapped in an array, so it's probably 
easy for the server to produce both.

Depending on how much time I get next week I'll talk with the developer network 
folks to see what I need to do to put a specification under their 
infrastructure.  Looks like from my schedule it's going to be another week of 
hell :(


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Ulrich Schaefer

Godmar Back wrote:

Hi,

Can anybody recommend an open source XML2JSON converter in PhP or
Python (or potentially other languages, including XSLT stylesheets)?

Ideally, it should implement one of the common JSON conventions, such
as Google's JSON convention for GData [1], but anything that preserves
all elements, attributes, and text content of the XML file would be
acceptable.

Note that json_encode(simplexml_load_file(...)) does not meet this
requirement - in fact, nothing based on simplexml_load_file() will.
(It can't even load MarcXML correctly).

Thanks!

 - Godmar

[1] http://code.google.com/apis/gdata/docs/json.html
  

Hi,
try this: http://code.google.com/p/xml2json-xslt/

best,
Ulrich

--
Dr.-Ing. Ulrich Schaefer http://dfki.de/~uschaefer phone:+496813025154
   DFKI Language Technology Lab, D-66123 Saarbruecken, Germany
---
  Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
  Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender), Dr. Walter Olthoff. Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes. Amtsgericht Kaiserslautern, HRB 2313


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Godmar Back
On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer ulrich.schae...@dfki.dewrote:

 Hi,
 try this: http://code.google.com/p/xml2json-xslt/


I should have mentioned that I already tried everything I could find after
googling - this stylesheet doesn't meet the requirements, not by far. It
drops attributes just like simplexml_json does.

The one thing I didn't try is a program called 'BadgerFish.php' which I
couldn't locate - Google once indexed it at badgerfish.ning.com

 - Godmar


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Kevin S. Clarke
Internet Archive seems to have a copy of that:

http://web.archive.org/web/20071013052842/badgerfish.ning.com/file.php?format=srcpath=lib/BadgerFish.php

as well as several versions of the site:

http://web.archive.org/web/*/http://badgerfish.ning.com

Kevin



On Fri, Mar 5, 2010 at 8:15 AM, Godmar Back god...@gmail.com wrote:
 On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer 
 ulrich.schae...@dfki.dewrote:

 Hi,
 try this: http://code.google.com/p/xml2json-xslt/


 I should have mentioned that I already tried everything I could find after
 googling - this stylesheet doesn't meet the requirements, not by far. It
 drops attributes just like simplexml_json does.

 The one thing I didn't try is a program called 'BadgerFish.php' which I
 couldn't locate - Google once indexed it at badgerfish.ning.com

  - Godmar



Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 8:15 AM, Godmar Back wrote:

On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaeferulrich.schae...@dfki.dewrote:

   

Hi,
try this: http://code.google.com/p/xml2json-xslt/


 

I should have mentioned that I already tried everything I could find after
googling - this stylesheet doesn't meet the requirements, not by far. It
drops attributes just like simplexml_json does.

The one thing I didn't try is a program called 'BadgerFish.php' which I
couldn't locate - Google once indexed it at badgerfish.ning.com

  - Godmar
   

Godmar,

I'd be interested in collaborating with you on creating one. I'd bounced 
this question off the CouchDB IRC channel a while back, and the summary 
was that you'd generally create a JSON structure for your document and 
then right the code to map the XML to JSON. However, I do think 
something more generic like Google's GData to JSON would fit the bill 
for most use cases...sadly, it doesn't seem they've made the conversion 
code available.


If you're looking at putting MARC into JSON, there was some discussion 
of that during code4lib 2010. Johnathan Rochkind, who was at code4lib 
2010 blogged about marc-json recently:

http://bibwild.wordpress.com/2010/03/03/marc-json/
He references a project that Bill Dueber's been playing with for a year:
http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

All told, there's growing momentum for a MARC in JSON format to be 
created, so you might jump in there.


Additionally, I'd love to find a project building code to do what 
Google's done with the GData to JSON format. If you find one, I'd enjoy 
seeing it.


Thanks, Godmar,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Cary Gordon
You can find it here, although I wouldn't get too excited: http://bit.ly/acROxH

You could also fish for more info by badgering its creator at
http://www.sklar.com/page/section/contact.

Cary

On Fri, Mar 5, 2010 at 5:15 AM, Godmar Back god...@gmail.com wrote:
 On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer 
 ulrich.schae...@dfki.dewrote:

 Hi,
 try this: http://code.google.com/p/xml2json-xslt/


 I should have mentioned that I already tried everything I could find after
 googling - this stylesheet doesn't meet the requirements, not by far. It
 drops attributes just like simplexml_json does.

 The one thing I didn't try is a program called 'BadgerFish.php' which I
 couldn't locate - Google once indexed it at badgerfish.ning.com

  - Godmar




-- 
Cary Gordon
The Cherry Hill Company
http://chillco.com


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Mark Mounts

have you tried this?

http://www.bramstein.com/projects/xsltjson/
http://github.com/bramstein/xsltjson

using the parameter |use-rayfish=true seems to preserve everything but 
namespaces but then there is a parameter to preserve namespaces as well.|


Mark

On 3/5/2010 12:54 AM, Godmar Back wrote:

Hi,

Can anybody recommend an open source XML2JSON converter in PhP or
Python (or potentially other languages, including XSLT stylesheets)?

Ideally, it should implement one of the common JSON conventions, such
as Google's JSON convention for GData [1], but anything that preserves
all elements, attributes, and text content of the XML file would be
acceptable.

Note that json_encode(simplexml_load_file(...)) does not meet this
requirement - in fact, nothing based on simplexml_load_file() will.
(It can't even load MarcXML correctly).

Thanks!

  - Godmar

[1] http://code.google.com/apis/gdata/docs/json.html
   


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Joe Hourcle

On Fri, 5 Mar 2010, Godmar Back wrote:


On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer ulrich.schae...@dfki.dewrote:


Hi,
try this: http://code.google.com/p/xml2json-xslt/



I should have mentioned that I already tried everything I could find after
googling - this stylesheet doesn't meet the requirements, not by far. It
drops attributes just like simplexml_json does.

The one thing I didn't try is a program called 'BadgerFish.php' which I
couldn't locate - Google once indexed it at badgerfish.ning.com


http://web.archive.org/web/20080216200903/http://badgerfish.ning.com/

http://web.archive.org/web/20071013052842/badgerfish.ning.com/file.php?format=srcpath=lib/BadgerFish.php

-Joe


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Jay Luker
If PHP/python isn't a hard requirement, I think this would be fairly
simple to do in perl using a combination of the XML::Simple [1] and
JSON::XS [2] modules.

In fact it's so simple, here's the code:


#!/usr/bin/perl

use JSON::XS;
use XML::Simple;
use strict;

my $filename = shift @ARGV;
my $parsed = XMLin($filename);
my $json = encode_json($parsed);
print $json, \n;


XML::Simple, in spite of the name, actually allows for a myriad of
options for how the perl data structure gets created from the xml,
including attribute preservation, grouping of elements, etc.

--jay

[1] http://search.cpan.org/~grantm/XML-Simple-2.18/lib/XML/Simple.pm
[2] http://search.cpan.org/~makamaka/JSON-2.17/lib/JSON.pm

On Fri, Mar 5, 2010 at 9:55 AM, Joe Hourcle
onei...@grace.nascom.nasa.gov wrote:
 On Fri, 5 Mar 2010, Godmar Back wrote:

 On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer
 ulrich.schae...@dfki.dewrote:

 Hi,
 try this: http://code.google.com/p/xml2json-xslt/


 I should have mentioned that I already tried everything I could find after
 googling - this stylesheet doesn't meet the requirements, not by far. It
 drops attributes just like simplexml_json does.

 The one thing I didn't try is a program called 'BadgerFish.php' which I
 couldn't locate - Google once indexed it at badgerfish.ning.com

        http://web.archive.org/web/20080216200903/http://badgerfish.ning.com/

  http://web.archive.org/web/20071013052842/badgerfish.ning.com/file.php?format=srcpath=lib/BadgerFish.php

 -Joe



Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Benjamin Young
 Sent: Friday, March 05, 2010 09:26 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 If you're looking at putting MARC into JSON, there was some discussion
 of that during code4lib 2010. Johnathan Rochkind, who was at code4lib
 2010 blogged about marc-json recently:
 http://bibwild.wordpress.com/2010/03/03/marc-json/
 He references a project that Bill Dueber's been playing with for a
 year:
 http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
 
 All told, there's growing momentum for a MARC in JSON format to be
 created, so you might jump in there.

Too bad I didn't attend code4lib.  OCLC Research has created a version of MARC 
in JSON and will probably release FAST concepts in MARC binary, MARC-XML and 
our MARC-JSON format among other formats.  I'm wondering whether there is some 
consensus that can be reached and standardized at LC's level, just like OCLC, 
RLG and LC came to consensus on MARC-XML.  Unfortunately, I have not had the 
time to document the format, although it fairly straight forward, and yes we 
have an XSLT to convert from MARC-XML to MARC-JSON.  Basically the format I'm 
using is:

[
  ...
]

which represents a collection of MARC records or 

{
  ...
}

which represents a single MARC records that takes the form:

{
  leader : 01192cz  a2200301n  4500,
  controlfield :
  [
{ tag : 001, data : fst01303409 },
{ tag : 003, data : OCoLC },
{ tag : 005, data : 20100202194747.3 },
{ tag : 008, data : 060620nn anznnbabn  || ana d }
  ],
  datafield :
  [
{
  tag : 040,
  ind1 :  ,
  ind2 :  ,
  subfield :
  [
{ code : a, data : OCoLC },
{ code : b, data : eng },
{ code : c, data : OCoLC },
{ code : d, data : OCoLC-O },
{ code : f, data : fast },
  ]
},
{
  tag : 151,
  ind1 :  ,
  ind2 :  ,
  subfield :
  [
{ code : a, data : Hawaii },
{ code : z, data : Diamond Head }
  ]
}
  ]
}


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew hough...@oclc.org wrote:

 Too bad I didn't attend code4lib.  OCLC Research has created a version of
 MARC in JSON and will probably release FAST concepts in MARC binary,
 MARC-XML and our MARC-JSON format among other formats.  I'm wondering
 whether there is some consensus that can be reached and standardized at LC's
 level, just like OCLC, RLG and LC came to consensus on MARC-XML.
  Unfortunately, I have not had the time to document the format, although it
 fairly straight forward, and yes we have an XSLT to convert from MARC-XML to
 MARC-JSON.  Basically the format I'm using is:


The stuff I've been doing:

  http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

... is pretty much the same, except:

  1. I don't explicitly split up control and data fields. There's a single
field list; an item that has two elements is a control field (tag/data); one
with four is a data field (tag / ind1 /ind2 / array_of_subfield)

  2. Instead of putting a collection in a big json array, I use
newline-delimited-json (basically, just stick one record on each line as a
single json hash). This has the advantage that it makes streaming much, much
easier, and makes doing some other things (e.g., grab the first record or
two) much cheaper for even the dumbest json parser). I'm not sure what the
state of JSON streaming parsers are; I know Jackson (for Java) can do it,
and perl's JSON::XS can...kind of...but it's not great.

3. I include a type (MARC-JSON, MARC-HASH, whatever) and version: [major,
minor] in each record. There's already a ton of JSON floating around the
library world; labeling what the heck a structure is is just friendly :-)

MARC's structure is dumb enough that we collectively basically can't miss;
there's only so much you can do with the stuff, and a round-trip to JSON and
back is easy to implement.

I'm not super-against explicitly labeling the data elements (tag:, :ind1:,
etc.) but I don't see where it's necessary unless you're planning on adding
out-of-band data to the records/fields/subfields at some point. Which might
be kinda cool (e.g., language hints on a per-subfield basis? Tokenization
hints for non-whitespace-delimited languages? URIs for unique concepts and
authorities where they exist for easy creation of RDF?)

I *am*, however, willing to push and push and push for NDJ instead of having
to deal with streaming JSON parsing, which to my limited understanding is
hard to get right and to my more qualified understanding is a pain in the
ass to work with.

And anything we do should explicitly be UTF-8 only; converting from MARC-8
is a problem for the server, not the receiver.

Support for what I've been calling marc-hash (I like to decouple it from the
eventual JSON format in case the serialization preferences change, or at
least so implementations don't get stuck with a single JSON library) is
already baked into ruby-marc, and obviously implementations are dead-easy no
matter what the underlying language is.

Anyone from the LoC want to get in on this?

 -Bill-




 [
  ...
 ]

 which represents a collection of MARC records or

 {
  ...
 }

 which represents a single MARC records that takes the form:

 {
  leader : 01192cz  a2200301n  4500,
  controlfield :
  [
{ tag : 001, data : fst01303409 },
{ tag : 003, data : OCoLC },
{ tag : 005, data : 20100202194747.3 },
{ tag : 008, data : 060620nn anznnbabn  || ana d }
  ],
  datafield :
  [
{
  tag : 040,
  ind1 :  ,
  ind2 :  ,
  subfield :
  [
{ code : a, data : OCoLC },
{ code : b, data : eng },
{ code : c, data : OCoLC },
{ code : d, data : OCoLC-O },
{ code : f, data : fast },
  ]
},
{
  tag : 151,
  ind1 :  ,
  ind2 :  ,
  subfield :
  [
{ code : a, data : Hawaii },
{ code : z, data : Diamond Head }
  ]
}
  ]
 }




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Bill Dueber
 Sent: Friday, March 05, 2010 12:30 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew hough...@oclc.org
 wrote:
 
  Too bad I didn't attend code4lib.  OCLC Research has created a
 version of
  MARC in JSON and will probably release FAST concepts in MARC binary,
  MARC-XML and our MARC-JSON format among other formats.  I'm wondering
  whether there is some consensus that can be reached and standardized
 at LC's
  level, just like OCLC, RLG and LC came to consensus on MARC-XML.
   Unfortunately, I have not had the time to document the format,
 although it
  fairly straight forward, and yes we have an XSLT to convert from
 MARC-XML to
  MARC-JSON.  Basically the format I'm using is:
 
 
 The stuff I've been doing:
 
   http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
 
 ... is pretty much the same, except:

I decided to stick closer to a MARC-XML type definition since its would be 
easier to explain how the two specifications are related, rather than take a 
more radical approach in producing a specification less familiar.  Not to say 
that other approaches are bad, they just have different advantages and 
disadvantages.  I was going for simple and familiar.

I certainly would be will to work with LC on creating a MARC-JSON specification 
as I did in creating the MARC-XML specification.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew hough...@oclc.org wrote:


 I decided to stick closer to a MARC-XML type definition since its would be
 easier to explain how the two specifications are related, rather than take a
 more radical approach in producing a specification less familiar.  Not to
 say that other approaches are bad, they just have different advantages and
 disadvantages.  I was going for simple and familiar.


That makes sense, but please consider adding a format/version (which we get
in MARC-XML from the namespace and isn't present here). In fact, please
consider adding a format / version / URI, so people know what they've got.

I'm also going to again push the newline-delimited-json stuff. The
collection-as-array is simple and very clean, but leads to trouble
for production (where for most of us we'd have to get the whole freakin'
collection in memory first and then call JSON.dump or whatever)
or consumption (have to deal with a streaming json parser). The production
part is particularly worrisome, since I'd hate for everyone to have to
default to writing out a '[', looping through the records, and writing a
']'. Yeah, it's easy enough, but it's an ugly hack that *everyone* would
have to do, as opposed to just something like:

  while (r = nextRecord) {
 print r.to_json, \n
  }

Unless, of course, writing json to a stream and reading json from a stream
is a lot easier than I make it out to be across a variety of languages and I
just don't know it, which is entirely possible. The streaming writer
interfaces for Perl (
http://search.cpan.org/dist/JSON-Streaming-Writer/lib/JSON/Streaming/Writer.pm)
and Java's Jackson (
http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example) are a
little more daunting than I'd like them to be.

Not wanting to argue unnecessarily, here; just adding input before things
get effectively set in stone.

 -Bill-

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 1:10 PM, Houghton,Andrew wrote:

From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Bill Dueber
Sent: Friday, March 05, 2010 12:30 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q: XML2JSON converter

On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrewhough...@oclc.org
wrote:

 

Too bad I didn't attend code4lib.  OCLC Research has created a
   

version of
 

MARC in JSON and will probably release FAST concepts in MARC binary,
MARC-XML and our MARC-JSON format among other formats.  I'm wondering
whether there is some consensus that can be reached and standardized
   

at LC's
 

level, just like OCLC, RLG and LC came to consensus on MARC-XML.
  Unfortunately, I have not had the time to document the format,
   

although it
 

fairly straight forward, and yes we have an XSLT to convert from
   

MARC-XML to
 

MARC-JSON.  Basically the format I'm using is:


   

The stuff I've been doing:

   http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

... is pretty much the same, except:
 

I decided to stick closer to a MARC-XML type definition since its would be 
easier to explain how the two specifications are related, rather than take a 
more radical approach in producing a specification less familiar.  Not to say 
that other approaches are bad, they just have different advantages and 
disadvantages.  I was going for simple and familiar.

I certainly would be will to work with LC on creating a MARC-JSON specification 
as I did in creating the MARC-XML specification.


Andy.
   
A CouchDB friend of mine just pointed me to the BibJSON format by the 
Bibliographic Knowledge Network:

http://www.bibkn.org/bibjson/index.html

Might be worth looking through for future collaboration/transformation 
options.


Benjamin


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Ross Singer
On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew hough...@oclc.org wrote:

 I certainly would be will to work with LC on creating a MARC-JSON 
 specification as I did in creating the MARC-XML specification.

Quite frankly, I think I (and I imagine others) would much rather see
a more open, RFC-style process to creating a marc-json spec than I
talked to LC and here you go.

Maybe I'm misreading this last paragraph a bit, however.

-Ross.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Ross Singer
On Fri, Mar 5, 2010 at 2:06 PM, Benjamin Young byo...@bigbluehat.com wrote:

 A CouchDB friend of mine just pointed me to the BibJSON format by the
 Bibliographic Knowledge Network:
 http://www.bibkn.org/bibjson/index.html

 Might be worth looking through for future collaboration/transformation
 options.

marc-json and BibJSON serve two different purposes:  marc-json would
need to be a loss-less serialization of a MARC record which may or may
not contain bibliographic data (it may be an authority, holding or CID
record, for example).  BibJSON is more of a merging of data model and
serialization (which, admittedly, is no stranger to MARC) for the
purpose of bibliographic /citations/.  So it will probably be lossy
and there would most likely be a lot of MARC data that is out of
scope.

That's not to say it wouldn't be useful to figure out how to get from
MARC-BibJSON, but from my perspective it's difficult to see the
advantage it brings (being tied to JSON) vs. BIBO.

-Ross.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Bill Dueber
 Sent: Friday, March 05, 2010 01:59 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew hough...@oclc.org
 wrote:
 
 
  I decided to stick closer to a MARC-XML type definition since its
 would be
  easier to explain how the two specifications are related, rather than
 take a
  more radical approach in producing a specification less familiar.
 Not to
  say that other approaches are bad, they just have different
 advantages and
  disadvantages.  I was going for simple and familiar.
 
 
 That makes sense, but please consider adding a format/version (which we
 get
 in MARC-XML from the namespace and isn't present here). In fact, please
 consider adding a format / version / URI, so people know what they've
 got.

This sounds reasonable and I'll consider adding into our specification.

 I'm also going to again push the newline-delimited-json stuff. The
 collection-as-array is simple and very clean, but leads to trouble
 for production (where for most of us we'd have to get the whole
 freakin' collection in memory first ...

As far as our MARC-JSON specificaton is concerned a server application can 
return either a collection or record which mimics the MARC-XML specification 
where the collection or record element can be used for a document element.

 Unless, of course, writing json to a stream and reading json from a
 stream
 is a lot easier than I make it out to be across a variety of languages
 and I
 just don't know it, which is entirely possible. The streaming writer
 interfaces for Perl (
 http://search.cpan.org/dist/JSON-Streaming-
 Writer/lib/JSON/Streaming/Writer.pm)
 and Java's Jackson (
 http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example)
 are a
 little more daunting than I'd like them to be.

As you point out JSON streaming doesn't work with all clients and I am hesitent 
to build on anything that all clients cannot accept.  I think part of the issue 
here is proper API design.  Sending tens of megabytes back to a client and 
expecting them to process it seems like a poor API design regardless of whether 
they can stream it or not.  It might make more sense to have a server API send 
back 10 of our MARC-JSON records in a JSON collection and have the client 
request an additional batch of records for the result set.  In addition, if I 
remember correctly, JSON streaming or other streaming methods keep the 
connection to the server open which is not a good thing to do to maintain 
server throughput.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Benjamin Young
 Sent: Friday, March 05, 2010 02:06 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 A CouchDB friend of mine just pointed me to the BibJSON format by the
 Bibliographic Knowledge Network:
 http://www.bibkn.org/bibjson/index.html
 
 Might be worth looking through for future collaboration/transformation
 options.

Unfortunately, it doesn't really work for authority and classification data 
that I'm frequently involved with.

Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ross Singer
 Sent: Friday, March 05, 2010 02:32 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew hough...@oclc.org
 wrote:
 
  I certainly would be will to work with LC on creating a MARC-JSON
 specification as I did in creating the MARC-XML specification.
 
 Quite frankly, I think I (and I imagine others) would much rather see
 a more open, RFC-style process to creating a marc-json spec than I
 talked to LC and here you go.
 
 Maybe I'm misreading this last paragraph a bit, however.

Yes, you misread the last paragraph.

Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 2:46 PM, Ross Singer wrote:

On Fri, Mar 5, 2010 at 2:06 PM, Benjamin Youngbyo...@bigbluehat.com  wrote:

   

A CouchDB friend of mine just pointed me to the BibJSON format by the
Bibliographic Knowledge Network:
http://www.bibkn.org/bibjson/index.html

Might be worth looking through for future collaboration/transformation
options.
 

marc-json and BibJSON serve two different purposes:  marc-json would
need to be a loss-less serialization of a MARC record which may or may
not contain bibliographic data (it may be an authority, holding or CID
record, for example).  BibJSON is more of a merging of data model and
serialization (which, admittedly, is no stranger to MARC) for the
purpose of bibliographic /citations/.  So it will probably be lossy
and there would most likely be a lot of MARC data that is out of
scope.

That's not to say it wouldn't be useful to figure out how to get from
MARC-BibJSON, but from my perspective it's difficult to see the
advantage it brings (being tied to JSON) vs. BIBO.

-Ross.
   
Thanks for the clarification, Ross. I thought it would be helpful (if 
nothing else) to see how data was being mapped in a related domain into 
and out of JSON. I'm new to library data in general, so I appreciate the 
clarification on which format is for what.


Appreciated,
Benjamin


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 3:14 PM, Houghton,Andrew hough...@oclc.org wrote:


 As you point out JSON streaming doesn't work with all clients and I am
 hesitent to build on anything that all clients cannot accept.  I think part
 of the issue here is proper API design.  Sending tens of megabytes back to a
 client and expecting them to process it seems like a poor API design
 regardless of whether they can stream it or not.  It might make more sense
 to have a server API send back 10 of our MARC-JSON records in a JSON
 collection and have the client request an additional batch of records for
 the result set.  In addition, if I remember correctly, JSON streaming or
 other streaming methods keep the connection to the server open which is not
 a good thing to do to maintain server throughput.


I guess my concern here is that the specification, as you're describing it,
is closing off potential uses.  It seems fine if, for example, your primary
concern is javascript-in-the-browser, and browser-request,
pagination-enabled systems might be all you're worried about right now.

That's not the whole universe of uses, though. People are going to want to
dump these things into a file to read later -- no possibility for pagination
in that situation. Others may, in fact, want to stream a few thousand
records down the pipe at once, but without a streaming parser that can't
happen if it's all one big array.

I worry that as specified, the *only* use will be, Pull these down a thin
pipe, and if you want to keep them for later, or want a bunch of them, you
have to deal with marc-xml. Part of my incentive is to *not* have to use
marc-xml, but in this case I'd just be trading one technology I don't like
(marc-xml) for two technologies, one of which I don't like (that'd be
marc-xml again).

I really do understand the desire to make this parallel to marc-xml, but
there's a seem between the two technologies that makes that a problematic
approach.



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread LeVan,Ralph
 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf
Of
 Bill Dueber
 
 I really do understand the desire to make this parallel to marc-xml,
but
 there's a seem between the two technologies that makes that a
problematic
 approach.

As a confession, here in OCLC Research, we do pass around files of
marc-xml records that are newline delimited without a wrapper element
containing them.  We do that for all the reasons you gave for wanting
the same thing for JSON records.

Ralph


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 3:45 PM, Bill Dueber wrote:

On Fri, Mar 5, 2010 at 3:14 PM, Houghton,Andrewhough...@oclc.org  wrote:


   

As you point out JSON streaming doesn't work with all clients and I am
hesitent to build on anything that all clients cannot accept.  I think part
of the issue here is proper API design.  Sending tens of megabytes back to a
client and expecting them to process it seems like a poor API design
regardless of whether they can stream it or not.  It might make more sense
to have a server API send back 10 of our MARC-JSON records in a JSON
collection and have the client request an additional batch of records for
the result set.  In addition, if I remember correctly, JSON streaming or
other streaming methods keep the connection to the server open which is not
a good thing to do to maintain server throughput.

 

I guess my concern here is that the specification, as you're describing it,
is closing off potential uses.  It seems fine if, for example, your primary
concern is javascript-in-the-browser, and browser-request,
pagination-enabled systems might be all you're worried about right now.

That's not the whole universe of uses, though. People are going to want to
dump these things into a file to read later -- no possibility for pagination
in that situation. Others may, in fact, want to stream a few thousand
records down the pipe at once, but without a streaming parser that can't
happen if it's all one big array.

I worry that as specified, the *only* use will be, Pull these down a thin
pipe, and if you want to keep them for later, or want a bunch of them, you
have to deal with marc-xml. Part of my incentive is to *not* have to use
marc-xml, but in this case I'd just be trading one technology I don't like
(marc-xml) for two technologies, one of which I don't like (that'd be
marc-xml again).

I really do understand the desire to make this parallel to marc-xml, but
there's a seem between the two technologies that makes that a problematic
approach.
   
For my part, I'd like to explore the options of putting MARC data into 
CouchDB (which stores documents as JSON) which could then open the door 
for replicating that data between any number of installations of CouchDB 
as well as providing for various output formats (marc-xml, etc).


It's just an idea, but it's one that uses JSON outside of the browser 
and is a good proof case for any MARC in JSON format.


Thanks,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Bill Dueber
 Sent: Friday, March 05, 2010 03:45 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 I guess my concern here is that the specification, as you're describing
 it, is closing off potential uses.  It seems fine if, for example, your
 primary concern is javascript-in-the-browser, and browser-request,
 pagination-enabled systems might be all you're worried about right now.
 
 That's not the whole universe of uses, though. People are going to want
 to dump these things into a file to read later -- no possibility for
 pagination in that situation.

I disagree that you couldn't dump a paginated result set into a file for 
reading later.  I do this all the time not only in Javascript, but may other 
programming languages.

 Others may, in fact, want to stream a few thousand
 records down the pipe at once, but without a streaming parser that
 can't happen if it's all one big array.

Well, if your service isn't allowing them to be streamed a few thousand records 
at a time, then that isn't a issue :)

Maybe I have been mislead or misunderstood JSON streaming.  My understanding 
was that you can generate an arbitrary large outgoing stream on the server side 
and can read an arbitrary large incoming stream on the client side.  So it 
shouldn't matter if the result set was delivered as one big JSON array.  The 
SAX like interface that JSON streaming uses provides the necessary events to 
allow you to pull the individual records from that arbitrary large array.

 I worry that as specified, the *only* use will be, Pull these down a
 thin pipe, and if you want to keep them for later, or want a bunch of
 them, you have to deal with marc-xml.

Don't quite follow this.  MARC-XML is an XML format, MARC-JSON is our JSON 
format for expressing the various MARC-21 format, e.g., authority, 
bibliographic, classification, community information and holdings in JSON.  The 
JSON is based on the structure of MARC-XML which was based on the structure of 
ISO 2709.  Don't see how MARC-XML comes into play when you are dealing with 
JSON.  If you want to save our MARC-JSON you don't have to convert it to 
MARC-XML on the client side.  Just save it as a text file.

 Part of my incentive is to *not* have to use marc-xml, but in this 
 case I'd just be trading one technology I don't like (marc-xml) 
 for two technologies, one of which I don't like (that'd be marc-xml 
 again).

Again not sure how to address this concern.  If you are dealing with library 
data, then its current communication formats are either MARC binary (ISO 2709) 
or MARC-XML, ignoring IFLA's MARC-XML-ish format for the moment.  You might not 
like it, but that is life in library land.  You can go develop your own formats 
based on the various MARC-21 format specifications, but are unlikely to achieve 
any sort of interoperability with the existing library systems and services.

We choose our MARC-JSON to maintain the structural components of MARC-XML and 
hence MARC binary (ISO 2709).  In MARC, control fields have different semantics 
from data fields and you cannot merge them into one thing called field.  If you 
look closely at the MARC-XML schema, you might notice that the controlfield and 
datafield elements can have non-numeric tags.  If you merge everything into 
something called field, then you cannot distinguish between a non-numeric tag 
for a controlfield vs. a datafield element.  There are valid reasons why we 
decided to maintain the existing structure of MARC.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Benjamin Young
 Sent: Friday, March 05, 2010 04:24 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 For my part, I'd like to explore the options of putting MARC data into
 CouchDB (which stores documents as JSON) which could then open the door
 for replicating that data between any number of installations of
 CouchDB
 as well as providing for various output formats (marc-xml, etc).
 
 It's just an idea, but it's one that uses JSON outside of the browser
 and is a good proof case for any MARC in JSON format.

This was partly the reason why I developed our MARC-JSON format since I'm using 
MongoDB [1] which is a NoSQL database based on JSON.


Andy.

[1] http://www.mongodb.org/display/DOCS/Home


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 4:38 PM, Houghton,Andrew hough...@oclc.org wrote:


 Maybe I have been mislead or misunderstood JSON streaming.


This is my central point. I'm actually saying that JSON streaming is painful
and rare enough that it should be avoided as a requirement for working with
any new format.

I guess, in sum, I'm making the following assertions:

1. Streaming APIs for JSON, where they exist, are a pain in the ass. And
they don't exist everywhere. Without a JSON streaming parser, you have to
pull the whole array of documents up into memory, which may be impossible.
This is the crux of my argument -- if you disagree with it, then I would
assume you disagree with the other points as well.

2. Many people -- and I don't think I'm exaggerating here, honestly --
really don't like using MARC-XML but have to because of the length
restrictions on MARC-binary. A useful alternative, based on dead-easy
parsing and production, is very appealing.

2.5 Having to deal with a streaming API takes away the dead-easy part.

3. If you accept my assertions about streaming parsers, then dealing with
the format you've proposed for large sets is either painful (with a
streaming API) or impossible (where such an API doesn't exist) due to memory
constraints.

4. Streaming JSON writer APIs are also painful; everything that applies to
reading applies to writing. Sans a streaming writer, trying to *write* a
large JSON document also results in you having to have the whole thing in
memory.

5. People are going to want to deal with this format, because of its
benefits over marc21 (record length) and marc-xml (ease of processing),
which means we're going to want to deal with big sets of data and/or dump
batches of it to a file. Which brings us back to #1, the pain or absence of
streaming apis.

Write a better JSON parser/writer  or use a different language seem like
bad solutions to me, especially when a (potentially) useful alternative
exists.

As I pointed out, if streaming JSON is no harder/unavailable to you than
non-streaming json, then this is mostly moot. I assert that for many people
in this community it is one or the other, which is why I'm leery of it.

  -Bill-


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew hough...@oclc.org wrote:

 OK, I will bite, you stated:

 1. That large datasets are a problem.
 2. That streaming APIs are a pain to deal with.
 3. That tool sets have memory constraints.

 So how do you propose to process large JSON datasets that:

 1. Comply with the JSON specification.
 2. Can be read by any JavaScript/JSON processor.
 3. Do not require the use of streaming API.
 4. Do not exceed the memory limitations of current JSON processors.


What I'm proposing is that we don't process large JSON datasets; I'm
proposing that we process smallish JSON documents one at a time by pulling
them out of a stream based on an end-of-record character.

This is basically what we use for MARC21 binary format -- have a defined
structure for a valid record, and separate multiple well-formed record
structures with an end-of-record character. This preserves JSON
specification adherence at the record level and uses a different scheme to
represent collections. Obviously, MARC-XML uses a different mechanism to
define a collection of records -- putting well-formed record structures
inside a collection tag.

So... I'm proposing define what we mean by a single MARC record serialized
to JSON (in whatever format; I'm not very opinionated on this point) that
preserves the order, indicators, tags, data, etc. we need to round-trip
between marc21binary, marc-xml, and marc-json.

And then separate those valid records with an end-of-record character --
\n.

Unless I've read all this wrong, you've come to the conclusion that the
benefit of having a JSON serialization that is valid JSON at both the record
and collection level outweighs the pain of having to deal with a streaming
parser and writer.  This allows a single collection to be treated as any
other JSON document, which has obvious benefits (which I certainly don't
mean to minimize) and all the drawbacks we've been talking about *ad nauseam
*.

I go the the other way. I think the pain of dealing with a streaming API
outweighs the benefits of having a single valid JSON structure for a
collection, and instead have put forward that we use a combination of JSON
records and a well-defined end-of-record character (\n) to represent a
collection.  I recognize that this involves providing special-purpose code
which must call for JSON-deserialization on each line, instead of being able
to throw the whole stream/file/whatever at your json parser is. I accept
that because getting each line of a text file is something I find easy
compared to dealing with streaming parsers.

And our point of disagreement, I think, is that I believe that defining the
collection structure in such a way that we need two steps (get a line;
deserialize that line) and can't just call the equivalent of
JSON.parse(stream) has benefits in ease of implementation and use that
outweigh the loss of having both a single record and a collection of records
be valid JSON. And you, I think, don't :-)

I'm going to bow out of this now, unless I've got some part of our positions
wrong, to let any others that care (which may number zero) chime in.

 -Bill-










-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ross Singer
 Sent: Friday, March 05, 2010 09:18 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q: XML2JSON converter
 
 I actually just wrote the same exact email as Bill (although probably
 not as polite -- I called the marcxml collection element a
 contrivance that appears nowhere in marc21).  I even wrote the
 marc21 is EOR character delimited files bit.  I was hoping to figure
 out how to use unix split to make my point, couldn't, and then
 discarded my draft.
 
 But I was *right there*.
 
 -Ross.

I'll answer Bill's message tomorrow after I have had some sleep :) 

Actually, I contend that the MARC-XML collection element does appear in MARC 
(ISO 2709), but it is at the physical layer and not at the structural layer.  
Remember MARC records were placed on a tape reel, thus the tape reel was the 
collection (container).  Placed on disk in a file, the file is the collection 
(container).  I agree that it's not spelled out in the standard, but the 
concept of a collection (container) is implicit when you have more than one 
record of anything.

Basic set theory: a set is a container for its members :)

The obvious reason why it exists in XML is that the XML infoset requires a 
single document element (container).  This is why the MARC-XML schema allows 
either a collection or record element to be specified as the document element.  
It is unfortunate that the XML infoset requires a single document element, 
otherwise you would be back to the file on disk being the implicit collection 
(container) as it is in ISO 2709.


Andy.