Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-18 Thread Jonathan Rochkind
Oh, I wasn't actually suggesting limiting to UTF-8 was the right way to 
go, I was asking your opinion!  It's not at all clear to me, but if your 
opinion is that UTF-8 is indeed the right way to go, that's comforting. :)


Bandwidth _does_ matter I think, it's primarily intended as a 
transmission format, and the reasons _I_ am interested in it as a 
transmission format over MarcXML is in large part precisely because it 
will be so much smaller a package, I'm running into various performance 
problems caused by the very large package size of MarcXML. (Disk space 
might be cheap, but bandwidth, over the network, or to the file system, 
is not neccesarily, for me anyway.)


But I'm not sure I'm concerned about UTF-8 bloating size of response, I 
think it will still be manageable and worth it to avoid confusion. I 
pretty much do _everything_ in UTF-8 myself these days, because it's 
just not worth the headache to me to do anything else. But I have MUCH 
less experience dealing with international character sets than you, 
which is why I was curious as to your opinion.  There's no reason the 
marc-hash-in-json proto-spec couldn't allow any valid JSON character 
encoding, if you/we/someone thinks it's neccesary/more-convenient.


Jonathan

Dan Scott wrote:

I hate Groupwise for forcing me to top-post.

Yes, you are right about everything. Limiting MARC-HASH to just UTF8, rather 
than supporting the full range of encodings allowed by JSON, probably makes it 
easier to generate and parse; it will bloat the size of the format for 
characters outside of the Basic Multilingual Plane but probably nobody cares, 
bandwidth is cheap, right? And this is primarily meant as a transmission format.

I missed the part in the blog entry about the newline-delimited JSON because I was 
specifically looking for a mention of "collections". newline-delimited JSON 
would work, yes, and probably be easier / faster / less memory-intensive to parse.

Dan

  

Jonathan Rochkind  03/18/10 10:41 AM >>>

So do you think the marc-hash-to-json "proto-spec" should suggest that 
the encoding HAS to be UTF-8, or should it leave it open to anything 
that's legal JSON?   (Is there a problem I don't know about with 
expressing "characters outside of the Basic Multilingual Plane" in 
UTF-8?  Any unicode char can be encoded in any of the unicode encodings, 
right?). 

If "collections" means what I think, Bill's blog proto-spec says they 
should be serialized as JSON-seperated-by-newlines, right?  That is, 
JSON for each record, seperated by newlines. Rather than the alternative 
approach you hypothesize there; there are various reasons to prefer 
json-seperated-by-newlines, which is an actual convention used in the 
wild, not something made up just for here.


Jonathan

Dan Scott wrote:
  

Hey Bill:

Do you have unit tests for MARC-HASH / JSON anywhere? If you do, that would 
make it easier for me to create a compliant PHP File_MARC_JSON variant, which 
I'll be happy-ish to create.

The only concerns I have with your write-up are:
  * JSON itself allows UTF8, UTF16, and UTF32 encoding - and we've seen in 
Evergreen some cases where characters outside of the Basic Multilingual Plane 
are required. We eventually wound up resorting to surrogate pairs, in that 
case; so maybe this isn't a real issue.
  * You've mentioned that you would like to see better support for collections 
in File_MARC / File_MARCXML; but I don't see any mention of how collections 
would work in MARC-HASH / JSON. Would it just be something like the following?

"collection": [
  {
"type" : "marc-hash"
"version" : [1, 0]
"leader" : "…leader string … "
"fields" : [array, of, fields]
  },
  {
"type" : "marc-hash"
"version" : [1, 0]
"leader" : "…leader string … "
"fields" : [array, of, fields]
  }
]

Dan

  


Bill Dueber  03/15/10 12:22 PM >>>

  

I'm pretty sure Andrew was (a) completely unaware of anything I'd done, and
(b) looking to match marc-xml as strictly as reasonable.

I also like the array-based rather than hash-based format, but I'm not gonna
go to the mat for it if no one else cares much.

I would like to see ind1 and ind2 get their own fields, though, for easier
use of stuff like jsonpath in json-centric nosql databases.

On Mon, Mar 15, 2010 at 10:52 AM, Jonathan Rochkind wrote:

  


I would just ask why you didn't use Bill Dueber's already existing
proto-spec, instead of making up your own incomptable one.

I'd think we could somehow all do the same consistent thing here.

Since my interest in marc-json is getting as small a package as possible
for transfer accross the wire, I prefer Bill'

Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-18 Thread Dan Scott
I hate Groupwise for forcing me to top-post.

Yes, you are right about everything. Limiting MARC-HASH to just UTF8, rather 
than supporting the full range of encodings allowed by JSON, probably makes it 
easier to generate and parse; it will bloat the size of the format for 
characters outside of the Basic Multilingual Plane but probably nobody cares, 
bandwidth is cheap, right? And this is primarily meant as a transmission format.

I missed the part in the blog entry about the newline-delimited JSON because I 
was specifically looking for a mention of "collections". newline-delimited JSON 
would work, yes, and probably be easier / faster / less memory-intensive to 
parse.

Dan

>>> Jonathan Rochkind  03/18/10 10:41 AM >>>
So do you think the marc-hash-to-json "proto-spec" should suggest that 
the encoding HAS to be UTF-8, or should it leave it open to anything 
that's legal JSON?   (Is there a problem I don't know about with 
expressing "characters outside of the Basic Multilingual Plane" in 
UTF-8?  Any unicode char can be encoded in any of the unicode encodings, 
right?). 

If "collections" means what I think, Bill's blog proto-spec says they 
should be serialized as JSON-seperated-by-newlines, right?  That is, 
JSON for each record, seperated by newlines. Rather than the alternative 
approach you hypothesize there; there are various reasons to prefer 
json-seperated-by-newlines, which is an actual convention used in the 
wild, not something made up just for here.

Jonathan

Dan Scott wrote:
> Hey Bill:
>
> Do you have unit tests for MARC-HASH / JSON anywhere? If you do, that would 
> make it easier for me to create a compliant PHP File_MARC_JSON variant, which 
> I'll be happy-ish to create.
>
> The only concerns I have with your write-up are:
>   * JSON itself allows UTF8, UTF16, and UTF32 encoding - and we've seen in 
> Evergreen some cases where characters outside of the Basic Multilingual Plane 
> are required. We eventually wound up resorting to surrogate pairs, in that 
> case; so maybe this isn't a real issue.
>   * You've mentioned that you would like to see better support for 
> collections in File_MARC / File_MARCXML; but I don't see any mention of how 
> collections would work in MARC-HASH / JSON. Would it just be something like 
> the following?
>
> "collection": [
>   {
> "type" : "marc-hash"
> "version" : [1, 0]
> "leader" : "…leader string … "
> "fields" : [array, of, fields]
>   },
>   {
> "type" : "marc-hash"
> "version" : [1, 0]
> "leader" : "…leader string … "
> "fields" : [array, of, fields]
>   }
> ]
>
> Dan
>
>   
>>>> Bill Dueber  03/15/10 12:22 PM >>>
>>>> 
> I'm pretty sure Andrew was (a) completely unaware of anything I'd done, and
> (b) looking to match marc-xml as strictly as reasonable.
>
> I also like the array-based rather than hash-based format, but I'm not gonna
> go to the mat for it if no one else cares much.
>
> I would like to see ind1 and ind2 get their own fields, though, for easier
> use of stuff like jsonpath in json-centric nosql databases.
>
> On Mon, Mar 15, 2010 at 10:52 AM, Jonathan Rochkind wrote:
>
>   
>> I would just ask why you didn't use Bill Dueber's already existing
>> proto-spec, instead of making up your own incomptable one.
>>
>> I'd think we could somehow all do the same consistent thing here.
>>
>> Since my interest in marc-json is getting as small a package as possible
>> for transfer accross the wire, I prefer Bill's approach.
>>
>> http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
>>
>>
>> Houghton,Andrew wrote:
>>
>> 
>>> From: Houghton,Andrew
>>>   
>>>> Sent: Saturday, March 06, 2010 06:59 PM
>>>> To: Code for Libraries
>>>> Subject: RE: [CODE4LIB] Q: XML2JSON converter
>>>>
>>>> Depending on how much time I get next week I'll talk with the developer
>>>> network folks to see what I need to do to put a specification under
>>>> their infrastructure
>>>>
>>>>
>>>> 
>>> I finished documenting our existing use of MARC-JSON.  The specification
>>> can be found on the OCLC developer network wiki [1].  Since it is a wiki,
>>> registered developer network members can edit the specification and I would
>>> ask that you refrain from doing so.
>>>
>>> However, please do use

Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-18 Thread Jonathan Rochkind
So do you think the marc-hash-to-json "proto-spec" should suggest that 
the encoding HAS to be UTF-8, or should it leave it open to anything 
that's legal JSON?   (Is there a problem I don't know about with 
expressing "characters outside of the Basic Multilingual Plane" in 
UTF-8?  Any unicode char can be encoded in any of the unicode encodings, 
right?). 

If "collections" means what I think, Bill's blog proto-spec says they 
should be serialized as JSON-seperated-by-newlines, right?  That is, 
JSON for each record, seperated by newlines. Rather than the alternative 
approach you hypothesize there; there are various reasons to prefer 
json-seperated-by-newlines, which is an actual convention used in the 
wild, not something made up just for here.


Jonathan

Dan Scott wrote:

Hey Bill:

Do you have unit tests for MARC-HASH / JSON anywhere? If you do, that would 
make it easier for me to create a compliant PHP File_MARC_JSON variant, which 
I'll be happy-ish to create.

The only concerns I have with your write-up are:
  * JSON itself allows UTF8, UTF16, and UTF32 encoding - and we've seen in 
Evergreen some cases where characters outside of the Basic Multilingual Plane 
are required. We eventually wound up resorting to surrogate pairs, in that 
case; so maybe this isn't a real issue.
  * You've mentioned that you would like to see better support for collections 
in File_MARC / File_MARCXML; but I don't see any mention of how collections 
would work in MARC-HASH / JSON. Would it just be something like the following?

"collection": [
  {
"type" : "marc-hash"
"version" : [1, 0]
"leader" : "…leader string … "
"fields" : [array, of, fields]
  },
  {
"type" : "marc-hash"
"version" : [1, 0]
"leader" : "…leader string … "
"fields" : [array, of, fields]
  }
]

Dan

  

Bill Dueber  03/15/10 12:22 PM >>>


I'm pretty sure Andrew was (a) completely unaware of anything I'd done, and
(b) looking to match marc-xml as strictly as reasonable.

I also like the array-based rather than hash-based format, but I'm not gonna
go to the mat for it if no one else cares much.

I would like to see ind1 and ind2 get their own fields, though, for easier
use of stuff like jsonpath in json-centric nosql databases.

On Mon, Mar 15, 2010 at 10:52 AM, Jonathan Rochkind wrote:

  

I would just ask why you didn't use Bill Dueber's already existing
proto-spec, instead of making up your own incomptable one.

I'd think we could somehow all do the same consistent thing here.

Since my interest in marc-json is getting as small a package as possible
for transfer accross the wire, I prefer Bill's approach.

http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/


Houghton,Andrew wrote:



From: Houghton,Andrew
  

Sent: Saturday, March 06, 2010 06:59 PM
To: Code for Libraries
Subject: RE: [CODE4LIB] Q: XML2JSON converter

Depending on how much time I get next week I'll talk with the developer
network folks to see what I need to do to put a specification under
their infrastructure




I finished documenting our existing use of MARC-JSON.  The specification
can be found on the OCLC developer network wiki [1].  Since it is a wiki,
registered developer network members can edit the specification and I would
ask that you refrain from doing so.

However, please do use the discussion tab to record issues with the
specification or add additional information to existing issues.  There are
already two open issues on the discussion tab and you can use them as a
template for new issues.  The first issue is Bill Dueber's request for some
sort of versioning and the second issue is whether the specification should
specify the flavor of MARC, e.g., marc21, unicode, etc.

It is recommended that you place issues on the discussion tab since that
will be the official place for documenting and disposing of them.  I do
monitor this listserve and the OCLC developer network listserve, but I only
selectively look at messages on those listserves.  If you would like to use
this listserve or the OCLC developer network listserve to discuss the
MARC-JSON specification, make sure you place MARC-JSON in the subject line,
to give me a clue that I *should* look at that message, or directly CC my
e-mail address on your post.

This message marks the beginning of a two week comment period on the
specification which will end on midnight 2010-03-28.

[1] <http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11>


Thanks, Andy.


  



  


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-17 Thread Dan Scott
Hey Bill:

Do you have unit tests for MARC-HASH / JSON anywhere? If you do, that would 
make it easier for me to create a compliant PHP File_MARC_JSON variant, which 
I'll be happy-ish to create.

The only concerns I have with your write-up are:
  * JSON itself allows UTF8, UTF16, and UTF32 encoding - and we've seen in 
Evergreen some cases where characters outside of the Basic Multilingual Plane 
are required. We eventually wound up resorting to surrogate pairs, in that 
case; so maybe this isn't a real issue.
  * You've mentioned that you would like to see better support for collections 
in File_MARC / File_MARCXML; but I don't see any mention of how collections 
would work in MARC-HASH / JSON. Would it just be something like the following?

"collection": [
  {
"type" : "marc-hash"
"version" : [1, 0]
"leader" : "…leader string … "
"fields" : [array, of, fields]
  },
  {
"type" : "marc-hash"
"version" : [1, 0]
"leader" : "…leader string … "
"fields" : [array, of, fields]
  }
]

Dan

>>> Bill Dueber  03/15/10 12:22 PM >>>
I'm pretty sure Andrew was (a) completely unaware of anything I'd done, and
(b) looking to match marc-xml as strictly as reasonable.

I also like the array-based rather than hash-based format, but I'm not gonna
go to the mat for it if no one else cares much.

I would like to see ind1 and ind2 get their own fields, though, for easier
use of stuff like jsonpath in json-centric nosql databases.

On Mon, Mar 15, 2010 at 10:52 AM, Jonathan Rochkind wrote:

> I would just ask why you didn't use Bill Dueber's already existing
> proto-spec, instead of making up your own incomptable one.
>
> I'd think we could somehow all do the same consistent thing here.
>
> Since my interest in marc-json is getting as small a package as possible
> for transfer accross the wire, I prefer Bill's approach.
>
> http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
>
>
> Houghton,Andrew wrote:
>
>> From: Houghton,Andrew
>>> Sent: Saturday, March 06, 2010 06:59 PM
>>> To: Code for Libraries
>>> Subject: RE: [CODE4LIB] Q: XML2JSON converter
>>>
>>> Depending on how much time I get next week I'll talk with the developer
>>> network folks to see what I need to do to put a specification under
>>> their infrastructure
>>>
>>>
>>
>> I finished documenting our existing use of MARC-JSON.  The specification
>> can be found on the OCLC developer network wiki [1].  Since it is a wiki,
>> registered developer network members can edit the specification and I would
>> ask that you refrain from doing so.
>>
>> However, please do use the discussion tab to record issues with the
>> specification or add additional information to existing issues.  There are
>> already two open issues on the discussion tab and you can use them as a
>> template for new issues.  The first issue is Bill Dueber's request for some
>> sort of versioning and the second issue is whether the specification should
>> specify the flavor of MARC, e.g., marc21, unicode, etc.
>>
>> It is recommended that you place issues on the discussion tab since that
>> will be the official place for documenting and disposing of them.  I do
>> monitor this listserve and the OCLC developer network listserve, but I only
>> selectively look at messages on those listserves.  If you would like to use
>> this listserve or the OCLC developer network listserve to discuss the
>> MARC-JSON specification, make sure you place MARC-JSON in the subject line,
>> to give me a clue that I *should* look at that message, or directly CC my
>> e-mail address on your post.
>>
>> This message marks the beginning of a two week comment period on the
>> specification which will end on midnight 2010-03-28.
>>
>> [1] <http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11>
>>
>>
>> Thanks, Andy.
>>
>>
>


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-16 Thread Jonathan Rochkind
Bill's format would allow there to be a control field and a data field 
with the same tag, however, so it's all good either way.


Ere Maijala wrote:

On 03/15/2010 06:22 PM, Houghton,Andrew wrote:
  

Secondly, Bill's specification looses semantics from ISO 2709, as I
previously pointed out.  His specification clumps control and data
fields into one property named fields. According to ISO 2709, control
and data fields have different semantics.  You could have a control
field tagged as 001 and a data field tagged as 001 which have
different semantics.  MARC-21 has imposed certain rules for



I won't comment on Bill's proposal, but I'll just say that I don't think 
you can have a control field and a data field with the same code in a 
single MARC format. Well, technically it's possible, but in practice 
everything I've seen relies on rules of the MARC format at hand. You 
could actually say that ISO 2709 works more like Bill's JSON, and 
MARCXML is the different one, as in ISO 2709 the directory doesn't 
separate control and data fields.


--Ere

  


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-16 Thread Ere Maijala

On 03/15/2010 06:22 PM, Houghton,Andrew wrote:

Secondly, Bill's specification looses semantics from ISO 2709, as I
previously pointed out.  His specification clumps control and data
fields into one property named fields. According to ISO 2709, control
and data fields have different semantics.  You could have a control
field tagged as 001 and a data field tagged as 001 which have
different semantics.  MARC-21 has imposed certain rules for


I won't comment on Bill's proposal, but I'll just say that I don't think 
you can have a control field and a data field with the same code in a 
single MARC format. Well, technically it's possible, but in practice 
everything I've seen relies on rules of the MARC format at hand. You 
could actually say that ISO 2709 works more like Bill's JSON, and 
MARCXML is the different one, as in ISO 2709 the directory doesn't 
separate control and data fields.


--Ere

--
Ere Maijala (Mr.)
The National Library of Finland


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Monday, March 15, 2010 12:40 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
> 
> On the one hand, I'm all for following specs. But on the other...should
> we really be too concerned about dealing with the full flexibility of
> the 2709 spec, vs. what's actually used? I mean, I hope to god no one 
> is actually creating new formats based on 2709!
> 
> If there are real-life examples in the wild of, say, multi-character
> indicators, or subfield codes of more than one character, that's one
> thing.

Yes there are real-life examples, e.g., MarcXchange, now ISO 25577, being the 
one that comes to mind.  Where IFLA was *compelled* to create a new MARC-XML 
specification, in a different namespace, and the main difference between the 
two specification was being able to specify up to nine indicator values.  Given 
that they are optional in the MarcXchange XML schema, personally I feel that, 
IFLA and LC could have just extended the MARC-XML schema and added the optional 
attributes making all existing MARC-XML documents MarcXchange documents and the 
library community wouldn't have to deal with two XML specifications.  This is 
another example of the library community creating more barriers for people 
entering their market.

> BTW, in the stuff I proposed, you know a controlfield vs. a datafield
> because of the length of the array (2 vs 5); it's well-specified, but
> by the size of the tuple, not by label.

Ahh... I overlooked that aspect of your proposal.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Bill Dueber
On the one hand, I'm all for following specs. But on the other...should we
really be too concerned about dealing with the full flexibility of the 2709
spec, vs. what's actually used? I mean, I hope to god no one is actually
creating new formats based on 2709!

If there are real-life examples in the wild of, say, multi-character
indicators, or subfield codes of more than one character, that's one thing.

BTW, in the stuff I proposed, you know a controlfield vs. a datafield
because of the length of the array (2 vs 5); it's well-specified, but by the
size of the tuple, not by label.

On Mon, Mar 15, 2010 at 11:22 AM, Houghton,Andrew  wrote:

> > From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> > Jonathan Rochkind
> > Sent: Monday, March 15, 2010 11:53 AM
> > To: CODE4LIB@LISTSERV.ND.EDU
> > Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
> >
> > I would just ask why you didn't use Bill Dueber's already existing
> > proto-spec, instead of making up your own incomptable one.
>
> Because the internal use of our specification predated Bill's blog entry,
> dated 2010-02-25, by almost a year.  Bill's post reminded me that I had not
> published or publicly discussed our specification.
>
> Secondly, Bill's specification looses semantics from ISO 2709, as I
> previously pointed out.  His specification clumps control and data fields
> into one property named fields. According to ISO 2709, control and data
> fields have different semantics.  You could have a control field tagged as
> 001 and a data field tagged as 001 which have different semantics.  MARC-21
> has imposed certain rules for assignment of tags such that this isn't a
> concern, but other systems based on ISO 2709 may not.
>
>
> Andy.
>



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Monday, March 15, 2010 12:19 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
> 
> I would like to see ind1 and ind2 get their own fields, though, for
> easier
> use of stuff like jsonpath in json-centric nosql databases.

I'll add this issue to the discussion page.  Your point on being able to index 
a specific indicator property has value.  So a resolution to the issue would be 
have 9 indicator properties, per ISO 2709 and MarcXchange, but require the ones 
based on the leader indicator number value.


Thanks, Andy.


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Monday, March 15, 2010 11:53 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
> 
> I would just ask why you didn't use Bill Dueber's already existing
> proto-spec, instead of making up your own incomptable one.

Because the internal use of our specification predated Bill's blog entry, dated 
2010-02-25, by almost a year.  Bill's post reminded me that I had not published 
or publicly discussed our specification.

Secondly, Bill's specification looses semantics from ISO 2709, as I previously 
pointed out.  His specification clumps control and data fields into one 
property named fields. According to ISO 2709, control and data fields have 
different semantics.  You could have a control field tagged as 001 and a data 
field tagged as 001 which have different semantics.  MARC-21 has imposed 
certain rules for assignment of tags such that this isn't a concern, but other 
systems based on ISO 2709 may not.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Bill Dueber
I'm pretty sure Andrew was (a) completely unaware of anything I'd done, and
(b) looking to match marc-xml as strictly as reasonable.

I also like the array-based rather than hash-based format, but I'm not gonna
go to the mat for it if no one else cares much.

I would like to see ind1 and ind2 get their own fields, though, for easier
use of stuff like jsonpath in json-centric nosql databases.

On Mon, Mar 15, 2010 at 10:52 AM, Jonathan Rochkind wrote:

> I would just ask why you didn't use Bill Dueber's already existing
> proto-spec, instead of making up your own incomptable one.
>
> I'd think we could somehow all do the same consistent thing here.
>
> Since my interest in marc-json is getting as small a package as possible
> for transfer accross the wire, I prefer Bill's approach.
>
> http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
>
>
> Houghton,Andrew wrote:
>
>> From: Houghton,Andrew
>>> Sent: Saturday, March 06, 2010 06:59 PM
>>> To: Code for Libraries
>>> Subject: RE: [CODE4LIB] Q: XML2JSON converter
>>>
>>> Depending on how much time I get next week I'll talk with the developer
>>> network folks to see what I need to do to put a specification under
>>> their infrastructure
>>>
>>>
>>
>> I finished documenting our existing use of MARC-JSON.  The specification
>> can be found on the OCLC developer network wiki [1].  Since it is a wiki,
>> registered developer network members can edit the specification and I would
>> ask that you refrain from doing so.
>>
>> However, please do use the discussion tab to record issues with the
>> specification or add additional information to existing issues.  There are
>> already two open issues on the discussion tab and you can use them as a
>> template for new issues.  The first issue is Bill Dueber's request for some
>> sort of versioning and the second issue is whether the specification should
>> specify the flavor of MARC, e.g., marc21, unicode, etc.
>>
>> It is recommended that you place issues on the discussion tab since that
>> will be the official place for documenting and disposing of them.  I do
>> monitor this listserve and the OCLC developer network listserve, but I only
>> selectively look at messages on those listserves.  If you would like to use
>> this listserve or the OCLC developer network listserve to discuss the
>> MARC-JSON specification, make sure you place MARC-JSON in the subject line,
>> to give me a clue that I *should* look at that message, or directly CC my
>> e-mail address on your post.
>>
>> This message marks the beginning of a two week comment period on the
>> specification which will end on midnight 2010-03-28.
>>
>> [1] <http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11>
>>
>>
>> Thanks, Andy.
>>
>>
>


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Jonathan Rochkind
I would just ask why you didn't use Bill Dueber's already existing 
proto-spec, instead of making up your own incomptable one.


I'd think we could somehow all do the same consistent thing here.

Since my interest in marc-json is getting as small a package as possible 
for transfer accross the wire, I prefer Bill's approach.


http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

Houghton,Andrew wrote:

From: Houghton,Andrew
Sent: Saturday, March 06, 2010 06:59 PM
To: Code for Libraries
Subject: RE: [CODE4LIB] Q: XML2JSON converter

Depending on how much time I get next week I'll talk with the developer
network folks to see what I need to do to put a specification under
their infrastructure



I finished documenting our existing use of MARC-JSON.  The specification can be 
found on the OCLC developer network wiki [1].  Since it is a wiki, registered 
developer network members can edit the specification and I would ask that you 
refrain from doing so.

However, please do use the discussion tab to record issues with the 
specification or add additional information to existing issues.  There are 
already two open issues on the discussion tab and you can use them as a 
template for new issues.  The first issue is Bill Dueber's request for some 
sort of versioning and the second issue is whether the specification should 
specify the flavor of MARC, e.g., marc21, unicode, etc.

It is recommended that you place issues on the discussion tab since that will 
be the official place for documenting and disposing of them.  I do monitor this 
listserve and the OCLC developer network listserve, but I only selectively look 
at messages on those listserves.  If you would like to use this listserve or 
the OCLC developer network listserve to discuss the MARC-JSON specification, 
make sure you place MARC-JSON in the subject line, to give me a clue that I 
*should* look at that message, or directly CC my e-mail address on your post.

This message marks the beginning of a two week comment period on the 
specification which will end on midnight 2010-03-28.

[1] <http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11>


Thanks, Andy. 

  


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-14 Thread Houghton,Andrew
> From: Houghton,Andrew
> Sent: Saturday, March 06, 2010 06:59 PM
> To: Code for Libraries
> Subject: RE: [CODE4LIB] Q: XML2JSON converter
> 
> Depending on how much time I get next week I'll talk with the developer
> network folks to see what I need to do to put a specification under
> their infrastructure

I finished documenting our existing use of MARC-JSON.  The specification can be 
found on the OCLC developer network wiki [1].  Since it is a wiki, registered 
developer network members can edit the specification and I would ask that you 
refrain from doing so.

However, please do use the discussion tab to record issues with the 
specification or add additional information to existing issues.  There are 
already two open issues on the discussion tab and you can use them as a 
template for new issues.  The first issue is Bill Dueber's request for some 
sort of versioning and the second issue is whether the specification should 
specify the flavor of MARC, e.g., marc21, unicode, etc.

It is recommended that you place issues on the discussion tab since that will 
be the official place for documenting and disposing of them.  I do monitor this 
listserve and the OCLC developer network listserve, but I only selectively look 
at messages on those listserves.  If you would like to use this listserve or 
the OCLC developer network listserve to discuss the MARC-JSON specification, 
make sure you place MARC-JSON in the subject line, to give me a clue that I 
*should* look at that message, or directly CC my e-mail address on your post.

This message marks the beginning of a two week comment period on the 
specification which will end on midnight 2010-03-28.

[1] <http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11>


Thanks, Andy. 


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-08 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Benjamin Young
> Sent: Monday, March 08, 2010 09:32 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> Rather than using a newline-delimited format (the whole of which would
> not together be considered a valid JSON object) why not use the JSON
> array format with or without new lines? Something like:
> 
> [{"key":"value"}, {"key","value"}]
> 
> You could include new line delimiters after the "," if you needed to
> make pre-parsing easier (in a streaming context), but may be able to
> get
> away with just looking for the next "," or "]" after each valid JSON
> object.
> 
> That would allow the entire stream, if desired, to be saved to disk and
> read in as a single JSON object, or the same API to serve smaller JSON
> collections in a JSON standard way.

I think we just went around full circle again.  There appear to be two distinct 
use cases when dealing with MARC collections.  The first conforms to the ECMA 
262 JSON subset.  Which is what you described, above:

[ { "key" : "value" }, { "key" : "value" } ]

its media type should be specified as application/json.

The second use case, which there was some discussion between Bill Dueber and 
myself, is a newline delimited format where the JSON array specifiers are 
omitted and the objects are specified one per line without commas separating 
objects.  The misunderstanding between Bill and I was that this "malformed" 
JSON was being sent as media type application/json which is not what he was 
proposing and I misunderstood.  This newline delimited JSON appears to be an 
import/export format in both CouchDB and MongoDB.

In the FAST work I'm doing I'm probably going to take an alternate approach to 
generating our 10,000 MARC record collection files for download.  The approach 
I'm going to take is to create valid JSON but make it easier for the CouchDB 
and MongoDB folks to import the collection of records.  The format will be:

[
{ "key" : "value" }
,
{ "key" : "value" }
]

the objects will be one per line, but the array specifier and comma delimiters 
between objects will appear on a separate line.  This would allow the CouchDB 
and MongoDB folks to run a simple sed script on the file before import:

sed -e '/^.$/D' file.json > file.txt

or if they are reading the data as a raw text file, they can just ignore all 
lines that start with opening brace, comma, or closing brace, or alternately 
only process lines starting with an opening brace.

However, this doesn't mean that I'm balking on pursuing a separate media type 
specific to the library community that specifies a specific MARC JSON 
serialization encoded as a single line.

I see multiple steps here with the first being a consensus on serializing MARC 
(ISO 2709) in JSON.  Which begins with me documenting it so people can throw 
some darts at.  I don't think what we are proposing is controversial, but it's 
beneficial to have a variety of perspectives as input.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-08 Thread Benjamin Young

On 3/6/10 6:59 PM, Houghton,Andrew wrote:

From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Bill Dueber
Sent: Saturday, March 06, 2010 05:11 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q: XML2JSON converter

Anyway, hopefully, it won't be a huge surprise that I don't disagree
with any of the quote above in general; I would assert, though, that
application/json and application/marc+json should both return JSON
(in the same way that text/xml, application/xml, and
application/marc+xml can all be expected to return XML).
Newline-delimited json is starting to crop up in a few places
(e.g. couchdb) and should probably have its own mime type
and associated extension. So I would say something like:

application/json -- return json (obviously)
application/marc+json  -- return json
application/marc+ndj  -- return newline-delimited json
 

This sounds like consensus on how to deal with newline-delimited JSON in a 
standards based manner.

I'm not familiar with CouchDB, but I am using MongoDB which is similar.  I'll 
have to dig into how they deal with this newline-delimited JSON.  Can you 
provide any references to get me started?
   
Rather than using a newline-delimited format (the whole of which would 
not together be considered a valid JSON object) why not use the JSON 
array format with or without new lines? Something like:


[{"key":"value"}, {"key","value"}]

You could include new line delimiters after the "," if you needed to 
make pre-parsing easier (in a streaming context), but may be able to get 
away with just looking for the next "," or "]" after each valid JSON object.


That would allow the entire stream, if desired, to be saved to disk and 
read in as a single JSON object, or the same API to serve smaller JSON 
collections in a JSON standard way.


CouchDB uses this array notation when returning multiple document 
revisions in one request. CouchDB also offers a slightly more annotated 
structure (which might be useful with streaming as well):


{
  "total_rows": 2,
  "offset": 0,
  "rows":[{"key":"value"}, {"key","value"}]
}

Rows here plays the same roll as the above array-based format, but 
provides an initial row count for the consumer to use (if it wants) for 
knowing what's ahead. The "offset" key is specific to CouchDB, but 
similar application specific information could be stored in the "header" 
of the JSON object using this method.

In all cases, we should agree on a standard record serialization,
though, and the pure-json returns should include something that
indicates what the heck it is (hopefully a URI that can act as a
distinct "namespace"-type identifier, including a version in it).
 

I agree that our MARC-JSON serialization needs some "namespace" identifier in 
it and it occurred to me that the way it is handling indicators, e.g., ind1 and ind2 
properties, might be better handled as an array to accommodate IFLA's MARC-XML-ish where 
they can have from 1-9 indicator values.

BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON 
standard, which means you need to use \u notation to specify characters in 
strings, not sure I made that clear in earlier posts.  A downside to the 
current ECMA 262 specification is that it doesn't support \U00XX, as Python 
does, for the extended characters.  Hopefully that will get rectified in a 
future ECMA 262 specification.

   

The question for me, I think, is whether within this community,  anyone
who provides one of these types (application/marc+json and
application/marc+ndj) should automatically be expected to provide both.
I don't have an answer for that.
 
As far as mime-type declarations go in general, I'd recommend avoiding 
any format specific mime types and sticking to the application/json 
format and providing document level hints (if needed) for the content 
type. If you do find a need for the special case mime types, I'd 
recommend still responding to Accepts: application/json whenever 
possible--for the sake of standards. :)


All told, I'm just glad to see this discussion being had. I'll be happy 
to provide some CouchDB test cases (replication, etc) if that's of 
interest to anyone.


Thanks,
Benjamin

I think this issue gets into familiar territory when dealing with RDF formats.  
Let's see, there is N3, NT, XML, Turtle, etc.  Do you need to provide all of 
them?  No, but it's nice of the server to at least provide NT or Turtle and 
XML.  Ultimately it's up to the server.  But the only difference between use 
cases #2 and #3 is whether the output is wrapped in an array, so it's probably 
easy for the server to produce both.

Depending on how much time I get next week I'll talk with the developer network 
folks to see what I need to do to put a specification under their 
infrastructure.  Looks like from my schedule it's going to be another week of 
hell :(


Andy.
   


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Saturday, March 06, 2010 05:11 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> Anyway, hopefully, it won't be a huge surprise that I don't disagree
> with any of the quote above in general; I would assert, though, that
> application/json and application/marc+json should both return JSON
> (in the same way that text/xml, application/xml, and 
> application/marc+xml can all be expected to return XML). 
> Newline-delimited json is starting to crop up in a few places 
> (e.g. couchdb) and should probably have its own mime type
> and associated extension. So I would say something like:
> 
> application/json -- return json (obviously)
> application/marc+json  -- return json
> application/marc+ndj  -- return newline-delimited json

This sounds like consensus on how to deal with newline-delimited JSON in a 
standards based manner.

I'm not familiar with CouchDB, but I am using MongoDB which is similar.  I'll 
have to dig into how they deal with this newline-delimited JSON.  Can you 
provide any references to get me started?

> In all cases, we should agree on a standard record serialization,
> though, and the pure-json returns should include something that 
> indicates what the heck it is (hopefully a URI that can act as a 
> distinct "namespace"-type identifier, including a version in it).

I agree that our MARC-JSON serialization needs some "namespace" identifier in 
it and it occurred to me that the way it is handling indicators, e.g., ind1 and 
ind2 properties, might be better handled as an array to accommodate IFLA's 
MARC-XML-ish where they can have from 1-9 indicator values.

BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON 
standard, which means you need to use \u notation to specify characters in 
strings, not sure I made that clear in earlier posts.  A downside to the 
current ECMA 262 specification is that it doesn't support \U00XX, as Python 
does, for the extended characters.  Hopefully that will get rectified in a 
future ECMA 262 specification.

> The question for me, I think, is whether within this community,  anyone
> who provides one of these types (application/marc+json and
> application/marc+ndj) should automatically be expected to provide both.
> I don't have an answer for that.

I think this issue gets into familiar territory when dealing with RDF formats.  
Let's see, there is N3, NT, XML, Turtle, etc.  Do you need to provide all of 
them?  No, but it's nice of the server to at least provide NT or Turtle and 
XML.  Ultimately it's up to the server.  But the only difference between use 
cases #2 and #3 is whether the output is wrapped in an array, so it's probably 
easy for the server to produce both.

Depending on how much time I get next week I'll talk with the developer network 
folks to see what I need to do to put a specification under their 
infrastructure.  Looks like from my schedule it's going to be another week of 
hell :(


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Bill Dueber
On Sat, Mar 6, 2010 at 1:57 PM, Houghton,Andrew  wrote:

>  A way to fix this issue is to say that use cases #1 and #2 conform to
> media type application/json and use case #3 conforms to a new media type
> say: application/marc+json.  This new application/marc+json media type now
> becomes a library centric standard and it avoids breaking a widely deployed
> Web standard.
>

I'm so sorry -- it never dawned on me that anyone would think that I was
asserting that a JSON MIME type should return anything but JSON. For the
record, I think that's batshit crazy. JSON needs to return json. I'd been
hoping to convince folks that we need to have a standard way to pass records
around that doesn't require a streaming parser/writer; not ignore standard
MIME-types willy-nilly. My use cases exist almost entirely outside the
browse environment (because, my god, I don't want to have to try to deal
with MARC21, whatever the serialization, in a browser environment); it
sounds like Andy is almost purely worried about working with a MARC21
serialization within a browser-based javascript environment.

Anyway, hopefully, it won't be a huge surprise that I don't disagree with
any of the quote above in general; I would assert, though, that
application/json and application/mac+json should both return JSON (in the
same way that text/xml, application/xml, and application/marc+xml can all be
expected to return XML). Newline-delmited json is starting to crop up in a
few places (e.g. couchdb) and should probably have its own mime type and
associated extension. So I would say something like:

application/json -- return json (obviously)
application/marc+json  -- return json
application/marc+ndj  -- return newline-delimited json

In all cases, we should agree on a standard record serialization, though,
and the pure-json returns should include something that indicates what the
heck it is (hopefully a URI that can act as a distinct "namespace"-type
identifier, including a version in it).

The question for me, I think, is whether within this community,  anyone who
provides one of these types (application/marc+json and application/marc+ndj)
should automatically be expected to provide both. I don't have an answer for
that.

 -Bill-


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 08:48 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew 
> wrote:
> 
> > OK, I will bite, you stated:
> >
> > 1. That large datasets are a problem.
> > 2. That streaming APIs are a pain to deal with.
> > 3. That tool sets have memory constraints.
> >
> > So how do you propose to process large JSON datasets that:
> >
> > 1. Comply with the JSON specification.
> > 2. Can be read by any JavaScript/JSON processor.
> > 3. Do not require the use of streaming API.
> > 4. Do not exceed the memory limitations of current JSON processors.
> >
> >
> What I'm proposing is that we don't process large JSON datasets; I'm
> proposing that we process smallish JSON documents one at a time by
> pulling
> them out of a stream based on an end-of-record character.
> 
> This is basically what we use for MARC21 binary format -- have a
> defined
> structure for a valid record, and separate multiple well-formed record
> structures with an end-of-record character. This preserves JSON
> specification adherence at the record level and uses a different scheme
> to represent collections. Obviously, MARC-XML uses a different 
> mechanism to define a collection of records -- putting well-formed 
> record structures inside a  tag.
> 
> So... I'm proposing define what we mean by a single MARC record
> serialized to JSON (in whatever format; I'm not very opinionated 
> on this point) that preserves the order, indicators, tags, data, 
> etc. we need to round-trip between marc21binary, marc-xml, and 
> marc-json.
> 
> And then separate those valid records with an end-of-record character
> -- "\n".

Ok, what I see here are divergent use cases and the willingness of the library 
community to break existing Web standards.  This is how the library community 
makes it more difficult to use their data and places additional barriers for 
people and organizations to enter their market because of these library centric 
protocols and standards.

If I were to try to sell this idea to the Web community, at large, and tell 
them that when they send an HTTP request with an Accept: application/json 
header to our services, our services will respond with a 200 HTTP status and 
deliver them malformed JSON, I would be immediately impaled with multiple 
arrows and daggers :(  Not to mention that OCLC would disparaged by a certain 
crowd in their blogs as being idiots who cannot follow standards.

OCLC's goals are use and conform to Web standards to make library data easier 
to use by people or organizations outside the library community, otherwise 
libraries and their data will become irrelevant.  The JSON serialization is a 
standard and the Web community expects that when they make HTTP requests with 
an Accept: application/json header that they will be get back JSON conforming 
to the standard.  JSON's main use case is in AJAX scenarios where you are not 
suppose to be sending megabytes of data across the wire.

Your proposal is asking me to break a widely deployed Web standard that is used 
by AJAX frameworks and to access millions (ok, many) Web sites.

> Unless I've read all this wrong, you've come to the conclusion that the
> benefit of having a JSON serialization that is valid JSON at both the
> record and collection level outweighs the pain of having to deal with
> a streaming parser and writer.  This allows a single collection to be
> treated as any other JSON document, which has obvious benefits (which 
> I certainly don't mean to minimize) and all the drawbacks we've been 
> talking about *ad nauseam*.

The goal is to adhere to existing Web standards and your underlying assumption 
is that you can or will be retrieving large datasets through an AJAX scenario.  
As I pointed out this is more an API design issue and due to the way AJAX works 
you should never design an API in that manner.  Your assumption that you can or 
will be retrieving large datasets through an AJAX scenario is false given the 
caveat of a well designed API.  Therefore you will never be put into the 
scenario requiring the use of JSON streaming so your argument from this point 
of view is mute.

But for arguments sake let's say you could retrieve a line delimited list of 
JSON objects.  You can no longer use any existing AJAX framework for getting 
back that JSON since it's malformed.  You could use the AJAX framework's 
XMLHTTP to retrieve this line delimited list of JSON objects, but this still 
doesn't help because the XMLHTTP object will keep the entire response in memory.

So when our service se

Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ross Singer
> Sent: Friday, March 05, 2010 09:18 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> I actually just wrote the same exact email as Bill (although probably
> not as polite -- I called the marcxml "collection" element a
> "contrivance that appears nowhere in marc21").  I even wrote the
> "marc21 is EOR character delimited files" bit.  I was hoping to figure
> out how to use unix split to make my point, couldn't, and then
> discarded my draft.
> 
> But I was *right there*.
> 
> -Ross.

I'll answer Bill's message tomorrow after I have had some sleep :) 

Actually, I contend that the MARC-XML collection element does appear in MARC 
(ISO 2709), but it is at the physical layer and not at the structural layer.  
Remember MARC records were placed on a tape reel, thus the tape reel was the 
collection (container).  Placed on disk in a file, the file is the collection 
(container).  I agree that it's not spelled out in the standard, but the 
concept of a collection (container) is implicit when you have more than one 
record of anything.

Basic set theory: a set is a container for its members :)

The obvious reason why it exists in XML is that the XML infoset requires a 
single document element (container).  This is why the MARC-XML schema allows 
either a collection or record element to be specified as the document element.  
It is unfortunate that the XML infoset requires a single document element, 
otherwise you would be back to the file on disk being the implicit collection 
(container) as it is in ISO 2709.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Ross Singer
I actually just wrote the same exact email as Bill (although probably
not as polite -- I called the marcxml "collection" element a
"contrivance that appears nowhere in marc21").  I even wrote the
"marc21 is EOR character delimited files" bit.  I was hoping to figure
out how to use unix split to make my point, couldn't, and then
discarded my draft.

But I was *right there*.

-Ross.

On Fri, Mar 5, 2010 at 8:48 PM, Bill Dueber  wrote:
> On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew  wrote:
>
>> OK, I will bite, you stated:
>>
>> 1. That large datasets are a problem.
>> 2. That streaming APIs are a pain to deal with.
>> 3. That tool sets have memory constraints.
>>
>> So how do you propose to process large JSON datasets that:
>>
>> 1. Comply with the JSON specification.
>> 2. Can be read by any JavaScript/JSON processor.
>> 3. Do not require the use of streaming API.
>> 4. Do not exceed the memory limitations of current JSON processors.
>>
>>
> What I'm proposing is that we don't process large JSON datasets; I'm
> proposing that we process smallish JSON documents one at a time by pulling
> them out of a stream based on an end-of-record character.
>
> This is basically what we use for MARC21 binary format -- have a defined
> structure for a valid record, and separate multiple well-formed record
> structures with an end-of-record character. This preserves JSON
> specification adherence at the record level and uses a different scheme to
> represent collections. Obviously, MARC-XML uses a different mechanism to
> define a collection of records -- putting well-formed record structures
> inside a  tag.
>
> So... I'm proposing define what we mean by a single MARC record serialized
> to JSON (in whatever format; I'm not very opinionated on this point) that
> preserves the order, indicators, tags, data, etc. we need to round-trip
> between marc21binary, marc-xml, and marc-json.
>
> And then separate those valid records with an end-of-record character --
> "\n".
>
> Unless I've read all this wrong, you've come to the conclusion that the
> benefit of having a JSON serialization that is valid JSON at both the record
> and collection level outweighs the pain of having to deal with a streaming
> parser and writer.  This allows a single collection to be treated as any
> other JSON document, which has obvious benefits (which I certainly don't
> mean to minimize) and all the drawbacks we've been talking about *ad nauseam
> *.
>
> I go the the other way. I think the pain of dealing with a streaming API
> outweighs the benefits of having a single valid JSON structure for a
> collection, and instead have put forward that we use a combination of JSON
> records and a well-defined end-of-record character ("\n") to represent a
> collection.  I recognize that this involves providing special-purpose code
> which must call for JSON-deserialization on each line, instead of being able
> to throw the whole stream/file/whatever at your json parser is. I accept
> that because getting each line of a text file is something I find easy
> compared to dealing with streaming parsers.
>
> And our point of disagreement, I think, is that I believe that defining the
> collection structure in such a way that we need two steps (get a line;
> deserialize that line) and can't just call the equivalent of
> JSON.parse(stream) has benefits in ease of implementation and use that
> outweigh the loss of having both a single record and a collection of records
> be valid JSON. And you, I think, don't :-)
>
> I'm going to bow out of this now, unless I've got some part of our positions
> wrong, to let any others that care (which may number zero) chime in.
>
>  -Bill-
>
>
>
>
>
>
>
>
>
>
> --
> Bill Dueber
> Library Systems Programmer
> University of Michigan Library
>


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew  wrote:

> OK, I will bite, you stated:
>
> 1. That large datasets are a problem.
> 2. That streaming APIs are a pain to deal with.
> 3. That tool sets have memory constraints.
>
> So how do you propose to process large JSON datasets that:
>
> 1. Comply with the JSON specification.
> 2. Can be read by any JavaScript/JSON processor.
> 3. Do not require the use of streaming API.
> 4. Do not exceed the memory limitations of current JSON processors.
>
>
What I'm proposing is that we don't process large JSON datasets; I'm
proposing that we process smallish JSON documents one at a time by pulling
them out of a stream based on an end-of-record character.

This is basically what we use for MARC21 binary format -- have a defined
structure for a valid record, and separate multiple well-formed record
structures with an end-of-record character. This preserves JSON
specification adherence at the record level and uses a different scheme to
represent collections. Obviously, MARC-XML uses a different mechanism to
define a collection of records -- putting well-formed record structures
inside a  tag.

So... I'm proposing define what we mean by a single MARC record serialized
to JSON (in whatever format; I'm not very opinionated on this point) that
preserves the order, indicators, tags, data, etc. we need to round-trip
between marc21binary, marc-xml, and marc-json.

And then separate those valid records with an end-of-record character --
"\n".

Unless I've read all this wrong, you've come to the conclusion that the
benefit of having a JSON serialization that is valid JSON at both the record
and collection level outweighs the pain of having to deal with a streaming
parser and writer.  This allows a single collection to be treated as any
other JSON document, which has obvious benefits (which I certainly don't
mean to minimize) and all the drawbacks we've been talking about *ad nauseam
*.

I go the the other way. I think the pain of dealing with a streaming API
outweighs the benefits of having a single valid JSON structure for a
collection, and instead have put forward that we use a combination of JSON
records and a well-defined end-of-record character ("\n") to represent a
collection.  I recognize that this involves providing special-purpose code
which must call for JSON-deserialization on each line, instead of being able
to throw the whole stream/file/whatever at your json parser is. I accept
that because getting each line of a text file is something I find easy
compared to dealing with streaming parsers.

And our point of disagreement, I think, is that I believe that defining the
collection structure in such a way that we need two steps (get a line;
deserialize that line) and can't just call the equivalent of
JSON.parse(stream) has benefits in ease of implementation and use that
outweigh the loss of having both a single record and a collection of records
be valid JSON. And you, I think, don't :-)

I'm going to bow out of this now, unless I've got some part of our positions
wrong, to let any others that care (which may number zero) chime in.

 -Bill-










-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 05:22 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> This is my central point. I'm actually saying that JSON streaming is
> painful
> and rare enough that it should be avoided as a requirement for working
> with
> any new format.

OK, in principle we are in agreement here.

> I guess, in sum, I'm making the following assertions:
> 
> 1. Streaming APIs for JSON, where they exist, are a pain in the ass.
> And
> they don't exist everywhere. Without a JSON streaming parser, you have
> to
> pull the whole array of documents up into memory, which may be
> impossible.
> This is the crux of my argument -- if you disagree with it, then I
> would
> assume you disagree with the other points as well.

Agree with streaming APIs for JSON are a pain and not universal across all 
clients.

Agree that without a streaming API you are limited by memory constraints on the 
client. 

> 2. Many people -- and I don't think I'm exaggerating here, honestly --
> really don't like using MARC-XML but have to because of the length
> restrictions on MARC-binary. A useful alternative, based on dead-easy
> parsing and production, is very appealing.

Cannot address this concern.  MARC (ISO 2709) and MARC-XML are library 
community standards.  Doesn't matter whether I like them or not, or you like 
them or not.  This is what the library community has agreed to as a 
communications format between systems for interoperability.

> 2.5 Having to deal with a streaming API takes away the "dead-easy"
> part.

My assumption is that 2.5 is dealing with using a streaming API with MARC-XML.  
I agree that using SAX in XML on MARC-XML is a pain, but that's an issue with 
dealing with large XML datasets, in general, and has nothing to do with 
MARC-21.  In general, when processing large MARC-XML I use SAX to get me a 
complete record and process at the record level, that isn't too bad, but I'll 
concede it's still a pain.  Usually, I break up large datasets into 10,000 
record chunks and process them that way since most XML and XSLT tools cannot 
effectively deal with documents that are 100MB or larger, so I rarely ever use 
SAX anymore.

> 3. If you accept my assertions about streaming parsers, then dealing
> with
> the format you've proposed for large sets is either painful (with a
> streaming API) or impossible (where such an API doesn't exist) due to
> memory
> constraints.

Large datasets, period, are a pain to deal with.  I deal with them all day long 
and have to deal with tool issues, disk space, processing times, etc.
I don't disagree with you here in principle, but as I previously point out this 
is an API issue.

If your API never allows you to return a collection of more than 10 records 
which is less than 1MB, you are not dealing with large datasets.  If your API 
is returning a large collection of records that is 100MB or larger, then you 
got problems and need to rethink your API.

This is no different than a large MARC-XML collection.  The entire LC authority 
dataset, names and subjects, is 8GB of MARC-XML.  Do I process that as 8GB of 
MARC-XML, heck no!!  I break it up into smaller chunks and process the chunks.  
This allows me to take those chunks and run parallel algorithms on them or 
throw the chunks at our cluster and get the results back quicker.

It's the size of the data that is the crux of your argument not the format of 
the data, e.g., XML, JSON, CSV, etc.

> 4. Streaming JSON writer APIs are also painful; everything that applies
> to
> reading applies to writing. Sans a streaming writer, trying to *write*
> a
> large JSON document also results in you having to have the whole thing
> in
> memory.

No disagreement here.
 
> 5. People are going to want to deal with this format, because of its
> benefits over marc21 (record length) and marc-xml (ease of processing),
> which means we're going to want to deal with big sets of data and/or
> dump batches of it to a file. Which brings us back to #1, the pain or
> absence of streaming apis.

So we are back to the general argument that large datasets, regardless of 
format, are a pain to deal with, and that tool sets have issues dealing with 
large datasets.  I don't disagree with these statements and run into these 
issues on a daily basis whether dealing with MARC datasets or other large 
datasets.  A solution to this issue is to create batches of stuff that can be 
processed in parallel, ever heard of Google and map-reduce :)

> "Write a better JSON parser/writer" or "use a different language" seem
> like
> bad solutions to me, especially when a (potentially) us

Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 4:38 PM, Houghton,Andrew  wrote:

>
> Maybe I have been mislead or misunderstood JSON streaming.


This is my central point. I'm actually saying that JSON streaming is painful
and rare enough that it should be avoided as a requirement for working with
any new format.

I guess, in sum, I'm making the following assertions:

1. Streaming APIs for JSON, where they exist, are a pain in the ass. And
they don't exist everywhere. Without a JSON streaming parser, you have to
pull the whole array of documents up into memory, which may be impossible.
This is the crux of my argument -- if you disagree with it, then I would
assume you disagree with the other points as well.

2. Many people -- and I don't think I'm exaggerating here, honestly --
really don't like using MARC-XML but have to because of the length
restrictions on MARC-binary. A useful alternative, based on dead-easy
parsing and production, is very appealing.

2.5 Having to deal with a streaming API takes away the "dead-easy" part.

3. If you accept my assertions about streaming parsers, then dealing with
the format you've proposed for large sets is either painful (with a
streaming API) or impossible (where such an API doesn't exist) due to memory
constraints.

4. Streaming JSON writer APIs are also painful; everything that applies to
reading applies to writing. Sans a streaming writer, trying to *write* a
large JSON document also results in you having to have the whole thing in
memory.

5. People are going to want to deal with this format, because of its
benefits over marc21 (record length) and marc-xml (ease of processing),
which means we're going to want to deal with big sets of data and/or dump
batches of it to a file. Which brings us back to #1, the pain or absence of
streaming apis.

"Write a better JSON parser/writer"  or "use a different language" seem like
bad solutions to me, especially when a (potentially) useful alternative
exists.

As I pointed out, if streaming JSON is no harder/unavailable to you than
non-streaming json, then this is mostly moot. I assert that for many people
in this community it is one or the other, which is why I'm leery of it.

  -Bill-


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Benjamin Young
> Sent: Friday, March 05, 2010 04:24 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> For my part, I'd like to explore the options of putting MARC data into
> CouchDB (which stores documents as JSON) which could then open the door
> for replicating that data between any number of installations of
> CouchDB
> as well as providing for various output formats (marc-xml, etc).
> 
> It's just an idea, but it's one that uses JSON outside of the browser
> and is a good proof case for any MARC in JSON format.

This was partly the reason why I developed our MARC-JSON format since I'm using 
MongoDB [1] which is a NoSQL database based on JSON.


Andy.

[1] <http://www.mongodb.org/display/DOCS/Home>


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 03:45 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> I guess my concern here is that the specification, as you're describing
> it, is closing off potential uses.  It seems fine if, for example, your
> primary concern is javascript-in-the-browser, and browser-request,
> pagination-enabled systems might be all you're worried about right now.
> 
> That's not the whole universe of uses, though. People are going to want
> to dump these things into a file to read later -- no possibility for
> pagination in that situation.

I disagree that you couldn't dump a paginated result set into a file for 
reading later.  I do this all the time not only in Javascript, but may other 
programming languages.

> Others may, in fact, want to stream a few thousand
> records down the pipe at once, but without a streaming parser that
> can't happen if it's all one big array.

Well, if your service isn't allowing them to be streamed a few thousand records 
at a time, then that isn't a issue :)

Maybe I have been mislead or misunderstood JSON streaming.  My understanding 
was that you can generate an arbitrary large outgoing stream on the server side 
and can read an arbitrary large incoming stream on the client side.  So it 
shouldn't matter if the result set was delivered as one big JSON array.  The 
SAX like interface that JSON streaming uses provides the necessary events to 
allow you to pull the individual records from that arbitrary large array.

> I worry that as specified, the *only* use will be, "Pull these down a
> thin pipe, and if you want to keep them for later, or want a bunch of
> them, you have to deal with marc-xml."

Don't quite follow this.  MARC-XML is an XML format, MARC-JSON is our JSON 
format for expressing the various MARC-21 format, e.g., authority, 
bibliographic, classification, community information and holdings in JSON.  The 
JSON is based on the structure of MARC-XML which was based on the structure of 
ISO 2709.  Don't see how MARC-XML comes into play when you are dealing with 
JSON.  If you want to save our MARC-JSON you don't have to convert it to 
MARC-XML on the client side.  Just save it as a text file.

> Part of my incentive is to *not* have to use marc-xml, but in this 
> case I'd just be trading one technology I don't like (marc-xml) 
> for two technologies, one of which I don't like (that'd be marc-xml 
> again).

Again not sure how to address this concern.  If you are dealing with library 
data, then its current communication formats are either MARC binary (ISO 2709) 
or MARC-XML, ignoring IFLA's MARC-XML-ish format for the moment.  You might not 
like it, but that is life in library land.  You can go develop your own formats 
based on the various MARC-21 format specifications, but are unlikely to achieve 
any sort of interoperability with the existing library systems and services.

We choose our MARC-JSON to maintain the structural components of MARC-XML and 
hence MARC binary (ISO 2709).  In MARC, control fields have different semantics 
from data fields and you cannot merge them into one thing called field.  If you 
look closely at the MARC-XML schema, you might notice that the controlfield and 
datafield elements can have non-numeric tags.  If you merge everything into 
something called field, then you cannot distinguish between a non-numeric tag 
for a controlfield vs. a datafield element.  There are valid reasons why we 
decided to maintain the existing structure of MARC.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 3:45 PM, Bill Dueber wrote:

On Fri, Mar 5, 2010 at 3:14 PM, Houghton,Andrew  wrote:


   

As you point out JSON streaming doesn't work with all clients and I am
hesitent to build on anything that all clients cannot accept.  I think part
of the issue here is proper API design.  Sending tens of megabytes back to a
client and expecting them to process it seems like a poor API design
regardless of whether they can stream it or not.  It might make more sense
to have a server API send back 10 of our MARC-JSON records in a JSON
collection and have the client request an additional batch of records for
the result set.  In addition, if I remember correctly, JSON streaming or
other streaming methods keep the connection to the server open which is not
a good thing to do to maintain server throughput.

 

I guess my concern here is that the specification, as you're describing it,
is closing off potential uses.  It seems fine if, for example, your primary
concern is javascript-in-the-browser, and browser-request,
pagination-enabled systems might be all you're worried about right now.

That's not the whole universe of uses, though. People are going to want to
dump these things into a file to read later -- no possibility for pagination
in that situation. Others may, in fact, want to stream a few thousand
records down the pipe at once, but without a streaming parser that can't
happen if it's all one big array.

I worry that as specified, the *only* use will be, "Pull these down a thin
pipe, and if you want to keep them for later, or want a bunch of them, you
have to deal with marc-xml." Part of my incentive is to *not* have to use
marc-xml, but in this case I'd just be trading one technology I don't like
(marc-xml) for two technologies, one of which I don't like (that'd be
marc-xml again).

I really do understand the desire to make this parallel to marc-xml, but
there's a seem between the two technologies that makes that a problematic
approach.
   
For my part, I'd like to explore the options of putting MARC data into 
CouchDB (which stores documents as JSON) which could then open the door 
for replicating that data between any number of installations of CouchDB 
as well as providing for various output formats (marc-xml, etc).


It's just an idea, but it's one that uses JSON outside of the browser 
and is a good proof case for any MARC in JSON format.


Thanks,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread LeVan,Ralph
> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf
Of
> Bill Dueber
> 
> I really do understand the desire to make this parallel to marc-xml,
but
> there's a seem between the two technologies that makes that a
problematic
> approach.

As a confession, here in OCLC Research, we do pass around files of
marc-xml records that are newline delimited without a wrapper element
containing them.  We do that for all the reasons you gave for wanting
the same thing for JSON records.

Ralph


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 3:14 PM, Houghton,Andrew  wrote:


> As you point out JSON streaming doesn't work with all clients and I am
> hesitent to build on anything that all clients cannot accept.  I think part
> of the issue here is proper API design.  Sending tens of megabytes back to a
> client and expecting them to process it seems like a poor API design
> regardless of whether they can stream it or not.  It might make more sense
> to have a server API send back 10 of our MARC-JSON records in a JSON
> collection and have the client request an additional batch of records for
> the result set.  In addition, if I remember correctly, JSON streaming or
> other streaming methods keep the connection to the server open which is not
> a good thing to do to maintain server throughput.
>

I guess my concern here is that the specification, as you're describing it,
is closing off potential uses.  It seems fine if, for example, your primary
concern is javascript-in-the-browser, and browser-request,
pagination-enabled systems might be all you're worried about right now.

That's not the whole universe of uses, though. People are going to want to
dump these things into a file to read later -- no possibility for pagination
in that situation. Others may, in fact, want to stream a few thousand
records down the pipe at once, but without a streaming parser that can't
happen if it's all one big array.

I worry that as specified, the *only* use will be, "Pull these down a thin
pipe, and if you want to keep them for later, or want a bunch of them, you
have to deal with marc-xml." Part of my incentive is to *not* have to use
marc-xml, but in this case I'd just be trading one technology I don't like
(marc-xml) for two technologies, one of which I don't like (that'd be
marc-xml again).

I really do understand the desire to make this parallel to marc-xml, but
there's a seem between the two technologies that makes that a problematic
approach.



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 2:46 PM, Ross Singer wrote:

On Fri, Mar 5, 2010 at 2:06 PM, Benjamin Young  wrote:

   

A CouchDB friend of mine just pointed me to the BibJSON format by the
Bibliographic Knowledge Network:
http://www.bibkn.org/bibjson/index.html

Might be worth looking through for future collaboration/transformation
options.
 

marc-json and BibJSON serve two different purposes:  marc-json would
need to be a loss-less serialization of a MARC record which may or may
not contain bibliographic data (it may be an authority, holding or CID
record, for example).  BibJSON is more of a merging of data model and
serialization (which, admittedly, is no stranger to MARC) for the
purpose of bibliographic /citations/.  So it will probably be lossy
and there would most likely be a lot of MARC data that is out of
scope.

That's not to say it wouldn't be useful to figure out how to get from
MARC->BibJSON, but from my perspective it's difficult to see the
advantage it brings (being tied to JSON) vs. BIBO.

-Ross.
   
Thanks for the clarification, Ross. I thought it would be helpful (if 
nothing else) to see how data was being mapped in a related domain into 
and out of JSON. I'm new to library data in general, so I appreciate the 
clarification on which format is for what.


Appreciated,
Benjamin


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ross Singer
> Sent: Friday, March 05, 2010 02:32 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew 
> wrote:
> 
> > I certainly would be will to work with LC on creating a MARC-JSON
> specification as I did in creating the MARC-XML specification.
> 
> Quite frankly, I think I (and I imagine others) would much rather see
> a more open, RFC-style process to creating a marc-json spec than "I
> talked to LC and here you go".
> 
> Maybe I'm misreading this last paragraph a bit, however.

Yes, you misread the last paragraph.

Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Benjamin Young
> Sent: Friday, March 05, 2010 02:06 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> A CouchDB friend of mine just pointed me to the BibJSON format by the
> Bibliographic Knowledge Network:
> http://www.bibkn.org/bibjson/index.html
> 
> Might be worth looking through for future collaboration/transformation
> options.

Unfortunately, it doesn't really work for authority and classification data 
that I'm frequently involved with.

Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 01:59 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew 
> wrote:
> 
> >
> > I decided to stick closer to a MARC-XML type definition since its
> would be
> > easier to explain how the two specifications are related, rather than
> take a
> > more radical approach in producing a specification less familiar.
> Not to
> > say that other approaches are bad, they just have different
> advantages and
> > disadvantages.  I was going for simple and familiar.
> >
> >
> That makes sense, but please consider adding a format/version (which we
> get
> in MARC-XML from the namespace and isn't present here). In fact, please
> consider adding a format / version / URI, so people know what they've
> got.

This sounds reasonable and I'll consider adding into our specification.

> I'm also going to again push the newline-delimited-json stuff. The
> collection-as-array is simple and very clean, but leads to trouble
> for production (where for most of us we'd have to get the whole
> freakin' collection in memory first ...

As far as our MARC-JSON specificaton is concerned a server application can 
return either a collection or record which mimics the MARC-XML specification 
where the collection or record element can be used for a document element.

> Unless, of course, writing json to a stream and reading json from a
> stream
> is a lot easier than I make it out to be across a variety of languages
> and I
> just don't know it, which is entirely possible. The streaming writer
> interfaces for Perl (
> http://search.cpan.org/dist/JSON-Streaming-
> Writer/lib/JSON/Streaming/Writer.pm)
> and Java's Jackson (
> http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example)
> are a
> little more daunting than I'd like them to be.

As you point out JSON streaming doesn't work with all clients and I am hesitent 
to build on anything that all clients cannot accept.  I think part of the issue 
here is proper API design.  Sending tens of megabytes back to a client and 
expecting them to process it seems like a poor API design regardless of whether 
they can stream it or not.  It might make more sense to have a server API send 
back 10 of our MARC-JSON records in a JSON collection and have the client 
request an additional batch of records for the result set.  In addition, if I 
remember correctly, JSON streaming or other streaming methods keep the 
connection to the server open which is not a good thing to do to maintain 
server throughput.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Ross Singer
On Fri, Mar 5, 2010 at 2:06 PM, Benjamin Young  wrote:

> A CouchDB friend of mine just pointed me to the BibJSON format by the
> Bibliographic Knowledge Network:
> http://www.bibkn.org/bibjson/index.html
>
> Might be worth looking through for future collaboration/transformation
> options.

marc-json and BibJSON serve two different purposes:  marc-json would
need to be a loss-less serialization of a MARC record which may or may
not contain bibliographic data (it may be an authority, holding or CID
record, for example).  BibJSON is more of a merging of data model and
serialization (which, admittedly, is no stranger to MARC) for the
purpose of bibliographic /citations/.  So it will probably be lossy
and there would most likely be a lot of MARC data that is out of
scope.

That's not to say it wouldn't be useful to figure out how to get from
MARC->BibJSON, but from my perspective it's difficult to see the
advantage it brings (being tied to JSON) vs. BIBO.

-Ross.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Ross Singer
On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew  wrote:

> I certainly would be will to work with LC on creating a MARC-JSON 
> specification as I did in creating the MARC-XML specification.

Quite frankly, I think I (and I imagine others) would much rather see
a more open, RFC-style process to creating a marc-json spec than "I
talked to LC and here you go".

Maybe I'm misreading this last paragraph a bit, however.

-Ross.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 1:10 PM, Houghton,Andrew wrote:

From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Bill Dueber
Sent: Friday, March 05, 2010 12:30 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q: XML2JSON converter

On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew
wrote:

 

Too bad I didn't attend code4lib.  OCLC Research has created a
   

version of
 

MARC in JSON and will probably release FAST concepts in MARC binary,
MARC-XML and our MARC-JSON format among other formats.  I'm wondering
whether there is some consensus that can be reached and standardized
   

at LC's
 

level, just like OCLC, RLG and LC came to consensus on MARC-XML.
  Unfortunately, I have not had the time to document the format,
   

although it
 

fairly straight forward, and yes we have an XSLT to convert from
   

MARC-XML to
 

MARC-JSON.  Basically the format I'm using is:


   

The stuff I've been doing:

   http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

... is pretty much the same, except:
 

I decided to stick closer to a MARC-XML type definition since its would be 
easier to explain how the two specifications are related, rather than take a 
more radical approach in producing a specification less familiar.  Not to say 
that other approaches are bad, they just have different advantages and 
disadvantages.  I was going for simple and familiar.

I certainly would be will to work with LC on creating a MARC-JSON specification 
as I did in creating the MARC-XML specification.


Andy.
   
A CouchDB friend of mine just pointed me to the BibJSON format by the 
Bibliographic Knowledge Network:

http://www.bibkn.org/bibjson/index.html

Might be worth looking through for future collaboration/transformation 
options.


Benjamin


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew  wrote:

>
> I decided to stick closer to a MARC-XML type definition since its would be
> easier to explain how the two specifications are related, rather than take a
> more radical approach in producing a specification less familiar.  Not to
> say that other approaches are bad, they just have different advantages and
> disadvantages.  I was going for simple and familiar.
>
>
That makes sense, but please consider adding a format/version (which we get
in MARC-XML from the namespace and isn't present here). In fact, please
consider adding a format / version / URI, so people know what they've got.

I'm also going to again push the newline-delimited-json stuff. The
collection-as-array is simple and very clean, but leads to trouble
for production (where for most of us we'd have to get the whole freakin'
collection in memory first and then call JSON.dump or whatever)
or consumption (have to deal with a streaming json parser). The production
part is particularly worrisome, since I'd hate for everyone to have to
default to writing out a '[', looping through the records, and writing a
']'. Yeah, it's easy enough, but it's an ugly hack that *everyone* would
have to do, as opposed to just something like:

  while (r = nextRecord) {
 print r.to_json, "\n"
  }

Unless, of course, writing json to a stream and reading json from a stream
is a lot easier than I make it out to be across a variety of languages and I
just don't know it, which is entirely possible. The streaming writer
interfaces for Perl (
http://search.cpan.org/dist/JSON-Streaming-Writer/lib/JSON/Streaming/Writer.pm)
and Java's Jackson (
http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example) are a
little more daunting than I'd like them to be.

Not wanting to argue unnecessarily, here; just adding input before things
get effectively set in stone.

 -Bill-

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 12:30 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew 
> wrote:
> 
> > Too bad I didn't attend code4lib.  OCLC Research has created a
> version of
> > MARC in JSON and will probably release FAST concepts in MARC binary,
> > MARC-XML and our MARC-JSON format among other formats.  I'm wondering
> > whether there is some consensus that can be reached and standardized
> at LC's
> > level, just like OCLC, RLG and LC came to consensus on MARC-XML.
> >  Unfortunately, I have not had the time to document the format,
> although it
> > fairly straight forward, and yes we have an XSLT to convert from
> MARC-XML to
> > MARC-JSON.  Basically the format I'm using is:
> >
> >
> The stuff I've been doing:
> 
>   http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
> 
> ... is pretty much the same, except:

I decided to stick closer to a MARC-XML type definition since its would be 
easier to explain how the two specifications are related, rather than take a 
more radical approach in producing a specification less familiar.  Not to say 
that other approaches are bad, they just have different advantages and 
disadvantages.  I was going for simple and familiar.

I certainly would be will to work with LC on creating a MARC-JSON specification 
as I did in creating the MARC-XML specification.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew  wrote:

> Too bad I didn't attend code4lib.  OCLC Research has created a version of
> MARC in JSON and will probably release FAST concepts in MARC binary,
> MARC-XML and our MARC-JSON format among other formats.  I'm wondering
> whether there is some consensus that can be reached and standardized at LC's
> level, just like OCLC, RLG and LC came to consensus on MARC-XML.
>  Unfortunately, I have not had the time to document the format, although it
> fairly straight forward, and yes we have an XSLT to convert from MARC-XML to
> MARC-JSON.  Basically the format I'm using is:
>
>
The stuff I've been doing:

  http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

... is pretty much the same, except:

  1. I don't explicitly split up control and data fields. There's a single
field list; an item that has two elements is a control field (tag/data); one
with four is a data field (tag / ind1 /ind2 / array_of_subfield)

  2. Instead of putting a collection in a big json array, I use
newline-delimited-json (basically, just stick one record on each line as a
single json hash). This has the advantage that it makes streaming much, much
easier, and makes doing some other things (e.g., grab the first record or
two) much cheaper for even the dumbest json parser). I'm not sure what the
state of JSON streaming parsers are; I know Jackson (for Java) can do it,
and perl's JSON::XS can...kind of...but it's not great.

3. I include a "type" (MARC-JSON, MARC-HASH, whatever) and version: [major,
minor] in each record. There's already a ton of JSON floating around the
library world; labeling what the heck a structure is is just friendly :-)

MARC's structure is dumb enough that we collectively basically can't miss;
there's only so much you can do with the stuff, and a round-trip to JSON and
back is easy to implement.

I'm not super-against explicitly labeling the data elements (tag:, :ind1:,
etc.) but I don't see where it's necessary unless you're planning on adding
out-of-band data to the records/fields/subfields at some point. Which might
be kinda cool (e.g., language hints on a per-subfield basis? Tokenization
hints for non-whitespace-delimited languages? URIs for unique concepts and
authorities where they exist for easy creation of RDF?)

I *am*, however, willing to push and push and push for NDJ instead of having
to deal with streaming JSON parsing, which to my limited understanding is
hard to get right and to my more qualified understanding is a pain in the
ass to work with.

And anything we do should explicitly be UTF-8 only; converting from MARC-8
is a problem for the server, not the receiver.

Support for what I've been calling marc-hash (I like to decouple it from the
eventual JSON format in case the serialization preferences change, or at
least so implementations don't get stuck with a single JSON library) is
already baked into ruby-marc, and obviously implementations are dead-easy no
matter what the underlying language is.

Anyone from the LoC want to get in on this?

 -Bill-




> [
>  ...
> ]
>
> which represents a collection of MARC records or
>
> {
>  ...
> }
>
> which represents a single MARC records that takes the form:
>
> {
>  leader : "01192cz  a2200301n  4500",
>  controlfield :
>  [
>{ tag : "001", data : "fst01303409" },
>{ tag : "003", data : "OCoLC" },
>{ tag : "005", data : "20100202194747.3" },
>{ tag : "008", data : "060620nn anznnbabn  || ana d" }
>  ],
>  datafield :
>  [
>{
>  tag : "040",
>  ind1 : " ",
>  ind2 : " ",
>  subfield :
>  [
>{ code : "a", data : "OCoLC" },
>{ code : "b", data : "eng" },
>{ code : "c", data : "OCoLC" },
>{ code : "d", data : "OCoLC-O" },
>{ code : "f", data : "fast" },
>  ]
>},
>{
>  tag : "151",
>  ind1 : " ",
>  ind2 : " ",
>  subfield :
>  [
>{ code : "a", data : "Hawaii" },
>{ code : "z", data : "Diamond Head" }
>  ]
>}
>  ]
> }
>



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Benjamin Young
> Sent: Friday, March 05, 2010 09:26 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> If you're looking at putting MARC into JSON, there was some discussion
> of that during code4lib 2010. Johnathan Rochkind, who was at code4lib
> 2010 blogged about marc-json recently:
> http://bibwild.wordpress.com/2010/03/03/marc-json/
> He references a project that Bill Dueber's been playing with for a
> year:
> http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
> 
> All told, there's growing momentum for a MARC in JSON format to be
> created, so you might jump in there.

Too bad I didn't attend code4lib.  OCLC Research has created a version of MARC 
in JSON and will probably release FAST concepts in MARC binary, MARC-XML and 
our MARC-JSON format among other formats.  I'm wondering whether there is some 
consensus that can be reached and standardized at LC's level, just like OCLC, 
RLG and LC came to consensus on MARC-XML.  Unfortunately, I have not had the 
time to document the format, although it fairly straight forward, and yes we 
have an XSLT to convert from MARC-XML to MARC-JSON.  Basically the format I'm 
using is:

[
  ...
]

which represents a collection of MARC records or 

{
  ...
}

which represents a single MARC records that takes the form:

{
  leader : "01192cz  a2200301n  4500",
  controlfield :
  [
{ tag : "001", data : "fst01303409" },
{ tag : "003", data : "OCoLC" },
{ tag : "005", data : "20100202194747.3" },
{ tag : "008", data : "060620nn anznnbabn  || ana d" }
  ],
  datafield :
  [
{
  tag : "040",
  ind1 : " ",
  ind2 : " ",
  subfield :
  [
{ code : "a", data : "OCoLC" },
{ code : "b", data : "eng" },
{ code : "c", data : "OCoLC" },
{ code : "d", data : "OCoLC-O" },
{ code : "f", data : "fast" },
  ]
},
{
  tag : "151",
  ind1 : " ",
  ind2 : " ",
  subfield :
  [
{ code : "a", data : "Hawaii" },
{ code : "z", data : "Diamond Head" }
  ]
}
  ]
}


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Jay Luker
If PHP/python isn't a hard requirement, I think this would be fairly
simple to do in perl using a combination of the XML::Simple [1] and
JSON::XS [2] modules.

In fact it's so simple, here's the code:

"""
#!/usr/bin/perl

use JSON::XS;
use XML::Simple;
use strict;

my $filename = shift @ARGV;
my $parsed = XMLin($filename);
my $json = encode_json($parsed);
print $json, "\n";
"""

XML::Simple, in spite of the name, actually allows for a myriad of
options for how the perl data structure gets created from the xml,
including attribute preservation, grouping of elements, etc.

--jay

[1] http://search.cpan.org/~grantm/XML-Simple-2.18/lib/XML/Simple.pm
[2] http://search.cpan.org/~makamaka/JSON-2.17/lib/JSON.pm

On Fri, Mar 5, 2010 at 9:55 AM, Joe Hourcle
 wrote:
> On Fri, 5 Mar 2010, Godmar Back wrote:
>
>> On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer
>> wrote:
>>
>>> Hi,
>>> try this: http://code.google.com/p/xml2json-xslt/
>>>
>>>
>> I should have mentioned that I already tried everything I could find after
>> googling - this stylesheet doesn't meet the requirements, not by far. It
>> drops attributes just like simplexml_json does.
>>
>> The one thing I didn't try is a program called 'BadgerFish.php' which I
>> couldn't locate - Google once indexed it at badgerfish.ning.com
>
>        http://web.archive.org/web/20080216200903/http://badgerfish.ning.com/
>
>  http://web.archive.org/web/20071013052842/badgerfish.ning.com/file.php?format=src&path=lib/BadgerFish.php
>
> -Joe
>


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Joe Hourcle

On Fri, 5 Mar 2010, Godmar Back wrote:


On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer wrote:


Hi,
try this: http://code.google.com/p/xml2json-xslt/



I should have mentioned that I already tried everything I could find after
googling - this stylesheet doesn't meet the requirements, not by far. It
drops attributes just like simplexml_json does.

The one thing I didn't try is a program called 'BadgerFish.php' which I
couldn't locate - Google once indexed it at badgerfish.ning.com


http://web.archive.org/web/20080216200903/http://badgerfish.ning.com/

http://web.archive.org/web/20071013052842/badgerfish.ning.com/file.php?format=src&path=lib/BadgerFish.php

-Joe


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Godmar Back
Thanks for the Internet Archive pointer. Hadn't thought of it (probably
because of a few past unsuccessful attempts to find archived pages.)

Tried BadgerFish (
http://libx.lib.vt.edu/services/code4lib/lccnrelay3/2004022563 which proxies
lccn.loc.gov's marcxml) and it meets the requirements of faithful
reproduction of the XML, albeit in a very verbose way that doesn't attempt
to do any minimization.

That leaves, indeed, two independent problems:

a) a free converter to GData's JSON format or another less redundant
convention than badgerfish. Looking at this for 1 second, I'm wondering if
this is even possible to implement without knowing the schema of the XML
document. It says, for instance, to use arrays [] for elements that may
occur more than once.

b) something MARC-specific to express MARC records in JSON. I talked to
Nathan Trail from LOC at code4lib, and they're revamping their lccn server
this year to scale up and also serve more formats. Presumably, this effort
could lead to a de-facto standard of how to serve MARC in JSON.

Thinking out loud about this for a minute, I'm wondering if part a) is
really a worthwhile goal. Aside from the impromptu prototyping of an
XML-to-JSON gateway, I don't see any production use for a XML to JSON
converter that is agnostic to the schema; for performance reasons alone.

 - Godmar


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Mark Mounts

have you tried this?

http://www.bramstein.com/projects/xsltjson/
http://github.com/bramstein/xsltjson

using the parameter |use-rayfish=true seems to preserve everything but 
namespaces but then there is a parameter to preserve namespaces as well.|


Mark

On 3/5/2010 12:54 AM, Godmar Back wrote:

Hi,

Can anybody recommend an open source XML2JSON converter in PhP or
Python (or potentially other languages, including XSLT stylesheets)?

Ideally, it should implement one of the common JSON conventions, such
as Google's JSON convention for GData [1], but anything that preserves
all elements, attributes, and text content of the XML file would be
acceptable.

Note that json_encode(simplexml_load_file(...)) does not meet this
requirement - in fact, nothing based on simplexml_load_file() will.
(It can't even load MarcXML correctly).

Thanks!

  - Godmar

[1] http://code.google.com/apis/gdata/docs/json.html
   


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Cary Gordon
You can find it here, although I wouldn't get too excited: http://bit.ly/acROxH

You could also fish for more info by badgering its creator at
http://www.sklar.com/page/section/contact.

Cary

On Fri, Mar 5, 2010 at 5:15 AM, Godmar Back  wrote:
> On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer 
> wrote:
>
>> Hi,
>> try this: http://code.google.com/p/xml2json-xslt/
>>
>>
> I should have mentioned that I already tried everything I could find after
> googling - this stylesheet doesn't meet the requirements, not by far. It
> drops attributes just like simplexml_json does.
>
> The one thing I didn't try is a program called 'BadgerFish.php' which I
> couldn't locate - Google once indexed it at badgerfish.ning.com
>
>  - Godmar
>



-- 
Cary Gordon
The Cherry Hill Company
http://chillco.com


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 8:15 AM, Godmar Back wrote:

On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaeferwrote:

   

Hi,
try this: http://code.google.com/p/xml2json-xslt/


 

I should have mentioned that I already tried everything I could find after
googling - this stylesheet doesn't meet the requirements, not by far. It
drops attributes just like simplexml_json does.

The one thing I didn't try is a program called 'BadgerFish.php' which I
couldn't locate - Google once indexed it at badgerfish.ning.com

  - Godmar
   

Godmar,

I'd be interested in collaborating with you on creating one. I'd bounced 
this question off the CouchDB IRC channel a while back, and the summary 
was that you'd generally create a JSON structure for your document and 
then right the code to map the XML to JSON. However, I do think 
something more "generic" like Google's GData to JSON would fit the bill 
for most use cases...sadly, it doesn't seem they've made the conversion 
code available.


If you're looking at putting MARC into JSON, there was some discussion 
of that during code4lib 2010. Johnathan Rochkind, who was at code4lib 
2010 blogged about marc-json recently:

http://bibwild.wordpress.com/2010/03/03/marc-json/
He references a project that Bill Dueber's been playing with for a year:
http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

All told, there's growing momentum for a MARC in JSON format to be 
created, so you might jump in there.


Additionally, I'd love to find a project building code to do what 
Google's done with the GData to JSON format. If you find one, I'd enjoy 
seeing it.


Thanks, Godmar,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Kevin S. Clarke
Internet Archive seems to have a copy of that:

http://web.archive.org/web/20071013052842/badgerfish.ning.com/file.php?format=src&path=lib/BadgerFish.php

as well as several versions of the site:

http://web.archive.org/web/*/http://badgerfish.ning.com

Kevin



On Fri, Mar 5, 2010 at 8:15 AM, Godmar Back  wrote:
> On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer 
> wrote:
>
>> Hi,
>> try this: http://code.google.com/p/xml2json-xslt/
>>
>>
> I should have mentioned that I already tried everything I could find after
> googling - this stylesheet doesn't meet the requirements, not by far. It
> drops attributes just like simplexml_json does.
>
> The one thing I didn't try is a program called 'BadgerFish.php' which I
> couldn't locate - Google once indexed it at badgerfish.ning.com
>
>  - Godmar
>


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Godmar Back
On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer wrote:

> Hi,
> try this: http://code.google.com/p/xml2json-xslt/
>
>
I should have mentioned that I already tried everything I could find after
googling - this stylesheet doesn't meet the requirements, not by far. It
drops attributes just like simplexml_json does.

The one thing I didn't try is a program called 'BadgerFish.php' which I
couldn't locate - Google once indexed it at badgerfish.ning.com

 - Godmar


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Ulrich Schaefer

Godmar Back wrote:

Hi,

Can anybody recommend an open source XML2JSON converter in PhP or
Python (or potentially other languages, including XSLT stylesheets)?

Ideally, it should implement one of the common JSON conventions, such
as Google's JSON convention for GData [1], but anything that preserves
all elements, attributes, and text content of the XML file would be
acceptable.

Note that json_encode(simplexml_load_file(...)) does not meet this
requirement - in fact, nothing based on simplexml_load_file() will.
(It can't even load MarcXML correctly).

Thanks!

 - Godmar

[1] http://code.google.com/apis/gdata/docs/json.html
  

Hi,
try this: http://code.google.com/p/xml2json-xslt/

best,
Ulrich

--
Dr.-Ing. Ulrich Schaefer http://dfki.de/~uschaefer phone:+496813025154
   DFKI Language Technology Lab, D-66123 Saarbruecken, Germany
---
  Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
  Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender), Dr. Walter Olthoff. Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes. Amtsgericht Kaiserslautern, HRB 2313