Re: [CODE4LIB] Q: XML2JSON converter
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of > Bill Dueber > Sent: Saturday, March 06, 2010 05:11 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] Q: XML2JSON converter > > Anyway, hopefully, it won't be a huge surprise that I don't disagree > with any of the quote above in general; I would assert, though, that > application/json and application/marc+json should both return JSON > (in the same way that text/xml, application/xml, and > application/marc+xml can all be expected to return XML). > Newline-delimited json is starting to crop up in a few places > (e.g. couchdb) and should probably have its own mime type > and associated extension. So I would say something like: > > application/json -- return json (obviously) > application/marc+json -- return json > application/marc+ndj -- return newline-delimited json This sounds like consensus on how to deal with newline-delimited JSON in a standards based manner. I'm not familiar with CouchDB, but I am using MongoDB which is similar. I'll have to dig into how they deal with this newline-delimited JSON. Can you provide any references to get me started? > In all cases, we should agree on a standard record serialization, > though, and the pure-json returns should include something that > indicates what the heck it is (hopefully a URI that can act as a > distinct "namespace"-type identifier, including a version in it). I agree that our MARC-JSON serialization needs some "namespace" identifier in it and it occurred to me that the way it is handling indicators, e.g., ind1 and ind2 properties, might be better handled as an array to accommodate IFLA's MARC-XML-ish where they can have from 1-9 indicator values. BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON standard, which means you need to use \u notation to specify characters in strings, not sure I made that clear in earlier posts. A downside to the current ECMA 262 specification is that it doesn't support \U00XX, as Python does, for the extended characters. Hopefully that will get rectified in a future ECMA 262 specification. > The question for me, I think, is whether within this community, anyone > who provides one of these types (application/marc+json and > application/marc+ndj) should automatically be expected to provide both. > I don't have an answer for that. I think this issue gets into familiar territory when dealing with RDF formats. Let's see, there is N3, NT, XML, Turtle, etc. Do you need to provide all of them? No, but it's nice of the server to at least provide NT or Turtle and XML. Ultimately it's up to the server. But the only difference between use cases #2 and #3 is whether the output is wrapped in an array, so it's probably easy for the server to produce both. Depending on how much time I get next week I'll talk with the developer network folks to see what I need to do to put a specification under their infrastructure. Looks like from my schedule it's going to be another week of hell :( Andy.
Re: [CODE4LIB] Code4Lib Midwest?
I think maybe I started the chatter in IRC, but now it looks as if I should redefine my region. Being in KC, I don't see myself or other MO, KS driving to Ohio and back in a day or weekend. I could be wrong. Thoughts? Any interest in a Code4Lib-Big-12-ish? Kudos to Jonathan for running with this. I'm a drummer too, BTW...in the event of a C4L11 Battle of the Bands. Jason -- Jason Stirnaman Biomedical Librarian, Digital Projects A.R. Dykes Library, University of Kansas Medical Center jstirna...@kumc.edu 913-588-7319 >>> On 3/5/2010 at 9:37 AM, in message , Bill Dueber wrote: > I'm pretty sure I could make it from Ann Arbor! > > On Fri, Mar 5, 2010 at 10:12 AM, Ken Irwin wrote: > >> I would come from Ohio to wherever we choose. Kalamazoo would suit me just >> fine; I've not been back there in entirely too long! >> Ken >> >> > -Original Message- >> > From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of >> > Scott Garrison >> > Sent: Friday, March 05, 2010 8:37 AM >> > To: CODE4LIB@LISTSERV.ND.EDU >> > Subject: Re: [CODE4LIB] Code4Lib Midwest? >> > >> > +1 >> > >> > ELM, I'm happy to help coordinate in whatever way you need. >> > >> > Also, if we can find a drummer, we could do a blues trio (count me in on >> bass). I >> > could bring our band's drummer (a HUGE ND fan) down for a day or two if >> > needed--he's awesome. >> > >> > --SG >> > WMU in Kalamazoo >> > >> > - Original Message - >> > From: "Eric Lease Morgan" >> > To: CODE4LIB@LISTSERV.ND.EDU >> > Sent: Thursday, March 4, 2010 4:38:53 PM >> > Subject: Re: [CODE4LIB] Code4Lib Midwest? >> > >> > On Mar 4, 2010, at 3:29 PM, Jonathan Brinley wrote: >> > >> > >> 2. share demonstrations >> > > >> > > I'd like to see this be something like a blend between lightning talks >> > > and the ask anything session at the last conference >> > >> > This certainly works for me, and the length of time of each "talk" >> would/could be >> > directly proportional to the number of people who attend. >> > >> > >> > >> 4. give a presentation to library staff >> > > >> > > What sort of presentation did you have in mind, Eric? >> > > >> > > This also raises the issue of weekday vs. weekend. I'm game for >> > > either. Anyone else have a preference? >> > >> > What I was thinking here was a possible presentation to library >> faculty/staff >> > and/or computing faculty/staff from across campus. The presentation could >> be >> > one or two cool hacks or solutions that solved wider, less geeky >> problems. >> > Instead of "tweaking Solr's term-weighting algorithms to index >> OAI-harvested >> > content" it would be "making journal articles easier to find". This would >> be an >> > opportunity to show off the good work done by institutions outside Notre >> Dame. >> > A prophet in their own land is not as convincing as the expert from afar. >> > >> > I was thinking it would happen on a weekday. There would be more stuff >> going >> > on here on campus, as well as give everybody a break from their normal >> work >> > week. More specifically, I would suggest such an event take place on a >> Friday >> > so the poeple who stayed over night would not have to take so many days >> off of >> > work. >> > >> > >> > >> 5. have a hack session >> > > >> > > It would be good to have 2 or 3 projects we can/should work on decided >> > > ahead of time (in case no one has any good ideas at the time), and >> > > perhaps a couple more inspired by the earlier presentations. >> > >> > >> > >> > True. >> > >> > -- >> > ELM >> > University of Notre Dame >> >
Re: [CODE4LIB] Q: XML2JSON converter
On Sat, Mar 6, 2010 at 1:57 PM, Houghton,Andrew wrote: > A way to fix this issue is to say that use cases #1 and #2 conform to > media type application/json and use case #3 conforms to a new media type > say: application/marc+json. This new application/marc+json media type now > becomes a library centric standard and it avoids breaking a widely deployed > Web standard. > I'm so sorry -- it never dawned on me that anyone would think that I was asserting that a JSON MIME type should return anything but JSON. For the record, I think that's batshit crazy. JSON needs to return json. I'd been hoping to convince folks that we need to have a standard way to pass records around that doesn't require a streaming parser/writer; not ignore standard MIME-types willy-nilly. My use cases exist almost entirely outside the browse environment (because, my god, I don't want to have to try to deal with MARC21, whatever the serialization, in a browser environment); it sounds like Andy is almost purely worried about working with a MARC21 serialization within a browser-based javascript environment. Anyway, hopefully, it won't be a huge surprise that I don't disagree with any of the quote above in general; I would assert, though, that application/json and application/mac+json should both return JSON (in the same way that text/xml, application/xml, and application/marc+xml can all be expected to return XML). Newline-delmited json is starting to crop up in a few places (e.g. couchdb) and should probably have its own mime type and associated extension. So I would say something like: application/json -- return json (obviously) application/marc+json -- return json application/marc+ndj -- return newline-delimited json In all cases, we should agree on a standard record serialization, though, and the pure-json returns should include something that indicates what the heck it is (hopefully a URI that can act as a distinct "namespace"-type identifier, including a version in it). The question for me, I think, is whether within this community, anyone who provides one of these types (application/marc+json and application/marc+ndj) should automatically be expected to provide both. I don't have an answer for that. -Bill-
Re: [CODE4LIB] Q: XML2JSON converter
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of > Bill Dueber > Sent: Friday, March 05, 2010 08:48 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] Q: XML2JSON converter > > On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew > wrote: > > > OK, I will bite, you stated: > > > > 1. That large datasets are a problem. > > 2. That streaming APIs are a pain to deal with. > > 3. That tool sets have memory constraints. > > > > So how do you propose to process large JSON datasets that: > > > > 1. Comply with the JSON specification. > > 2. Can be read by any JavaScript/JSON processor. > > 3. Do not require the use of streaming API. > > 4. Do not exceed the memory limitations of current JSON processors. > > > > > What I'm proposing is that we don't process large JSON datasets; I'm > proposing that we process smallish JSON documents one at a time by > pulling > them out of a stream based on an end-of-record character. > > This is basically what we use for MARC21 binary format -- have a > defined > structure for a valid record, and separate multiple well-formed record > structures with an end-of-record character. This preserves JSON > specification adherence at the record level and uses a different scheme > to represent collections. Obviously, MARC-XML uses a different > mechanism to define a collection of records -- putting well-formed > record structures inside a tag. > > So... I'm proposing define what we mean by a single MARC record > serialized to JSON (in whatever format; I'm not very opinionated > on this point) that preserves the order, indicators, tags, data, > etc. we need to round-trip between marc21binary, marc-xml, and > marc-json. > > And then separate those valid records with an end-of-record character > -- "\n". Ok, what I see here are divergent use cases and the willingness of the library community to break existing Web standards. This is how the library community makes it more difficult to use their data and places additional barriers for people and organizations to enter their market because of these library centric protocols and standards. If I were to try to sell this idea to the Web community, at large, and tell them that when they send an HTTP request with an Accept: application/json header to our services, our services will respond with a 200 HTTP status and deliver them malformed JSON, I would be immediately impaled with multiple arrows and daggers :( Not to mention that OCLC would disparaged by a certain crowd in their blogs as being idiots who cannot follow standards. OCLC's goals are use and conform to Web standards to make library data easier to use by people or organizations outside the library community, otherwise libraries and their data will become irrelevant. The JSON serialization is a standard and the Web community expects that when they make HTTP requests with an Accept: application/json header that they will be get back JSON conforming to the standard. JSON's main use case is in AJAX scenarios where you are not suppose to be sending megabytes of data across the wire. Your proposal is asking me to break a widely deployed Web standard that is used by AJAX frameworks and to access millions (ok, many) Web sites. > Unless I've read all this wrong, you've come to the conclusion that the > benefit of having a JSON serialization that is valid JSON at both the > record and collection level outweighs the pain of having to deal with > a streaming parser and writer. This allows a single collection to be > treated as any other JSON document, which has obvious benefits (which > I certainly don't mean to minimize) and all the drawbacks we've been > talking about *ad nauseam*. The goal is to adhere to existing Web standards and your underlying assumption is that you can or will be retrieving large datasets through an AJAX scenario. As I pointed out this is more an API design issue and due to the way AJAX works you should never design an API in that manner. Your assumption that you can or will be retrieving large datasets through an AJAX scenario is false given the caveat of a well designed API. Therefore you will never be put into the scenario requiring the use of JSON streaming so your argument from this point of view is mute. But for arguments sake let's say you could retrieve a line delimited list of JSON objects. You can no longer use any existing AJAX framework for getting back that JSON since it's malformed. You could use the AJAX framework's XMLHTTP to retrieve this line delimited list of JSON objects, but this still doesn't help because the XMLHTTP object will keep the entire response in memory. So when our service sends the user agent 100MB of line delimited JSON objects, the XMLHTTP object is going to try to slurp the entire 100MB HTTP response into memory and that is going to exceed the memory requirement of the JSON/Javascrpt processor or the browser that is controlling the XMLHTTP object and