Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Saturday, March 06, 2010 05:11 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> Anyway, hopefully, it won't be a huge surprise that I don't disagree
> with any of the quote above in general; I would assert, though, that
> application/json and application/marc+json should both return JSON
> (in the same way that text/xml, application/xml, and 
> application/marc+xml can all be expected to return XML). 
> Newline-delimited json is starting to crop up in a few places 
> (e.g. couchdb) and should probably have its own mime type
> and associated extension. So I would say something like:
> 
> application/json -- return json (obviously)
> application/marc+json  -- return json
> application/marc+ndj  -- return newline-delimited json

This sounds like consensus on how to deal with newline-delimited JSON in a 
standards based manner.

I'm not familiar with CouchDB, but I am using MongoDB which is similar.  I'll 
have to dig into how they deal with this newline-delimited JSON.  Can you 
provide any references to get me started?

> In all cases, we should agree on a standard record serialization,
> though, and the pure-json returns should include something that 
> indicates what the heck it is (hopefully a URI that can act as a 
> distinct "namespace"-type identifier, including a version in it).

I agree that our MARC-JSON serialization needs some "namespace" identifier in 
it and it occurred to me that the way it is handling indicators, e.g., ind1 and 
ind2 properties, might be better handled as an array to accommodate IFLA's 
MARC-XML-ish where they can have from 1-9 indicator values.

BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON 
standard, which means you need to use \u notation to specify characters in 
strings, not sure I made that clear in earlier posts.  A downside to the 
current ECMA 262 specification is that it doesn't support \U00XX, as Python 
does, for the extended characters.  Hopefully that will get rectified in a 
future ECMA 262 specification.

> The question for me, I think, is whether within this community,  anyone
> who provides one of these types (application/marc+json and
> application/marc+ndj) should automatically be expected to provide both.
> I don't have an answer for that.

I think this issue gets into familiar territory when dealing with RDF formats.  
Let's see, there is N3, NT, XML, Turtle, etc.  Do you need to provide all of 
them?  No, but it's nice of the server to at least provide NT or Turtle and 
XML.  Ultimately it's up to the server.  But the only difference between use 
cases #2 and #3 is whether the output is wrapped in an array, so it's probably 
easy for the server to produce both.

Depending on how much time I get next week I'll talk with the developer network 
folks to see what I need to do to put a specification under their 
infrastructure.  Looks like from my schedule it's going to be another week of 
hell :(


Andy.


Re: [CODE4LIB] Code4Lib Midwest?

2010-03-06 Thread Jason Stirnaman
I think maybe I started the chatter in IRC, but now it looks as if I
should redefine my region.  Being in KC, I don't see myself or other MO,
KS driving to Ohio and back in a day or weekend. I could be wrong.
Thoughts? Any interest in a Code4Lib-Big-12-ish? 
Kudos to Jonathan for running with this. I'm a drummer too, BTW...in
the event of a C4L11 Battle of the Bands.

Jason


-- 
Jason Stirnaman
Biomedical Librarian, Digital Projects
A.R. Dykes Library, University of Kansas Medical Center
jstirna...@kumc.edu
913-588-7319


>>> On 3/5/2010 at 9:37 AM, in message
, Bill
Dueber
 wrote:
> I'm pretty sure I could make it from Ann Arbor!
> 
> On Fri, Mar 5, 2010 at 10:12 AM, Ken Irwin 
wrote:
> 
>> I would come from Ohio to wherever we choose. Kalamazoo would suit
me just
>> fine; I've not been back there in entirely too long!
>> Ken
>>
>> > -Original Message-
>> > From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
Behalf Of
>> > Scott Garrison
>> > Sent: Friday, March 05, 2010 8:37 AM
>> > To: CODE4LIB@LISTSERV.ND.EDU 
>> > Subject: Re: [CODE4LIB] Code4Lib Midwest?
>> >
>> > +1
>> >
>> > ELM, I'm happy to help coordinate in whatever way you need.
>> >
>> > Also, if we can find a drummer, we could do a blues trio (count me
in on
>> bass). I
>> > could bring our band's drummer (a HUGE ND fan) down for a day or
two if
>> > needed--he's awesome.
>> >
>> > --SG
>> > WMU in Kalamazoo
>> >
>> > - Original Message -
>> > From: "Eric Lease Morgan" 
>> > To: CODE4LIB@LISTSERV.ND.EDU 
>> > Sent: Thursday, March 4, 2010 4:38:53 PM
>> > Subject: Re: [CODE4LIB] Code4Lib Midwest?
>> >
>> > On Mar 4, 2010, at 3:29 PM, Jonathan Brinley wrote:
>> >
>> > >>  2. share demonstrations
>> > >
>> > > I'd like to see this be something like a blend between lightning
talks
>> > > and the ask anything session at the last conference
>> >
>> > This certainly works for me, and the length of time of each
"talk"
>> would/could be
>> > directly proportional to the number of people who attend.
>> >
>> >
>> > >>  4. give a presentation to library staff
>> > >
>> > > What sort of presentation did you have in mind, Eric?
>> > >
>> > > This also raises the issue of weekday vs. weekend. I'm game for
>> > > either. Anyone else have a preference?
>> >
>> > What I was thinking here was a possible presentation to library
>> faculty/staff
>> > and/or computing faculty/staff from across campus. The
presentation could
>> be
>> > one or two cool hacks or solutions that solved wider, less geeky
>> problems.
>> > Instead of "tweaking Solr's term-weighting algorithms to index
>> OAI-harvested
>> > content" it would be "making journal articles easier to find".
This would
>> be an
>> > opportunity to show off the good work done by institutions outside
Notre
>> Dame.
>> > A prophet in their own land is not as convincing as the expert
from afar.
>> >
>> > I was thinking it would happen on a weekday. There would be more
stuff
>> going
>> > on here on campus, as well as give everybody a break from their
normal
>> work
>> > week. More specifically, I would suggest such an event take place
on a
>> Friday
>> > so the poeple who stayed over night would not have to take so many
days
>> off of
>> > work.
>> >
>> >
>> > >>  5. have a hack session
>> > >
>> > > It would be good to have 2 or 3 projects we can/should work on
decided
>> > > ahead of time (in case no one has any good ideas at the time),
and
>> > > perhaps a couple more inspired by the earlier presentations.
>> >
>> >
>> >
>> > True.
>> >
>> > --
>> > ELM
>> > University of Notre Dame
>>
> 


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Bill Dueber
On Sat, Mar 6, 2010 at 1:57 PM, Houghton,Andrew  wrote:

>  A way to fix this issue is to say that use cases #1 and #2 conform to
> media type application/json and use case #3 conforms to a new media type
> say: application/marc+json.  This new application/marc+json media type now
> becomes a library centric standard and it avoids breaking a widely deployed
> Web standard.
>

I'm so sorry -- it never dawned on me that anyone would think that I was
asserting that a JSON MIME type should return anything but JSON. For the
record, I think that's batshit crazy. JSON needs to return json. I'd been
hoping to convince folks that we need to have a standard way to pass records
around that doesn't require a streaming parser/writer; not ignore standard
MIME-types willy-nilly. My use cases exist almost entirely outside the
browse environment (because, my god, I don't want to have to try to deal
with MARC21, whatever the serialization, in a browser environment); it
sounds like Andy is almost purely worried about working with a MARC21
serialization within a browser-based javascript environment.

Anyway, hopefully, it won't be a huge surprise that I don't disagree with
any of the quote above in general; I would assert, though, that
application/json and application/mac+json should both return JSON (in the
same way that text/xml, application/xml, and application/marc+xml can all be
expected to return XML). Newline-delmited json is starting to crop up in a
few places (e.g. couchdb) and should probably have its own mime type and
associated extension. So I would say something like:

application/json -- return json (obviously)
application/marc+json  -- return json
application/marc+ndj  -- return newline-delimited json

In all cases, we should agree on a standard record serialization, though,
and the pure-json returns should include something that indicates what the
heck it is (hopefully a URI that can act as a distinct "namespace"-type
identifier, including a version in it).

The question for me, I think, is whether within this community,  anyone who
provides one of these types (application/marc+json and application/marc+ndj)
should automatically be expected to provide both. I don't have an answer for
that.

 -Bill-


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 08:48 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew 
> wrote:
> 
> > OK, I will bite, you stated:
> >
> > 1. That large datasets are a problem.
> > 2. That streaming APIs are a pain to deal with.
> > 3. That tool sets have memory constraints.
> >
> > So how do you propose to process large JSON datasets that:
> >
> > 1. Comply with the JSON specification.
> > 2. Can be read by any JavaScript/JSON processor.
> > 3. Do not require the use of streaming API.
> > 4. Do not exceed the memory limitations of current JSON processors.
> >
> >
> What I'm proposing is that we don't process large JSON datasets; I'm
> proposing that we process smallish JSON documents one at a time by
> pulling
> them out of a stream based on an end-of-record character.
> 
> This is basically what we use for MARC21 binary format -- have a
> defined
> structure for a valid record, and separate multiple well-formed record
> structures with an end-of-record character. This preserves JSON
> specification adherence at the record level and uses a different scheme
> to represent collections. Obviously, MARC-XML uses a different 
> mechanism to define a collection of records -- putting well-formed 
> record structures inside a  tag.
> 
> So... I'm proposing define what we mean by a single MARC record
> serialized to JSON (in whatever format; I'm not very opinionated 
> on this point) that preserves the order, indicators, tags, data, 
> etc. we need to round-trip between marc21binary, marc-xml, and 
> marc-json.
> 
> And then separate those valid records with an end-of-record character
> -- "\n".

Ok, what I see here are divergent use cases and the willingness of the library 
community to break existing Web standards.  This is how the library community 
makes it more difficult to use their data and places additional barriers for 
people and organizations to enter their market because of these library centric 
protocols and standards.

If I were to try to sell this idea to the Web community, at large, and tell 
them that when they send an HTTP request with an Accept: application/json 
header to our services, our services will respond with a 200 HTTP status and 
deliver them malformed JSON, I would be immediately impaled with multiple 
arrows and daggers :(  Not to mention that OCLC would disparaged by a certain 
crowd in their blogs as being idiots who cannot follow standards.

OCLC's goals are use and conform to Web standards to make library data easier 
to use by people or organizations outside the library community, otherwise 
libraries and their data will become irrelevant.  The JSON serialization is a 
standard and the Web community expects that when they make HTTP requests with 
an Accept: application/json header that they will be get back JSON conforming 
to the standard.  JSON's main use case is in AJAX scenarios where you are not 
suppose to be sending megabytes of data across the wire.

Your proposal is asking me to break a widely deployed Web standard that is used 
by AJAX frameworks and to access millions (ok, many) Web sites.

> Unless I've read all this wrong, you've come to the conclusion that the
> benefit of having a JSON serialization that is valid JSON at both the
> record and collection level outweighs the pain of having to deal with
> a streaming parser and writer.  This allows a single collection to be
> treated as any other JSON document, which has obvious benefits (which 
> I certainly don't mean to minimize) and all the drawbacks we've been 
> talking about *ad nauseam*.

The goal is to adhere to existing Web standards and your underlying assumption 
is that you can or will be retrieving large datasets through an AJAX scenario.  
As I pointed out this is more an API design issue and due to the way AJAX works 
you should never design an API in that manner.  Your assumption that you can or 
will be retrieving large datasets through an AJAX scenario is false given the 
caveat of a well designed API.  Therefore you will never be put into the 
scenario requiring the use of JSON streaming so your argument from this point 
of view is mute.

But for arguments sake let's say you could retrieve a line delimited list of 
JSON objects.  You can no longer use any existing AJAX framework for getting 
back that JSON since it's malformed.  You could use the AJAX framework's 
XMLHTTP to retrieve this line delimited list of JSON objects, but this still 
doesn't help because the XMLHTTP object will keep the entire response in memory.

So when our service sends the user agent 100MB of line delimited JSON objects, 
the XMLHTTP object is going to try to slurp the entire 100MB HTTP response into 
memory and that is going to exceed the memory requirement of the JSON/Javascrpt 
processor or the browser that is controlling the XMLHTTP object and