Re: [CODE4LIB] LCSH, Bisac, facets, hierarchy?

2016-04-25 Thread Bill Dueber
The University of Michigan maintains what we call “High Level Browse” — a
mapping of LC/Dewey call numbers to a limited hierarchy, based loosely
around academic departments (at least at the time it started). It’s still
maintained, and may prove generally useful as well.

The HLB hierarchy <http://www.lib.umich.edu/browse> gives you an idea of
what it is, and you can download and XML dump of the categories and their
associated call number ranges
<http://www.lib.umich.edu/browse/categories/xml.php>
​(1.8mb) ​
if that’s your thing.

​

On Wed, Apr 13, 2016 at 10:38 AM, William Denton <w...@pobox.com> wrote:

> On 13 April 2016, Mark Watkins wrote:
>
> I'm a library sciences newbie, but it seems like LCSH doesn't really
>> provide a formal hierarchy of genre/topic, just a giant controlled
>> vocabulary. Bisac seems to provide the "expected" hierarchy.
>>
>> Is anyone aware of any approaches (or better yet code!) that translates
>> lcsh to something like BISAC categories (either BISAC specifically or some
>> other hierarchy/ontology)? General web searching didn't find anything
>> obvious.
>>
>
> There's HILCC, the Hierarchical Interface of LC Classification:
>
> https://www1.columbia.edu/sec/cu/libraries/bts/hilcc/subject_map.html
>
> Bill
> --
> William Denton ↔  Toronto, Canada ↔  https://www.miskatonic.org/




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] Ruby MARC::Record: anyone need ruby 1.8 support anymore?

2016-02-28 Thread Bill Dueber
Ruby 1.8 was EOL'd about 2.5 years ago, so in theory everyone should be
long off of it. In practice, well, I thought I'd ask before making any
releases that change that.

Sidenote: does dropping supported for a long-EOLd version of the software
constitute a major version change under SemVer? None of the public
interfaces would change (it's a performance-focused release I'm
considering).

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] Traject 2.0.0 released: index MARC into Solr with ruby

2015-02-19 Thread Bill Dueber
[Apologies, as always, for any cross-post copies]

The traject https://github.com/traject-project/traject/ maintainers are
happy to announce the release of traject version 2.0.0.

Traject is an ETL (extract/transform/load) system designed and optimized
for indexing MARC records into Solr. It is similar in functionality to
solrmarc https://code.google.com/p/solrmarc/, but with everything written
in ruby instead of java.

Traject 2.0 brings several notable changes:

   - Support for MRI (“normal”) rub
   ​y,​
   JRuby
   ​, and rbx​
   - New Solr JSON writer (for solr versions =3.2) accessible from MRI and
   with about 20% better performance than previous indexing.
   - New writers for producing tab-delimited/CSV files

(Note that while traject runs fine under MRI, you’ll get substantially
faster indexing using JRuby due to traject’s use of multiple threads when
available).

Traject is in production use indexing metadata for the library catalogs of
the University of Michigan, the HathiTrust, Johns Hopkins, and Brown
University. (Using Traject? Let us know!)

   -

   The traject README https://github.com/traject-project/traject/ and doc
   folder https://github.com/traject-project/traject/tree/master/doc
   contain reference information, and we also provide a sample real-ish
   configuration https://github.com/traject-project/traject_sample to
   help get you started.
-

   Brown University is using traject for a new search interface; the Brown
   configuration https://github.com/Brown-University-Library/bul-traject/
   is a great example of a real-life traject installation.
-

   The University of Michigan and Hathitrust catalog are also indexed with
   traject; their shared configuration
   https://github.com/billdueber/ht_traject provides another (potentially
   overly)-complex real-life set of configuration files.

Thanks to everyone who provided feedback for this release!
​Feel free to contact me with questions directly, or add issues/ pull
requests to the github project https://github.com/traject-project/traject/
.​

​
-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] Announcement: ruby-marc 0.8.2 re-released as version 1.0.0

2015-01-28 Thread Bill Dueber
The ruby-marc https://github.com/ruby-marc/ruby-marc team is happy to
announce that we’ve decided to release the current code as version 1.0.0.

There are no non-cosmetic changes to this code compared to the
until-now-current version 0.8.2.

The jump to version 1.0.0 reflects the *de facto* use of the marc gem in
production at dozens of institutions and allows further development to more
easily adhere to semantic versioning http://semver.org/.

In that vein, please begin the process of updating your gem directives in
Gemfiles and .gemspec files to something like

gem 'marc', '~1'

…to be sure you have the latest backwards-compatible version for your
projects.

Thanks to everyone involved, from committers to folks who file bugs, for
the progress ruby-marc has made over the years. Special thanks for the most
recent releases go to Jonathan Rochkind, whose work on encodings (including
MARC-8!!) has been relentless.

-Bill Dueber, for the ruby-marc contributors-
​
-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Lorem Ipsum metadata? Is there such a thing?

2013-12-09 Thread Bill Dueber
 code
 that
handles paging in a UI, and I had to make it all up by hand. This
hurts my soul. Someone please tell me such a service exists, and
link me to it, so I never have to do this again. Or else, I may
 just
make such a service, to save us all. But I don't want to go coding
some new service if it already exists, because that sort of thing
 is
for chumps.
   
   
--
HARDY POTTINGER pottinge...@umsystem.edu University of Missouri
Library Systems http://lso.umsystem.edu/~pottingerhj/
https://MOspace.umsystem.edu/
Making things that are beautiful is real fun. --Lou Reed
   
  
  
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] The lie of the API

2013-12-02 Thread Bill Dueber
On Sun, Dec 1, 2013 at 7:57 PM, Barnes, Hugh hugh.bar...@lincoln.ac.nzwrote:

 +1 to all of Richard's points here. Making something easier for you to
 develop is no justification for making it harder to consume or deviating
 from well supported standards.



​I just want to point out that as much as we all really, *really* want
easy to consume and following the standards to be the same
thingthey're not. Correct content negotiation is one of those things
that often follows the phrase all they have to do..., which is always a
red flag, as in  Why give the user  different URLs ​when *all they have to
do is* Caching, json vs javascript vs jsonp, etc. all make this
harder. If *all * *I have to do* is know that all the consumers of my data
are going to do content negotiation right, and then I need to get deep into
the guts of my caching mechanism, then set up an environment where it's all
easy to test...well, it's harder.

And don't tell me how lazy I am until you invent a day with a lot more
hours. I'm sick of people telling me I'm lazy because I'm not pure. I
expose APIs (which have their own share of problems, of course) because I
want them to be *useful* and *used. *

  -Bill, apparently feeling a little bitter this morning -




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Bill Dueber
I'm running it against the HathiTrust catalog right now. It'll just take a
while, given that I don't have access to Roy's Hadoop cluster :-)


On Wed, Oct 16, 2013 at 1:38 PM, Sean Hannan shan...@jhu.edu wrote:

 That sounds like a request for Roy to fire up the ole OCLC Hadoop.

 -Sean



 On 10/16/13 1:06 PM, Karen Coyle li...@kcoyle.net wrote:

 Anybody have data for the average length of specific MARC fields in some
 reasonably representative database? I mainly need 100, 245, 6xx.
 
 Thanks,
 kc
 
 --
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 m: 1-510-435-8234
 skype: kcoylenet




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Bill Dueber
For the HathiTrust catalog's 6,046,746 bibs and looking at only the lengths
of the subfields $a and $b in 245s, I get an average length of  62.0


On Wed, Oct 16, 2013 at 3:24 PM, Kyle Banerjee kyle.baner...@gmail.comwrote:

 245 not including $c, indicators, or delimiters, |h (which occurs before
 |b), |n, |p, with trailing slash preceding |c stripped for about 9 million
 records for Orbis Cascade collections is 70.1

 kyle


 On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle li...@kcoyle.net wrote:

  Thanks, Roy (and others!)
 
  It looks like the 245 is including the $c - dang! I should have been more
  specific. I'm mainly interested in the title, which is $a $b -- I'm
 looking
  at the gains and losses of bytes should one implement FRBR. As a hedge,
  could I ask what've you got for the 240? that may be closer to reality.
 
  kc
 
 
  On 10/16/13 10:57 AM, Roy Tennant wrote:
 
  I don't even have to fire it up. That's a statistic that we generate
  quarterly (albeit via Hadoop). Here you go:
 
  100 - 30.3
  245 - 103.1
  600 - 41
  610 - 48.8
  611 - 61.4
  630 - 40.8
  648 - 23.8
  650 - 35.1
  651 - 39.6
  653 - 33.3
  654 - 38.1
  655 - 22.5
  656 - 30.6
  657 - 27.4
  658 - 30.7
  662 - 41.7
 
  Roy
 
 
  On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan shan...@jhu.edu wrote:
 
   That sounds like a request for Roy to fire up the ole OCLC Hadoop.
 
  -Sean
 
 
 
  On 10/16/13 1:06 PM, Karen Coyle li...@kcoyle.net wrote:
 
   Anybody have data for the average length of specific MARC fields in
 some
  reasonably representative database? I mainly need 100, 245, 6xx.
 
  Thanks,
  kc
 
  --
  Karen Coyle
  kco...@kcoyle.net http://kcoyle.net
  m: 1-510-435-8234
  skype: kcoylenet
 
 
  --
  Karen Coyle
  kco...@kcoyle.net http://kcoyle.net
  m: 1-510-435-8234
  skype: kcoylenet
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] ANNOUNCEMENT: Traject MARC-Solr indexer release

2013-10-15 Thread Bill Dueber
'traject' means to transmit (e.g., trajectory) -- or at least it did,
when people still used it, which they don't.

The traject workflow is incredibly general: *a reader* sends *a record* to *an
indexing routine* which stuffs...stuff...into a context object which is
then sent to *a writer*. We have a few different MARC readers, a few useful
writers (one of which, obviously, is the solr writer), and a bunch of
shipped routines (which we're calling macros but are just well-formed
ruby lambda or blocks) for extracting and transforming common MARC data.

[see
http://robotlibrarian.billdueber.com/announcing-traject-indexing-software/for
more explanation and some examples]

But there's no reason why a reader couldn't produce a MODS record which
would then be worked on. I'm already imagining readers and writers that
target databases (RDBMS or NoSQL), or a queueing system like Hornet, etc.

If there are people at Stanford that want to talk about how (easy it is) to
extend traject, I'd be happy to have that conversation.



On Tue, Oct 15, 2013 at 12:28 PM, Tom Cramer tcra...@stanford.edu wrote:

 ++ Jonathan and Bill.

 1.) Do you have any thoughts on extending traject to index other types of
 data--say MODS--into solr, in the future?

 2.) What's the etymology of 'traject'?

 - Tom


 On Oct 14, 2013, at 8:53 AM, Jonathan Rochkind wrote:

  Jonathan Rochkind (Johns Hopkins) and Bill Dueber (University of
 Michigan), are happy to announce a robust, feature-complete beta release of
 traject, a tool for indexing MARC data to Solr.
 
  traject, in the vein of solrmarc, allows you to define your indexing
 rules using simple macro and translation files. However, traject runs under
 JRuby and is ruby all the way down, so you can easily provide additional
 logic by simply requiring ruby files.
 
  There's a sample configuration file to give you a feel for traject[1].
 
  You can view the code[2] on github, and easily install it as a (jruby)
 gem using gem install traject.
 
  traject is in a beta release hoping for feedback from more testers prior
 to a 1.0.0 release, but it is already being used in production to generate
 the HathiTrust (metadata-lookup) Catalog (http://www.hathitrust.org/).
 traject was developed using a test-driven approach and has undergone both
 continuous integration and an extensive benchmarking/profiling period to
 keep it fast. It is also well covered by high-quality documentation.
 
  Feedback is very welcome on all aspects of traject including
 documentation, ease of getting started, features, any problems you have,
 etc.
 
  What we think makes traject great:
 
  * It's all just well-crafted and documented ruby code; easy to program,
 easy to read, easy to modify (the whole code base is only 6400 lines of
 code, more than a third of which is tests)
  * Fast. Traject by default indexes using multiple threads, so you can
 use all your cores!
  * Decoupled from specific readers/writers, so you can use ruby-marc or
 marc4j to read, and write to solr, a debug file, or anywhere else you'd
 like with little extra code.
  * Designed so it's easy to test your own code and distribute it as a gem
 
  We're hoping to build up an ecosystem around traject and encourage
 people to ask questions and contribute code (either directly to the project
 or via releasing plug-in gems).
 
  [1]
 https://github.com/traject-project/traject/blob/master/test/test_support/demo_config.rb
  [2] http://github.com/traject-project/traject




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Good MARC PHP Libraries,

2013-09-26 Thread Bill Dueber
Given that File_MARC has been around since, what, the late 1950's, why
don't you just slap a 1.0 on it? It's not like anyone isn't using it
because they're waiting for the API to stabilize; we're all using it
regardless.


On Thu, Sep 26, 2013 at 12:01 AM, Dan Scott deni...@gmail.com wrote:

 I hear the maintainer of File_MARC is pretty responsive to questions
 and bug reports. This list might be a good place to raise questions
 about usage; others may be interested.

 Was the random undescriptive exit error something like the following?

 C:\phppear install File_MARC
 Failed to download pear/File_MARC within preferred state stable,
 latest release is version 0.7.3, stability beta, use
 channel://pear.php.net/File_MARC-0.7.3 to install
 install failed

 One of these days that package will make it to 1.0 and the -beta
 will no longer be necessary. Or the pear.php.net install instructions
 will include that. Or newer versions of PEAR will be smarter about
 detecting that no stable version is available and automatically offer
 to install the beta.

 On Wed, Sep 25, 2013 at 8:18 PM, Riley Childs ri...@tfsgeo.com wrote:
  Thanks! I will give it a shot tomorrow
 
  Riley Childs
  Junior and Library Tech Manager
  Charlotte United Christian Academy
  +1 (704) 497-2086
  Sent from my iPhone
  Please excuse mistakes
 
  On Sep 25, 2013, at 8:14 PM, Ross Singer rossfsin...@gmail.com wrote:
 
  Try:
 
  pear install file_marc-beta
 
  -Ross.
 
  On Wednesday, September 25, 2013, Riley Childs wrote:
 
  I have been having some troubles with the installation (some random
  undescriptive exit error)
 
  Riley Childs
  Junior and Library Tech Manager
  Charlotte United Christian Academy
  +1 (704) 497-2086
  Sent from my iPhone
  Please excuse mistakes
 
  On Sep 25, 2013, at 7:28 PM, Eric Phetteplace phett...@gmail.com
 javascript:;
  wrote:
 
  I think File_MARC is the standard:
  http://pear.php.net/package/File_MARC/
 
  Are there others?
 
  Best,
  Eric
 
 
  On Wed, Sep 25, 2013 at 7:17 PM, Riley Childs ri...@tfsgeo.com
 javascript:;
  wrote:
 
  Does anyone know of any good MARC PHP Libraries, I am struggling to
  create
  MARC records out of our proprietary database.
 
  Riley Childs
  Junior and Library Tech Manager
  Charlotte United Christian Academy
  +1 (704) 497-2086
  Sent from my iPhone
  Please excuse mistakes
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] A Proposal to serialize MARC in JSON

2013-09-03 Thread Bill Dueber
I can see where you might think that no progress has been made because
the only real document of the format is that old, old blog post.

The problem, however, is not a lack of progress but a lack of documentation
of that progress. File_MARC (PHP), MARC::Record (perl), ruby-marc (ruby)
and marc4j (java) will all deal, to one extent or another, either with the
JSON directly or with a hash/map data structure that maps directly to that
JSON structure.

[BTW, can anyone summarize the state of pymarc wrt marc-in-json?]





On Tue, Sep 3, 2013 at 5:09 AM, dasos ili dasos_...@yahoo.gr wrote:

 It is exactly three years back, and no real progress has been made
 concerning  this proposal to serialize MARC in JSON:


 http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/


 Meanwhile new tools for searching and retrieving records have come in,
 such as Solr and Elasticsearch. Any ideas on how one could alter (or
 propose a new format) more suited to the mechanisms of these two search
 platforms?

 Any example implemantations would be also really appreciated,

 thank you in advance




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] Ruby AlephSequential file reader

2013-08-19 Thread Bill Dueber
I've written up a quick-and-dirty (well, except for the 'quick' part)
ruby-marc reader class to read AlephSequential  files as output from the Ex
Libris Aleph system. If you don't know what that is, or why you would want
it, thank your god and move on.

Initial code is at https://github.com/billdueber/marc_alephsequential

If there's any interest, I'll gemify it and/or start a discussion about
whether or not to fold this into ruby-marc proper.

Speed isn't too awful -- about 150% the speed of reading a marc-binary file
with ruby-marc on my machine.

Pull requests are *always *in fashion (...at the Copa...)

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] New perl module MARC::File::MiJ -- marc-in-json for perl

2013-07-15 Thread Bill Dueber
The 
marc-in-jsonhttp://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/format
is, as you might expect, a JSON serialization for MARC. A JSON
serialization for MARC is potentially useful in the same places where
MARC-XML would be useful (long records, utility of human-readable records,
etc.) without what many perceive to be the relative pain of working with
XML vs JSON.

It's currently supported across several implementations:

   - ruby's *marc* gem
   - php's *File_MARC*
   - java's *marc4j*
   - python's *pymarc*

There wasn't one for perl, so I wrote one :-)

MARC::File::MiJhttp://search.cpan.org/~gmcharlt/MARC-File-MiJ-0.01/lib/MARC/File/MiJ.pmis
a perl module that allows MARC::Record to encode/decode marc-in-json.
It
also supplies a handler to MARC::File/MARC::Batch that will read
marc-in-json records from a newline-delimited-json (ndj) file (where each
line is a JSON object without unescaped newlines, ending with a newline).

marc-in-json encoding/decoding tends to be pretty
fasthttp://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/,
since json parsers tend to be pretty fast, and uncompressed filesizes
occupy a middle-ground between binary marc and marc-xml. A sample file of
about 18k marc records looks like this:

  31M topics.mrc
  56M topics.ndj (newline-delimited JSON)
  93M topics.xml

 8.9M topics.mrc.gz
 7.9M topics.ndj.gz
 8.7M topics.xml.gz

​...so obviously it compresses pretty well, too.

I can take generic questions; bugs should go to
https://rt.cpan.org/Public/Bug/Report.html?Queue=MARC-File-MiJ

[ Note that there are many other possible JSON serializations for
MARChttp://jakoblog.de/2011/04/13/mapping-bibliographic-record-subfields-to-json/,
including the (incompatible) one implemented in the
MARC::File::JSONhttp://search.cpan.org/~cfouts/MARC-File-JSON-0.002/lib/MARC/File/JSON.pmmodule]




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] ISBN/LCCN normalization for Solr

2013-06-13 Thread Bill Dueber
Thanks to the efforts of Jay Lurker, Jonathan Rochkind, and Adam
Constabaris,
​ Solr analyzer filters to normalize ISBNs (to ISBN13s) and LCCNs are now
cleaned up and ready to work with Solr 4.x.

I've extracted the code into a new repo, shined up the README, and provided
a .jar for download​ and instructions on what to do with it.

​Get it while it's hot at
https://github.com/billdueber/solr-libstdnum-normalize​

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] A Responsibility to Encourage Better Browsers ( ? )

2013-02-19 Thread Bill Dueber
Keep in mind that many old-IE users are there because their corporate/gov
entity requires it. Our entire univeristy health/hospital complex, for
example, was on IE6 until...last year, maybe?... because they had several
critical pieces of software written as active-x components that only ran in
IE6. Which, sure, you can say that's dumb (because it is), but at the same
time we couldn't have a setup that made it hard for the doctors
and researchers use the library.


On Tue, Feb 19, 2013 at 10:22 AM, Michael Schofield mschofi...@nova.eduwrote:

 Hi everyone,

 I'm having a change of heart.

 It is kind of sacrilegious, especially if you-like me-evangelize
 mobile-first, progressively enhanced web design, to  throw alerts when
 users hit your site using IE7 / IE8 that encourage upgrading or changing
 browsers. Especially in libraries which are legally and morally mandated to
 be the pinnacle of accessibility, your website should - er, ideally - be
 functional in every browser. That's certainly what I say when I give a talk.

 But you know what? I'm kind of starting to not care. I understand that
 patrons blah blah might not blah blah have access to anything but IE7 or
 IE8 - but, you know, if they're on anything other than Windows 95 that
 isn't true.


 * Using Old IE makes you REALLY vulnerable to malicious software.

 * Spriting IEs that don't support gradients, background size, CSS
 shapes, etc. and spinning-up IE friendly stylesheets (which, admittedly, is
 REALLY easy to do with Modernizr and SASS) can be a time-sink, which I am
 starting to think is more of a disservice to the tax- and tuition-payers
 that pad my wallet.

 I ensure that web services are 100% functional for deprecated browsers,
 and there is lingering pressure-especially from the public wing of our
 institution (which I totally understand and, in the past, sympathized with)
 to present identical experiences across browsers. But you know what I did
 today? I sinned. From our global script, if modernizr detects that the
 browser is lt-ie9, it appends just below the navbar a subtle notice: Did
 you know that your version of Internet Explorer is several years old? Why
 not give Firefox, Google Chrome, or Safari a try?*

 In most circles this is considered the most heinous practice. But, you
 know, I can no longer passively stand by and see IE8 rank above the others
 when I give the analytics report to our web committee. Nope. The first step
 in this process was dropping all support for IE7 / Compatibility Mode a few
 months ago. Now that Google, jQuery, and others will soon drop support for
 IE8 - its time to politely join-in and make luddite patrons aware. IMHO,
 anyway.

 Already, old IE users get the raw end of the bargain because just viewing
 our website makes several additional server requests to pull additional CSS
 and JS bloat, not to mention all the images graphics they don't support.
 Thankfully, IE8 is cool with icon fonts, otherwise I'd be weeping at my
 desk.

 Now, why haven't I extended this behavior to browsers with limited support
 for, say, css gradients? That's trickier. A user might have the latest HTC
 phone but opt to surf in Opera Mini. There are too many variables and too
 many webkits (etc.). With old IE you can infer that a.) the user has a lap-
 or desktop, and [more importantly] b.) that old IE will never be a phone.

 Anyway,

 This is a really small-potatoes rant / action, but in a culture of all
 accessibility / never pressuring the user / whatever, it feels momentous. I
 kind of feel stupid getting all high and mighty about it. What do you think?

 Michael | Front End Librarian | www.ns4lib.com

 * Why, you may ask, did I not suggest IE9? Well, IE9 isn't exactly the
 experience we'd prefer them to have, but also according to our analytics
 the huge majority of old IE users are on Windows XP - where 9 isn't an
 option anyway. Eventually, down the road, we'll encourage IE9ers to upgrade
 too (once things like flexbox become standard), and at least they should
 have the option to try IE10.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] library date parsing

2013-02-07 Thread Bill Dueber
Speaking of which...does any have robust code for getting the date of
publication out of a MARC record, correcting for (or ignoring or otherwise
dealing with) stuff in the fixed fields, dates on other calendars, dates
that are far enough in the future that they must be a mistake, etc.?

  -Bill yes, that *was* published in 5763 Dueber


On Thu, Feb 7, 2013 at 11:40 AM, Kevin S. Clarke kscla...@gmail.com wrote:

 I have an idea stuck in my memory that OCLC wrote a Java-based date
 parsing library long ago (that parses all the library world's strange
 date formats).  My search-fu seems to be weak, though, because I don't
 seem to be able to Google/find it.  Was it just a crazy dream or does
 anyone know what I'm talking about (and how to find it)?

 Thanks,
 Kevin




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] code4lib 2013 location

2013-02-05 Thread Bill Dueber
On Tue, Feb 5, 2013 at 12:01 PM, Francis Kayiwa kay...@uic.edu wrote:

 Power will be better than the Superbowl post
 half-time but we expect you to share. :-)


Does this mean We'll loaded for bear or Bring your own plug-strips?

Also, a reminder to people -- put your name on your computer *and your
power adapter.* Things can get...confusing.


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] conf presenters: a kind request

2013-02-04 Thread Bill Dueber
I'm gonna add to this briefly, and probably a bit less tactfully than
Jonathan :-)

   - My number-one complaint about past presentations: Don't have slides we
   can't read. You probably can't read this, but... isn't a helpful thing to
   hear during a presentation. Make it legible, or figure out a different way
   to present the information. A kick-ass poster or UML diagram or flowchart
   or whatever isn't kick-ass when we can't read it. It's just an
   uninformative blur.  [Note: this doesn't mean you shouldn't include the
   kick-ass poster when you upload your slides. Please do!]
   - Make sure your content fits well in the time allotted. You're not
   there to get through as much as possible. You're there to best use our
   collective time to make the argument that what you're doing is
   important/impressive/worth knowing, and to convey *as much of the
   interesting bits as you can without rushing*. The goal isn't for you to
   get lots of words out of your mouth; the goal is for us to understand them.
   If you absolutely can't cut it down to a point where you're not rushing,
   then you haven't done the hard work of distilling out the interesting bits,
   and you should get on that right away.
   - On the flip side, don't present for 8mn and leave plenty of time for
   questions. Odds are your'e not saying anything interesting enough to
   elicit questions in those 8 minutes. If you really only have 8mn of
   content, well, you shouldn't have proposed a talk. But odds are you *do*
   have interesting things to say, and may want to chat with your colleagues
   to figure out exactly what that is.
   - Don't make the 3.38 million messages on creating a non-threatening
   environment be for naught. Please.

As Jonathan said: this is a great, great audience. We're all forgiving,
we're all interested, we're all eager to lean new things and figure out how
to apply them to our own situations. We love to hear about your successes.
We *love* to hear about failures that include a way for us to avoid them,
and you're going to be well-received no matter what because a bunch of
people voted to hear you!





On Mon, Feb 4, 2013 at 10:47 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

 We are all very excited about the conference next week, to speak to our
 peers and to hear what our peers have to say!

 I would like to suggest that those presenting be considerate to your
 audience, and actually prepare your talk in advance!

 You may think you can get away with making some slides that morning during
 someone elses talk and winging it; nobody will notice right? Or they wont'
 care if they do?

 From past years, I can say that for me at least, yeah, I can often tell
 who hasn't actually prepared their talk. And I'll consider it disrespectful
 to the time of the audience, who voted for your talk and then got on
 airplanes to come see it, and you didn't spend the time to plan it advance
 and make it as high quality for them as you could.

 I don't mean to make people nervous about public speaking. The code4lib
 audience is a very kind and generous audience, they are a good audience.
 It'll go great! Just maybe repay their generosity by actually preparing
 your talk in advance, you know?  Do your best, it'll go great!

 If you aren't sure how to do this, the one thing you can probably do to
 prepare (maybe this is obvious) is practice your presentation in advance,
 with a timer, just once.  In front of a friend or just by yourself. Did you
 finish on time, and get at least half of what was important in? Then you're
 done preparing, that was it!  Yes, if you're going to have slides, this
 means making your slides or notes/outline in advance so you can practice
 your delivery just once!

 Just practice it once in advance (even the night before, as a last
 resort!), and it'll go great!

 Jonathan




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Why we need multiple discovery services engine?

2013-02-04 Thread Bill Dueber
 and hosted metadata results
 presented seperately (although probably preferably in a consistent UI),
 rather than merged.
 
 A bunch more discussion of these issues is included in my blog post at:
 
 http://bibwild.wordpress.com/2012/10/02/article-search-improvement-strateg
 y/
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Wayne
 Lam [
 wing...@gmail.com]
 Sent: Thursday, January 31, 2013 9:31 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] Why we need multiple discovery services engine?
 
 Hi all,
 
 I saw in numerous of library website, many of them would have their own
 based discovery services (e.g. blacklight / vufind) and at the same time
 they will have vendor based discovery services (e.g. EDS / Primo /
 Summon).
 Instead of having to maintain 2 separate system, why not put everything
 into just one? Any special reason or concern?
 
 Best
 
 Wayne
 
 --
 Emily Lynema
 Associate Department Head
 Information Technology, NCSU Libraries
 919-513-8031
 emily_lyn...@ncsu.edu




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Code4Lib Conference streaming?

2013-01-30 Thread Bill Dueber
...and a gentle reminder to people actually *at* the conference to *please
don't stream the talk you're actually sitting in*. If you can't see, move
up; don't kill the wifi ;-)

 -Bill, remembering the conf at IU where this happened -


On Wed, Jan 30, 2013 at 8:49 AM, Sarah Wiebe swi...@georgebrown.ca wrote:

 +1
 Eagerly awaiting streaming news. :)

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Eric Phetteplace
 Sent: January-29-13 9:59 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Code4Lib Conference streaming?

 yayyy! I can't stress how valuable this is for those of us who can only
 attend a couple conferences a year.

 Best,
 Eric Phetteplace
 Emerging Technologies Librarian
 Chesapeake College
 Wye Mills, MD


 On Tue, Jan 29, 2013 at 9:41 PM, Margaret Heller mhell...@luc.edu wrote:

  Yes, thanks to the people at UIC Learning Environments  Technology
  Services the conference will be streamed and archived. We are awaiting
  details, but certainly will publicize it widely when we have them.
 
  Margaret Heller
 
  Margaret Heller
  Digital Services Librarian
  Loyola University Chicago
  773.508.2686
 
   Tom Keays tomke...@gmail.com 01/29/13 20:36 PM 
  I was wondering if talks from the conference would be streamed this year?
  It was really great to have it the last time I was unable to attend.
 
  Tom
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Adding authority control to IR's that don't have it built in

2013-01-29 Thread Bill Dueber
Has anyone created a nice little wrapper around FAST? I'd like to test out
including FAST subjects in our catalog, but am hoping someone else went
through the work of building the code to do it :-) I know FAST has a web
interface, but I've got about 10M records and would rather use something
local.


On Tue, Jan 29, 2013 at 4:36 PM, Ed Summers e...@pobox.com wrote:

 Hi Kyle,

 If you are thinking of doing name or subject authority control you
 might want to check out OCLC's VIAF AutoSuggest service [1] and FAST
 AutoSuggest [2]. There are also autosuggest searches for the name and
 subject authority files, that are lightly documented in their
 OpenSearch document [3].

 In general, I really like this approach, and I think it has a lot of
 potential for newer cataloging interfaces. I'll describe two scenarios
 that I'm familiar with, that have worked quite well (so far). Note,
 these aren't IR per-se, but perhaps they will translate to your
 situation.

 As part of the National Digital Newspaper Program LC has a simple app
 so that librarians can create essays that describe newspapers in
 detail. Rather than making this part of our public website we created
 an Essay Editor as a standalone django app that provides a web based
 editing environment, for authority the essays. Part of this process is
 linking up the essay with the correct newspaper. Rather than load all
 the newspapers that could be described into the Essay Editor, and keep
 them up to date, we exposed an OpenSearch API in the main Chronicling
 America website (where all the newspaper records are loaded and
 maintained) [4]. It has been working quite well so far.

 Another example is the jobs.code4lib.org website that allows people to
 enter jobs announcements. I wanted to make sure that it was possible
 to view jobs by organization [5], or skill [6] -- so some form of
 authority control was needed. I ended up using Freebase Suggest [7]
 that makes it quite easy to build simple forms that present users with
 subsets of Freebase entities, depending on what they type. A nice side
 benefit of using Freebase is that you get descriptive text and images
 for the employers and topics for free. It has been working pretty well
 so far. There is a bit of an annoying conflict between the Freebase
 CSS and Twitter Bootstrap, which might be resolved by updating
 Bootstrap. Also, I've noticed Freebase's service slowing down a bit
 lately, which hopefully won't degrade further.

 The big caveat here is that these external services are dependencies.
 If they go down, a significant portion of your app might go down to.
 Minimizing this dependency, or allowing things degrade well is good to
 keep in mind. Also, it's worth remembering identifiers (if they are
 available) for the selected matches, so that they can be used for
 linking your data with the external resource. A simple string might
 change.

 I hope this helps. Thanks for the question, I think this is an area
 where we can really improve some of our back-office interfaces and
 applications.

 //Ed

 [1]
 http://www.oclc.org/developer/documentation/virtual-international-authority-file-viaf/request-types#autosuggest
 [2] http://experimental.worldcat.org/fast/assignfast/
 [3] http://id.loc.gov/authorities/opensearch/
 [4] http://chroniclingamerica.loc.gov/about/api/#autosuggest
 [5]
 http://jobs.code4lib.org/employer/university-of-illinois-at-urbana-champaign/
 [6] http://jobs.code4lib.org/jobs/ruby/
 [7] http://wiki.freebase.com/wiki/Freebase_Suggest

 On Tue, Jan 29, 2013 at 11:59 AM, Kyle Banerjee kyle.baner...@gmail.com
 wrote:
  How are libraries doing this and how well is it working?
 
  Most systems that even claim to have authority control simply allow a
  controlled keyword list. But this does nothing for the see and see also
  references that are essential for many use cases (people known by many
  names, entities that change names, merge or whatever over time, etc).
 
  The two most obvious solutions to me are to write an app that provides
 this
  information interactively as the query is typed (requires access to the
  search box) or to have a record that serves as a disambiguation page
 (might
  not be noticed by the user for a variety of reasons). Are there other
  options, and what do you recommend?
 
  Thanks,
 
  kyle




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Adding authority control to IR's that don't have it built in

2013-01-29 Thread Bill Dueber
Right -- I'd like to show the FAST stuff as facets in our catalog search
(or, at least try it out and see if anyone salutes). So I'd need to inject
the FAST data into the records at index time.


On Tue, Jan 29, 2013 at 4:59 PM, Ed Summers e...@pobox.com wrote:

 I think that Mike Giarlo and Michael Witt used the FAST AutoSuggest as
 part of their databib project [1]. But are you talking about bringing
 the data down for a local index?

 //Ed

 [1] http://databib.org/

 On Tue, Jan 29, 2013 at 4:45 PM, Bill Dueber b...@dueber.com wrote:
  Has anyone created a nice little wrapper around FAST? I'd like to test
 out
  including FAST subjects in our catalog, but am hoping someone else went
  through the work of building the code to do it :-) I know FAST has a web
  interface, but I've got about 10M records and would rather use something
  local.
 
 
  On Tue, Jan 29, 2013 at 4:36 PM, Ed Summers e...@pobox.com wrote:
 
  Hi Kyle,
 
  If you are thinking of doing name or subject authority control you
  might want to check out OCLC's VIAF AutoSuggest service [1] and FAST
  AutoSuggest [2]. There are also autosuggest searches for the name and
  subject authority files, that are lightly documented in their
  OpenSearch document [3].
 
  In general, I really like this approach, and I think it has a lot of
  potential for newer cataloging interfaces. I'll describe two scenarios
  that I'm familiar with, that have worked quite well (so far). Note,
  these aren't IR per-se, but perhaps they will translate to your
  situation.
 
  As part of the National Digital Newspaper Program LC has a simple app
  so that librarians can create essays that describe newspapers in
  detail. Rather than making this part of our public website we created
  an Essay Editor as a standalone django app that provides a web based
  editing environment, for authority the essays. Part of this process is
  linking up the essay with the correct newspaper. Rather than load all
  the newspapers that could be described into the Essay Editor, and keep
  them up to date, we exposed an OpenSearch API in the main Chronicling
  America website (where all the newspaper records are loaded and
  maintained) [4]. It has been working quite well so far.
 
  Another example is the jobs.code4lib.org website that allows people to
  enter jobs announcements. I wanted to make sure that it was possible
  to view jobs by organization [5], or skill [6] -- so some form of
  authority control was needed. I ended up using Freebase Suggest [7]
  that makes it quite easy to build simple forms that present users with
  subsets of Freebase entities, depending on what they type. A nice side
  benefit of using Freebase is that you get descriptive text and images
  for the employers and topics for free. It has been working pretty well
  so far. There is a bit of an annoying conflict between the Freebase
  CSS and Twitter Bootstrap, which might be resolved by updating
  Bootstrap. Also, I've noticed Freebase's service slowing down a bit
  lately, which hopefully won't degrade further.
 
  The big caveat here is that these external services are dependencies.
  If they go down, a significant portion of your app might go down to.
  Minimizing this dependency, or allowing things degrade well is good to
  keep in mind. Also, it's worth remembering identifiers (if they are
  available) for the selected matches, so that they can be used for
  linking your data with the external resource. A simple string might
  change.
 
  I hope this helps. Thanks for the question, I think this is an area
  where we can really improve some of our back-office interfaces and
  applications.
 
  //Ed
 
  [1]
 
 http://www.oclc.org/developer/documentation/virtual-international-authority-file-viaf/request-types#autosuggest
  [2] http://experimental.worldcat.org/fast/assignfast/
  [3] http://id.loc.gov/authorities/opensearch/
  [4] http://chroniclingamerica.loc.gov/about/api/#autosuggest
  [5]
 
 http://jobs.code4lib.org/employer/university-of-illinois-at-urbana-champaign/
  [6] http://jobs.code4lib.org/jobs/ruby/
  [7] http://wiki.freebase.com/wiki/Freebase_Suggest
 
  On Tue, Jan 29, 2013 at 11:59 AM, Kyle Banerjee 
 kyle.baner...@gmail.com
  wrote:
   How are libraries doing this and how well is it working?
  
   Most systems that even claim to have authority control simply allow a
   controlled keyword list. But this does nothing for the see and see
 also
   references that are essential for many use cases (people known by many
   names, entities that change names, merge or whatever over time, etc).
  
   The two most obvious solutions to me are to write an app that provides
  this
   information interactively as the query is typed (requires access to
 the
   search box) or to have a record that serves as a disambiguation page
  (might
   not be noticed by the user for a variety of reasons). Are there other
   options, and what do you recommend?
  
   Thanks,
  
   kyle
 
 
 
 
  --
  Bill Dueber

Re: [CODE4LIB] Anyone have a SUSHI client?

2013-01-24 Thread Bill Dueber
Yeah -- I found that right away. Most of what's there appears to be
abandonware.


On Thu, Jan 24, 2013 at 9:10 AM, Tom Keays tomke...@gmail.com wrote:

 Hey. NISO has a list of SUSHI tools.

 http://www.niso.org/workrooms/sushi/tools/

 Tom




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] Anyone have a SUSHI client?

2013-01-23 Thread Bill Dueber
[Background: SUSHI
http://www.niso.org/committees/SUSHI/SUSHI_comm.htmlis a SOAP
protocol for getting data on use of electronic resources in the
COUNTER format]

I'm just starting to look at trying to get COUNTER data via SUSHI into our
data warehouse, and I'm discovering that apparently no one has worked on a
SUSHI client since late 2009.

UnlessI'm missing one? Anyone out there using SUSHI and have a client
that works and is up-to-date and has some documentation of some sort? I'd
prefer ruby or java, but will take anything that'll run under linux (i.e.,
not C#) at this point.

I'm desperately trying not to have to deal with the raw SOAP and parsing
the XML and such, so any help would be appreciated.

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Anybody using the Open Library APIs?

2013-01-22 Thread Bill Dueber
The HathiTrust BibAPI might help you out -- you can get MARC-XML back with
a call, although of course its only as good as the underlying record and
our coverage won't be nearly as good as the OCLC.

Format is:

http://catalog.hathitrust.org/api/volumes/full/isbn/080582796X.json





On Tue, Jan 22, 2013 at 8:38 PM, William Denton w...@pobox.com wrote:

 On 21 January 2013, David Fiander wrote:

  All I'm really looking for at this point is a way to convert an ISBN into
 basic bibliographic data, and to find any related ISBNs, a la OCLC's xISBN
 service.


 LibraryThing's thingISBN is nice and might serve your needs:

 
 http://www.librarything.com/**wiki/index.php/LibraryThing_**APIshttp://www.librarything.com/wiki/index.php/LibraryThing_APIs

 Bill
 --
 William Denton
 Toronto, Canada
 http://www.miskatonic.org/




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Zoia

2013-01-22 Thread Bill Dueber
On Tue, Jan 22, 2013 at 9:50 PM, Genny Engel gen...@sonoma.lib.ca.us
 wrote:

 Guess there's no groundswell of support for firing Zoia and replacing
 her/it with a GLaDOS irc bot, then?


I'm in. We've both said things you're going to regret.

[GLaDOS https://en.wikipedia.org/wiki/Glados is the really-quite-mean AI
from the games Portal and Portal2]

On Tue, Jan 22, 2013 at 9:50 PM, Genny Engel gen...@sonoma.lib.ca.uswrote:

 Guess there's no groundswell of support for firing Zoia and replacing
 her/it with a GLaDOS irc bot, then?

 *Sigh.*

 Genny Engel


 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Andromeda Yelton
 Sent: Friday, January 18, 2013 11:30 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Zoia

 FWIW, I am both an active #libtechwomen participant and someone who is so
 thoroughly charmed by zoia I am frequently bothered she isn't right there
 *in my real life*.  (Yes, I have tried to issue zoia commands during
 face-to-face conversations with non-Code4Libbers.)

 I think a collaboratively maintained bot with a highly open ethos is always
 going to end up with some things that cross people's lines, and that's an
 opportunity to talk about those lines and rearticulate our group norms.
  And to that end, I'm in favor of weeding the collection of plugins,
 whether because of offensiveness or disuse.  (Perhaps this would be a good
 use of github's issue tracker, too?)

 I also think some sort of 'what's zoia and how can you contribute' link
 would be useful in any welcome-newbie plugin; it did take me a while to
 figure out what was going on there.  (Just as it took me the while to
 acquire the tastes for, say, coffee, bourbon, and blue cheese, tastes which
 I would now defend ferociously.)

 But not having zoia would make me sad.  And defining zoia to be
 woman-unfriendly, when zoia-lovers and zoia-haters appear to span the
 gender spectrum and have a variety of reasons (both gendered and non) for
 their reactions, would make me sad too.

 @love zoia.

 Andromeda




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] A gentle proposal: slim down zoia during the conference

2013-01-17 Thread Bill Dueber
I'd like to propose that zoia (the IRC bot that provides help and
entertainment in the #code4lib IRC channel) have some of its normal plugins
disabled during conf. With three or four times as many people online during
conference, things can get out of hand.

Lots of zoia plugins can be useful during conference; I'm mostly thinking
of stuff whose utility is suspect and whose output covers several lines.
Some examples:

   - @mf
   - @cast
   - @tdih
   - @sing

The goal, really, is to try and turn the firehose that the IRC channel
becomes into something at least plausibly manageable in realtime.

I can also make a case for things that newbies will just find confusing
(chef, takify, etc.) or offensive (@forecast, @mf again) but I'll let
others potentially make that case.



  -Bill-


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Groupon: $9 for 3-Day CTA Pass

2013-01-16 Thread Bill Dueber
I guess it depends on when you're leaving, but by my numbers it's more than
three weeks until the conference...


On Wed, Jan 16, 2013 at 11:22 AM, Wilhelmina Randtke rand...@gmail.comwrote:

 It says Allow up to 3 weeks for delivery of CTA Pass.  This is better if
 you are going to ALA over the summer, or something else more in the future.

 -Wilhelmina Randtke


 On Wed, Jan 16, 2013 at 10:17 AM, Carmen Mitchell
 carmenmitch...@gmail.comwrote:

  For the folks going to Chicago this year...This is a great deal.
 
   $9 for a 3-Day Pass from the Chicago Transit Authority ($20 Value)
 
 
 http://www.groupon.com/deals/chicago-transit-authority-cta-3?utm_campaign=UserReferral_dpamp;utm_medium=emailamp;utm_source=uu83298
 
  -Carmen
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] code4lib 2013 location

2013-01-11 Thread Bill Dueber
Because it seems like it might be useful, I've started a publicly-editable
google map at

http://goo.gl/maps/LWqay

Right now, it has two points: the hotel and the conference location. Please
add stuff as appropriate if the urge strikes you.




On Fri, Jan 11, 2013 at 7:54 PM, Francis Kayiwa kay...@uic.edu wrote:

 On Fri, Jan 11, 2013 at 06:41:26PM -0500, Cynthia Ng wrote:
  I'm sorry, but that doesn't actually clear up anything for me. The
  location on the layrd page just says Chicago. So, is the conference
  still happening at UIC? Since the conference hotel isn't super close,
  does that mean there will be transportation provided?

 The entire conference and pre-conference is at UIC. The Forum is a
 revenue generating part of UIC. The pre-conference will be at the
 University Libraries on Monday with the exception of the Drupal one.

 The hotel is a mile or thereabouts from UIC Forum. Here is the problem
 with us natives planning. It never crossed our minds that walking a mile
 while on the *upper limit* of our shuttling to and from work is not the
 norm for everyone. This was brought to our attention and we will have a
 shuttle from the Hotel to the Conference venue.

 
  While we're on the subject, are the pre-conferences happening at the
  same location?


 See above.

 ./fxk

 
  On Fri, Jan 11, 2013 at 2:51 PM, Francis Kayiwa kay...@uic.edu wrote:
   On Fri, Jan 11, 2013 at 10:41:54AM -0800, Erik Hetzner wrote:
   Hi all,
  
   Apparently code4lib 2013 is going to be held at the UIC Forum
  
 http://www.uic.edu/depts/uicforum/
  
   I assumed it would be at the conference hotel. This is just a note so
   that others do not make the same assumption, since nowhere in the
   information about the conference is the location made clear.
  
   Since the conference hotel is 1 mile from the venue, I assume
   transportation will be available.
  
   That's a good assumption to make. As to the confusion  I said to you
   when you asked me about this a couple of days ago.
  
   http://www.uic.edu/~kayiwa/code4lib.html was supposed to be our
   proposal. If you look at the document it also suggests that we were
   going to have the conference registration staggered by timezones. We
   have elected not to update that because as that was our proposal. When
   preparing our proposal we borrowed heavily from Yale's and IU's
 proposal
   and if someone would like to steal from us I think it is fair to leave
   that as is.
  
   If you want the conference page use the lanyrd.com link below. I can't
   even take credit for doing that. All of that goes to @pberry
  
   http://lanyrd.com/2013/c4l13/
  
   Cheers,
   ./fxk
  
  
  
  
   best, Erik Hetzner
  
   Sent from my free software system http://fsf.org/.
  
  
  
  
   --
   Speed is subsittute fo accurancy.
 

 --
 Speed is subsittute fo accurancy.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] refworks export

2012-12-27 Thread Bill Dueber
The best generic format is probably RIS. It's simple and everyone reads
it.

For export to Refworks, I actually use the Refworks tagged
formathttp://www.refworks.com/rwathens/help/RefWorks_Tagged_Format.htm
-- it's at least as expressive as other tagged formats (RIS, Endnote, etc.)
and allows more types (conference proceeding, book, etc.).

I've attached two files (or, at least, I hope they're attached; not sure
what the mailing software will do) that are simple YAML files specifying
the mappings that I use, if you want to start there. It's pretty easy to
see from the YAML files how to write the code to produce the actual export
files.

Let me know if you improve them :-)





On Thu, Dec 27, 2012 at 4:16 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 If I have software I'm writing that I want to provide an export to
 refwroks from...

 ...refworks supports import in a bazillion different formats, many
 vendor-specific.

 What are people's experience with the best, most complete, easiest to work
 wtih, 'generic' format for RefWorks import?

 EndNote? RIS? Other?




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


refworksFormatExport.yaml
Description: Binary data


risexport.yaml
Description: Binary data


Re: [CODE4LIB] Code4Lib MidWest

2012-04-28 Thread Bill Dueber
I'm very interested, and imagine there are a few of us here in Ann Arbor
that would make the day-trip. I'm personally on vacation the first week, if
you're keeping track.

On Sat, Apr 28, 2012 at 9:20 AM, Mita Williams mita.willi...@gmail.comwrote:

 I'm interested. I'd prefer during the week instead of weekends. Thanks! M

 On Fri, Apr 27, 2012 at 8:32 AM, Ken Irwin kir...@wittenberg.edu wrote:

  Thanks Ranti!
 
  I am definitely interested, and would favor a the latter end of the
  proposed timeframe.
 
  Ken
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
  Matt Schultz
  Sent: Thursday, April 26, 2012 3:08 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Code4Lib MidWest
 
  Hi Ranti,
 
  I work virtually with Educopia Institute and the MetaArchive Cooperative,
  and am based near Grand Rapids, MI. I would definitely look forward to
  attending being so close and all, and could do so either early in the
 week
  or the weekend. But would prefer the weekend.
 
  Best,
 
  Matt Schultz
  Program Manager
  Educopia Institute, MetaArchive Cooperative http://www.metaarchive.org
  matt.schu...@metaarchive.org
  616-566-3204
 
  On Thu, Apr 26, 2012 at 2:45 PM, Ranti Junus ranti.ju...@gmail.com
  wrote:
 
   Hello All,
  
   Michigan State University (Lansing, MI) is hosting the next Code4Lib
   Midwest. We aim to hold the event in either week of July 16th or 23rd
   (but most likely not July 27th) either as 1.5 or 2 days event. So, my
   question for those who might be interested to come: would it be better
   to have it early in the week or weekend?
  
   Let me know and then I'll set up a doodle poll for the date options.
  
  
   thanks,
   ranti.
  
   --
   Bulk mail.  Postage paid.
  
 
 
 
  --
  Matt Schultz
  Program Manager
  Educopia Institute, MetaArchive Cooperative http://www.metaarchive.org
  matt.schu...@metaarchive.org
  616-566-3204
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-17 Thread Bill Dueber
On Tue, Apr 17, 2012 at 8:46 PM, Simon Spero sesunc...@gmail.com wrote:

 Actually Anglo and Francophone centric. And the USMARC style 245 was a poor
 replacement for the UKMARC approach (someone at the British Library hosted
 Linked Data meeting wondered why there were punctation characters included
 in the data in the title field. The catalogers wept slightly).

 Simon



Slightly? I cry my eyes out *every single day* about that. Well, every
weekday, anyway.


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] Modern NACO Normalization (esp. in java?)

2012-04-11 Thread Bill Dueber
I'm about to embark on trying to write code to apply NACO normalization to
strings (not for field-to-field comparisons, but for correctly sorting
things). I was drivin to this by a complaint about how some Arabic
manuscript titles are sorting.

My end goal is a Solr filter, so I'm most interested in Java code.

It doesn't look hard so much as long and error-prone so I'm hoping
someone has already done this (or at least has a character map that I can
easily convert to java).

I've seen the code at the
OCLChttp://www.oclc.org/research/activities/naco/default.htm,
but it's 10 years old and doesn't have a lot of the non-latin stuff in it.

Evergreen has a perl
implementationhttp://git.evergreen-ils.org/?p=Evergreen.git;a=blob;f=Open-ILS/src/perlmods/lib/OpenILS/Utils/Normalize.pm:
that's probably where I'll start if no one has anything else.

Anyone?


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Modern NACO Normalization (esp. in java?)

2012-04-11 Thread Bill Dueber
Wow! Thanks, Ralph! This is great!

On Wed, Apr 11, 2012 at 12:04 PM, LeVan,Ralph le...@oclc.org wrote:

 I'm pretty sure attachments don't work on the list, so I'm just pasting
 my NACO normalizer below.  Note that there are 2007 versions of the
 normalize() method in there.  This is used for all the VIAF and
 Identities indexing.

 Ralph


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] Anyone using marc2solr?

2012-03-21 Thread Bill Dueber
A while ago I released the software I've been using for solr indexing as
marc2solr (and related gems).

I'm planning on starting over from the ground up, butwell, I really
like the name. :-)

Is there anyone out there actually *using* marc2solr besides me, in a way
that would make repurposing the github/rubygem name a bad idea? I know in
general it's a good idea to not do that, but I have a feeling this is
essentially an internal project that happens to be exposed on the public
web.

[Note: I'm pretty sure a flame war about reusing old github/gem names isn't
a great use of anyone's time.]

 -Bill-


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Unicode font for PDF generation?

2012-03-16 Thread Bill Dueber
I don't know if it's any good, but TITUS[1] is a pan-unicode font free for
non-commercial use. I don't know if that included embedding in a PDF or not.

1. http://titus.fkidg1.uni-frankfurt.de/unicode/tituut.asp

On Fri, Mar 16, 2012 at 6:13 PM, Mark Redar mark.re...@ucop.edu wrote:

 Hi All,

 We're having some fun with unicode characters in PDF generation. We have a
 process that automatically generates a pdf from XML input. The tool stack
 doesn't support multiple fonts for displaying different codepoints so we
 need a good pan-unicode font to bundle with the pdfs.

 Currently, we use the DejaVu font family for creating the pdfs. This has
 good coverage for latin  cyrillic characters but has no CJK
 (chinese-japanese-korean) coverage. We've looked into licensing a
 commercial fonts, but for web server use these require annual licensing
 fees that are substantial (in the thousands of $).
 A number of our source documents contain CJK characters and some
 contributors have noticed the lack of support for these characters.

 Does anyone know of a good pan-unicode free font that includes CJK
 codepoints that looks good? Gnu unifont has the coverage, but it is not the
 best looking font.

 Barring that, we're thinking of rolling our own pan-unicode font. There
 are good open source fonts for portions of the unicode character sets.
 We're hoping to find some way to take a number of open source fonts and
 combine them into one large pan-unicode font.

 Does anyone have experience with font authoring and merging different
 fonts?

 It looks as though FontForge can merge fonts, but it's not clear how to
 deal with overlapping codepoints in the merged fonts.

 Thanks,

 Mark




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] NON-MARC ILS?

2012-03-14 Thread Bill Dueber
On Wed, Mar 14, 2012 at 2:17 PM, Wilfred Drew dr...@tc3.edu wrote:

 I did not mean to sound snarky in my earlier message but I do not
 understand why no one is talking about standards and why we have them.
  This includes standard ways to present and transmit data between systems.
  That is oen of the big reasons for using MARC.


I think at least partially because the standard (MARC21 with AACR2) is
incredibly arcane with an enormous learning curve. It's hard, it doesn't
make sense in lots and lots of ways, and for many applications the initial
cost is just plain too steep, no matter what the eventual benefits.
MARC/AACR2 is the standard I spend most of my time with, but that doesn't
mean I find it easy to defend.

Personally, I don't find it hard to imagine bibliographic applications
where MARC cataloging is way over the top. If you only have a few thousand
volumes, even something as simplistic as an RIS record for each item that
includes a shelf-number will get you an awfully long way. Whether or not it
gets your far enough is a different (and more difficult) question that can
only be answered by the people on the ground, who know what they have and
can guess at what's coming.


Re: [CODE4LIB] Preserving hyperlinks in conversion from Excel/googledocs/anything to PDF (was Any ideas for free pdf to excel conversion?)

2012-03-06 Thread Bill Dueber
What exactly are you trying to do? Take a list of links and turn them
into...a list of hot links in a PDF file?

On Mon, Mar 5, 2012 at 8:46 AM, Matt Amory matt.am...@gmail.com wrote:

 Does anyone know of any script library that can convert a set of (~200)
 hyperlinks into Acrobat's goofy protocol?  I do own Acrobat Pro.

 Thanks

 On Wed, Dec 14, 2011 at 1:08 PM, Matt Amory matt.am...@gmail.com wrote:

  Just looking to preserve column structure.
 
  --
  Matt Amory
  (917) 771-4157
  matt.am...@gmail.com
  http://www.linkedin.com/pub/matt-amory/8/515/239
 
 


 --
 Matt Amory
 (917) 771-4157
 matt.am...@gmail.com
 http://www.linkedin.com/pub/matt-amory/8/515/239




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Metadata war stories...

2012-01-28 Thread Bill Dueber
://ead.lib.virginia.edu/vivaxtf/view?docId=uva-sc/viu00888.xml;query=;brand=default#adminlink
 
 
  On Fri, Jan 27, 2012 at 6:26 PM, Roy Tennantroytenn...@gmail.com
   wrote:
 
   Oh, I should have also mentioned that some of the worst problems occur
  when people treat their metadata like it will never leave their
  institution. When that happens you get all kinds of crazy cruft in a
  record. For example, just off the top of my head:
 
  * Embedded HTML markup (one of my favorites is animg  tag)
  * URLs to remote resources that are hard-coded to go through a
  particular institution's proxy
  * Notes that only have meaning for that institution
  * Text that is meant to display to the end-user but may only do so in
  certain systems; e.g., Click here in a particular subfield.
 
  Sigh...
  Roy
 
  On Fri, Jan 27, 2012 at 4:17 PM, Roy Tennantroytenn...@gmail.com
   wrote:
 
  Thanks a lot for the kind shout-out Leslie. I have been pondering
 what
  I might propose to discuss at this event, since there is certainly
  plenty of fodder. Recently we (OCLC Research) did an investigation of
  856 fields in WorldCat (some 40 million of them) and that might prove
  interesting. By the time ALA rolls around there may something else
  entirely I could talk about.
 
  That's one of the wonderful things about having 250 million MARC
  records sitting out on a 32-node cluster. There are any number of
  potentially interesting investigations one could do.
  Roy
 
  On Thu, Jan 26, 2012 at 2:10 PM, Johnston, Leslielesl...@loc.gov
 
  wrote:
 
  Roy's fabulous Bitter Harvest paper:
 
  http://roytennant.com/bitter_**harvest.html
 http://roytennant.com/bitter_harvest.html
 
 
  -Original Message-
  From: Code for Libraries [mailto:code4...@listserv.nd.**EDU
 CODE4LIB@LISTSERV.ND.EDU]
  On Behalf
 
  Of Walter Lewis
 
  Sent: Wednesday, January 25, 2012 1:38 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Metadata war stories...
 
  On 2012-01-25, at 10:06 AM, Becky Yoose wrote:
 
   - Dirty data issues when switching discovery layers or using
  legacy/vendor metadata (ex. HathiTrust)
 
 
  I have a sharp recollection of a slide in a presentation Roy Tennant
 
  offered up at Access  (at Halifax, maybe), where he offered up a
 range
  of
  dates extracted from an array of OAI harvested records.  The good, the
  bad,
  the incomprehensible, the useless-without-context (01/02/03 anyone?)
  and on
  and on.  In my years of migrating data, I've seen most of those
  variants.
  (except ones *intended* to be BCE).
 
 
  Then there are the fielded data sets without authority control.  My
 
  favourite example comes from staff who nominally worked for me, so
 I'm
  not
  telling tales out of school.  The classic Dynix product had a
 Newspaper
  index module that we used before migrating it (PICK migrations; such a
  joy).  One title had twenty variations on Georgetown Independent (I
  wish
  I was kidding) and the dates ranged from the early ninth century until
  nearly the 3rd millenium. (apparently there hasn't been much change in
  local council over the centuries).
 
 
  I've come to the point where I hand-walk the spatial metadata to
 links
 
  with to geonames.org for the linked open data. Never had to do it
 for
  a
  set with more than 40,000 entries though.  The good news is that it
  isn't
  hard to establish a valid additional entry when one is required.
 
 
  Walter
 
 
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Sending html via ajax -vs- building html in js (was: jQuery Ajax request to update a PHP variable)

2011-12-08 Thread Bill Dueber
To these I would add:

* Reuse. The call you're making may be providing data that would be useful
in other contexts as well. If you're generating application-specific html,
that can't happen.

But really, separation of concerns is the biggest one. Having to dig
through both template and code to make stylistic changes is icky. Now
excuse me, I have to go work with PHP. And then take a shower to try to get
the smell off me.

On Wed, Dec 7, 2011 at 5:19 PM, Robert Sanderson azarot...@gmail.comwrote:

 Here's some off the top of my head:

 * Separation of concerns -- You can keep your server side data
 transfer and change the front end easily by working with the
 javascript, rather than reworking both.

 * Lax Security -- It's easier to get into trouble when you're simply
 inlining HTML received, compared to building the elements.  Getting
 into the same bad habits as SQL injection. It might not be a big deal
 now, but it will be later on.

 * Obfuscation -- It's easier to debug one layer of code rather than
 two at once. It's thus also easier to maintain the two layers of code,
 and easier to see at which end the system is failing.

 Rob

 On Wed, Dec 7, 2011 at 3:12 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:
  A fair number? Anyone but Godmar?
 
  On 12/7/2011 5:02 PM, Nate Vack wrote:
 
  OK. So we have a fair number of very smart people saying, in essence,
  it's better to build your HTML in javascript than send it via ajax
  and insert it.
 
  So, I'm wondering: Why? Is it an issue of data transfer size? Is there
  a security issue lurking? Is it tedious to bind events to the new /
  updated code? Something else? I've thought about it a lot and can't
  think of anything hugely compelling...
 
  Thanks!
  -Nate
 
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] marc in json

2011-12-01 Thread Bill Dueber
I've worked to deprecate marc-hash (what tends to be referred to as Bill
Dueber's JSON format) in favor of Ross's marc-in-json. To the best of my
knowledge, there is marc-in-json support for ruby (current ruby-marc), PHP
(current File_MARC), marc4j (currently in trunk, soon to be released, I
think), and perl (MARC::Record in the next release). I think that covers
all the major players except the IndexData yaz- stuff.

[Galen, any word on that next release of the perl module?]

I, at least, already use marc-in-json in production (It's a great way to
store MARC in solr). It would be great if folks would have the confidence
to use it, at least as a single-record format. I think for wider adoption
we'll need to all have either (a) json pull-parsers to read in a file that
contains an array of marc-in-json objects, or (b) a decision to use
newline-delimited-json (or some other record-delimiter), so folks can put
more than one of these in a file and be able to get them out without
running out of memory.

 -Bill-

On Thu, Dec 1, 2011 at 9:11 AM, Ross Singer rossfsin...@gmail.com wrote:

 Ed, I think this would be great.  Obviously, there's zero
 standardization around MARC/JSON (Andrew Houghton has come the
 closest by writing up the most RFC-y proposal:
 http://www.oclc.org/developer/content/marc-json-draft-2010-03-11).

 I generally fall more in the camp of working code wins, though,
 which, solely on the basis of MARC parser support, would put my
 proposal in front.  In the end, I don't think it matters which style
 is adopted; it's an interchange format, any one of them works (and
 they all, including Bill Dueber's) has their pluses and minuses.  The
 more important thing is that we pick -one- and go with it so we can
 use it with some confidence.

 While we're on the subject, if there are any other serializations of
 MARC that people are legitimately interested in (TurboMARC, for
 example:
 https://www.indexdata.com/blog/2010/05/turbomarc-faster-xml-marc-records)
 and wish ruby-marc supported, let me know.

 Thanks,
 -Ross.

 On Thu, Dec 1, 2011 at 5:57 AM, Ed Summers e...@pobox.com wrote:
  Martin Czygan recently added JSON support to pymarc [1]. Before this
  gets rolled into a release I was wondering if it might make sense to
  bring the implementation in line with Ross Singer's proposed JSON
  serialization for MARC [2]. After quickly looking around it seems to
  be what got implemented in ruby-marc [3] and PHP's File_MARC [4]. It
  also looked like there was a MARC::Record branch [5] for doing
  something similar, but I'm not sure if that has been released yet.
 
  It seems like a no-brainer to bring it in line, but I thought I'd ask
  since I haven't been following the conversation closely.
 
  //Ed
 
  [1]
 https://github.com/edsu/pymarc/commit/245ea6d7bceaec7215abe788d61a0b34a6cd849e
  [2]
 http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/
  [3]
 https://github.com/ruby-marc/ruby-marc/blob/master/lib/marc/record.rb#L227
  [4]
 http://pear.php.net/package/File_MARC/docs/latest/File_MARC/File_MARC_Record.html#methodtoJSON
  [5]
 http://marcpm.git.sourceforge.net/git/gitweb.cgi?p=marcpm/marcpm;a=shortlog;h=refs/heads/marc-json




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] marc in json

2011-12-01 Thread Bill Dueber
I was a strong proponent of NDJ at one point, but I've grown less strident
and more weary since then.

Brad Baxter has a good overview of some options[1]. I'm assuming it's a
given we'd all prefer to work with valid JSON files if the pain-point can
be brought down far enough.

A couple years have passed since we first talked about this stuff, and the
state of JSON pull-parsers is better than it once was:

  * yajl[2] is a super-fast C library for parsing json and support stream
parsing. Bindings for ruby, node, python, and perl are linked to off the
home page. I found one PHP binding[3] on github which is broken/abandoned,
and no other pull-parser for PHP that I can find. Sadly, the ruby wrapper
doesn't actually expose the callbacks necessary for pull-parsing, although
there is a pull request[4] and at least one other option[5].
  * Perl's JSON::XS supports incremental parsing
  * the Jackson java library[6] is excellent and has an easy-to-use
pull-parser. There are a few simplistic efforts to wrap it for jruby/jython
use as well.

Pull-parsing is ugly, but no longer astoundingly difficult or slow, with
the possible exception of PHP. And output is simple enough.

As much as it makes me shudder, I think we're probably better off trying to
do pull parsers and have a marc-in-json document be a valid JSON array.

We could easily adopt a *convention* of, essentially, one-record-per-line,
but wrap it in '[]' to make it valid json. That would allow folks with a
pull-parser to write a real streaming reader, and folks without to cheat
(ditch the leading and trailing [], and read the rest as
one-record-per-line) until such a time as they can start using a more
full-featured json parser.

1.
http://en.wikipedia.org/wiki/User:Baxter.brad/Drafts/JSON_Document_Streaming_Proposal
2. http://lloyd.github.com/yajl/
3. https://github.com/sfalvo/php-yajl
4. https://github.com/brianmario/yajl-ruby/pull/50
5. http://dgraham.github.com/json-stream/
6. http://wiki.fasterxml.com/JacksonHome



On Thu, Dec 1, 2011 at 12:56 PM, Michael B. Klein mbkl...@gmail.com wrote:

 +1 to marc-in-json
 +1 to newline-delimited records
 +1 to read support
 +1 to edsu, rsinger, BillDueber, gmcharlt, and the other module maintainers

 On Thu, Dec 1, 2011 at 9:31 AM, Keith Jenkins k...@cornell.edu wrote:

  On Thu, Dec 1, 2011 at 11:56 AM, Gabriel Farrell gsf...@gmail.com
  wrote: I suspect newline-delimited will win this race.
  Yes.  Everyone please cast a vote for newline-delimited JSON.
 
  Is there any consensus on the appropriate mime type for ndj?
 
  Keith
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Citation Analysis - like projects for print resources

2011-11-17 Thread Bill Dueber
If I'm understanding you correctly, you're describing citation analysis
(sometimes referred to as a part of bibliometrics). It is mostly applied to
article data (e.g, the web of science / web of knowledge at ISI) but there
are zillions of studies looking at co-citation and co-authorship networks,
the long tail of cited works and authors, etc. You can hardly shake a stick
at JASIST without hitting two or three of these studies.

As you're probably already thinking, getting a hold of the citation
information in a machine-readable format is the painful part. Things are
made harder by your desire to work with books, since many citation are to
individual chapters for edited works, and (of course) books just plain
aren't generally available digitally.

Article searches (in google scholar or your local academic library) for
bibliometrics or citation analysis should get you started on past and
future work.

On Thu, Nov 17, 2011 at 12:47 PM, Joe Hourcle onei...@grace.nascom.nasa.gov
 wrote:

 On Nov 17, 2011, at 12:09 PM, Miles Fidelman wrote:

  Matt Amory wrote:
  Is anyone involved with, or does anyone know of any project to extract
 and
  aggregate bibliography data from individual works to produce some kind
 of
  most-cited authors list across a collection?
  Local/Network/Digital/OCLC
  or historic?
 
  Sorry to be vague, but I'm trying to get my head around whether this is
 a
  tired old idea or worth pursuing...
 
 
  Sounds like you're describing citeseer - http://citeseerx.ist.psu.edu/- 
  it's a combination bibliographic and citation index for computer science
 literature.  It includes a good degree of citation analysis.  Incredibly
 useful tool.


 Another recent project (that I haven't had a chance to play with yet) is
 Total Impact :

http://total-impact.org/about.php

 It's from some of the folks in altmetrics, who are trying to find better
 bibliometrics for measuring value:

http://altmetrics.org/manifesto/

 I don't see a list of what they're scraping I think they're using the
 publisher's indexes, PubMed and other databases rather than parsing the
 text themselves ... but the software's available, if you wanted to take a
 look.  Or you could just ask Heather or Jason, they're both approachable
 and always eager to talk, when I've run into them at meetings.

 I also seem to remember someone at the DataCite meeting this summer who
 was involved in a project to parse references in papers ... unfortunately,
 I don't have that notebook here to check ... but I *think* it was John
 Kunze.  (and I don't think it was part of the person's presentation, but
 something that I had picked up in the Q/A part)

 -Joe




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] ISBN Regular Expression

2011-10-24 Thread Bill Dueber
So much duplication. If only there were some sort of organization that might
serve as a clearinghouse for this sort of code that's useful to libraries...

[Yes, I know the only appropriate response is, Well, Dueber, step up and do
something about it. ]

On Mon, Oct 24, 2011 at 4:59 PM, Jon Gorman jonathan.gor...@gmail.comwrote:

 Also, I don't know OpenBook to know your source data, but don't forget
 a lot of publishers have printed ISBNs in different ways over the past
 few years.  The regex would choke on any hyphens.  If users are
 copying from printed material, they could type them in. For example,
 one of the books near my desk has the ISBN printed like  0-521-61678-6

 if this is user input and nothing is striping characters like that
 out, it could cause problems.

 (I think I've also seen spaces used instead of hyphens, but less
 positive about this).

 Jon Gorman


 On Mon, Oct 24, 2011 at 9:44 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:
  John: That's not going to work, an ISBN can end in X as a check digit,
  which is not [0-9].  You are going to be rejecting valid ISBN's, you have
 a
  bug.
 
  On 10/24/2011 10:40 AM, John Miedema wrote:
 
  Here's a php function I use in OpenBook to test if a user has entered a
 10
  or 13 digit ISBN.
 
  //test if 10 or 13 digits ISBN
  function openbook_utilities_validISBN($testisbn) {
  return (ereg (([0-9]{10}), $testisbn, $regs) || ereg (([0-9]{13}),
  $testisbn, $regs));
  }
 
 
 
  On Fri, Oct 21, 2011 at 1:44 PM,
  Kozlowski,Brendonbkozlow...@sals.eduwrote:
 
  Hi all.
 
 
 
  I'm somewhat surprised that I've never had to validate an ISBN manually
  up
  until now. I suppose that's a testiment to all of the software out
 there.
 
 
 
  However, I now find that I need to validate both the 10-digit and
  13-digit
  ISBNs. I realize there's also a check digit and a REGEX cannot check
 this
  value - one step at a time. Right now I just want to work on the REGEX.
 
 
 
  Does anyone know the exact specifications of both forms of an ISBN? The
  ISBN organization's website didn't seem to be overly clear to me.
  Alternatively, if anyone has a full working regular expression for this
  purpose I would definitely not mind if they'd be willing to share.
 
 
 
  The only thing I'm doing which is abnormal is that I am not requiring
 the
  hyphenation or spaces between numbers since some of this data will be
  coming
  from a system, and some will be coming from human input.
 
 
 
 
  Brendon Kozlowski
  Web Administrator
  Saratoga Springs Public Library
  49 Henry Street
  Saratoga Springs, NY, 12866
  [518] 584-7860 x217
 
  Please consider the environment before printing this message.
 
  To report this message as spam, offensive, or if you feel you have
  received
  this in error,
  please send e-mail to ab...@sals.edu including the entire contents and
  subject of the message.
  It will be reviewed by staff and acted upon appropriately.
 
 
 
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Examples of Web Service APIs in Academic Public Libraries

2011-10-08 Thread Bill Dueber
The HathiTrust BibAPI and DataAPIs are being used by several on this list
(and by me behind the scenes on occasion, although I sometimes cheat because
the data are local). Based on our logs, the most common use is to use the
BibAPI to check HT availability of an item already in someone's local
catalog.

http://www.hathitrust.org/data



On Sat, Oct 8, 2011 at 1:33 PM, Michel, Jason Paul miche...@muohio.eduwrote:

 Hello all,

 I'm a lurker on this listserv and am interested in gaining some insight
 into your experiences of utilizing web service APIs in either an academic
 library or public library setting.

 I'm writing a book for ALA Editions on the use of Web Service APIs in
 libraries.  Each chapter covers a specific API by delineating the
 technicalities of the API, discussing potential uses of the API in library
 settings, and step-by-step tutorials.

 I'm already including examples of how my library (Miami University in
 Oxford, Ohio) are utilizing these APIs but would like to give the reader
 more examples from a variety of settings.

 APIs covered in the book: Flickr, Vimeo, Google Charts, Twitter, Open
 Library, LibraryThing, Goodreads, OCLC.

 So, what are you folks doing with APIs?

 Thanks for any insight!

 Kind regards,

 Jason

 --
 Jason Paul Michel
 User Experience Librarian
 Miami University Libraries
 Oxford, Ohio 45044
 twitter:jpmichel




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Advice on a class

2011-07-28 Thread Bill Dueber
 was hiring a digital *librarian*, I'd also expect them to know
 Javascript, the language at the heart of the EPUB format.  But
 Javascript is kind of tricky; it's a subtle powerful language with bad
 syntax and weak libraries.  I certainly wouldn't recommend it to start
 with.

 Cary Gordon listu...@chillco.com wrote:
  There are still plenty of opportunities for Cobol coders, but I
  wouldn't recommend that either.

 Java is the COBOL of the 21st century, so if you know Java well, there
 will be a job in that for the next 20-30 years, I'd expect.  Until the
 Singularity happens, anyway.  I'd think there will always be lots of
 enterprise Java jobs around.

 Bill




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] stemming in author search?

2011-06-14 Thread Bill Dueber
We had stemming on for authors at first (maybe was the VUFind default way
back when?) and turned it off as soon as we noticed. The initial complaint
was that searching on Rowles gave records for Rowling. and of course
it's not hard to find other examples, esp. with the -ing suffix.

On Mon, Jun 13, 2011 at 8:08 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 In a Solr-based search, stemming is done at indexing time, into fields with
 stemmed tokens.

 It seems typical in library-catalog type applications based on Solr to have
 the default (or even only) searches be over these stemmed fields, thus
 'auto-stemming' to the user. (Search for 'monkey', find 'monkeys' too, and
 vice versa).

 I am curious how many people, who have Solr based catalogs (that is, I'm
 interested in people who have search engines with majority or only content
 originally from MARC), use such stemmed fields ('auto-stemming') over their
 _author_ fields as well.

 There are pro's and con's to this. There are certainly some things in an
 author field that would benefit from stemming (mostly various kinds of
 corporate authors, some of whose endings end up looking like english
 language phrases). There are also very many things in an author field that
 would not benefit from stemming, and thus when stemming is done it
 sometimes(/often?) results in false matches, pluralizing an author's last
 name in an inappropriate way for instance.

 So, wanna say on the list, if you are using a Solr-based catalog, are you
 using stemmed fields for your author searches? Curious what people end up
 doing.  If there are any other more complicated clever things you've done
 than just stem-or-not, let us know that too!

 Jonathan




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Seth Godin on The future of the library

2011-05-19 Thread Bill Dueber
My short answer: It's too damn expensive to check out everything that's
available for free to see if it's worth selecting for inclusion, and
library's (at least as I see them) are supposed to be curated, not
comprehensive.

My long answer:

The most obvious issue is that the OPAC is traditionally a listing of
holdings, and free ebooks aren't held in any sense that helps
disambiguate them from any other random text on the Internet. Certainly the
fact that someone bothered to transform it into ebook form isn't indicative
of anything. Not everything that's available can be cataloged. I see stuff
we paid for not as an arbitrary bias, but simply as a very, very useful way
to define the borders of the library.

Free is a very recent phenomenon, but it just adds more complexity to the
existing problem of deciding what publications are within the library's
scope. Library collections are curated, and that curation mission is not
simply a side effect of limited funds. The filtering process that goes into
deciding what a library will hold is itself an incredibly valuable aspect of
the collection.

Up until very recently, the most important pre-purchase filter was the fact
that some publisher thought she could make some money by printing text on
paper, and by doing so also allocated resources to edit/typeset/etc. For a
traditionally-published work, we know that real person(s), with relatively
transparent goals, has already read it and decided it was worth the gamble
to sink some fixed costs into the project. It certainly wasn't a perfect
filter, but anyone who claims it didn't add enormous information to the
system is being disingenuous.

Now that (e)publishing and (e)printing costs have nosedived toward $0.00,
that filter is breaking. Even print-on-paper costs have been reduced
enormously. But going through the slush pile, doing market research,
filtering, editing, marketing -- these things all cost money, and for the
moment the traditional publishing houses still do them better and more
efficiently than anyone else. And they expect to be paid for their work, and
they should.

There's a tendency in the library world, I think, to dismiss the value of
non-academic professionals and assume random people or librarians can just
do the work (see also: web-site development, usability studies, graphic
design, instructional design and development), but successful publishers are
incredibly good at what they do, and the value they add shouldn't be
dismissed (although their business practices should certainly be under
scrutiny).

Of course, I'm not differentiating free (no money) and free (CC0). One can
imagine models where the functions of the publishing house move to a
work-for-hire model and the final content is released CC0, but it's not
clear who's going to pay them for their time.


  -Bill-



On Thu, May 19, 2011 at 8:04 AM, Andreas Orphanides 
andreas_orphani...@ncsu.edu wrote:

 On 5/19/2011 7:36 AM, Mike Taylor wrote:

 I dunno.  How do you assess the whole realm of proprietary stuff?
 Wouldn't the same approach work for free stuff?

 -- Mike.


 A fair question. I think there's maybe at least two parts: marketing and
 bundling.

 Marketing is of course not ideal, and likely counterproductive on a number
 of measures, but at least when a product is marketed you get sales demos.
 Even if they are designed to make a product or collection look as good as
 possible, it still gives you some sense of scale, quality, content, etc.

 I think bundling is probably more important. It's a challenge in the
 free-stuff realm, but for open access products where there is bundling (for
 instance, Directory of Open Access Journals) I think you are likely to see
 wider adoption.

 Bundling can of course be both good (lower management cost) and bad
 (potentially diluting collection quality for your target audience). But when
 there isn't any bundling, which is true for a whole lot of free stuff,
 you've got to locally gather a million little bits into a collection.

 I guess what's really happening in the bundling case, at least for free
 content, is that collection and quality management activities are being
 outsourced to a third party. This is probably why DOAJ gets decent
 adoption. But of course, this still requires SOME group to be willing to
 perform these activities, and for the content/package to remain free, they
 either have to get some kind of outside funding (e.g., donations) or be
 willing to volunteer their services.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] Changes coming to the Library::CallNumber::LC perl module

2011-05-05 Thread Bill Dueber
...and that change could be YOU!

First thing: I'm abandoning this module; I never use it. If you want to
adopt it, lemme know. It's free!

Second thing: whomever picks it up might want to consider two major changes.

   1. For some reason, I'm only allowing two decimal places in the initial
   number (e.g, A123.456 is invalid). The comments in the code indicate
   there might have been a good reason at one point. Heck, I'm sure there was.
   I just don't remember it. And there are plenty of call numbers with three
   digits there, esp. in the QAs. And the code I actually use now doesn't
   enforce that restriction and the sky hasn't fallen, so it should probably
   go.
   2. The output format, which seemed smart at the time, is dumb.
A123expands to A
123. Which means you have to url-escape the spaces, and muck with your
   search query so it doesn't look like two words, and that (in solr, at least)
   you can't do a wildcard query (in solr, A  123* isn't valid syntax).
   What I do in the java code is to use an @ sign instead, e.g. A@@123. This
   makes things easier.

The second is obviously a backwards-incompatible change which warrants some
discussion.

But none of this matters until someone steps up and adopts it. Code is at
https://library-callnumber-lc.googlecode.com/ (a move to GitHub might make
sense, too) -- step right up and take your chances!

  -Bill-
-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] What do you wish you had time to learn?

2011-04-28 Thread Bill Dueber
I've thought for a while that libraries would be significantly better places
if there was always a big brisket near the reference desk that people could
just carve a slice off of and a giant pot of curry in the basement.

On Thu, Apr 28, 2011 at 8:55 AM, Andreas Orphanides 
andreas_orphani...@ncsu.edu wrote:


 Ranti, I think the call is clear: we need to start a group called Food4Lib.

 Who's with me?!



  Ranti Junus ranti.ju...@gmail.com 4/27/2011 11:39 PM 
 On Wed, Apr 27, 2011 at 12:57 PM, Bohyun Kim k...@fiu.edu wrote:
  Seems that we can use a class in cooking in addition to guitar playing at
 the next conference : )
 

 Hey, there's a Cooking for Geek authored by Jeff Potter. [1]
 Perhaps we should invite him to do a workshop and raffle the books.


 ranti.

 [1] http://www.cookingforgeeks.com


 --
 Bulk mail.  Postage paid.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] What do you wish you had time to learn?

2011-04-26 Thread Bill Dueber
play the guitar
real statistics (not have t-test, will travel!)
cook a really good roast
graph theory
map/reduce
Hebrew
some machine learning (esp. wrt parsing)

On Tue, Apr 26, 2011 at 4:15 PM, Ross Singer rossfsin...@gmail.com wrote:

 map/reduce
 coffeescript, node.js, other server side javascripts
 XSLT
 How to not make a not-completely-hideous-looking web app.

 -Ross.

 On Tue, Apr 26, 2011 at 8:30 AM, Edward Iglesias
 edwardigles...@gmail.com wrote:
  Hello All,
 
  I am doing a presentation at RILA (Rhode Island Library Association) on
  changing skill sets for Systems Librarians.  I did a formal survey a
 while
  back (if you participated, thank you) but this stuff changes so quickly I
  thought I would ask this another way.  What do you wish you had time to
  learn?
 
  My list includes
 
 
  CouchDB(NoSQL in general)
  neo4j
  nodejs
  prototype
  API Mashups
  R
 
  Don't be afraid to include Latin or Greek History.  I'm just going for a
  snapshot of System angst at not knowing everything.
 
  Thanks,
 
 
  ~
  Edward Iglesias
  Systems Librarian
  Central Connecticut State University
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Bill Dueber
On Fri, Apr 8, 2011 at 10:10 AM, Ross Singer rossfsin...@gmail.com wrote:

 But, yeah, it would be worth running your ideas by a few catalogers to
 see what they think.



And if anyone does this...please please *please* write it up!

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Bill Dueber
On Fri, Apr 8, 2011 at 1:50 PM, Shirley Lincicum shirley.linci...@gmail.com
 wrote:

 Ross is essentially correct. Education is an authorized subject term
 that can be subdivided geographically. Finance is a free-floating
 subdivision that is authorized for use under subject terms that
 conform to parameters given in the scope notes in its authority record
 (680 fields), but it cannot be subdivided geographically. England is
 an authorized geographic subject term that can be added to any heading
 that can be subdivided geographically.


Wait, so is it possible to know if England means the free-floating
geographic entity or the country? Or is that just plain unknowable.

Suddenly, my mouth is hungering for something gun-flavored.

I know OCLC did some work trying to dis-integrate different types of terms
with the FAST stuff, but it's not clear to me how I can leverage that (or
anything else) to make LCSH at all useful as a search target or (even
better) facet.  Has anyone done anything with it?


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Bill Dueber
2011/4/8 Karen Miller k-mill...@northwestern.edu

 I hope I'm not pointing out the obvious,


That made me laugh so hard I almost ruptured something.

Thank you so much for such a complete (please, god, tell me it's
complete...) explanation. It's a little depressing, but at least now I now
why I'm depressed :-)


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] LAMP Hosting service that supports php_yaz?

2011-03-23 Thread Bill Dueber
On Wed, Mar 23, 2011 at 10:44 AM, Cary Gordon listu...@chillco.com wrote:

  You can probably find an curious intern to do it.


Oh, for the love of god, please don't go this route. This is why libraries
tend to be a huge mishmash of unsupported, one-off crap that some outgoing
student did for extra credit six years ago.

To ask the obvious question: You're at a real,
honest-to-god prestigious college. Why are you trolling code4lib for cheap
hosting environments? If IT won't give you a piece of a machine somewhere,
or at least set up a Mac running OSX, they're failing to support a critical
mission of the college and someone needs to be up in arms about it. If you
haven't even asked them, well, maybe you should.

 -Bill, who spent his first two years in a library dealing with crappy old
PHP code from long-gone students

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] LAMP Hosting service that supports php_yaz?

2011-03-23 Thread Bill Dueber
On Wed, Mar 23, 2011 at 11:19 AM, Mark A. Matienzo m...@matienzo.orgwrote:

 You're definitely welcome here, and I don't think Bill's response was

to suggest that you weren't.


Not even a little! :-)

I was mostly responding to a perception that, for many in the code4lib
community, Central IT is a bogeyman to be avoided/deferred to at all
costs. Those of us in libraries tend to self-select as the sort of folks
that will find a way to get something done, no matter what. I think the
profession would benefit from more of us saying, Well, OK. Then that's not
going to get done. Go explain it to the dean.

Kudos to you for doing stuff on your own time (and your own dime, no less).
And please don't let my little rant scare you off. Turning good, wholesome
librarians into...er...whatever it is that most of us here are...is what we
do best :-)


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] stats for the conference video?

2011-02-17 Thread Bill Dueber
Cha: 16 You must've been watching a different crowd than the rest of us
:-)

On Thu, Feb 17, 2011 at 8:38 PM, Simon Spero s...@unc.edu wrote:

 Str: 11
 Dex: 3
 Con: 8
 Int: 16:
 Wis: 18
 Cha: 16




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] Bad numbers in my lightning talk (e.g. 45% of sessions have one action: search)

2011-02-15 Thread Bill Dueber
Basically, I failed to exclude a whole swath of activity I should have
ignored.

An explanation, the new data, and an excellent link to a corroborating paper
by our usability group, is at:

http://robotlibrarian.billdueber.com/corrected-code4lib-slides-are-up/

My sincere apologies to everyone. I'm trying to do due-diligence, but anyone
passed a copy of my slides to anyone, please make sure they get the better
numbers.

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] VuFind Beyond MARC slides

2011-02-11 Thread Bill Dueber
Ditto lightning talks? Should we attach slides to the appropriate page
(e.g., http://code4lib.org/conference/2011/lightning)? Maybe pull in content
from the Wiki to reflect what actual lightning talks happened?

On Fri, Feb 11, 2011 at 2:16 PM, Ryan Wick ryanw...@gmail.com wrote:

 Thanks for posting these.

 There are already pages on code4lib.org (not the wiki), linked off the
 schedule, for individual talks. We'd like to host the slides on
 code4lib.org if possible, but with links at a minimum. Similar with
 lightning talks. If slides or other links are already on the wiki,
 I'll try and get them moved over to the (slightly more 'official')
 code4lib.org pages.

 If you are able to upload and add your slides to
 http://code4lib.org/conference/2011/katz  please do so. Let me know if
 you have any questions. Thanks.


 Ryan Wick

 On Fri, Feb 11, 2011 at 8:35 AM, Demian Katz demian.k...@villanova.edu
 wrote:
  On a similar note, I've posted the slides for my VuFind talk here:
 
  http://vufind.org/docs/beyond_marc.ppt
 
  Is there someplace in the Wiki where all of the slides are being
 collected?  I assume it would be better if we were all listing these in a
 central location rather than posting dozens of messages on the mailing
 list... but I couldn't find an obvious spot to put the link!
 
  thanks,
  Demian
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
  Rick Johnson
  Sent: Friday, February 11, 2011 10:42 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: [CODE4LIB] Notre Dame Hydra Digital Exhibit Slides Available
 
  Thanks for a great conference everyone!  Our slides for our
  presentation on our Hydra Digital Exhibit Plugin as well as the
  screencast demo are now posted online:
 
http://code4lib.library.nd.edu
 
  It seemed like there was interest in the IRC channel for reuse of our
  plugin.  Again the code can be found here:
 
https://github.com/ndliblis/hydra_exhibit
 
  I am also extremely interested to here how many places would be
  interested in a Blacklight only version of the plugin.
 
  Also, the main Hydra branch of code we extended can be found here as a
  baseline:
 
https://github.com/projecthydra/hydrangea
 
  Matt Zumwalt also mentioned in the Hydra breakout that a good place to
  look at the moment for active projects using Hydra can be found on
  Duraspace's JIRA instance listed under Hydra Software:
 
https://jira.duraspace.org/secure/BrowseProjects.jspa#all
 
  Finally, a more formal web presence will be available soon in the
  coming weeks including more detailed instructions on how to download,
  install, and try out existing Hydra heads.
 
  Thanks!
  Rick
  --
  --
  Rick Johnson
  Unit Manager, Digital Library Applications and Local Programming Unit
  Library Information Systems
  University of Notre Dame
  Michiana Academic Library Consortium
  Notre Dame, IN USA 46556
  http://www.library.nd.edu
  574-631-1086
  
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Reminder: Newcomer dinner and Ribbons

2011-02-07 Thread Bill Dueber
Yep, that's me. Meet on the Mezzanine level near the comfy chairs at 5:30.
My shirt features a robot riding a dinosaur.

On Mon, Feb 7, 2011 at 3:41 PM, Jakub Skoczen ja...@indexdata.dk wrote:

 To the group that signed up for the Anyetsang's Little Tibet: I heard
 from Dot that he's not leading anymore, is anyone else going to take
 over his place or should we regroup?

 On Mon, Feb 7, 2011 at 8:03 PM, Richard, Joel M richar...@si.edu wrote:
  Roberto,
 
  I chose to meet outside of the Walnut conference room in order to not
 contribute to a large number of people in the Lobby. I know it's a bit out
 of the way, but that just means we'll be easier to find. I'll have a sign
 with large words to make it easy to find me.
 
  --Joel
 
 
 
  On Feb 7, 2011, at 2:52 PM, Roberto Hoyle 
 roberto.j.ho...@dartmouth.edu wrote:
 
  On Feb 2, 2011, at 11:11 AM, Richard, Joel M wrote:
 
  Just a general question, how are team leaders contacting their
 attendees? I have no one's email addresses, so for Crazy Horse, I've put
 mine in the Wiki.
 
  FYI, I'm one of the ones who signed up for the Crazy Horse.  I assume
 we'll meet in the lobby at 6?
 
  r.
 



 --

 Cheers,
 Jakub




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Reminder: Newcomer dinner and Ribbons

2011-02-07 Thread Bill Dueber
Ooops. OK. I'll be there at 5:30, but we won't be leaving until everyone
shows up.

On Mon, Feb 7, 2011 at 4:52 PM, Birkin James Diana
birkin_di...@brown.eduwrote:

 Bill,

 I recently signed up for this dinner-trek. 5:30 is fine with me, but just
 an fyi that the guidelines said 6ish, so I'm concerned others might others
 be planning to show up then -- or maybe y'all have been in touch along the
 way. Regardless, I'll be there at 5:30.

 -Birkin

 ---
 Birkin James Diana
 Programmer, Digital Technologies
 Brown University Library
 birkin_di...@brown.edu


 On Feb 7, 2011, at 3:51 PM, Bill Dueber wrote:

  Yep, that's me. Meet on the Mezzanine level near the comfy chairs at
 5:30.
  My shirt features a robot riding a dinosaur.
 
  On Mon, Feb 7, 2011 at 3:41 PM, Jakub Skoczen ja...@indexdata.dk
 wrote:
 
  To the group that signed up for the Anyetsang's Little Tibet: I heard
  from Dot that he's not leading anymore, is anyone else going to take
  over his place or should we regroup?
 
  On Mon, Feb 7, 2011 at 8:03 PM, Richard, Joel M richar...@si.edu
 wrote:
  Roberto,
 
  I chose to meet outside of the Walnut conference room in order to not
  contribute to a large number of people in the Lobby. I know it's a bit
 out
  of the way, but that just means we'll be easier to find. I'll have a
 sign
  with large words to make it easy to find me.
 
  --Joel
 
 
 
  On Feb 7, 2011, at 2:52 PM, Roberto Hoyle 
  roberto.j.ho...@dartmouth.edu wrote:
 
  On Feb 2, 2011, at 11:11 AM, Richard, Joel M wrote:
 
  Just a general question, how are team leaders contacting their
  attendees? I have no one's email addresses, so for Crazy Horse, I've put
  mine in the Wiki.
 
  FYI, I'm one of the ones who signed up for the Crazy Horse.  I assume
  we'll meet in the lobby at 6?
 
  r.
 
 
 
 
  --
 
  Cheers,
  Jakub
 
 
 
 
  --
  Bill Dueber
  Library Systems Programmer
  University of Michigan Library




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] A/B Testing Catalogs and Such

2011-01-26 Thread Bill Dueber
I've proposed A/B testing for our OPAC. I managed to avoid the torches, but
the pitchforks...youch!

On Wed, Jan 26, 2011 at 5:55 PM, Sean Moore thedreadpirates...@gmail.comwrote:

 There's a lot of resistance in my institution to A/B or multivariate
 testing
 any of our live production properties (catalog, website, etc...).  I've
 espoused the virtues of having hard data to back up user activity (if I
 hear
 one more well, in my opinion, I'll just go blind), but the reply is
 always
 along the lines of, But it will confuse users!  I've pointed out the
 myriad successful and critical business that use these methodologies, but
 was told that businesses and academia are different.

 So, my question to you is, which of you academic libraries are using A/B
 testing; on what potion of your web properties (catalog, discovery
 interface, website, etc...); and I suppose to spark conversation, which
 testing suite are you using (Google Website Optimizer, Visual Website
 Optimizer, a home-rolled non-hosted solution)?

 I was told if I can prove it's a commonly accepted practice, I can move
 forward.  So help a guy out, and save me from having to read another survey
 of 12 undergrads that is proof positive of changes I need to make.

 Thanks!

 *Sean Moore*
 Web Application Programmer
 *Phone*: (504) 314-7784
 *Email*:  cmoo...@tulane.edu

 Howard-Tilton Memorial Library http://library.tulane.edu, Tulane
 University




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] code4lib 2011 Update

2011-01-18 Thread Bill Dueber
Right. The key is to make sure the N band has its own SSID. Mac Laptops, at
least, will always glom onto the strongest signal, so if you're broadcasting
on G and N with the same name, most of the time the laptop will grab the G
because the signals go through walls better. If we can just choose, e.g.,
Code4Lib2011 N, that problem goes away.

On Tue, Jan 18, 2011 at 12:46 PM, Richard, Joel M richar...@si.edu wrote:

 I think you missed a critical part of that message, Jonathan. (which I
 didn't write, BTW)

 it does not mean that you have to have one...

 Robert is saying that 802.11n is recommended and you'll have a better
 experience with it. It is not a requirement. Besides, I believe any router
 that supports the n standards is also backwards compatible to prior
 standards.

 --Joel


 Joel Richard
 IT Specialist, Web Services Department
 Smithsonian Institution Libraries | http://www.sil.si.edu/
 (202) 633-1706 | (202) 786-2861 (f) | richar...@si.edu



 On Jan 18, 2011, at 11:15 AM, Jonathan Rochkind wrote:

  On 1/18/2011 9:05 AM, Richard, Joel M wrote:
 
  Our central wireless group has recommended that if everyone has an
 802.11n card (5Ghz radio spectrum) in their device that they will likely
 have a much better experience for connectivity – it does not mean that you
 have to have one it will just be better download speeds etc.
 
  There is ABSOLUTELY no way to guarantee that 100% of 200 conference
  attendees will have 802.11n cards in their devices.
 
  I suspect the vast majority of us will bring the devices we have, and
  not upgrade our devices just for the conf.
 
  I would suggest you make sure IT is assuming that NOT everyone will
  have 802.11n -- there's no way that's going to happen.
 
  Jonathan




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Which O'Reilly books should we give away at Code4Lib 2011?

2010-12-14 Thread Bill Dueber
While both are document stores, there are some major differences in their
data model, most notably that mongoDB uses an update-replaces mechanism,
while CouchDB allows you to access any version of a document, which brings
with it issues of transaction overlaps (who wins?) and having to
periodically compact your database.

CouchDB uses a REST interface for all interaction; mongo has programming
language-specific drivers (although there are also REST interfaces
available), which in many cases can increase performance.

Their querying approaches are differnet. Mongo is more akin to a define an
index and use it when possible at query time. CouchDB is more of a Define
a view beforehand and use that view.

Oops. I just found a better overview than I can provide, at
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB

There are lots of other players in this space, too -- see
http://nosql-database.org/


  -

On Tue, Dec 14, 2010 at 9:12 AM, Thomas Dowling tdowl...@ohiolink.eduwrote:

 On 12/14/2010 07:58 AM, Luciano Ramalho wrote:

 
  I believe CouchDB will take the library world by storm, and the sooner
  the better.
 
  A document database is what we need for many of our applications.
  CouchDB, with its 100% RESTful API is a highly productive web-services
  platform with a document oriented data model and built-in peer-to-peer
  replication. In short, it does very well lots of things we need done.


 Amen.  Does anyone have helpful things to say about choosing between
 CouchDB and MongoDB?


 Thomas Dowling
 tdowl...@ohiolink.edu




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] PHP vs. Python [was: Re: Django]

2010-10-29 Thread Bill Dueber
On Fri, Oct 29, 2010 at 6:28 PM, Peter Schlumpf pschlu...@earthlink.netwrote:

 What's wrong with the library world developing its own domain language?


EVERYTHING!!!

We're already in a world of pain because we have our own data formats and
ways of dealing with them, all of which have basically stood idle while 30
years of advances computer science and information architecture have whizzed
by us with a giant WHOOSHing sound.

Having a bunch of non-experts design and implement a language that's
destined from the outset to be stuck in a tiny little ghetto of the
programming world is a guaranteed way to live with half- or un-supported
code, no decent libraries, and yet another legacy of pain we'd have to
support.

 I'm not picking on programming in particular. It's a dumb-ass move  EVERY
time a library is presented with a problem for which there are experts and
decades of research literature, and it choses to ignore all of that and
decide to throw a committee of librarians (or whomever else happens to be in
the building at the time) at it based on the vague idea that librarians are
just that much smarter (or cheaper) than everyone else (I'm looking at you,
usability...)

 -Bill-




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Bill Dueber
I know there are two parts of this discussion (speed on the one hand,
applicability/features on teh other), but for the former, running a little
benchmark just isn't that hard. Aren't we supposed to, you know, prefer to
make decisions based on data?

Note: I'm only testing deserialization because there's isn't, as of now, a
fast serialization option for ruby-marc. It uses REXML, and it's dog-slow. I
already looked marc-in-json vs marc binary at
http://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/

Benchmark Source: http://gist.github.com/645683

18,883 records as either an XML collection or newline-delimited json.
Open the file, read every record, pull out a title. Repeat 5 times for a
total of 94,415 records (i.e., just under 100K records total).

Under ruby-marc, using the libxml deserializer is the fastest option. If
you're using the REXML parser, well,  god help us all.

ruby 1.8.7 (2010-08-16 patchlevel 302) [i686-darwin9.8.0]. User time
reported in seconds.

  xml w/libxml 227 seconds
  marc-in-json w/yajl  130 seconds


Soquite a bit faster (more than 40%). For a million records (assuming I
can just say 10*these_values) you're talking about a difference of 16
minutes due to just reading speed. Assuming, of course, you're running your
code on my desktop. Today.

For the 8M records I have to deal with, that'd be roughly 8M * ((227-130)
/ 94,415)  = 7806 seconds, or about 130 minutes. S...a lot.

Of course, if you're using a slower XML library or a slower JSON library,
your numbers will vary quite a bit. REXML is unforgivingly slow, and
json/pure (and even 'json') are quite a bit slower than yajl. And don't
forget that you need to serialize these things from your source somehow...

 -Bill-



On Mon, Oct 25, 2010 at 4:23 PM, Stephen Meyer sme...@library.wisc.eduwrote:

 Kyle Banerjee wrote:

 On Mon, Oct 25, 2010 at 12:38 PM, Tim Spalding t...@librarything.com
 wrote:

  Does processing speed of something matter anymore? You'd have to be
 doing a LOT of processing to care, wouldn't you?


 Data migrations and data dumps are a common use case. Needing to break or
 make hundreds of thousands or millions of records is not uncommon.

 kyle


 To make this concrete, we processes the MARC records from 14 separate ILS's
 throughout the University of Wisconsin System. We extract, sort on OCLC
 number, dedup and merge pieces from any campus that has a record for the
 work. The MARC that we then index and display here

  http://forward.library.wisconsin.edu/catalog/ocm37443537?school_code=WU

 is not identical to the version of the MARC record from any of the 4
 schools that hold it.

 We extract 13 million records and dedup down to 8 million every week. Speed
 is paramount.

 -sm
 --
 Stephen Meyer
 Library Application Developer
 UW-Madison Libraries
 436 Memorial Library
 728 State St.
 Madison, WI 53706

 sme...@library.wisc.edu
 608-265-2844 (ph)


 Just don't let the human factor fail to be a factor at all.
 - Andrew Bird, Tables and Chairs




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Bill Dueber
On Mon, Oct 25, 2010 at 9:32 PM, Alexander Johannesen 
alexander.johanne...@gmail.com wrote:

 Lots of people around the library world infra-structure will think
 that since your data is now in XML it has taken some important step
 towards being inter-operable with the rest of the world, that library
 data now is part of the real world in *any* meaningful way, but this
 is simply demonstrably deceivingly not true.


Here, I think you're guilty of radically underestimating lots of people
around the library world. No one thinks MARC is a good solution to our
modern problems, and no one who actually knows what MARC is has trouble
understanding MARC-XML as an XML serialization of the same old data --
certainly not anyone capable of meaningful contribution to work on an
alternative.

You seem to presuppose that there's an enormous pent-up energy poised to
sweep in changes to an obviously-better data format, and that the existence
of MARC-XML somehow defuses all that energy. The truth is that a high
percentage of people that work with MARC data actively think about (or
curse) things that are wrong with it and gobs and gobs of ridiculously-smart
people work on a variety of alternate solutions (not the least of which is
RDA) and get their organizations to spend significant money to do so. The
problem we're dealing with is *hard*. Mind-numbingly hard.

The library world has several generations of infrastructure built around
MARC (by which I mean AACR2), and devising data structures and standards
that are a big enough improvement over MARC to warrant replacing all
that infrastructure is an engineering and political nightmare. I'm happy to
take potshots at the RDA stuff from the sidelines, but I never forget that
I'm on the sidelines, and that the people active in the game are among the
best and brightest we have to offer, working on a problem that invariably
seems more intractable the deeper in you go.

If you think MARC-XML is some sort of an actual problem, and that people
just need to be shouted at to realize that and do something about it, then,
well, I think you're just plain wrong.

  -Bill-

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Bill Dueber
On Mon, Oct 25, 2010 at 10:10 PM, Alexander Johannesen 
alexander.johanne...@gmail.com wrote:

 Political? For sure. Engineering? Not so much.


Ok. Solve it. Let us know when you're done.


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MARCXML - What is it for?

2010-10-25 Thread Bill Dueber
Sorry. That was rude, and uncalled for. I disagree that the problem is
easily solved, even without the politics. There've been lots of attempts to
try to come up with a sufficiently expressive toolset for dealing with
biblio data, and we're still working on it. If you do think you've got some
insight, I'm sure we're all ears, but try to frame it terms of the existing
work if you can (RDA, some of the dublin core stuff, etc.) so we have a
frame of reference.

On Mon, Oct 25, 2010 at 10:18 PM, Bill Dueber b...@dueber.com wrote:

 On Mon, Oct 25, 2010 at 10:10 PM, Alexander Johannesen 
 alexander.johanne...@gmail.com wrote:

 Political? For sure. Engineering? Not so much.


 Ok. Solve it. Let us know when you're done.



 --
 Bill Dueber
 Library Systems Programmer
 University of Michigan Library




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] membership recommendations

2010-08-26 Thread Bill Dueber
Make sure to include a line:

Code4Lib...$0.00


On Thu, Aug 26, 2010 at 12:56 PM, Adam Wead aw...@rockhall.org wrote:

 Hi all,

 I'm budgeting for membership dues and am seeking suggestions for
 professional organizations that are good to have.  As a digital/systems
 librarian working with music and video in an archive, there are lots to
 choose from!  I'm hoping to chose a couple that cover most of the bases.

 Thanks in advance for the recommendations.

 ...adam




 http://rockhall.com/event/rock-hall-ball/
 Join us on Friday, September 3, at the
 http://rockhall.com/event/rock-hall-ball/ 15th Anniversary Celebration at
 the Rock and Roll Hall of Fame and Museum.
 http://rockhall.com/event/rock-hall-ball/! The latest act: Eli Paperboy
 Reed


 Rock  Roll: (noun) African American slang dating back to the early 20th
 Century. In the early 1950s, the term came to be used to describe a new form
 of music, steeped in the blues, rhythm  blues, country and gospel. Today,
 it refers to a wide variety of popular music -- frequently music with an
 edge and attitude, music with a good beat and --- often --- loud guitars.©
 2005 Rock and Roll Hall of Fame and Museum.

 This communication is a confidential and proprietary business
 communication. It is intended solely for the use of the designated
 recipient(s). If this communication is received in error, please contact the
 sender and delete this communication.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Bill Dueber
On Mon, May 3, 2010 at 2:40 PM, MJ Suhonos m...@suhonos.ca wrote:

 Yes, even to me as a librarian but not a cataloguer, many (most?) of these
 elements seem like overkill.  I have no doubt there is an edge-case for
 having this fine level of descriptive detail, but I wonder:

 a) what proportion of records have this level of description
 b) what kind of (or how much) user access justifies the effort in creating
 and preserving it


On many levels, I agree. Or I wish I could.

If you look at a business model like Amazon, for example, it's easy to
imagine that their overriding goal is, Make the easy-to-find stuff
ridiculously easy to find. The revenue they get from someone finding an
edge-case book is exactly the same as the revenue they get from someone
buying Harry Potter. The ROI easy to think about.

But I work in an academic library. In a lot of ways, our *primary audience*
is some grad student 12 years from now who needs one trivial piece of crap
to make it all come together in her head. I know we have thousands of books
that have never been looked at, but computing the ROI on someone being able
to see them some day is difficult. Maybe it's zero. Maybe not. We just can't
tell.

Now, none of this is to say that MARC/AACR2 is necessarily the best (or even
a good) way to go about making these works findable. I'm just saying that
evaluating the edge cases in terms of user access are a complicated
business.

  -Bill-

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] A call for your OPAC (or other system) statistics! (Browse interfaces)

2010-05-03 Thread Bill Dueber
On Mon, May 3, 2010 at 7:10 PM, Bryan Baldus
bryan.bal...@quality-books.com wrote:
 I can't speak for other users (particularly the generic patron user type), 
 but as a
cataloger/librarian user,

...and THERE IT IS, ladies and gentlemen.

I've started trying to keep a list of IP addresses I *know* are staff
and separate out the statistics. The OPAC isn't for the librarians;
the ILS client is. If the client sucks so badly that librarians need
the OPAC to do our job (as I was told several times during our roll
out of vufind), then the solution is to fix the client, or
(alternately) build up a workaround for staff. NOT to overload the
OPAC.  If librarians need specialized tools, let's just build them
without some sort of pretense that they're anything but the tiniest
blip on the bell curve of patrons.

And, BTW, just because you (and you know who you are!) do 8 hours of
reference desk work a week doesn't mean you have a hell of a lot more
insight. The patrons that self-select to actually speak to a librarian
sitting *in the library* are a freakshow themselves, statistically
speaking.

[Not meaning to imply that Bryan doesn't know the difference between
himself and a normal patron; his post makes it clear that he does. I
just took the opportunity to rant.]

I'm not saying that patrons don't use browse much (that's what I'm
trying to determine). But, to borrow from the 2009 code4lib
conference, every time a librarian's work habits inform the design of
a public-facing application, God kills a kitten.

  -Bill-

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] it's cool to hate on OpenURL (was: Twitter annotations...)

2010-05-03 Thread Bill Dueber
On Mon, May 3, 2010 at 6:34 PM, Karen Coyle li...@kcoyle.net wrote:
 Quoting Jakob Voss jakob.v...@gbv.de:

 I bet there are several reasons why OpenURL failed in some way but I
 think one reason is that SFX got sold to Ex Libris. Afterwards there
 was no interest of Ex Libris to get a simple clean standard and most
 libraries ended up in buying a black box with an OpenURL label on it -
 instead of developing they own systems based on a common standard. I
 bet you can track most bad library standards to commercial vendors. I
 don't trust any standard without open specification and a reusable Open
 Source reference implementation.

 For what it's worth, that does not coincide with my experience.


I'm going to turn this back on Karen and say that much of my pain does
come from vendors, but it comes from their shitty data. OpenURL and
resolvers would be a much more valuable piece of technology if the
vendors would/could get off their collective asses(1) and give us
better data.

 -Bill-

(1) By this, of course, I mean if the librarians would grow a pair
and demand better data via our contracts


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] A call for your OPAC (or other system) statistics! (Browse interfaces)

2010-05-03 Thread Bill Dueber
On Mon, May 3, 2010 at 8:39 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 So, Bill, you're still not certain yourself exactly what purposes browse is 
 used for by actual non-librarian searchers, if anything?

Right. I'm not sure *the extent* to which it's used (data which are
necessarily going to be messy and partially driven by how prevalent
browse vs search are in the interface), and I certainly don't know
what's going through people's heads when they choose to use it (on
those occasions when they make a conscious choice to use browse in
addition to/instead of  search).

My attempts to find stuff in the research literature failed me; if
anyone has other pointers, I'd love to read them! (If only there was a
real librarian around to help poor little me...)

 -Bill-


Re: [CODE4LIB] ILS short list

2010-04-08 Thread Bill Dueber
On Thu, Apr 8, 2010 at 2:32 PM, Ryan Eby ryan...@gmail.com wrote:
 Unicorn
 * Export
 Built in. MARC21 or flat file formats. Unicode support is available as an 
 extra.

...as an extra??? This is the saddest thing I've ready all day.


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Zotero, unapi, and formats?

2010-04-06 Thread Bill Dueber
The unAPI support is also...non-ideal...in that you can't present
preferences for the best format to use. For example, the Refworks
Tagged format just plain has more tags (and hence more or
more-finely-grained information) than other formats (e.g., Endnote),
but Zotero will prefer Endnote just because it does. My RIS output is
better than my endnote output, but there's no way for me to tell
Zotero that.  For Mirlyn I ended up just having exactly one format
listed in my unapi-server file. Which is dumb. But I'm not sure what
else to do.

On Tue, Apr 6, 2010 at 10:16 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Yeah, we need some actual documentation on Zotero's use of unAPI in general.
 Maybe if I can figure it out (perhaps by asking the developer(s)) I'll write
 some for them.

 Robert Forkel wrote:

 well, looks like a combination:
 in case of mods it checks for the namespace URL, in case of rdf, it
 looks for a format name of rdf_dc, ...
 and yes, endnote export would have to have a name of endnote (i ran
 into this problem as well with names like endnote-utf-8, ...). i think
 unapi would be more usable if there were at least a recommendation of
 common format names.

 On Tue, Apr 6, 2010 at 4:07 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:


 Wait, does it actually recognize the format by the format _name_ used,
 and
 not by a mime content-type?  Like unless my unAPI server calls the
 endnote
 format endnote, it won't recognize it?  That would be odd, and good to
 know. I thought the unAPI format names were purely arbitrary, but
 recognized
 by their association with a mime content-type like application/x-
 /endnote/-refer.   But no, at least as far as Zotero is concerned, you
 have
 to pick format shortnames that match what Zotero expects?


 Robert Forkel wrote:


 from looking at line 14 here
 https://www.zotero.org/trac/browser/extension/trunk/translators/unAPI.js
 i'd say:
 ad 1. RECOGNIZABLE_FORMATS = [mods, marc, endnote, ris,
 bibtex, rdf] also see function checkFormats
 ad 2. the order listed above
 ad 4.: from my experience the unapi scraper takes precedence over coins

 On Tue, Apr 6, 2010 at 3:48 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:



 Anyone know if there's any developer documentation for Zotero on it's
 use
 of
 unAPI?  Alternately, anyone know where I can find the answers to these
 questions, or know the answers to these questions themselves?

 1. What formats will Zotero use via unAPI. What mime content-types does
 it
 use to recognize those formats (sometimes a format has several in use,
 or
 no
 official content-type).

 2. What is Zotero's order of preference when multiple formats via unAPI
 are
 available?

 3. Will Zotero get confused if different documents on the page have
 different formats available?  This can be described with unAPI, but it
 seems
 atypical, so not sure if it will confuse Zotero.

 4. If both unAPI and COinS are on a given page -- will Zotero use both
 (resulting in possible double-import for citations exposed both ways).
 Or
 only one? Or depends on how you set up the HTML?

 5. Somewhere that now I can't find I saw a mention of a Zotero RDF
 format
 that Zotero would consume via unAPI. Is there any documentation of this
 format/vocabulary, how can I find out how to write it?











-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] planet code4lib code (was: newbie)

2010-03-28 Thread Bill Dueber
I know some systems (I'm thinking of CPAN and Gemcutter in particular) have
feeds of new releases -- maybe we could tap into those and note when
registered projects have new releases? I don't know if that's fine-grained
enough information for what folks want.

On Sun, Mar 28, 2010 at 6:44 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Good point Aaron. Maybe that's possible, but I'm not seeing exactly what
 the interface would look like. Without worrying about how to implement it,
 can you say more about what you'd actually want to see as a user?  Expand on
 what you mean by listens for feeds of specific types, I'm not sure what
 that means.  You'd like to see, what? Just initial commits by certain users,
 and new stable releases on certain projects (or by certain users?).   Or you
 want to have an interface that gives you the ability to choose/search
 exactly what you want to see from categories like these, accross a wide
 swatch of projects chosen as of interest?
 
 From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Aaron
 Rubinstein [arubi...@library.umass.edu]
 Sent: Sunday, March 28, 2010 6:33 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] planet code4lib code (was: newbie)

 Quoting Jonathan Rochkind rochk...@jhu.edu:

  Hmm, an aggregated feed of the commit logs (from repos that offer
  feeds, as most do), of open source projects of interest to the
  code4lib community.  Would that be at all useful?

 I think that's a start but I'd imagine that just a feed of the commit
 logs would contain a lot of noise that would drown out what might
 actually be interesting, like newly published gists, initial commits
 of projects, new project releases, etc...  I'm most familiar with
 GitHub, which indicates the type of event being published, but I'm
 sure other code repos do something similar.  Would it be possible to
 put something together using Views that listens for feeds of specific
 types published by users in the code4lib community?

 Aaron




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] PHP bashing (was: newbie)

2010-03-25 Thread Bill Dueber
Also...it's pretty good for plugging leaks in ducts.

On Thu, Mar 25, 2010 at 11:51 AM, Nate Vack njv...@wisc.edu wrote:

 On Thu, Mar 25, 2010 at 10:00 AM, Joe Hourcle
 onei...@grace.nascom.nasa.gov wrote:

  You say that as if duct tape is a bad thing for auto repairs.  Not all
 duct
  tape repairs are candidates for There, I fixed it![1].  It works just
 fine
  for the occassional hose repair.

 At the risk of taking an off-topic conversation even further into
 Peanut Heaven, automotive hose repair is actually one of the things
 duct tape is least well-suited to. The adhesive doesn't bond when wet,
 it's not strong enough to hold much pressure or vacuum (especially
 moderate continuous pressure), and it fails very quickly at even
 moderately high temperatures. And it tends to leave goo all over
 everything, thus adding headaches to the proper repair you'll still
 need later.

 Duct tape is OK for keeping a wire bundle out of your fan or
 something, but if you try to fix a leak in your radiator hose with it,
 you'll still be stranded and also have gooey duct tape adhesive all
 over the place.

 Extending these points to the ongoing language debate is an exercise
 that will benefit no one ;-)

 Cheers (and just get that hose replaced ;-)
 -Nate




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Bill Dueber
On the one hand, I'm all for following specs. But on the other...should we
really be too concerned about dealing with the full flexibility of the 2709
spec, vs. what's actually used? I mean, I hope to god no one is actually
creating new formats based on 2709!

If there are real-life examples in the wild of, say, multi-character
indicators, or subfield codes of more than one character, that's one thing.

BTW, in the stuff I proposed, you know a controlfield vs. a datafield
because of the length of the array (2 vs 5); it's well-specified, but by the
size of the tuple, not by label.

On Mon, Mar 15, 2010 at 11:22 AM, Houghton,Andrew hough...@oclc.org wrote:

  From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
  Jonathan Rochkind
  Sent: Monday, March 15, 2010 11:53 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
 
  I would just ask why you didn't use Bill Dueber's already existing
  proto-spec, instead of making up your own incomptable one.

 Because the internal use of our specification predated Bill's blog entry,
 dated 2010-02-25, by almost a year.  Bill's post reminded me that I had not
 published or publicly discussed our specification.

 Secondly, Bill's specification looses semantics from ISO 2709, as I
 previously pointed out.  His specification clumps control and data fields
 into one property named fields. According to ISO 2709, control and data
 fields have different semantics.  You could have a control field tagged as
 001 and a data field tagged as 001 which have different semantics.  MARC-21
 has imposed certain rules for assignment of tags such that this isn't a
 concern, but other systems based on ISO 2709 may not.


 Andy.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] questions about 2011 conference proposal

2010-03-15 Thread Bill Dueber
I'm pretty sure the closest real hotel (there are a couple bed  breakfasts)
is the new Hilton downtown; it's about 3/4 of a mile straight down Kirkwood
Ave and probably a 12mn walk.

On Mon, Mar 15, 2010 at 9:20 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 (Code4Lib listserv, Robert McDonald CC'd). I have a question about the
 Bloomington Code4Lib conference proposal. (I would personally be quite happy
 for the conference to be in Bloomington).

 I note that the actual IU conference center has under 200 rooms. Probably
 not enough for all attendees even if we take every room. Are the other
 hotels in Bloomington a quick walk to the conference center, and what are
 their rates like?

 (I would ask the same thing about the Vancouver proposal, but they say they
 can secure $109 rates at two named hotels, which I'm assuming would have
 enough rooms for us all, and I'm assuming are close enough to the proposed
 meeting venue to work, although I haven't looked it up on google maps.)

 Jonathan




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Bill Dueber
On Sat, Mar 6, 2010 at 1:57 PM, Houghton,Andrew hough...@oclc.org wrote:

  A way to fix this issue is to say that use cases #1 and #2 conform to
 media type application/json and use case #3 conforms to a new media type
 say: application/marc+json.  This new application/marc+json media type now
 becomes a library centric standard and it avoids breaking a widely deployed
 Web standard.


I'm so sorry -- it never dawned on me that anyone would think that I was
asserting that a JSON MIME type should return anything but JSON. For the
record, I think that's batshit crazy. JSON needs to return json. I'd been
hoping to convince folks that we need to have a standard way to pass records
around that doesn't require a streaming parser/writer; not ignore standard
MIME-types willy-nilly. My use cases exist almost entirely outside the
browse environment (because, my god, I don't want to have to try to deal
with MARC21, whatever the serialization, in a browser environment); it
sounds like Andy is almost purely worried about working with a MARC21
serialization within a browser-based javascript environment.

Anyway, hopefully, it won't be a huge surprise that I don't disagree with
any of the quote above in general; I would assert, though, that
application/json and application/mac+json should both return JSON (in the
same way that text/xml, application/xml, and application/marc+xml can all be
expected to return XML). Newline-delmited json is starting to crop up in a
few places (e.g. couchdb) and should probably have its own mime type and
associated extension. So I would say something like:

application/json -- return json (obviously)
application/marc+json  -- return json
application/marc+ndj  -- return newline-delimited json

In all cases, we should agree on a standard record serialization, though,
and the pure-json returns should include something that indicates what the
heck it is (hopefully a URI that can act as a distinct namespace-type
identifier, including a version in it).

The question for me, I think, is whether within this community,  anyone who
provides one of these types (application/marc+json and application/marc+ndj)
should automatically be expected to provide both. I don't have an answer for
that.

 -Bill-


Re: [CODE4LIB] Code4Lib Midwest?

2010-03-05 Thread Bill Dueber
I'm pretty sure I could make it from Ann Arbor!

On Fri, Mar 5, 2010 at 10:12 AM, Ken Irwin kir...@wittenberg.edu wrote:

 I would come from Ohio to wherever we choose. Kalamazoo would suit me just
 fine; I've not been back there in entirely too long!
 Ken

  -Original Message-
  From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
  Scott Garrison
  Sent: Friday, March 05, 2010 8:37 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Code4Lib Midwest?
 
  +1
 
  ELM, I'm happy to help coordinate in whatever way you need.
 
  Also, if we can find a drummer, we could do a blues trio (count me in on
 bass). I
  could bring our band's drummer (a HUGE ND fan) down for a day or two if
  needed--he's awesome.
 
  --SG
  WMU in Kalamazoo
 
  - Original Message -
  From: Eric Lease Morgan emor...@nd.edu
  To: CODE4LIB@LISTSERV.ND.EDU
  Sent: Thursday, March 4, 2010 4:38:53 PM
  Subject: Re: [CODE4LIB] Code4Lib Midwest?
 
  On Mar 4, 2010, at 3:29 PM, Jonathan Brinley wrote:
 
2. share demonstrations
  
   I'd like to see this be something like a blend between lightning talks
   and the ask anything session at the last conference
 
  This certainly works for me, and the length of time of each talk
 would/could be
  directly proportional to the number of people who attend.
 
 
4. give a presentation to library staff
  
   What sort of presentation did you have in mind, Eric?
  
   This also raises the issue of weekday vs. weekend. I'm game for
   either. Anyone else have a preference?
 
  What I was thinking here was a possible presentation to library
 faculty/staff
  and/or computing faculty/staff from across campus. The presentation could
 be
  one or two cool hacks or solutions that solved wider, less geeky
 problems.
  Instead of tweaking Solr's term-weighting algorithms to index
 OAI-harvested
  content it would be making journal articles easier to find. This would
 be an
  opportunity to show off the good work done by institutions outside Notre
 Dame.
  A prophet in their own land is not as convincing as the expert from afar.
 
  I was thinking it would happen on a weekday. There would be more stuff
 going
  on here on campus, as well as give everybody a break from their normal
 work
  week. More specifically, I would suggest such an event take place on a
 Friday
  so the poeple who stayed over night would not have to take so many days
 off of
  work.
 
 
5. have a hack session
  
   It would be good to have 2 or 3 projects we can/should work on decided
   ahead of time (in case no one has any good ideas at the time), and
   perhaps a couple more inspired by the earlier presentations.
 
 
 
  True.
 
  --
  ELM
  University of Notre Dame




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew hough...@oclc.org wrote:

 Too bad I didn't attend code4lib.  OCLC Research has created a version of
 MARC in JSON and will probably release FAST concepts in MARC binary,
 MARC-XML and our MARC-JSON format among other formats.  I'm wondering
 whether there is some consensus that can be reached and standardized at LC's
 level, just like OCLC, RLG and LC came to consensus on MARC-XML.
  Unfortunately, I have not had the time to document the format, although it
 fairly straight forward, and yes we have an XSLT to convert from MARC-XML to
 MARC-JSON.  Basically the format I'm using is:


The stuff I've been doing:

  http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

... is pretty much the same, except:

  1. I don't explicitly split up control and data fields. There's a single
field list; an item that has two elements is a control field (tag/data); one
with four is a data field (tag / ind1 /ind2 / array_of_subfield)

  2. Instead of putting a collection in a big json array, I use
newline-delimited-json (basically, just stick one record on each line as a
single json hash). This has the advantage that it makes streaming much, much
easier, and makes doing some other things (e.g., grab the first record or
two) much cheaper for even the dumbest json parser). I'm not sure what the
state of JSON streaming parsers are; I know Jackson (for Java) can do it,
and perl's JSON::XS can...kind of...but it's not great.

3. I include a type (MARC-JSON, MARC-HASH, whatever) and version: [major,
minor] in each record. There's already a ton of JSON floating around the
library world; labeling what the heck a structure is is just friendly :-)

MARC's structure is dumb enough that we collectively basically can't miss;
there's only so much you can do with the stuff, and a round-trip to JSON and
back is easy to implement.

I'm not super-against explicitly labeling the data elements (tag:, :ind1:,
etc.) but I don't see where it's necessary unless you're planning on adding
out-of-band data to the records/fields/subfields at some point. Which might
be kinda cool (e.g., language hints on a per-subfield basis? Tokenization
hints for non-whitespace-delimited languages? URIs for unique concepts and
authorities where they exist for easy creation of RDF?)

I *am*, however, willing to push and push and push for NDJ instead of having
to deal with streaming JSON parsing, which to my limited understanding is
hard to get right and to my more qualified understanding is a pain in the
ass to work with.

And anything we do should explicitly be UTF-8 only; converting from MARC-8
is a problem for the server, not the receiver.

Support for what I've been calling marc-hash (I like to decouple it from the
eventual JSON format in case the serialization preferences change, or at
least so implementations don't get stuck with a single JSON library) is
already baked into ruby-marc, and obviously implementations are dead-easy no
matter what the underlying language is.

Anyone from the LoC want to get in on this?

 -Bill-




 [
  ...
 ]

 which represents a collection of MARC records or

 {
  ...
 }

 which represents a single MARC records that takes the form:

 {
  leader : 01192cz  a2200301n  4500,
  controlfield :
  [
{ tag : 001, data : fst01303409 },
{ tag : 003, data : OCoLC },
{ tag : 005, data : 20100202194747.3 },
{ tag : 008, data : 060620nn anznnbabn  || ana d }
  ],
  datafield :
  [
{
  tag : 040,
  ind1 :  ,
  ind2 :  ,
  subfield :
  [
{ code : a, data : OCoLC },
{ code : b, data : eng },
{ code : c, data : OCoLC },
{ code : d, data : OCoLC-O },
{ code : f, data : fast },
  ]
},
{
  tag : 151,
  ind1 :  ,
  ind2 :  ,
  subfield :
  [
{ code : a, data : Hawaii },
{ code : z, data : Diamond Head }
  ]
}
  ]
 }




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew hough...@oclc.org wrote:


 I decided to stick closer to a MARC-XML type definition since its would be
 easier to explain how the two specifications are related, rather than take a
 more radical approach in producing a specification less familiar.  Not to
 say that other approaches are bad, they just have different advantages and
 disadvantages.  I was going for simple and familiar.


That makes sense, but please consider adding a format/version (which we get
in MARC-XML from the namespace and isn't present here). In fact, please
consider adding a format / version / URI, so people know what they've got.

I'm also going to again push the newline-delimited-json stuff. The
collection-as-array is simple and very clean, but leads to trouble
for production (where for most of us we'd have to get the whole freakin'
collection in memory first and then call JSON.dump or whatever)
or consumption (have to deal with a streaming json parser). The production
part is particularly worrisome, since I'd hate for everyone to have to
default to writing out a '[', looping through the records, and writing a
']'. Yeah, it's easy enough, but it's an ugly hack that *everyone* would
have to do, as opposed to just something like:

  while (r = nextRecord) {
 print r.to_json, \n
  }

Unless, of course, writing json to a stream and reading json from a stream
is a lot easier than I make it out to be across a variety of languages and I
just don't know it, which is entirely possible. The streaming writer
interfaces for Perl (
http://search.cpan.org/dist/JSON-Streaming-Writer/lib/JSON/Streaming/Writer.pm)
and Java's Jackson (
http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example) are a
little more daunting than I'd like them to be.

Not wanting to argue unnecessarily, here; just adding input before things
get effectively set in stone.

 -Bill-

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 3:14 PM, Houghton,Andrew hough...@oclc.org wrote:


 As you point out JSON streaming doesn't work with all clients and I am
 hesitent to build on anything that all clients cannot accept.  I think part
 of the issue here is proper API design.  Sending tens of megabytes back to a
 client and expecting them to process it seems like a poor API design
 regardless of whether they can stream it or not.  It might make more sense
 to have a server API send back 10 of our MARC-JSON records in a JSON
 collection and have the client request an additional batch of records for
 the result set.  In addition, if I remember correctly, JSON streaming or
 other streaming methods keep the connection to the server open which is not
 a good thing to do to maintain server throughput.


I guess my concern here is that the specification, as you're describing it,
is closing off potential uses.  It seems fine if, for example, your primary
concern is javascript-in-the-browser, and browser-request,
pagination-enabled systems might be all you're worried about right now.

That's not the whole universe of uses, though. People are going to want to
dump these things into a file to read later -- no possibility for pagination
in that situation. Others may, in fact, want to stream a few thousand
records down the pipe at once, but without a streaming parser that can't
happen if it's all one big array.

I worry that as specified, the *only* use will be, Pull these down a thin
pipe, and if you want to keep them for later, or want a bunch of them, you
have to deal with marc-xml. Part of my incentive is to *not* have to use
marc-xml, but in this case I'd just be trading one technology I don't like
(marc-xml) for two technologies, one of which I don't like (that'd be
marc-xml again).

I really do understand the desire to make this parallel to marc-xml, but
there's a seem between the two technologies that makes that a problematic
approach.



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 4:38 PM, Houghton,Andrew hough...@oclc.org wrote:


 Maybe I have been mislead or misunderstood JSON streaming.


This is my central point. I'm actually saying that JSON streaming is painful
and rare enough that it should be avoided as a requirement for working with
any new format.

I guess, in sum, I'm making the following assertions:

1. Streaming APIs for JSON, where they exist, are a pain in the ass. And
they don't exist everywhere. Without a JSON streaming parser, you have to
pull the whole array of documents up into memory, which may be impossible.
This is the crux of my argument -- if you disagree with it, then I would
assume you disagree with the other points as well.

2. Many people -- and I don't think I'm exaggerating here, honestly --
really don't like using MARC-XML but have to because of the length
restrictions on MARC-binary. A useful alternative, based on dead-easy
parsing and production, is very appealing.

2.5 Having to deal with a streaming API takes away the dead-easy part.

3. If you accept my assertions about streaming parsers, then dealing with
the format you've proposed for large sets is either painful (with a
streaming API) or impossible (where such an API doesn't exist) due to memory
constraints.

4. Streaming JSON writer APIs are also painful; everything that applies to
reading applies to writing. Sans a streaming writer, trying to *write* a
large JSON document also results in you having to have the whole thing in
memory.

5. People are going to want to deal with this format, because of its
benefits over marc21 (record length) and marc-xml (ease of processing),
which means we're going to want to deal with big sets of data and/or dump
batches of it to a file. Which brings us back to #1, the pain or absence of
streaming apis.

Write a better JSON parser/writer  or use a different language seem like
bad solutions to me, especially when a (potentially) useful alternative
exists.

As I pointed out, if streaming JSON is no harder/unavailable to you than
non-streaming json, then this is mostly moot. I assert that for many people
in this community it is one or the other, which is why I'm leery of it.

  -Bill-


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Bill Dueber
On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew hough...@oclc.org wrote:

 OK, I will bite, you stated:

 1. That large datasets are a problem.
 2. That streaming APIs are a pain to deal with.
 3. That tool sets have memory constraints.

 So how do you propose to process large JSON datasets that:

 1. Comply with the JSON specification.
 2. Can be read by any JavaScript/JSON processor.
 3. Do not require the use of streaming API.
 4. Do not exceed the memory limitations of current JSON processors.


What I'm proposing is that we don't process large JSON datasets; I'm
proposing that we process smallish JSON documents one at a time by pulling
them out of a stream based on an end-of-record character.

This is basically what we use for MARC21 binary format -- have a defined
structure for a valid record, and separate multiple well-formed record
structures with an end-of-record character. This preserves JSON
specification adherence at the record level and uses a different scheme to
represent collections. Obviously, MARC-XML uses a different mechanism to
define a collection of records -- putting well-formed record structures
inside a collection tag.

So... I'm proposing define what we mean by a single MARC record serialized
to JSON (in whatever format; I'm not very opinionated on this point) that
preserves the order, indicators, tags, data, etc. we need to round-trip
between marc21binary, marc-xml, and marc-json.

And then separate those valid records with an end-of-record character --
\n.

Unless I've read all this wrong, you've come to the conclusion that the
benefit of having a JSON serialization that is valid JSON at both the record
and collection level outweighs the pain of having to deal with a streaming
parser and writer.  This allows a single collection to be treated as any
other JSON document, which has obvious benefits (which I certainly don't
mean to minimize) and all the drawbacks we've been talking about *ad nauseam
*.

I go the the other way. I think the pain of dealing with a streaming API
outweighs the benefits of having a single valid JSON structure for a
collection, and instead have put forward that we use a combination of JSON
records and a well-defined end-of-record character (\n) to represent a
collection.  I recognize that this involves providing special-purpose code
which must call for JSON-deserialization on each line, instead of being able
to throw the whole stream/file/whatever at your json parser is. I accept
that because getting each line of a text file is something I find easy
compared to dealing with streaming parsers.

And our point of disagreement, I think, is that I believe that defining the
collection structure in such a way that we need two steps (get a line;
deserialize that line) and can't just call the equivalent of
JSON.parse(stream) has benefits in ease of implementation and use that
outweigh the loss of having both a single record and a collection of records
be valid JSON. And you, I think, don't :-)

I'm going to bow out of this now, unless I've got some part of our positions
wrong, to let any others that care (which may number zero) chime in.

 -Bill-










-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] HathiTrust API

2010-02-24 Thread Bill Dueber
I didn't put in links for RIS-type formats because, I think, I don't really
understand the semantics of the link tag. The RIS output for a record is a
tiny percentage of what's in a record -- is it really another
representation or is it a different thing altogether?

On Wed, Feb 24, 2010 at 3:45 AM, Ed Summers e...@pobox.com wrote:

 Nice work Bill! I particularly like your use of the link element to
 enable auto-discovery of these resources:

  link rel=canonical href=/Record/005550418
  link rel=alternate type=application/marc href=/Record/005550418.mrc
 
  link rel=alternate type=application/marc+xml
 href=/Record/005550418.xml 
  link rel=alternate href=/Record/005550418.rdf
 type=application/rdf+xml /

 Did you shy away from adding the RIS and Refworks formats as links
 because it wasn't clear what MIME type to use?

 I'd be interested in helping flesh out the RDF a bit if you are interested.

 //Ed

 On Tue, Feb 23, 2010 at 4:07 PM, Bill Dueber b...@dueber.com wrote:
  Many of you just saw Albert Betram of the University of Michigan
 Libraries
  talk at #c4l10 about HathiTrust APIs available to anyone interested. One
 of
  these, the BibAPI, was formed mostly by me on the basis of Imaginary
 User
  Needs, not actual use cases. Anyone who has use cases that aren't
  well-covered by the existing BibAPI should drop me a line and let me
 know.
 
  This is also a good time to mention that catalog.hathitrust.org (and
  mirlyn.lib.umich.edu) support some limited export facilities by adding
 an
  extension to a record URL. SO...
 
 
  http://catalog.hathitrust.org/Record/005550418   Link to the
 Hathitrust
  page
  http://catalog.hathitrust.org/Record/005550418.marc  MARC21 binary
  http://catalog.hathitrust.org/Record/005550418.xml   MARC-XML
  http://catalog.hathitrust.org/Record/005550418.ris   RIS tagged format
  http://catalog.hathitrust.org/Record/005550418.refworks Refworks tagged
  format
  http://catalog.hathitrust.org/Record/005550418.rdf   Perfunctory RDF
  document
 
  I'd love help getting the RDF more fleshed out, btw.
 
  Again -- if you need anything else, or if you, say, wrap a nice jQuery
  plugin around the BibAPI, please let me know!
 
   -Bill-
 
 
 
  Bill Dueber
  Library Systems Programmer
  University of Michigan Library
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] HathiTrust API

2010-02-24 Thread Bill Dueber
OK, I've added links for RIS and Endnote, but it turns out I *don't* know
what mime type to use for Refworks. When actually talking to refworks with
their callback system, I need to send it as text/plain, and I've been unable
to track down what the preferred type is.

Anyone know?

On Wed, Feb 24, 2010 at 3:45 AM, Ed Summers e...@pobox.com wrote:

 Nice work Bill! I particularly like your use of the link element to
 enable auto-discovery of these resources:

  link rel=canonical href=/Record/005550418
  link rel=alternate type=application/marc href=/Record/005550418.mrc
 
  link rel=alternate type=application/marc+xml
 href=/Record/005550418.xml 
  link rel=alternate href=/Record/005550418.rdf
 type=application/rdf+xml /

 Did you shy away from adding the RIS and Refworks formats as links
 because it wasn't clear what MIME type to use?

 I'd be interested in helping flesh out the RDF a bit if you are interested.

 //Ed

 On Tue, Feb 23, 2010 at 4:07 PM, Bill Dueber b...@dueber.com wrote:
  Many of you just saw Albert Betram of the University of Michigan
 Libraries
  talk at #c4l10 about HathiTrust APIs available to anyone interested. One
 of
  these, the BibAPI, was formed mostly by me on the basis of Imaginary
 User
  Needs, not actual use cases. Anyone who has use cases that aren't
  well-covered by the existing BibAPI should drop me a line and let me
 know.
 
  This is also a good time to mention that catalog.hathitrust.org (and
  mirlyn.lib.umich.edu) support some limited export facilities by adding
 an
  extension to a record URL. SO...
 
 
  http://catalog.hathitrust.org/Record/005550418   Link to the
 Hathitrust
  page
  http://catalog.hathitrust.org/Record/005550418.marc  MARC21 binary
  http://catalog.hathitrust.org/Record/005550418.xml   MARC-XML
  http://catalog.hathitrust.org/Record/005550418.ris   RIS tagged format
  http://catalog.hathitrust.org/Record/005550418.refworks Refworks tagged
  format
  http://catalog.hathitrust.org/Record/005550418.rdf   Perfunctory RDF
  document
 
  I'd love help getting the RDF more fleshed out, btw.
 
  Again -- if you need anything else, or if you, say, wrap a nice jQuery
  plugin around the BibAPI, please let me know!
 
   -Bill-
 
 
 
  Bill Dueber
  Library Systems Programmer
  University of Michigan Library
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] HathiTrust API

2010-02-24 Thread Bill Dueber
OK, slow it down J-Rock. :-)

I'm looking for a MIME type for Refworks Tagged Format, which is NOT RIS.
It's a different tagged format. The three most common tagged formats are
RIS, Refworks, and Endnote-style-Refer. It's the Refworks one I need help
with.

And I gave up on the marc-lines-pretend-format stuff; I just send Refworks
their preferred tagged format now, I just don't know what MIME type to use.

On Wed, Feb 24, 2010 at 9:57 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Do you mean what's the mime-type for RIS files? (RIS != Refworks, I
 forget what RIS stands for, but it's used by many reference managers, and
 may have originally been invented by EndNote?)

 There isn't a registered MIME type for RIS.  Googling around, it looks like
 the preferred one is: application/x-Research-Info-Systems

 (Guess that's what RIS stands for?)


 Or wait, you mean the Refworks callback?You can actually give Refworks
 a variety of types of content in the callback.  Are you giving it that weird
 marc-formatted-a-certain-way-in-a-textfile format?  I doubt there's any mime
 type for that other than text/plain.

 Jonathan


 
 From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill
 Dueber [b...@dueber.com]
 Sent: Wednesday, February 24, 2010 9:47 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] HathiTrust API

 OK, I've added links for RIS and Endnote, but it turns out I *don't* know
 what mime type to use for Refworks. When actually talking to refworks with
 their callback system, I need to send it as text/plain, and I've been
 unable
 to track down what the preferred type is.

 Anyone know?

 On Wed, Feb 24, 2010 at 3:45 AM, Ed Summers e...@pobox.com wrote:

  Nice work Bill! I particularly like your use of the link element to
  enable auto-discovery of these resources:
 
   link rel=canonical href=/Record/005550418
   link rel=alternate type=application/marc
 href=/Record/005550418.mrc
  
   link rel=alternate type=application/marc+xml
  href=/Record/005550418.xml 
   link rel=alternate href=/Record/005550418.rdf
  type=application/rdf+xml /
 
  Did you shy away from adding the RIS and Refworks formats as links
  because it wasn't clear what MIME type to use?
 
  I'd be interested in helping flesh out the RDF a bit if you are
 interested.
 
  //Ed
 
  On Tue, Feb 23, 2010 at 4:07 PM, Bill Dueber b...@dueber.com wrote:
   Many of you just saw Albert Betram of the University of Michigan
  Libraries
   talk at #c4l10 about HathiTrust APIs available to anyone interested.
 One
  of
   these, the BibAPI, was formed mostly by me on the basis of Imaginary
  User
   Needs, not actual use cases. Anyone who has use cases that aren't
   well-covered by the existing BibAPI should drop me a line and let me
  know.
  
   This is also a good time to mention that catalog.hathitrust.org (and
   mirlyn.lib.umich.edu) support some limited export facilities by adding
  an
   extension to a record URL. SO...
  
  
   http://catalog.hathitrust.org/Record/005550418   Link to the
  Hathitrust
   page
   http://catalog.hathitrust.org/Record/005550418.marc  MARC21 binary
   http://catalog.hathitrust.org/Record/005550418.xml   MARC-XML
   http://catalog.hathitrust.org/Record/005550418.ris   RIS tagged format
   http://catalog.hathitrust.org/Record/005550418.refworks Refworks
 tagged
   format
   http://catalog.hathitrust.org/Record/005550418.rdf   Perfunctory RDF
   document
  
   I'd love help getting the RDF more fleshed out, btw.
  
   Again -- if you need anything else, or if you, say, wrap a nice jQuery
   plugin around the BibAPI, please let me know!
  
-Bill-
  
  
  
   Bill Dueber
   Library Systems Programmer
   University of Michigan Library
  
 



 --
 Bill Dueber
 Library Systems Programmer
 University of Michigan Library




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] HathiTrust API

2010-02-23 Thread Bill Dueber
Many of you just saw Albert Betram of the University of Michigan Libraries
talk at #c4l10 about HathiTrust APIs available to anyone interested. One of
these, the BibAPI, was formed mostly by me on the basis of Imaginary User
Needs, not actual use cases. Anyone who has use cases that aren't
well-covered by the existing BibAPI should drop me a line and let me know.

This is also a good time to mention that catalog.hathitrust.org (and
mirlyn.lib.umich.edu) support some limited export facilities by adding an
extension to a record URL. SO...


http://catalog.hathitrust.org/Record/005550418   Link to the Hathitrust
page
http://catalog.hathitrust.org/Record/005550418.marc  MARC21 binary
http://catalog.hathitrust.org/Record/005550418.xml   MARC-XML
http://catalog.hathitrust.org/Record/005550418.ris   RIS tagged format
http://catalog.hathitrust.org/Record/005550418.refworks Refworks tagged
format
http://catalog.hathitrust.org/Record/005550418.rdf   Perfunctory RDF
document

I'd love help getting the RDF more fleshed out, btw.

Again -- if you need anything else, or if you, say, wrap a nice jQuery
plugin around the BibAPI, please let me know!

 -Bill-



Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] urldecode problem and CAS

2010-01-27 Thread Bill Dueber
I'd first make sure you're not url-encoding your return URL twice. I'd like
to not believe that a CAS server wouldn't url-decode before the redirect,
but ...

On Wed, Jan 27, 2010 at 12:38 PM, Jimmy Ghaphery jghap...@vcu.edu wrote:

 Yes the original url looks like

 http://../app.cfm?id=15
 and the return url coming back from CAS looks like

 http://../app.cfm?id%3d15

 I am pretty sure this is native to the way CAS returns urls, and probably
 need to ping some ColdFusion folks on how to deal with the urlencoded
 return. I'll also message the ColdFusion library group.

 If anyone out here has CAS experience and can confirm that urlencoded
 return urls seem normal that would be helpful.




 Walker, David wrote:

 So a user arrives at your app.  You see that they are not logged in, and
 so redirect them to the CAS server with a return URL back to your
 application.

 Do you have an example of that URL?

 --Dave

 ==
 David Walker
 Library Web Services Manager
 California State University
 http://xerxes.calstate.edu
 
 From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Jimmy
 Ghaphery [jghap...@vcu.edu]
 Sent: Wednesday, January 27, 2010 9:18 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] urldecode problem and CAS

 CODE4LIB,

 I'm looking for some urldecode help if possible. I have an app that gets
 a call through a url which looks like this in order to pull up a
 specific record:
 http://../app.cfm?id=15

 It is password protected and we have recently moved to CAS for
 authentication. After it gets passed from CAS back to our server it
 looks like this and tosses an error:
 http://../app.cfm?id%3d15

 The equals sign translated to %3d

 Any ideas are appreciated.

 thanks

 -Jimmy


 --
 Jimmy Ghaphery
 Head, Library Information Systems
 VCU Libraries
 http://www.library.vcu.edu
 --


 --
 Jimmy Ghaphery
 Head, Library Information Systems
 VCU Libraries
 http://www.library.vcu.edu
 --




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Choosing development platforms and/or tools, how'd you do it?

2010-01-06 Thread Bill Dueber
On Wed, Jan 6, 2010 at 8:53 AM, Joel Marchesoni jma...@email.wcu.eduwrote:

 I agree with Dan's last point about avoiding using a special IDE to develop
 with a language.


I'll respectfully, but vehemently, disagree. I would say avoid *forcing*
everyone working on the project depend on a special IDE -- avoid lockin.
Don't avoid use.

There's a spectrum of how much an editor/environment can know about a
program. At one end is Smalltalk, where the development environment *is* the
program. At the other end is something like LISP (and, to an extent, Ruby)
where so little can be inferred from the syntax of the code that a smart
IDE can't actually know much other than how to match parentheses.

For languages where little can be known at compile time, an IDE may not buy
you very much other than syntax highlighting and code folding. For Java,
C++, etc. an IDE can know damn near everything about your project and
radically up your productivity -- variable renaming, refactoring,
context-sensitive help, jump-to-definition, method-name completion, etc. It
really is a difference that makes a difference.

I know folks say they can get the same thing from vim or emacs, but at that
level those editors are no less complex (and a good deal more opaque) than
something like Eclipse or Netbeans unless you already have a decade of
experience with them.

If you're starting in a new language, try a couple editors, too. Both
Eclipse and Netbeans are free and cross-platform, and have support for a lot
of languages. Editors like Notepad++, EditPlus, Textmate jEdit, and BBEdit
can all do very nice things with a variety of languages.



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] University of Michigan Solr filters for ISBN/LCCN and High Level Browse code available at github

2009-11-16 Thread Bill Dueber
I've made available the code we use in the solrmarc/solr installation behind
http://mirlyn.lib.umich.edu to normalize LCCNs and ISBNs and add our local
High Level Browse LC-callnumber-based categorization scheme.

The code itself and a downloadable .jar file for the normalizers are
available at

http://github.com/billdueber/lib.umich.edu-solr-stuff

The README has usage examples as well, so you know what to put in your
schema.xml.

The source is not pretty in the same way the sea is not above the sky, but
it all works as best as I can tell and we all know the dangers of waiting
to  clean up code before release. Patches are, of course, always welcome.

 -Bill-



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] FW: PURL Server Update 2

2009-09-02 Thread Bill Dueber
Andy, I think there are three issues here:

1. Should the GPO put in place, at least at the moment, some throttling for
user agents behaving like dicks?
2. Should III (and others), when acting as a user agent, be such a dick?
3. How do I know if I'm being a dick?

The answers folks are offering, I think, are (1) Yes, (2) No, and (3) It's
hard to know, but you should always check robots.txt, and you should always
throttle yourself to a reasonable level unless you know the target can take
the abuse.

For the majority of the web, for the majority of the time, basic courtesy
and the gentleperson's agreement ensconced in robots.txt works fine -- most
folks who write user agents don't want to be dicks. When this informality
doesn't work, as you point out, there are solutions you can implement at
some edge of your network. Of course, at that point the requests are already
flooding through to *somewhere*, so getting things stopped as close to the
point of origin is key.


On Wed, Sep 2, 2009 at 11:26 AM, Houghton,Andrew hough...@oclc.org wrote:

  From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
  Thomas Dowling
  Sent: Wednesday, September 02, 2009 10:25 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] FW: PURL Server Update 2
 
  The III crawler has been a pain for years and Innovative has shown no
  interest
  in cleaning it up.  It not only ignores robots.txt, but it hits target
  servers
  just as fast and hard as it can.  If you have a lot of links that a lot
  of III
  catalogs check, its behavior is indistinguishable from a DOS attack.  (I
  know
  because our journals server often used to crash about 2:00am on the
  first of
  the month...)

 I see that I didn't fully make the connection to the point I was
 making... which is that there are hardware solutions to these
 issues rather than using robots.txt or sitemap.xml.  If a user
 agent is a problem, then network folks should change the router
 to ignore the user agent or reduce the number of requests it is
 allowed to make to the server.

 In the case you point to with III hitting the server as fast as
 it can and it looking like a DOS attack to the network which
 caused the server to crash, then 1) the router hasn't been setup
 to impose throttling limits on user agents, and 2) the server
 probably isn't part of a server farm that is being load balanced.

 In the case of GPO, they mentioned or implied, that they were
 having contention issues with user agents hitting the server
 while trying to restore the data.  This contention could be
 mitigated by imposing lower throttling limits in the router on
 user agents until the data is restored and then raising the
 limits back to the whatever their prescribed SLA (service level
 agreement) was.

 You really don't need to have a document on the server to tell
 user agents what to do.  You can and should impose a network
 policy on user agents which is far better solution in my opinion.


 Andy.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Open, public standards v. pay per view standards and usage

2009-07-16 Thread Bill Dueber
On Thu, Jul 16, 2009 at 11:26 AM, Houghton,Andrew hough...@oclc.org wrote:

 Not saying you're wrong Ross, but it depends.  People adopted MARC-XML
 by looking at the .xsd without an actual specification.  Granted it's
 not a complicated schema however, and there already existed the MARC 21
 Specifications for Record Structure, Character Sets, and Exchange Media
 so it wasn't a big leap to adopt MARC-XML, IMHO.


I'm not disagreeing with your overall point, but this is a specious example,
I think. Examining a MARC-XML file shows you how to do a mechanical
translation from a ridiculously simple non-XML syntax into an XML syntax --
the actual data itself remains completely opaque. The MARC-XML schema +
AACR2 gives you what you need.

The ISO 208775 schema, for example, include elements like xs:element
name=physicalLocation -- and there's no way you're going to know what the
hell goes in there without a lot more help. And if you were to have to pay
for that help, many would rely on cheat-sheets or pattern-matching and it
all goes to hell.


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] WARC file format now ISO standard

2009-06-03 Thread Bill Dueber
So, can we expect a leaked final draft of RDA, do you think :-)

On Tue, Jun 2, 2009 at 5:47 PM, David Fiander da...@fiander.info wrote:

 This is a common problem with ISO standards, and the common solution
 is to do just this: release the final draft before it's approved by
 ISO as an official standard. That's what the ISO Forth programming
 language group did as well.

 - David

 On Tue, Jun 2, 2009 at 5:35 PM, st...@archive.org st...@archive.org
 wrote:
  point well taken. :)
 
  there were no significant changes to the WARC format
  between the last draft and the published standard.
 
  you can use Heritrix WARCReader, or WARC Tools warcvalidator
  to verify that you have created a valid WARC in accordance
  with the spec.
 
 
  /st...@archive.org
 
 
  On 6/2/09 2:27 PM, Ray Denenberg, Library of Congress wrote:
 
  But you have to pay $200 for the document that lists changes from last
  draft to first official version.
 
  (Ok, Ok, it was just a joke. But you do get the point.)
 
 
  - Original Message - From: st...@archive.org 
 st...@archive.org
  To: CODE4LIB@LISTSERV.ND.EDU
  Sent: Tuesday, June 02, 2009 5:18 PM
  Subject: Re: [CODE4LIB] WARC file format now ISO standard
 
 
  hi Karen,
 
  understood.
 
  the final draft of the spec is available here:
 
 
 http://www.scribd.com/doc/4303719/WARC-ISO-28500-final-draft-v018-Zentveld-080618
 
  and other (similar) versions here:
  http://archive-access.sourceforge.net/warc/
 
 
  /st...@archive.org
 
 
 
  On 6/2/09 2:15 PM, Karen Coyle wrote:
 
  Unfortunately, being an ISO standard, to obtain it costs 118 CHF
 (about
  $110 USD). Hard to follow a standard you can't afford to read. Is
 there an
  online version somewhere?
 
  kc
 
  st...@archive.org wrote:
 
  hi code4lib,
 
  if you're archiving web content, please use the WARC format.
 
  thanks,
  /st...@archive.org
 
 
 
  WARC File Format Published as an International Standard
  http://netpreserve.org/press/pr20090601.php
 
  ISO 28500:2009 specifies the WARC file format:
 
  * to store both the payload content and control information from
   mainstream Internet application layer protocols, such as the
   Hypertext Transfer Protocol (HTTP), Domain Name System (DNS),
   and File Transfer Protocol (FTP);
  * to store arbitrary metadata linked to other stored data
   (e.g. subject classifier, discovered language, encoding);
  * to support data compression and maintain data record integrity;
  * to store all control information from the harvesting protocol
   (e.g. request headers), not just response information;
  * to store the results of data transformations linked to other
   stored data;
  * to store a duplicate detection event linked to other stored
   data (to reduce storage in the presence of identical or
   substantially similar resources);
  * to be extended without disruption to existing functionality;
  * to support handling of overly long records by truncation or
   segmentation, where desired.
 
 
  more info here:
  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
 
 
 
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] exact title searches with z39.50

2009-04-27 Thread Bill Dueber
Like so many library standards, z30.50 is a syntax and a set of rough
guidelines. You have no idea what's actually happening on the other end,
because it's not specified, and you just have to either find someone you can
ask at the target machine or reverse engineer it.

On Mon, Apr 27, 2009 at 5:13 PM, Eric Lease Morgan emor...@nd.edu wrote:

 What are the ways to accomplish exact title searches with z39.50?

 I'm looping through a list of MARC records trying to determine whether or
 not we own multiple copies of an item. After reading MARC field 245,
 subfield a I am creating the following z39.50 query:

  @attr 1=4 foo bar

 Unfortunately my local implementation seems to interpret this in a rather
 regular expression sort of way -- * foo bar *. Does anybody out there know
 how to create a more exact query? I only want to find titles exactly
 equalling foo bar.

 --
 Eric Lease Morgan
 University of Notre Dame




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Something completely different

2009-04-09 Thread Bill Dueber
On Thu, Apr 9, 2009 at 10:26 AM, Mike Taylor m...@indexdata.com wrote:

 I'm not sure what to make of this except to say that Yet Another XML
 Bibliographic Format is NOT the answer!


I recognize that you're being flippant, and yet think there's an important
nugget in here.

When you say it that way, it makes it sound as if folks are debating the
finer points of OAI-MARC vs MARC-XML -- that it's simply syntactic sugar
(although I'm certainly one to argue for the importance of syntactic sugar)
over the top of what we already have.

What's actually being discussed, of course, is the underlying data model.
E-R pairs primarily analyzed by set theory, triples forming directed graphs,
whether or not links between data elements can themselves have attributes --
these are all possible characteristics of the fundamental underpinning of a
data model to describe the data we're concerned with.

The fact that they all have common XML representations is noise, and
referencing the currently-most-common xml schema for these things is just
convenient shorthand in a community that understands the exemplars. The fact
that many in the library community don't understand that syntax is not the
same as a data model is how we ended up with RDA.  (Mike: I don't know your
stuff, but I seriously doubt you're among that group. I'm talkin' in
general, here.)

Bibliographic data is astoundingly complex, and I believe wholeheartedly
that modeling it sufficiently is a very, very hard task. But no matter the
underlying model, we should still insist on starting with the basics that
computer science folks have been using for decades now: uids  (and, these
days, guids) for the important attributes, separation of data and display,
definition of sufficient data types and reuse of those types whenever
possible, separation of identity and value, full normalization of data, zero
ambiguity in the relationship diagram as a fundamental tenet, and a rigorous
mathematical model to describe how it all fits together.

This is hard stuff. But it's worth doing right.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] ANN: University of Michigan Live vuFind beta!

2009-02-11 Thread Bill Dueber
The University of Michigan University Libraries has gone live with a beta
installation of vuFind, currently branded as Mirlyn2-Beta to differentiate
it from our existing OPAC interface, Mirlyn. You can take a look at

  http://mirlyn2-beta.lib.umich.edu/

We've added several enhancements:

  - Spellcheck when there are no results (possible because we use  a recent
Solr nightly) -- try searching on 'minnesoa'
  - Extraction of search specifications into an external file for easier
tweaking
  - Integration of UMich's High Level Browse subject headings (seen here
as the Academic Discipline facet)
  - Inlining of more extensive, real-time availability information in search
results
  - Reworked Refworks export

Any comments or bug reports can be sent to me or, even better, via the Tell
us what you think button in the upper-right corner.

Thanks to everyone in all the various communities that have offered help and
feedback!

  -Bill-


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


  1   2   >