Re: [CODE4LIB] COinS

2012-11-20 Thread Godmar Back
Funny this topic comes up right now.

A few days ago, Wikipedia (arguably the biggest provider of COiNS) decided
to discontinue it because they've discovered that generating the COinS
using their decrepit infrastructure uses up so much processing power that
attempts to edit pages with lots of citations time out. See [1, 2]. That
said, there is some movement to restore them once they get their act
together and improve their infrastructure. The big irony is that this move
was driven by editors and regular contributors (it doesn't affect anyone
not signed into Wikipedia) that is, exactly those users who *ought* to
make the most regular use of COinS to actually retrieve cited material...

Just by coincidence, we finally engaged on a project to better process
COinS. As is, we're just linking to the OpenURL resolver, which is hit and
miss - that said, it's a facility that's used. We're now keeping
statistics, and for just 10 editions we've had over 5,000 clicks in the
last three month alone.  But we have additional options - Link/360 being
one for SS clients, and Summon another. We think we can do a much better
job at resolving COinS with a combination of these services. None of this
depends on the specific COinS format, of course - any suitable microformat
would work, too.

 - Godmar

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=19262
[2] https://en.wikipedia.org/wiki/Template_talk:Citation/core#Disappointed


On Tue, Nov 20, 2012 at 4:47 PM, Bigwood, David dbigw...@hou.usra.eduwrote:

 I've used the COinS Generator at OCLC for years. Now it is gone. Any
 suggestions on how I can get an occasional COinS for use in our
 bibliography? Do any of the citation managers generate COinS?



 Or is this just an old unused metadata format that should be replaced by
 something else?



 Thanks,

 Dave Bigwood

 dbigw...@hou.usra.edu

 Lunar and Planetary Institute



Re: [CODE4LIB] COinS

2012-11-20 Thread Godmar Back
Could you elaborate on your belief that COinS is actually illegal in
HTML5? Why would that be so?

 - Godmar



On Tue, Nov 20, 2012 at 5:20 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 It _IS_ an old unused metadata format that should be replaced by something
 else (among other reasons because it's actually illegal in HTML5), but I'm
 not sure there is a something else with the right balance of flexibility,
 simplicity, and actual adoption by consuming software.

 But COinS didn't have a whole lot of adoption by consuming software
 either. Can you say what you think the COinS you've been adding are useful
 for, what they are getting used for? And what sorts of 'citations' youw ere
 adding them for? For my own curiosity, and because it might help answer if
 there's another solution that would still meet those needs.

 But if you want to keep using COinS -- creating a COinS generator like
 OCLC's no longer existing one is a pretty easy thing to do, perhaps some
 code4libber reading this will be persuaded to find the time to create one
 for you and others. If you have a server that could host it, you could
 offer that. :)




 On 11/20/2012 4:47 PM, Bigwood, David wrote:

 I've used the COinS Generator at OCLC for years. Now it is gone. Any
 suggestions on how I can get an occasional COinS for use in our
 bibliography? Do any of the citation managers generate COinS?



 Or is this just an old unused metadata format that should be replaced by
 something else?



 Thanks,

 Dave Bigwood

 dbigw...@hou.usra.edu

 Lunar and Planetary Institute





Re: [CODE4LIB] Book metadata source

2012-10-26 Thread Godmar Back
If it's only in the hundreds, why not just look them up in Worldcat via
their basic search API and pull the ISBNs from the xISBN service? That's
quickly scripted.

 - Godmar

On Thu, Oct 25, 2012 at 3:05 PM, Cab Vinton bibli...@gmail.com wrote:

 I have a list of several hundred book titles  corresponding authors,
 comprising our State Library's book group titles,  am looking for
 ways of putting these titles online in a way that would be useful to
 librarians  patrons. Something along the lines of a LibraryThing
 collection or Amazon wishlist.

 Without ISBNs, however, the process could be very labor-intensive.

 Any suggestions for how we could handle this as part of a batch process?

 I realize that different manifestations of the same work will have
 different ISBNs, so we'd be seeking any work in print format, ideally
 the most commonly held.

 The only thought I've had is to do a Z39.50 search using the author 
 title Bib-1 attributes EG @and @attr 1=4 mansfield @attr 1=1003
 austen.

 Thanks for your thoughts,

 Cab Vinton, Director
 Sanbornton Public Library
 Sanbornton, NH



Re: [CODE4LIB] Q: Discovery products and authentication (esp Summon)

2012-10-24 Thread Godmar Back
On Wed, Oct 24, 2012 at 12:16 PM, Jonathan Rochkind rochk...@jhu.eduwrote:

 Looking at the major 'discovery' products, Summon, Primo, EDS

 ...all three will provide some results to un-authenticated users (the
 general public), but have some portions of the corpus that are restricted
 and won't show up in your results unless you have an authenticated user
 affiliated with customer's organization.


I brought this issue up on the Summon clients mailing list a few weeks ago.

My impression from the resulting reaction was that people do not appear to
be overly concerned about it, because

a) most queries come from on-campus
b) the only results missing are those that come from re-published AI
databases (which don't allow unauthenticated access), which is a minority
of content when compared to what is indexed by Summon itself
c) there's an option Use Off Campus Sign In to access full text and more
content users can use to avoid the problem.

Personally, I think it's little known, and insufficiently presented to the
user (more content).

The key problem is that as libraries are increasingly offering their
discovery systems as OPAC replacements, users accustomed to the conventions
used in OPAC do not expect this difference in behavior. OPACs generally
show the same results independent of the user's authentication status, and
do not require authentication just to search.

 - Godmar


Re: [CODE4LIB] Q: Discovery products and authentication (esp Summon)

2012-10-24 Thread Godmar Back
On Wed, Oct 24, 2012 at 1:54 PM, Mark Mounts mark.mou...@dartmouth.eduwrote:

 We have Summon at Dartmouth College. Authentication is IP based so with a
 Dartmouth IP address the user will see all our licensed content.

 There is also the option to see all the content Summon has beyond what we
 license by selecting the option Add results beyond your library's
 collection


That's according to my understanding not what Jonathan is talking about.

You can select Add results beyond your library's collection while being
unauthenticated/off-campus, but this still won't show you the same results.

The results that are never displayed to unauthenticated users are those
Summon republishes from AI databases.

Add results beyond your library's collection just adds (public) results
from the holdings of other libraries; it doesn't add AI results.

 - Godmar


Re: [CODE4LIB] Q.: software for vendor title list processing

2012-10-17 Thread Godmar Back
Thanks for everyone who replied to my question.

From a brief examination, if I understand it correctly, KBART and ONIX
create normative standards for how holdings data should be represented,
which vendors (increasingly) follow.

This leads to three follow-up questions.

First, is there software to translate/normalize existing vendor lists from
vendors that have not yet adopted either of these standards into these
formats? I'm thinking of a collection of adapters or converters, perhaps.
Each would likely constitute small effort, but there would be benefits from
sharing development and maintenance.

Second, if holdings lists were provided in, or converted to, for instance
the KBART format, what software understands these formats to further
process them? In other words, is there immediate bang for the buck of
adopting these standards?

Third, unsurprisingly, these efforts arose in the managements of serials
because holdings there change frequently depending on purchase agreements,
etc. It is my understanding that eBooks are now posing similar collection
management challenges. Are there separate normative efforts for eBooks or
is it believed that efforts such as KBART/ONIX can encompass eBooks as well?

 - Godmar


[CODE4LIB] Q.: software for vendor title list processing

2012-10-16 Thread Godmar Back
Hi,

at our library, there's an emerging need to process title lists from
vendors for various purposes, such as checking that the titles purchased
can be discovered via discovery system and/or OPAC. It appears that the
formats in which those lists are provided are non-uniform, as is the
process of obtaining them.

For example, one vendor - let's call them Expedition Scrolls - provides
title lists for download to Excel, but which upon closer inspection turn
out to be HTML tables. They are encoded using an odd mixture of CP1250 and
HTML entities. Other vendors use entirely different formats.

My question is whether there are efforts, software, or anything related to
streamlining the acquisition and processing of vendor title lists in
software systems that aid in the collection development and maintenance
process. Any pointers would be appreciated.

 - Godmar


[CODE4LIB] isoncampus service

2012-06-14 Thread Godmar Back
A number of web applications, both client and server-side, could benefit if
it could be easily determined if a user is on or off campus with respect to
accessing resources that use IP-address based authentication.

For instance, a web site could show/hide a button asking the user to log
in, or a proxied/non-proxied URL could be displayed depending on whether
the user is connecting from within/outside an authorized IP range. This
would reduce or eliminate the need for special proxy setups/unnecessary
proxy use and could improve the user experience.

This is probably a problem for which many ad-hoc solutions exist on
campuses as well as solutions integrated into vendor-provided systems. It
would be nice, and beneficial to in particular LibX, but also presumably
other software that is facing this problem, to have a reusable service
implementation/response format that is easily deployable and requires only
minimum effort for setup and maintenance. Maintenance should be as simple
as maintaining a file with the IP-ranges in a directory, like many
libraries already do for their communication with database vendors or
publishers.

My question is what existing ideas/standards/software exists for this
purpose, if any, or what ideas/approaches others could share.

I would like to point at a small piece of software I'm sharing, which is a
PhP-based isoncampus service [1], a demo is available here [2]. If anyone
has a similar need and is interested in working together on a solution,
this could be a seed around which to start. Besides the easily deployable
PhP implementation, more efficient bindings/implementations for other
languages and/or server/cloud environment could be created (AppEngine comes
to mind.)

 - Godmar

[1] https://github.com/godmar/isoncampus
[2] http://libx.lib.vt.edu/services/isoncampus/isoncampus.php

ps: as a side-note, OCLC's OpenURL registry used to include IP-ranges as
they were known to OCLC; this was at some point removed due to privacy
concerns. I do note, however, that in general the ownership of IP-ranges is
public information, as are CIDR ranges, both of which are easily accessible
via web services provided by arin.net or by the regional registries. Though
mapping from an IP address to its owner is not the same as listing IP
ranges associated with an organization (many include multiple discontiguous
CIDR ranges), I note that some of this information is also public via the
BGP-advertised IP-prefixes for an institution's (main-) AS. In any event,
no one would be forced to run this service if they have privacy concerns.


Re: [CODE4LIB] WebOPAC/III Z39.50 PHP Query/PHPYAZ

2012-05-10 Thread Godmar Back
Scraping III systems has got to be one of the most frequently repeated
tasks in the history of coding librarianship.

Majax2 ([1,2]) is one such service, though (as of right now) it doesn't
support search by Call Number.
Here's an example ISBN search:
http://libx.lib.vt.edu/services/majax2/isbn/0747591059?opacbase=http://catalog.library.miami.edu/search

Since you have Summon, you could use their API.  Example is here [3,4]

 - Godmar

[1] http://libx.lib.vt.edu/services/majax2/
[2] http://code.google.com/p/majax2/
[3] http://libx.lib.vt.edu/services/summon/test.php
[4] http://libx.lib.vt.edu/services/summon/

On Wed, May 9, 2012 at 11:27 AM, Madrigal, Juan A j.madrig...@miami.eduwrote:

 Hi,

 I'm looking for a way to send a Call Number to WebOPAC via a query so that
 I can return data (title, author, etc…) for a specific book in the catalog
 preferably in JSON or XML (I'll even take text at this point).
 I'm thinking that one way  to accomplish this is via Z39.50 and send a
 query to the backend that powers WebOPAC

 Has anyone done something similar to this?

 PHP YAZ (https://www.indexdata.com/phpyaz) looks promising, but I'd
 appreciate any guidance.

 Thanks,

 Juan Madrigal

 Web Developer
 Web and Emerging Technologies
 University of Miami
 Richter Library



Re: [CODE4LIB] Anyone using node.js?

2012-05-09 Thread Godmar Back
On Tue, May 8, 2012 at 11:26 PM, Ed Summers e...@pobox.com wrote:


 For both these apps the socket.io library for NodeJS provided a really
 nice abstraction for streaming data from the server to the client
 using a variety of mechanisms: web sockets, flash socket, long
 polling, JSONP polling, etc. NodeJS' event driven programming model
 made it easy to listen to the Twitter stream, or the ~30 IRC channels,
 while simultaneously holding open socket connections to browsers to
 push updates to--all from within one process. Doing this sort of thing
 in a more typical web application stack like Apache or Tomcat can get
 very expensive where each client connection is a new thread or
 process--which can lead to lots of memory being used.


We've also been using socket.io for our cloudbrowser project, with great
success. The only drawback is that websockets don't (yet) support
compression, but that's not node.js fault. Another fault: you can't easily
migrate open socket.io connections across processes (yet). FWIW, since you
mention Rackspace - the lead student on the the cloudbrowser project has
now accepted a job at Rackspace (having turned down M$), in part because he
finds their technology/environment more exciting.

I need to dampen the enthusiasm about memory use a bit. It's true that
you're saving memory for additional threads etc., but - depending on your
application - you're also paying for that because V8 still lacks some
opportunities for sharing other environments have. For instance, if you run
25 Apache instances with say mod_whatever, they'll all share the code via
shared .so file. In Java/Tomcat, the JVM exploits, under the hood, similar
sharing opportunities.

V8/node.js, as of now, does not. This means if you need to load libraries
such as jQuery n times, you're paying a substantial price (we found on the
order of 1-2MB per instance), because V8 will not do any code sharing under
the hood.  That said, whether you need to load it multiple times depends on
your application - but that's another subtle and error prone issue.


 If you've done any JavaScript programming in the browser, it will seem
 familiar, because of the extensive use of callbacks. This can take
 some getting used to, but it can be a real win in some cases,
 especially in applications that are more I/O bound than CPU bound.
 Ryan Dahl (the creator of NodeJS) gave a presentation [4] to a PHP
 group last year which does a really nice job of describing how NodeJS
 is different, and why it might be useful for you. If you are new to
 event driven programming I wouldn't underestimate how much time you
 might spend feeling like you are turning our brain inside out.


The complications arising from event-based programming are an extensively
written-about topic of research; one available approach is the use of
compilers that provide a linear syntax for asynchronous calls. The TAME
system, which originally arose from research at MIT, is one such example.
Originally for C++, there's now a version for JavaScript available:
http://tamejs.org/  Though I haven't tried it myself, I'm eager to and
would also like to know if someone else has. The tamejs.org provides
excellent reading for why/how you'd want to do this.

 - Godmar


Re: [CODE4LIB] Anyone using node.js?

2012-05-08 Thread Godmar Back
On Tue, May 8, 2012 at 10:17 AM, Ethan Gruber ewg4x...@gmail.com wrote:

 Thanks.  I have been working on a system that allows editing of RDF in web
 forms, creating linked data connections in the background, publishing to
 eXist and Solr for dissemination, and will eventually integrate operation
 with an RDF triplestore/SPARQL, all with Tomcat apps.  I'm not sure it is
 possible to create, manage, and deliver our content with node.js, but I was
 told by the project manager that Apache, Java, and Tomcat were showing
 signs of age.  I'm not so sure about this considering the prevalence of
 Tomcat apps both in libraries and industry.  I happen to be very fond of
 Solr, and it seems very risky to start over in node.js, especially since I
 can't be certain the end product will succeed.  I prefer to err on the side
 of stability.

 If anyone has other thoughts about the future of Tomcat applications in the
 library, or more broadly cultural heritage informatics, feel free to jump
 in.  Our data is exclusively XML, so LAMP/Rails aren't really options.


We've used node.js (but not Express, their web app framework) to build our
own experimental AJAX framework (http://cloudbrowser.cs.vt.edu/ ). We also
have extensive experience with Tomcat-based systems.

Given that wide, and increasing use of node.js, I'm optimistic that it
should be stable and reliable enough for your needs; let me emphasize three
points you may want to consider.

a) You're programming in JavaScript/CoffeeScript, which is a higher-level
language than Java. My students are vastly more productive than in Java.
The use of CoffeeScript and require still allows for maintainable code.

b) node.js is a single-threaded environment. Reduced potential for some
race conditions, but requires an asynchronous programming style. If you've
done client-side AJAX, you'll find it familiar; otherwise, you need to
adapt. New potential for race conditions.

c) Scalability. Each node.js instance runs on a single core; modules exist
for clustering on a single machine. I don't know/don't believe session
state replication is as well supported as for Tomcat. On the other hand,
Tomcat can be a setup nightmare (in my experience).

d) Supporting libraries. We've found the surrounding infrastructure
excellent. A large community is developing for it http://search.npmjs.org/ .
The cool thing is that many client-side libraries work or are easily ported
(e.g. moment.js).

e) Doing XML in JavaScript. Though JavaScript as a language is intended to
be embedded in XML documents, processing XML in JavaScript can be almost as
awkward as in Java. JSON is clearly preferred and integrates very naturally.

 - Godmar


Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-12 Thread Godmar Back
On Mon, Mar 12, 2012 at 3:38 AM, Ed Summers e...@pobox.com wrote:

 On Fri, Mar 9, 2012 at 12:12 PM, Godmar Back god...@gmail.com wrote:
  Here's my hand ||*(  [1].

 ||*)

 I'm sorry that I was so unhelpful w/ the patches welcome message on
 your docfix. You're right, it was antagonistic of me to suggest you
 send a patch for something so simple. Plus, it wasn't even accurate,
 because I actually wanted a pull request :-)


Here's a make-up pull request especially made for you :-)

https://github.com/edsu/pymarc/pull/25

 - Godmar


Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-09 Thread Godmar Back
On Thu, Mar 8, 2012 at 3:53 PM, Mark A. Matienzo m...@matienzo.org wrote:

 On Thu, Mar 8, 2012 at 3:32 PM, Godmar Back god...@gmail.com wrote:

  One side comment here; while smart handling/automatic detection of
  encodings would be a nice feature to have, it would help if pymarc could
  operate in an 'agnostic', or 'raw' mode where it would simply preserve
 the
  encoding that's there after a record has been read when writing the
 record.
 
  [ Right now, pymarc does not have such a mode - if leader[9] == 'a', the
  data is unconditionally utf8 encoded on output as per mbklein's patch. ]

 Please feel free to write a patch and submit a pull request if you're
 able to contribute code to do this.


Mark, while I would be able to contribute code to pymarc, I probably won't
(unless my collaborators' needs in respect to pymarc become urgent.)

I've been contributing to open source for over 15 years, my first major
contribution having been the ext2fs filesystem code in the FreeBSD kernel (
http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/filesystems-linux.html
)
and I'm a bit confused by how the spirit in the community has changed.  The
phrase patches welcome used to be reserved for when there was a feature
request somebody wanted, but you (the owner/maintainer of the software)
didn't have the time or considered the problem not important.

Back then, it used to be that all suggestions were welcome. For instance,
if a user pointed out a typo, you'd fix it. Similarly, if a user or fellow
developer pointed out a potential design flaw, you'd understand that you
don't ask for patches, but that you go back to the drawing board and think
about your software's design. In pymarc's case, what's needed is not more
code (it already has a moderately confusing set of almost a dozen switches
for reading/writing), but a requirement analysis where you think about use
cases you want to support. For instance, whether you want to support
reading/writing real world records in batches (without touching them) even
if they have flaws or not. And/Or whether you insist on interpreting a
record's data in terms of encoding, always. That's something occasional
contributors cannot do, it requires work by the core team, in discussion
with frequent users. (I would have liked to take this discussion to a
pymarc-users list, but didn't find any.)

 - Godmar


Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-09 Thread Godmar Back
On Fri, Mar 9, 2012 at 10:37 AM, Michael B. Klein mbkl...@gmail.com wrote:

 The internal discussion then becomes, I have a need, and I've written
 something that satisfies it. I think it could also be useful to others, but
 I'm not going to have time to make major changes or implement features
 others need. Should I open source this or keep it to myself? Does freeing
 my code come with an implicit requirement to maintain and support it?
 Should it?


It used to be that way, at least it was this way when I grew up in open
source (in the 90s, before Eric Raymond invented the term). And it makes
sense, for successful projects that have at least a moderate number of
users.  Just dumping your code on github helps very few people.


 I'd vote open source just about every time. If someone sees the need and
 has the time to do a functional/requirements analysis and develop a core
 team around pymarc, more power to them. The code that's already there will
 give them a head start. Or they can start from scratch.

 Until then, it will remain a fork-patch-and-pull, community-supported
 project.


It's not just an agreement on design goals the core team must reach, it's
also the issue of maintaining a record (in email discussions/posts and in
the developer's minds) of what issues arose, what legacy decisions were
made, where backwards compatibility is required. That's something
maintainers do, it enables them to reason about future design
decisions. People who feel a sense of ownership and mental
investment. Sure, I could throw in a flag 'dont_utf8_encode' to make the
code work for my case. But it wouldn't improve the software.  (In pymarc's
case, I'd also recommend a discussion about data structures. For instance,
what should the type of the elements of the subfield array be that's passed
to a Field constructor? 8-bit string or unicode objects? The thread you
link to shows ambiguity here.)

Staying with fork-patch-and-pull may help individual people meet their
individual needs, but can prevent wide-spread adoption - and creates
frustration for users who may lack the expertise to track down encoding
errors or who are even unable to understand where the code they're using
lives on their machine. Once a piece of software has reached the stage
where it's distributed as a package (which pymarc, I believe, is), the
distributors have taken on a piece of responsibility. Related, being
unwilling to fix even documentation typos unless someone clones the
repository and delivers a pull request (on a silver platter?) seems unusual
to me, but - perhaps I'm just too old and culturally out of tune with
today's open source movement. (I'm not being ironic here, maybe there has
been a shift and I should just get with it.)

 - Godmar


Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-09 Thread Godmar Back
On Fri, Mar 9, 2012 at 11:48 AM, Jon Gorman jonathan.gor...@gmail.comwrote:


 Can't we all just shake hands virtually or something?


Here's my hand ||*(  [1].

I overreacted, for which I'm sorry. (Also, I didn't see the entire github
conversation until I just now visited the website, the github email
notification seems selective and only sent me Ed's replies (?) in my
emailbox.)

 - Godmar

[1] http://www.kadifeli.com/fedon/smiley.htm


[CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-08 Thread Godmar Back
Hi,

a few days ago, I showed pymarc to a group of technical librarians to
demonstrate how easily certain tasks can be scripted/automated.

Unfortunately, it blew up at me when I tried to write a record:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
ordinal not in range(128)

Investigation revealed this culprit:

=LDR  00916nam a2200241I  4500
=001  ocm10685946
=005  19880203211447.0
=007  cr\bn||abp
=007  cr\bn||cda
=008  840503s1939gw00010\ger\d
=040  \\$aMBB$cMBB$dCRL
=049  \\$aCRLL
=100  10$aEsser, Hermann,$d1900-
=245  14$aDie jE8udischer Weltpest ;$bjudendE1ammerung auf dem
Erdball,$cvon Hermann Esser.
=260  0\$aME8unchen,$bZentralverlag der N S D A P., F. Eher ahchf.,$c1939.
=300  \\$a243 [1] p.$c23 cm.
=533  \\$aAlso available as electronic reproduction.$bChicago :$cCenter for
Research Libraries,$d[2009]
=650  \0$aJewish question.
=700  12$aBierbrauer, Johann Jacob,$d1705-1760?
=710  2\$aCenter for Research Libraries (U.S.)
=856  41$uhttp://dds.crl.edu/CRLdelivery.asp?tid=10538$zOnline version
=907  \\$a.b28931622$b08-30-10$c08-30-10
=998  \\$awww$b08-30-10$cm$dz$e-$fger$ggw $h4$i0

The leader[9] field is set to 'a', so the record should contain
UTF8-encoded Unicode [1], but E8 75 in the 245$a appears to be ANSEL where
'E8' denotes the Umlaut preceding the lowercase 'u' (0x75). [2]

To me, this record looks misencoded... am I correct here? There are
thousands of such records in the data set I'm dealing with, which was
obtained using the 'Data Exchange' feature of III's Millennium system.

My question is how others, especially pymarc users dealing with III
records, deal with this issue or whatever other
experiences/hints/practices/kludges exist in this area.

Thanks.

 - Godmar

[1] http://www.loc.gov/marc/bibliographic/bdleader.html
[2] http://lcweb2.loc.gov/diglib/codetables/45.html


Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-08 Thread Godmar Back
On Thu, Mar 8, 2012 at 1:46 PM, Terray, James james.ter...@yale.edu wrote:

 Hi Godmar,

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
 ordinal not in range(128)

 Having seen my fair share of these kinds of encoding errors in Python, I
 can speculate (without seeing the pymarc source code, so please don't hold
 me to this) that it's the Python code that's not set up to handle the UTF-8
 strings from your data source. In fact, the error indicates it's using the
 default 'ascii' codec rather than 'utf-8'. If it said 'utf-8' codec can't
 decode..., then I'd suspect a problem with the data.

 If you were to send the full traceback (all the gobbledy-gook that Python
 spews when it encounters an error) and the version of pymarc you're using
 to the program's author(s), they may be able to help you out further.


My question is less about the Python error, which I understand, than about
the MARC record causing the error and about how others deal with this issue
(if it's a common issue, which I do not know.)

But, here's the long story from pymarc's perspective.

The record has leader[9] == 'a', but really, truly contains ANSEL-encoded
data.  When reading the record with a MARCReader(to_unicode = False)
instance, the record reads ok since no decoding is attempted, but attempts
at writing the record fail with the above error since pymarc attempts to
utf8 encode the ANSEL-encoded string which contains non-ascii chars such as
0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see
[1]).

When reading the record with a MARCReader(to_unicode=True) instance, it'll
throw an exception during marc_decode when trying to utf8-decode the
ANSEL-encoded string. Rightly so.

I don't blame pymarc for this behavior; to me, the record looks wrong.

 - Godmar

(ps: that said, what pymarc does fails in different circumstances - from
what I can see, pymarc shouldn't assume that it's ok to utf8-encode the
field data if leader[9] is 'a'.  For instance, this would double-encode
correctly encoded Marc/Unicode records that were read with a
MARCReader(to_unicode=False) instance. But that's a separate issue that is
not my immediate concern. pymarc should probably remember if a record needs
or does not need encoding when writing it, rather than consulting the
leader[9] field.)


(*)
https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef904c24baee6


Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-08 Thread Godmar Back
On Thu, Mar 8, 2012 at 3:18 PM, Ed Summers e...@pobox.com wrote:

 Hi Terry,

 On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry
 terry.re...@oregonstate.edu wrote:
  This is one of the reasons you really can't trust the information found
 in position 9.  This is one of the reasons why when I wrote MarcEdit, I
 utilize a mixed process when working with data and determining characterset
 -- a process that reads this byte and takes the information under
 advisement, but in the end treats it more as a suggestion and one part of a
 larger heuristic analysis of the record data to determine whether the
 information is in UTF8 or not.  Fortunately, determining if a set of data
 is in UTF8 or something else, is a fairly easy process.  Determining the
 something else is much more difficult, but generally not necessary.

 Can you describe in a bit more detail how MARCEdit sniffs the record
 to determine the encoding? This has come up enough times w/ pymarc to
 make it worth implementing.


One side comment here; while smart handling/automatic detection of
encodings would be a nice feature to have, it would help if pymarc could
operate in an 'agnostic', or 'raw' mode where it would simply preserve the
encoding that's there after a record has been read when writing the record.

[ Right now, pymarc does not have such a mode - if leader[9] == 'a', the
data is unconditionally utf8 encoded on output as per mbklein's patch. ]

 - Godmar


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Godmar Back
On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote:

 On 26 Feb 2012, at 14:42, Godmar Back wrote:

  May I ask a side question and make a side observation regarding the
  harvesting of full text of the object to which a OAI-PMH record refers?
 
  In general, is the idea to use the dc:source/text() element, treat it
 as
  a URL, and then expect to find the object there (provided that there was
 a
  suitable dc:type and dc:format element)?
 
 I think dc:identifier is usually used to provide a URL for the item being
 described. The examples at
 http://www.openarchives.org/OAI/openarchivesprotocol.html#dublincorefollow 
 this, and the UK E-Thesis schema (
 http://naca.central.cranfield.ac.uk/ethos-oai/2.0/oai-uketd.xml) does as
 well.


Thanks. FWIW, the identifier contains the same URL as the source field
in my example; but your interpretation of the identifier matches that
found in the OAI-PMH spec at
http://www.openarchives.org/OAI/openarchivesprotocol.html#UniqueIdentifier
where it also points out that it may not necessarily be a URL, could be any
URN or even a DOI as long as it relates the metadata to the underlying item.


 This issue is certainly not unique to VT - we've come across this as part
 of our project.


I note that this means that providing the service point URL for the ETD
OAI-PMH server is not sufficient to facilitate full-text
harvesting/indexing by a provider such as Summon. (And sure enough, they've
indexed only the metadata.) They would have to/will have to employ
additional effort.

Re: your points about the right to full-text index.

If indeed you're right that full-text indexing is a fair use (is it? Eric
Hellmann seems to indicate so:
http://go-to-hellman.blogspot.com/2010/02/copyright-safe-full-text-indexing-of.html
as
long as the technical definition of making a copy is met.) - if that's
indeed so, then of course the intentions of the author don't matter, at
least in the US legal system.  Otherwise, my point would have been that I'd
like to see the signed ETD agreement forms extended to explicitly include
the author's permission for full-text indexing.

 - Godmar


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Godmar Back
On Mon, Feb 27, 2012 at 8:31 AM, Diane Hillmann metadata.ma...@gmail.comwrote:

 On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote:

 
  This issue is certainly not unique to VT - we've come across this as part
  of our project. While the OAI-PMH record may point at the PDF, it can
 also
  point to a intermediary page. This seems to be standard practice in some
  instances - I think because there is a desire, or even requirement, that
 a
  user should see the intermediary page (which may contain rights
 information
  etc.) before viewing the full-text item. There may also be an issue where
  multiple files exist for the same item - maybe several data files and a
 pdf
  of the thesis attached to the same metadata record - as the metadata via
  OAI-PMH may not describe each asset.
 
 
 This has been an issue since the early days of OAI-PMH, and many large
 providers provide such intermediate pages (arxiv.org, for instance). The
 other issue driving providers towards intermediate pages is that it allows
 them to continue to derive statistics from usage of their materials, which
 direct access URIs and multiple web caches don't.  For providers dependent
 on external funding, this is a biggie.


Why do you place direct access URI and multiple web caches into the same
category? I follow your argument re: usage statistics for web caches, but
as long as the item remains hosted in the repository direct access URIs
should still be counted (provided proper cache-control headers are sent.)
Perhaps it would require server-side statistics rather than client-based GA.

Also, it seems to me that except for Google full-text indexing engines
don't necessarily want to be come providers of cached copies (certainly the
discovery systems currently provided commercially don't AFAIK.)

 - Godmar


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Godmar Back
May I ask a side question and make a side observation regarding the
harvesting of full text of the object to which a OAI-PMH record refers?

In general, is the idea to use the dc:source/text() element, treat it as
a URL, and then expect to find the object there (provided that there was a
suitable dc:type and dc:format element)?

Example: http://scholar.lib.vt.edu/theses/OAI/cgi-bin/index.pl allows the
harvesting of ETD metadata.  Yet, its metadata reads:

ListRecords
   
   metadata
 dc
typetext/type
formatapplication/pdf/format
source
http://scholar.lib.vt.edu/theses/available/etd-3345131939761081//source



When one visits
http://scholar.lib.vt.edu/theses/available/etd-3345131939761081/ however
there is no 'text' document of type 'application/pdf' - rather, it's an
HTML title page that embeds links to one or more PDF documents, such as
http://scholar.lib.vt.edu/theses/available/etd-3345131939761081/unrestricted/Walker_1.pdfto
Walker_5.pdf.

Is VT's ETD OAI implementation deficient, or is OAI-PMH simply not set up
to allow the harvesting of full-text without what would basically amount to
crawling the ETD title page, or other repository-specific mechanisms?

On a related note, regarding rights. As a faculty member, I regularly sign
ETD approval forms.  At Tech, students have three options to choose from:
(a) open and immediate access, (b) restricted to VT for 1 year, (c)
withhold access completely for 1 year for patent/security purposes.  The
current form does not allow student authors to address whether the
full-text of their dissertation may be harvested for the purposes of
full-text indexing in such indexes as Google or Summon, not does it allow
them to restrict where copies are served from.  Similarly, the dc:rights
section in the OAI-PMH records address copyright only.  In practice, Google
crawls, indexes, and serves full-text copies of our dissertations.

 - Godmar


Re: [CODE4LIB] Voting for c4l 2012 talks ends today

2011-12-09 Thread Godmar Back
This site shows:

Ruby (Rack) application could not be started
On Fri, Dec 9, 2011 at 11:50 AM, Anjanette Young
youn...@u.washington.eduwrote:

 Get your votes in before 5pm (PST)

 http://vote.code4lib.org/election/21  -- You will need your
 code4lib.orglogin in order to vote. If you do not have one you can
 create one at
 http://code4lib.org/



Re: [CODE4LIB] jQuery Ajax request to update a PHP variable

2011-12-07 Thread Godmar Back
On Tue, Dec 6, 2011 at 3:40 PM, Doran, Michael D do...@uta.edu wrote:


  Current trends certainly go in the opposite direction, look at jQuery
  Mobile.

 I agree that jQuery Mobile is very popular now.  However, that in no way
 negates the caution.  One could consider it as a tragedy of the commons
 in which a user's iPhone battery is the shared resource.  Why should I as a
 developer (rationally consulting my own self-interest) conserve battery
 power that doesn't belong to me, just so some other developer's app can use
 that resource?  I'm just playing the devil's advocate here. ;-)


You're taking it as given that the use of JavaScript on a mobile device is
significantly less energy-efficient than an approach that would exercise
only the HTML parsing path. Be careful here, intuition can be misleading.
Devices cannot send HTML to their displays. It takes energy to parse it,
and energy to render it. Time is roughly proportional to energy. Where do
you think most time/energy is spent in? (page-provided) JavaScript
execution, HTML parsing, or page layout/rendering?

Based on the information I have available to me (I'd appreciate pointers to
other studies), JS execution does not dominate - it ranks last behind Page
layout and rendering [1], even for sites that are JS heavy, such as webmail
sites. Interestingly, a large part of that is evaluating CSS selectors.

On a related note, let me point out that there are many ways to change the
DOM on the client. Client-side templating frameworks such as knockout.js or
jQuery tmpl produce HTML (which then must be parsed), but modern AJAX
frameworks such as ZK don't produce any HTML at all, skipping parsing
altogether.

I meant to add another reason why at this point teaching newbies an AJAX
style that relies on HTML-returning entry points is a really bad idea, and
that is the move from read-only applications (like Nate's) to applications
that actually update state on the server. In this case, multiple parts of
the client page (perhaps a label here, a link there) need to be updated.
Expressing this in HTML is cumbersome, to say the least. (As an aside, I
note that AJAX frameworks such as ZK, which pursued the HTML approach in
their first iterations, have moved away from it. Compare the client/server
traffic on a ZK 3.x application to the one in a ZK 5. app to see this.)

For those interested in how to use one of possible client-side approaches
I'm suggesting, I prototyped Nate's application using only client-side
templating: http://libx.lib.vt.edu/services/popsubjects/cs/ It uses
knockout.js's data binding facilities as well as (due to qTip 1.0's design)
the jQuery tmpl engine. Read the (small, self-contained) source to learn
about the server-side entry points. (I should point out that in this case,
the need for the book cover ISBNs to be retrieved remotely is somewhat
contrived; they should probably be sent along with the page in the first
place.) A side effect of this JSON-oriented design is that it results in 2
nice JSON-P web services that can be embedded/used in other
pages/applications.

 - Godmar

[1]
http://www.eecs.berkeley.edu/~lmeyerov/projects/pbrowser/pubfiles/login.pdf


Re: [CODE4LIB] jQuery Ajax request to update a PHP variable

2011-12-06 Thread Godmar Back
On Tue, Dec 6, 2011 at 8:38 AM, Erik Hatcher erikhatc...@mac.com wrote:

 I'm with jrock on this one.   But maybe I'm a luddite that didn't get the
 memo either (but I am credited for being one of the instrumental folks in
 the Ajax world, heh - in one or more of the Ajax books out there, us old
 timers called it remote scripting).


On the in-jest rhetorical front, I'm wondering if referring to oneself as
oldtimer helps in defending against insinuations that opposing
technological change makes one a defender of the old ;-)

But:


 What I hate hate hate about seeing JSON being returned from a server for
 the browser to generate the view is stuff like:

   string = div + some_data_from_JSON + /div;

 That embodies everything that is wrong about Ajax + JSON.


That's exactly why you use new libraries such as knockout.js, to avoid just
that. Client-side template engines with automatic data-bindings.

Alternatively, AJAX frameworks use JSON and then interpret the returned
objects as code. Take a look at the client/server traffic produced by ZK,
for instance.


 As Jonathan said, the server is already generating dynamic HTML... why
 have it return


It isn't. There is no already generating anything server, it's a new app
Nate is writing. (Unless you count his work of the past two days). The
dynamic HTML he's generating is heavily tailored to his JS. There's
extremely tight coupling, which now exists across multiple files written in
multiple languages. Simply avoidable bad software engineering. That's not
even making the computational cost argument that avoiding template
processing on the server is cheaper. And with respect to Jonathan's
argument of degradation, a degraded version of his app (presumably) would
use table - or something like that, it'd look nothing like what's he
showed us yesterday.

Heh - the proof of the pudding is in the eating. Why don't we create 2
versions of Nate's app, one with mixed server/client - like the one he's
completing now, and I create the client-side based one, and then we compare
side by side?  I'll work with Nate on that.

  - Godmar

[ I hope it's ok to snip off the rest of the email trail in my reply. ]


Re: [CODE4LIB] jQuery Ajax request to update a PHP variable

2011-12-06 Thread Godmar Back
On Tue, Dec 6, 2011 at 11:18 AM, Nate Hill nathanielh...@gmail.com wrote:

 I attached the app as it stands now.  There's something wrong w/ the regex
 matching in catscrape.php so only some of the images are coming through.


No, it's not the regexp. You're simply scraping syndetics links, without
checking if syndetics has or does not have an image for those ISBNs. Those
searches where the first four hits have jackets display, the others don't.


 Also: should I be sweating the fact that basically every time someone
 mouses over one of these boxes they are hitting our library catalog with a
 query?  It struck me that this might be unwise.  But I don't know either
 way.


Yes, it's unwise, especially since the results won't change (much).

 - Godmar


Re: [CODE4LIB] jQuery Ajax request to update a PHP variable

2011-12-06 Thread Godmar Back
On Tue, Dec 6, 2011 at 11:22 AM, Doran, Michael D do...@uta.edu wrote:

  You had earlier asked the question whether to do things client or server
  side - well in this example, the correct answer is to do it client-side.
  (Yours is a read-only application, where none of the advantages of
  server-side processing applies.)

 One thing to take into consideration when weighing the advantages of
 server-side vs. client-side processing, is whether the web app is likely to
 be used on mobile devices.  Douglas Crockford, speaking about the fact that
 JavaScript has become the de fact universal runtime, cautions: Which I
 think puts even more pressure on getting JavaScript to go fast.
 Particularly as we're now going into mobile. Moore's Law doesn't apply to
 batteries. So how much time we're wasting interpreting stuff really matters
 there. The cycles count.[1]  Personally, I don't know enough to know how
 significant the impact would be.  However, I understand Douglas Crockford
 knows a little something about JavaScript and JSON.


It's certainly true that limited energy motivates the need to minimize
client processing, but the conclusion that this then means server
generation of static HTML is not clear.

Current trends certainly go in the opposite direction, look at jQuery
Mobile.

 - Godmar


Re: [CODE4LIB] jQuery Ajax request to update a PHP variable

2011-12-06 Thread Godmar Back
On Tue, Dec 6, 2011 at 1:57 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 On 12/6/2011 1:42 PM, Godmar Back wrote:

 Current trends certainly go in the opposite direction, look at jQuery
 Mobile.


 Hmm, JQuery mobile still operates on valid and functional HTML delivered
 by the server. In fact, one of the designs of JQuery mobile is indeed to
 degrade to a non-JS version in feature phones (you know, eg, flip phones
 with a web browser but probably no javascript).  The non-JS version it
 degrades to is the same HTML that was delivered to the browser in either
 way, just not enhanced by JQuery Mobile.


My argument was that current platforms, such as jQuery mobile, heavily rely
on JavaScript on the very platforms on which Crockford statement points out
it would be wise to save energy. Look at the jQuery Mobile documentation,
A-grade platforms:
http://jquerymobile.com/demos/1.0/docs/about/platforms.html



If I were writing AJAX requests for an application targetted mainly at
 JQuery Mobile... I'd be likely to still have the server delivery HTML to
 the AJAX request, then have js insert it into the page and trigger JQuery
 Mobile enhancements on it.


I wouldn't. Return JSON and interpret or template the result.

 - Godmar


Re: [CODE4LIB] jQuery Ajax request to update a PHP variable

2011-12-05 Thread Godmar Back
FWIW, I would not send HTML back to the client in an AJAX request - that
style of AJAX fell out of favor years ago.

Send back JSON instead and keep the view logic client-side. Consider using
a library such as knockout.js. Instead of your current (difficult to
maintain) mix of PhP and client-side JavaScript, you'll end up with a
static HTML page, a couple of clean JSON services (for checked-out per
subject, and one for the syndetics ids of the first 4 covers), and clean
HTML templates.

You had earlier asked the question whether to do things client or server
side - well in this example, the correct answer is to do it client-side.
(Yours is a read-only application, where none of the advantages of
server-side processing applies.)

 - Godmar

On Mon, Dec 5, 2011 at 6:18 PM, Nate Hill nathanielh...@gmail.com wrote:

 Something quite like that, my friend!
 Cheers
 N

 On Mon, Dec 5, 2011 at 3:10 PM, Walker, David dwal...@calstate.edu
 wrote:

  I gotcha.  More information is, indeed, better. ;-)
 
  So, on the PHP side, you just need to grab the term from the  query
  string, like this:
 
   $searchterm = $_GET['query'];
 
  And then in your JavaScript code, you'll send an AJAX request, like:
 
   http://www.natehill.net/vizstuff/catscrape.php?query=Cooking
 
  Is that what you're looking for?
 
  --Dave
 
  -
  David Walker
  Library Web Services Manager
  California State University
 
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
  Nate Hill
  Sent: Monday, December 05, 2011 3:00 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] jQuery Ajax request to update a PHP variable
 
  As always, I provided too little information.  Dave, it's much more
  involved than that
 
  I'm trying to make a kind of visual browser of popular materials from one
  of our branches from a .csv file.
 
  In order to display book covers for a series of searches by keyword, I
  query the catalog, scrape out only the syndetics images, and then
 display 4
  of them.  The problem is that I've hardcoded in a search for 'Drawing',
  rather than dynamically pulling the correct term and putting it into the
  catalog query.
 
  Here's the work in process, and I believe it will only work in Chrome
  right now.
  http://www.natehill.net/vizstuff/donerightclasses.php
 
  I may have a solution, Jason's idea got me part way there.  I looked all
  over the place for that little snippet he sent over!
 
  Thanks!
 
 
 
  On Mon, Dec 5, 2011 at 2:44 PM, Walker, David dwal...@calstate.edu
  wrote:
 
And I want to update 'Drawing' to be 'Cooking'  w/ a jQuery hover
effect on the client side then I need to make an Ajax request,
 correct?
  
   What you probably want to do here, Nate, is simply output the PHP
   variable in your HTML response, like this:
  
h1 id=foo?php echo $searchterm ?/h1
  
   And then in your JavaScript code, you can manipulate the text through
   the DOM like this:
  
$('#foo').html('Cooking');
  
   --Dave
  
   -
   David Walker
   Library Web Services Manager
   California State University
  
  
   -Original Message-
   From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
   Of Nate Hill
   Sent: Monday, December 05, 2011 2:09 PM
   To: CODE4LIB@LISTSERV.ND.EDU
   Subject: [CODE4LIB] jQuery Ajax request to update a PHP variable
  
   If I have in my PHP script a variable...
  
   $searchterm = 'Drawing';
  
   And I want to update 'Drawing' to be 'Cooking'  w/ a jQuery hover
   effect on the client side then I need to make an Ajax request, correct?
   What I can't figure out is what that is supposed to look like...
   something like...
  
   $.ajax({
type: POST,
url: myfile.php,
data: ...not sure how to write what goes here to make it
 'Cooking'...
   });
  
   Any ideas?
  
  
   --
   Nate Hill
   nathanielh...@gmail.com
   http://www.natehill.net
  
 
 
 
  --
  Nate Hill
  nathanielh...@gmail.com
  http://www.natehill.net
 



 --
 Nate Hill
 nathanielh...@gmail.com
 http://www.natehill.net



Re: [CODE4LIB] jQuery Ajax request to update a PHP variable

2011-12-05 Thread Godmar Back
On Mon, Dec 5, 2011 at 6:45 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 I still like sending HTML back from my server. I guess I never got the
 message that that was out of style, heh.


I suppose there are always some stalwart defenders of the status quo ;-)

More seriously, I think I'd like to defend my statement.

The purpose of graceful degradation is well-acknowledged - I don't think
no-JS browsers are much of a concern, but web spiders are and so are
probably ADA accessibility requirements, as well as low-bandwidth
environments.

I do not believe, however, that such situation warrant any sharing of HTML
templates. If they do, it means your app is, well, perhaps outdated in that
it doesn't make full use of today's JS features. Certainly Gmail's basic
html version for low bandwidth environments shares no HTML templates with
the JS main app. In Nate's case, which is a heavily JS-dependent app (he
uses various jQuery plug-ins to drive his layout, as well as qtip for
tooltips), I find it difficult to see how any degraded environment would
share any HTML with his app.

That said, I'm genuinely interested in what others are thinking/have
experienced.

Also, for expository purposes, I'd love to prototype the client-side for
Nate's app. Then we could compare the mixed PhP server/client-side AJAX
version with the pure JS app I'm suggesting.

 - Godmar


On Mon, Dec 5, 2011 at 6:45 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 I still like sending HTML back from my server. I guess I never got the
 message that that was out of style, heh.

 My server application already has logic for creating HTML from templates,
 and quite possibly already creates this exact same piece of HTML in some
 other place, possibly for use with non-AJAX fallbacks, or some other
 context where that snippet of HTML needs to be rendered. I prefer to re-use
 this logic that's already on the server, rather than have a duplicate HTML
 generating/templating system in the javascript too.  It's working fine for
 me, in my use patterns.

 Now, certainly, if you could eliminate any PHP generation of HTML at all,
 as I think Godmar is suggesting, and basically have a pure Javascript app
 -- that would be another approach that avoids duplication of HTML
 generating logic in both JS and PHP. That sounds fine too. But I'm still
 writing apps that degrade if you have no JS (including for web spiders that
 have no JS, for instance), and have nice REST-ish URLs, etc.   If that's
 not a requirement and you can go all JS, then sure.  But I wouldn't say
 that making apps that use progressive enhancement with regard to JS and
 degrade fine if you don't have is out of style, or if it is, it ought not
 to be!

 Jonathan





Re: [CODE4LIB] Examples of Web Service APIs in Academic Public Libraries

2011-10-08 Thread Godmar Back
On Sat, Oct 8, 2011 at 1:40 PM, Patrick Berry pbe...@gmail.com wrote:

 We're (CSU, Chico) using http://code.google.com/p/googlebooks/ to provide
 easy access to partial and full text books.


Good to hear.

As an aside, we wrote up some background on how to use widgets and
webservices in a 2010 article published in LITA's ITAL magazine:

http://www.lita.org/ala/mgrps/divs/lita/publications/ital/29/2/back.pdf

 - Godmar



 On Sat, Oct 8, 2011 at 10:33 AM, Michel, Jason Paul miche...@muohio.edu
 wrote:

  Hello all,
 
  I'm a lurker on this listserv and am interested in gaining some insight
  into your experiences of utilizing web service APIs in either an academic
  library or public library setting.
 
  I'm writing a book for ALA Editions on the use of Web Service APIs in
  libraries.  Each chapter covers a specific API by delineating the
  technicalities of the API, discussing potential uses of the API in
 library
  settings, and step-by-step tutorials.
 
  I'm already including examples of how my library (Miami University in
  Oxford, Ohio) are utilizing these APIs but would like to give the reader
  more examples from a variety of settings.
 
  APIs covered in the book: Flickr, Vimeo, Google Charts, Twitter, Open
  Library, LibraryThing, Goodreads, OCLC.
 
  So, what are you folks doing with APIs?
 
  Thanks for any insight!
 
  Kind regards,
 
  Jason
 
  --
  Jason Paul Michel
  User Experience Librarian
  Miami University Libraries
  Oxford, Ohio 45044
  twitter:jpmichel
 



Re: [CODE4LIB] ny times best seller api

2011-09-29 Thread Godmar Back
On Wed, Sep 28, 2011 at 5:02 PM, Michael B. Klein mbkl...@gmail.com wrote:


 It's not NYTimes.com's fault; it's the cross-site scripting jerks who made
 the security necessary in the first place.


NYTimes could allow JSONP, but then developers would need to embed their API
key in their web pages, which means the API key would simply be a token used
for statistics, rather than for authentication. It's their choice that they
don't allow that.

Closer to the code4lib community: OCLC and Serials Solutions don't support
JSONP in their webservices, either, even though doing so would allow cool
services and would likely not affect their business models adversely in a
significant way, IMO. We should keep lobbying them to remove these
restrictions, as I've been doing for a while.

 - Godmar


Re: [CODE4LIB] ny times best seller api

2011-09-28 Thread Godmar Back
Are you trying to run this inside a webpage served from a domain other than
nytimes.com?
If so, you'd need to use JSONP, which a cursory examination of their API
documentation reveals they do not support. So, you need to use a proxy.

Here's one:
$ cat hardcover.php
?
$cb = @$_GET['callback'];

$json = file_get_contents('
http://api.nytimes.com/svc/books/v2/lists/hardcover-fiction.json?api-key='
);
header(Content-Type: text/javascript);
echo $cb . '(' . $json . ')';

?

Install it on your webserver, then change your JavaScript code to refer to
it using callback=?.

For instance, if you installed it on
http://libx.lib.vt.edu/services/nytimes/hardcover.php
then you would be using the URL
http://libx.lib.vt.edu/services/nytimes/hardcover.php?callback=?
(.getJSON will replace the ? with a suitably generated function name).

 - Godmar

On Wed, Sep 28, 2011 at 3:28 PM, Nate Hill nathanielh...@gmail.com wrote:

 Anybody out there using the NY times best seller API to do stuff on their
 library websites?
 I can't figure out what's wrong with my code here.
 Data is returned as null; I can't seem to parse the response with jQuery.
 Any help would be supercool.
 I removed the API key - my code doesn't actually contain ''.
 Here's the jQuery:

 jQuery(document).ready(function(){
$(function(){
//json request to new york times
$.getJSON('

 http://api.nytimes.com/svc/books/v2/lists/hardcover-fiction.json?api-key=
 ',

function(data) {
//loop through the results with the following
 function
$.each(data.results.book_details, function(i,item){
//turn the title into a variable
var bookTitle = item.title;
$('#container').append('p'+bookTitle+'/p');

});
});
});
 });


 Here's a snippet of the JSON response:

 {
status: OK,
copyright: Copyright (c) 2011 The New York Times Company.  All Rights
 Reserved.,
num_results: 35,
last_modified: 2011-09-23T12:00:29-04:00,
results: [{
list_name: Hardcover Fiction,
display_name: Hardcover Fiction,
updated: WEEKLY,
bestsellers_date: 2011-09-17,
published_date: 2011-10-02,
rank: 1,
rank_last_week: 0,
weeks_on_list: 1,
asterisk: 0,
dagger: 0,
isbns: [{
isbn10: 0399157786,
isbn13: 9780399157783
}],
book_details: [{
title: NEW YORK TO DALLAS,
description: An escaped child molester pursues Lt. Eve
 Dallas; by Nora Roberts, writing pseudonymously.,
contributor: by J. D. Robb,
author: J D Robb,
contributor_note: ,
price: 27.95,
age_group: ,
publisher: Putnam,
primary_isbn13: 9780399157783,
primary_isbn10: 0399157786
}],
reviews: [{
book_review_link: ,
first_chapter_link: ,
sunday_review_link: ,
article_chapter_link: 
}]


 --
 Nate Hill
 nathanielh...@gmail.com
 http://www.natehill.net



Re: [CODE4LIB] internet explorer and pdf files

2011-08-31 Thread Godmar Back
On Wed, Aug 31, 2011 at 8:42 AM, Eric Lease Morgan emor...@nd.edu wrote:

 Eric wrote:

  Unfortunately IE's behavior is weird. The first time someone tries to
 load
  one of these URL nothing happens. When someone tries to load another one,
 it
  loads just fine. When they re-try the first one, it loads. We are banging
  our heads against the wall here at Catholic Pamphlet Central. Networking
  issue? Port issue? IE PDF plug-in? Invalid HTTP headers? On-campus versus
  off-campus issue?

 Thank you for all the replies.

 We'er not one hundred percent positive, but we think the issue with IE has
 something to do with headers. As alluded to previously, IE needs/desires
 file name extensions in order to know what to do with incoming files. We are
 serving these PDF documents from Fedora which is sending out a stream, not
 necessarily a file. Apparently this confuses IE. Since Fedora is not really
 designed to be a file server, we will write a piece of intermediary
 software to act as a go between. This isn't really a big deal since all of
 our other implementations of Fedora are expected to work in the same way.
 Wish us luck.


FWIW, this is true for any and all HTTP servers.  Only the client's request
specifies a name (as the path component of the request, e.g.,
/fedora/get/CATHOLLIC-PAMPHLET:1000793/PDF1http://fedoraprod.library.nd.edu:8080/fedora/get/CATHOLLIC-PAMPHLET:1000793/PDF1

The server's reply does not contain a name at all. It simply specifies what
type and, typically, the length of the returned content is. The returned
content itself is just a blob of bytes. Your server says this blob  of
bytes is a PDF object (application/pdf), but it doesn't specify the length.
 Not specifying the length makes the job of the client slightly more
difficult, which is why the HTTP/1.1 specification discourages it; it now
has to read the stream until the server closes the connection. It is
certainly possible that IE's PDF plug-in is not prepared to deal with this
situation; and I would certainly fix this first.

 - Godmar


Re: [CODE4LIB] internet explorer and pdf files

2011-08-29 Thread Godmar Back
Earlier versions of IE were known to sometimes disregard the Content-Type
(which you set correctly to application/pdf) and look at the suffix of the
URL instead. For instance, they would render HTML if you served a .html as
text/plain, etc.

You may try creating URLs that end with .pdf

Separately, you're not sending a Content-Length header:

HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Server: Apache-Coyote/1.1
  Pragma: No-cache
  Cache-Control: no-cache
  Expires: Wed, 31 Dec 1969 19:00:00 EST
  Content-Type: application/pdf
  Date: Mon, 29 Aug 2011 19:47:27 GMT
  Connection: close
Length: unspecified [application/pdf]

which disregards RFC 2616,
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.13

On Mon, Aug 29, 2011 at 3:30 PM, Eric Lease Morgan emor...@nd.edu wrote:

 I need some technical support when it comes to Internet Explorer (IE) and
 PDF files.

 Here at Notre Dame we have deposited a number of PDF files in a Fedora
 repository. Some of these PDF files are available at the following URLs:

  *
 http://fedoraprod.library.nd.edu:8080/fedora/get/CATHOLLIC-PAMPHLET:1000793/PDF1
  *
 http://fedoraprod.library.nd.edu:8080/fedora/get/CATHOLLIC-PAMPHLET:832898/PDF1
  *
 http://fedoraprod.library.nd.edu:8080/fedora/get/CATHOLLIC-PAMPHLET:999332/PDF1
  *
 http://fedoraprod.library.nd.edu:8080/fedora/get/CATHOLLIC-PAMPHLET:832657/PDF1
  *
 http://fedoraprod.library.nd.edu:8080/fedora/get/CATHOLLIC-PAMPHLET:1001919/PDF1
  *
 http://fedoraprod.library.nd.edu:8080/fedora/get/CATHOLLIC-PAMPHLET:832818/PDF1
  *
 http://fedoraprod.library.nd.edu:8080/fedora/get/CATHOLLIC-PAMPHLET:834207/PDF1

 Retrieving the URLs with any browser other than IE works just fine.

 Unfortunately IE's behavior is weird. The first time someone tries to load
 one of these URL nothing happens. When someone tries to load another one, it
 loads just fine. When they re-try the first one, it loads. We are banging
 our heads against the wall here at Catholic Pamphlet Central. Networking
 issue? Port issue? IE PDF plug-in? Invalid HTTP headers? On-campus versus
 off-campus issue?

 Could some of y'all try to load some of the URLs with IE and tell me your
 experience? Other suggestions would be greatly appreciated as well.

 --
 Eric Lease Morgan
 University of Notre Dame

 (574) 631-8604



Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Godmar Back
On Tue, Mar 1, 2011 at 11:14 PM, Roy Tennant roytenn...@gmail.com wrote:
 On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back god...@gmail.com wrote:

Similarly, the date associated with a record can come in a variety of
formats. Some are single-field (20080901), some are abbreviated
(200811), some are separated into year, month, date, etc.  Some
records have a mixture of those.

 In this world of MARC (s/MARC/hurt) I call that an embarrassment of
 riches. I've spent some bit of time parsing MARC, especially lately,
 and just the fact that Summon provides a normalized date element is
 HUGE.

That's great to hear - but how do I know which elements to use?

For instance, look at the JSON excerpt at
http://api.summon.serialssolutions.com/help/api/search/response/documents

 PublicationDateCentury:[
  1900
],
PublicationDateDecade:[
  1970
],
PublicationDateYear:[
  1979
],
PublicationDate:[
  1979.
],
PublicationDate_xml:[
  {
day:01,
month:01,
text:1979.,
year:1979
  }
],

Which one is the cleaned up date, and in which order shall I be
looking for the date field in the record when some or all of this
information is missing in a particular record?

Andrew responded to that if given, PublicationDate_xml is the
preferred one - but this raises the question which field in
PublicationDate_xml to use: .text, .day, or .year?  What if some are
missing?
What if PublicationDate_xml is missing, then I use or look for
PublicationDate?  Or is PublicationDateYear/Month/Decade preferred to
PublicationDate?  Which fields are derived from which others?

These are the types of questions I'm looking to answer.

 - Godmar


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Godmar Back
On Wed, Mar 2, 2011 at 11:12 AM, Roy Tennant roytenn...@gmail.com wrote:
 Godmar,
 I'm surprised you're asking this. Most of the questions you want
 answered could be answered by a basic programming construct: an
 if-then-else statement and a simple decision about what you want to
 use in your specific application (for example, do you prefer text
 with the period, or not?). About the only question that such a
 solution wouldn't deal with is which fields are derived from which
 others, which strikes me as superfluous to your application if you
 know a hierarchy of preference. But perhaps I'm missing something
 here.

I'm not asking how to code it, I'm asking for the algorithm I should
use, given the fact that I'm not familiar with the provenance and
status of the data Summon returns (which, I understand, is a mixture
of original, harvested data, and cleaned-up, processed data.)

Can you suggest such an algorithm, given the fact that each of the 8
elements I showed in the example (PublicationDateYear,
PublicationDateDecade, PublicationDate, PublicationDateCentury,
PublicationDate_xml.text, PublicationDate_xml.day,
PublicationDate_xml.month, PublicationDate_xml.year is optional?  But
wait  I think I've also seen records where there is a
PublicationDateMonth, and records where some values have arrays of
length  1.

Can you suggest, or at least outline, such an algorithm?

It would be helpful to know, for instance, if the presence of a
PublicationDate_xml field supplants any other PublicationDate* fields
(does it?)  If a PublicationDate_xml field is absent, which field
would I want to look at next?  Is PublicationDate more reliable than a
combination of PublicationDateYear and PublicationDateMonth (and
perhaps PublicationDateDay if it exists?)?

If the PublicationDate_xml is present, then: should I prefer the .text
option?  What's the significance of that dot? Is it spurious, like the
identifier you mentioned you find in raw MARC records?  If not, what,
if anything, is known about the presence of the other fields?  What if
multiple fields are given in an array?  Is the ordering significant
(e.g., the first one is more trustworthy?) Or should I sort them based
on a heuristics?  (e.g., if 20100523 and 201005 is given, prefer
the former?)  What if the data is contradictory?

These are the questions I'm seeking answers to; I know that those of
you who have coded their own Summon front-ends must have faced the
same questions when implementing their record displays.

 - Godmar


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Godmar Back
On Wed, Mar 2, 2011 at 11:36 AM, Walker, David dwal...@calstate.edu wrote:
 Just out of curiosity, is there a Summon (API) developer listserv?  Should 
 there be?

Yes, there is - I'm waiting for my subscription there to be approved.

Like I said at the beginning of this thread, this is only tangentially
a Code4Lib issue, and certainly the details aren't.  But perhaps the
general problem is (?)

 - Godmar


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Godmar Back
On Wed, Mar 2, 2011 at 11:54 AM, Demian Katz demian.k...@villanova.edu wrote:
 These are the questions I'm seeking answers to; I know that those of
 you who have coded their own Summon front-ends must have faced the
 same questions when implementing their record displays.

 Feel free to refer to VuFind's Summon template for reference if that is 
 helpful:

 https://vufind.svn.sourceforge.net/svnroot/vufind/trunk/web/interface/themes/default/Summon/record.tpl

 Andrew wrote this originally, and I've tweaked it in a few places to address 
 problems as they arose.  I don't claim that this offers the definitive answer 
 to your questions...  but it's working reasonably well for us so far.


Ah, thanks.  As they say, a piece of code speaks a thousand words!

So, to solve the conundrum: only PublicationDate_xml and
PublicationDate are of interest. If the former is given, use it and
print (if available) its .month, .day, and .year fields. Else, if the
latter is given, just print it.
Ignore all other date-related fields. Ignore PublicationDate_xml.text.
 Ignore if there's more than one date field - use the first one.

This knowledge will also help me avoid sending unnecessary data to the
LibX client. As you know, Summon requires a proxy that talks to the
actual service, and cutting out redundant and derived fields at the
proxy could save a fair amount of bandwidth (though I'll have to check
if it also shaves off latency.) A typical search response (raw JSON,
with 20 hits) is  500KB long, so investing computing time at the
proxy in cutting this down may be promising.

 - Godmar


Re: [CODE4LIB] C4L2011 Voting for Prepared Talks

2010-12-01 Thread Godmar Back
through Dec 1 typically means until Dec 1, 23:59pm (in some time zone) -
yet the page says voting is closed.

Could this be fixed?

 - Godmar

On Mon, Nov 29, 2010 at 5:02 PM, McDonald, Robert H.
rhmcd...@indiana.eduwrote:

 Just a reminder that voting for prepared talks for code4lib 2011 is ongoing
 and open through Dec 1, 2010.

 Please vote if you have not done so already.

 To vote - go here - http://vote.code4lib.org/election/index/17

 If you have never voted before you will need to register here first -
 http://code4lib.org/user/register

 Thanks

 Robert

 **
 Robert H. McDonald
 Associate Dean for Library Technologies and Digital Libraries
 Associate Director, Data to Insight Center-Pervasive Technology Institute
 Executive Director, Kuali OLE
 Frye Leadership Institute Fellow 2009
 Indiana University
 Herman B Wells Library 234
 1320 East 10th Street
 Bloomington, IN 47405
 Phone: 812-856-4834
 Email: rob...@indiana.edu
 Skype/GTalk: rhmcdonald
 AIM/MSN: rhmcdonald1



Re: [CODE4LIB] detecting user copying URL?

2010-12-01 Thread Godmar Back
On Thu, Dec 2, 2010 at 12:25 AM, Susan Kane adarconsult...@gmail.comwrote:

 Absolutely this should be solved by the vendors / content providers but --
 just for the sake of argument -- it is a possible extension for LibX?

 You can't send a standard message everytime a user copies a URL from their
 address bar -- they would kill you.

 Is there a way for a browser plugin to know that the user is on a
 specific
 website and to warn them for such actions while there?

 Or would that level of coordination between the website and the address bar
 be (a) impossible or (b) not really not worth the effort or (c) a serious
 privacy concern?


Extensions such as LibX can certainly interpose when users bookmark items,
at least in Firefox (and possibly Chrome). The question is how to determine
if a URL is bookmarkable or not. This could be done either by consulting a
database - online or built-in, or perhaps by using heuristics (for instance,
URLs containing session ids are often not bookmarkable.)

 - Godmar


[CODE4LIB] Q: Summon API Service?

2010-10-27 Thread Godmar Back
Hi,

Unlike Link/360, Serials Solution's Summon API is extremely cumbersome to
use - requiring, for instance, that requests be digitally signed. (*)

Has anybody developed a proxy server for Summon that makes its API public
(e.g. receives requests, signs them, forwards them to Summon, and relays the
result back to a HTTP client?)

Serials Solutions publishes some PHP5 and Ruby sample code in two API
libraries (**), but these don't appear to be fully fledged nor
easy-to-install solutions.  (Easy to install here is defined as an average
systems librarian can download them, provide the API key, and have a running
solution in less time than it takes to install Wordpress.)

Thanks!

 - Godmar

(*) http://api.summon.serialssolutions.com/help/api/authentication
(**) http://api.summon.serialssolutions.com/help/api/code


Re: [CODE4LIB] Safari extensions

2010-08-06 Thread Godmar Back
On Fri, Aug 6, 2010 at 8:19 AM, Joel Marchesoni jma...@email.wcu.edu wrote:
 Honestly I try to switch to Chrome every month or so, but it just doesn't do 
 what Firefox does for me. I've actually been using a Firefox mod called Pale 
 Moon [1] that takes out some of the not so useful features for work (parental 
 controls, etc) and optimizes for current processors. It's not a huge speed 
 increase, but it is definitely noticeable.


Chrome is certainly behind Firefox in its extension capability. For
instance, it doesn't allow the extension of context menus yet (planned
for later this year or next), and even the planned API will be less
flexible than Firefox's  . It is hobbled by the fact that the browser
is not itself written using the same markup language as its
extensions, so Google's programmers have to add an API (along with a
C++ implementation) for every feature they want supported.

Regarding the JavaScript performance, both Firefox and Chrome have
just-in-time compilers in their engines (Chrome uses V8, Firefox uses
TraceMonkey), which each provide an order or two of magnitudes speedup
compared to interpreters that were used in FF 3.0 and before.

Regarding resource usage, it's difficult to tell. Firefox is certainly
a memory hog, with internal memory leaks, but when the page itself is
the issue (perhaps because the JavaScript programmer leaked memory),
then both browsers are affected. In Chrome, I've observed two
problems. First, if a page leaks, then the corresponding tab will
simply ask for more memory from the OS. There are no resource controls
at this point. The effect is the same as in Firefox. Second, each page
is scheduled separately by the OS. I've observed that Chrome tabs slow
to a halt in Windows XP because the OS is starving a tab's thread if
there are CPU-bound activities on the machine, making Chrome actually
very difficult to use.

 - Godmar


Re: [CODE4LIB] Safari extensions

2010-08-05 Thread Godmar Back
No, nothing beyond a quick read-through.

The architecture is similar to Google Chrome's - which is perhaps not
surprising given that both Safari and Chrome are based on WebKit -
which for us at LibX means we should be able to leverage the redesign
we did for LibX 2.0.

A notable characteristic of this architecture is that content scripts
that interact with a page are in a separate OS process from the main
extensions' code, thus they have to communicate with the main
extension via message passing rather than by exploiting direct method
calls as in Firefox.

 - Godmar

On Thu, Aug 5, 2010 at 4:04 PM, Eric Hellman e...@hellman.net wrote:
 Has anyone played with the new Safari extensions capability? I'm looking at 
 you, Godmar.


 Eric Hellman
 President, Gluejar, Inc.
 41 Watchung Plaza, #132
 Montclair, NJ 07042
 USA

 e...@hellman.net
 http://go-to-hellman.blogspot.com/
 @gluejar



Re: [CODE4LIB] Safari extensions

2010-08-05 Thread Godmar Back
On Thu, Aug 5, 2010 at 4:15 PM, Raymond Yee y...@berkeley.edu wrote:
 Has anyone given thought to how hard it would be to port Firefox extensions
 such as LibX and  Zotero to Chrome or Safari?  (Am I the only one finding
 Firefox to be very slow compared to Chrome?)

We have ported LibX to Chrome, see http://libx.org/releases/gc/

Put briefly, Chrome provides an extension API that is entirely
JavaScript/HTML based. As such, existing libraries such as jQuery can
be used to implement the extensions' user interface (such as LibX's
search box, implemented as a browser action). Unlike Firefox, no
coding in a special-purpose user interface markup language such as XUL
is required. (That said, it's possible to achieve the same in Firefox,
and in fact we're now using the same HTML/JS code in Firefox, reducing
the XUL-specific to a minimum). Safari will use the same approach.

Chrome also supports content scripts that interact with the page a
user is looking at. These scripts live in an environment that is
similar to the environment seen by client-side code coming from the
origin. In this sense, it's very similar to how Firefox works with its
sandboxes, with the exception mentioned in my previous email that all
communication outside has to be done via message passing (sending
JSON-encoded objects back and forth).

 - Godmar


Re: [CODE4LIB] SerSol 360Link API?

2010-04-19 Thread Godmar Back
I wrote to-JSON proxy a while ago:
http://libx.lib.vt.edu/services/link360/index.html

I found the Link360 doesn't handle load very well. Even a small burst of
requests leads to a spike in latency and error responses. I ask SS if this
was a bug or part of some intentional throttling attempt, but never received
a reply. Didn't pursue it further.

http://libx.lib.vt.edu/services/link360/index.html - Godmar

On Mon, Apr 19, 2010 at 2:42 AM, David Pattern d.c.patt...@hud.ac.ukwrote:

 Hiya

 We're using it to add e-holdings into to our OPAC, e.g.
 http://library.hud.ac.uk/catlink/bib/396817/

 I've also tried using the API to add the coverage info to the
 availability text for journals in Summon (e.g. Availability: print
 (1998-2005)  electronic (2000-present)).

 I've made quite a few tweaks to our 360 Link (mostly using jQuery), so I'm
 half tempted to have a go using the API to develop a complete replacement
 for 360 Link.  If anyone's already done that, I'd be keen to hear more.

 regards
 Dave Pattern
 University of Huddersfield

 
 From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Jonathan
 Rochkind [rochk...@jhu.edu]
 Sent: 19 April 2010 03:50
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] SerSol 360Link API?

 Is anyone using the SerSol 360Link API in a real-world production or
 near-production application?  If so, I'm curious what you are using it for,
 what your experiences have been, and in particular if you have information
 on typical response times of their web API.  You could reply on list or off
 list just to me. If I get interesting information especially from several
 sources, I'll try to summarize on list and/or blog either way.

 Jonathan


 ---
 This transmission is confidential and may be legally privileged. If you
 receive it in error, please notify us immediately by e-mail and remove it
 from your system. If the content of this e-mail does not relate to the
 business of the University of Huddersfield, then we do not endorse it and
 will accept no liability.



Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Godmar Back
On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaefer ulrich.schae...@dfki.dewrote:

 Hi,
 try this: http://code.google.com/p/xml2json-xslt/


I should have mentioned that I already tried everything I could find after
googling - this stylesheet doesn't meet the requirements, not by far. It
drops attributes just like simplexml_json does.

The one thing I didn't try is a program called 'BadgerFish.php' which I
couldn't locate - Google once indexed it at badgerfish.ning.com

 - Godmar


[CODE4LIB] Q: XML2JSON converter

2010-03-04 Thread Godmar Back
Hi,

Can anybody recommend an open source XML2JSON converter in PhP or
Python (or potentially other languages, including XSLT stylesheets)?

Ideally, it should implement one of the common JSON conventions, such
as Google's JSON convention for GData [1], but anything that preserves
all elements, attributes, and text content of the XML file would be
acceptable.

Note that json_encode(simplexml_load_file(...)) does not meet this
requirement - in fact, nothing based on simplexml_load_file() will.
(It can't even load MarcXML correctly).

Thanks!

 - Godmar

[1] http://code.google.com/apis/gdata/docs/json.html


Re: [CODE4LIB] Q: what is the best open source native XML database

2010-01-19 Thread Godmar Back
On Tue, Jan 19, 2010 at 10:09 AM, Sean Hannan shan...@jhu.edu wrote:
 I've had the best experience (query speed, primarily) with BaseX.  This was 
 primarily for large XML document processing, so I'm not sure how much it will 
 satisfy your transactional needs.

 I was initially using eXist, and then switched over to BaseX because the 
 speed gains were very noticeable.


What about the relative maturity/functionality of eXist vs BaseX? I'm
a bit skeptical to put my eggs in a University project basket not
backed by a continuous revenue stream (... did I just say that out
loud?)

 - Godmar


[CODE4LIB] Q: what is the best open source native XML database

2010-01-16 Thread Godmar Back
Hi,

we're currently looking for an XML database to store a variety of
small-to-medium sized XML documents. The XML documents are
unstructured in the sense that they do not follow a schema or DTD, and
that their structure will be changing over time. We'll need to do
efficient searching based on elements, attributes, and full text
within text content. More importantly, the documents are mutable.
We'll like to bring documents or fragments into memory in a DOM
representation, manipulate them, then put them back into the database.
Ideally, this should be done in a transaction-like manner. We need to
efficiently serve document fragments over HTTP, ideally in a manner
that allows for scaling through replication. We would prefer strong
support for Java integration, but it's not a must.

Have other encountered similar problems, and what have you been using?

So far, we're researching: eXist-DB (http://exist.sourceforge.net/ ),
Base-X (http://www.basex.org/ ), MonetDB/XQuery
(http://www.monetdb.nl/XQuery/ ), Sedna
(http://modis.ispras.ru/sedna/index.html ). Wikipedia lists a few
others here: http://en.wikipedia.org/wiki/XML_database
I'm wondering to what extent systems such as Lucene, or even digital
object repositories such as Fedora could be coaxed into this usage
scenario.

Thanks for any insight you have or experience you can share.

 - Godmar


Re: [CODE4LIB] ipsCA Certs

2010-01-04 Thread Godmar Back
Hi,

in my role as unpaid tech advisor for our local library, may I ask a
question about the ipsCA issue?

Is my understanding correct that ipsCA currently reissues certificates [1]
signed with a root CA that is not yet in Mozilla products, due to IPS's
delaying the necessary vetting process [2]? In other words, Mozilla users
would see security warnings even if a reissued certificate was used?

The reason I'm confused is that I, like David, saw a number of still valid
certificates from IPS Internet publishing Services s.l. already shipping
with Firefox, alongside the now-expired certificate. But I suppose those
certificates are for something else and the reissued certificates won't be
signed using them?

Thanks,

 - Godmar

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=529286
[1] http://certs.ipsca.com/Support/hierarchy-ipsca.asp

On Thu, Dec 17, 2009 at 4:02 PM, John Wynstra john.wyns...@uni.edu wrote:

 Out of curiosity, did anyone else using ipsCA certs receive notification
 that due to the coming expiration of their root CA (December 29,2009), they
 would need a reissued cert under a new root CA?

 I am uncertain as to how this new Root CA will become a part of the
 browsers trusted roots without some type of user action including a software
 upgrade, but the following library website instructions lead me to believe
 that this is not going to be smooth.  http://bit.ly/53Npel

 We are just about to go live with EZProxy in January with an ipsCA cert
 issued a few months ago, and I am not about to do that if I have serious
 browser support issue.


 --
 
 John Wynstra
 Library Information Systems Specialist
 Rod Library
 University of Northern Iowa
 Cedar Falls, IA  50613
 wyns...@uni.edu
 (319)273-6399
 



Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Godmar Back
The string in question is double-encoded, that is, a string that's in
UTF-8 already was run through a UTF-8 encoder.

The string is Acta Ortopedica where the 'e' is really '\u00e9' aka
'Latin Small Letter E with Acute'. [1]

In UTF-8, the e-acute is two-byte encoded as C3 A9.  If you run the
bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with
tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9).
C3 83 C2 A9 is exactly what JISC is serving, what it should be serving
is C3 A9.

Send email to them.

 - Godmar

[1] http://www.utf8-chartable.de/

2009/12/21 Glen Newton glen.new...@nrc-cnrc.gc.ca

 [I realise there was a recent related 'Character-sets for dummies'[1]
 discussion recently]

 I am using tictocs[2] list of journal RSS feeds, and I am getting
 gibberish in places for diacritics. Below is an example:

 in emacs:
  221    Acta Ortop  dica Brasileira     
 http://www.scielo.br/rss.php?pid=1413-7852lang=en      1413-7852
 in Firefox:
  221    Acta Ortop  dica Brasileira     
 http://www.scielo.br/rss.php?pid=1413-7852lang=en      1413-7852

 Note that the emacs view is both of a save of the Firefox, and from a
 direct download using 'wget'.

 Is this something on my end, or are the tictocs people not serving
 proper UTF-8?

 The HTTP header from wget claims UTF-8:
  wget -S http://www.tictocs.ac.uk/text.php
  --2009-12-21 12:47:59--  http://www.tictocs.ac.uk/text.php
  Resolving www.tictocs.ac.uk... 130.88.101.131
  Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected.
  HTTP request sent, awaiting response...
    HTTP/1.1 200 OK
    Date: Mon, 21 Dec 2009 17:42:05 GMT
    Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2
    X-Powered-By: PHP/5.3.0
    Content-Type: text/plain; charset=utf-8
    Connection: close
  Length: unspecified [text/plain]
 stuff removed

 Can someone validate if they are also experiencing this issue?

 Thanks,
 Glen

 [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b=
 [2]http://www.tictocs.ac.uk/text.php

 --
 Glen Newton | glen.new...@nrc-cnrc.gc.ca
 Researcher, Information Science, CISTI Research
  NRC W3C Advisory Committee Representative
 http://tinyurl.com/yvchmu
 tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246
 Canada Institute for Scientific and Technical Information (CISTI)
 National Research Council Canada (NRC)| M-55, 1200 Montreal Road
 http://www.nrc-cnrc.gc.ca/
 Institut canadien de l'information scientifique et technique (ICIST)
 Conseil national de recherches Canada | M-55, 1200 chemin Montr al
 Ottawa, Ontario K1A 0R6
 Government of Canada | Gouvernement du Canada
 --


Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Godmar Back
I believe they've changed it while we were having the discussion.

When I downloaded the file (with curl), it looked like this:

0020700   r   t   o   p   C etx   B   )   d   i   c   a  sp   B   r   a
72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61
0020720   s   i   l   e   i   r   a  ht   h   t   t   p   :   /   /   w
73 69 6c 65 69 72 61 09 68 74 74 70 3a 2f 2f 77

 - Godmar

On Mon, Dec 21, 2009 at 2:24 PM, Erik Hetzner erik.hetz...@ucop.edu wrote:
 At Mon, 21 Dec 2009 14:09:28 -0500,
 Glen Newton wrote:

 It seems that different people are seeing different things in their
 respective viewers (i.e some are OK and others are like what I am
 seeing).

 When I use wget and view the local file in Firefox (3.0.4, Linux Suse
 11.0) I see:
  http://cuvier.cisti.nrc.ca/~gnewton/tictoc1.gif
 [gif used as it is not lossy]

 The text is clearly not correct.

 The file I got with wget is:
   http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt

 Is this just a question of different client software (and/or OSes)
 viewing or mangling the content?

 When dealing with character set issues (especially the dreaded
 double-encoding!) I find it best to use hex editors or dumpers. If in
 emacs, try M-x hexl-find-file. On a Unix command line, the od or hd
 commands are useful.

 For the record:

   48 54 54 50 2f 31 2e 31  20 32 30 30 20 4f 4b 0d  |HTTP/1.1 200 OK.|
 0010  0a 44 61 74 65 3a 20 4d  6f 6e 2c 20 32 31 20 44  |.Date: Mon, 21 D|
 0020  65 63 20 32 30 30 39 20  31 39 3a 32 32 3a 33 38  |ec 2009 19:22:38|
 0030  20 47 4d 54 0d 0a 53 65  72 76 65 72 3a 20 41 70  | GMT..Server: Ap|
 0040  61 63 68 65 2f 32 2e 32  2e 31 33 20 28 55 6e 69  |ache/2.2.13 (Uni|
 0050  78 29 20 6d 6f 64 5f 73  73 6c 2f 32 2e 32 2e 31  |x) mod_ssl/2.2.1|
 0060  33 20 4f 70 65 6e 53 53  4c 2f 30 2e 39 2e 38 6b  |3 OpenSSL/0.9.8k|
 0070  20 50 48 50 2f 35 2e 33  2e 30 20 44 41 56 2f 32  | PHP/5.3.0 DAV/2|
 0080  0d 0a 58 2d 50 6f 77 65  72 65 64 2d 42 79 3a 20  |..X-Powered-By: |
 0090  50 48 50 2f 35 2e 33 2e  30 0d 0a 43 6f 6e 74 65  |PHP/5.3.0..Conte|
 00a0  6e 74 2d 54 79 70 65 3a  20 74 65 78 74 2f 70 6c  |nt-Type: text/pl|
 00b0  61 69 6e 3b 20 63 68 61  72 73 65 74 3d 75 74 66  |ain; charset=utf|
 00c0  2d 38 0d 0a 54 72 61 6e  73 66 65 72 2d 45 6e 63  |-8..Transfer-Enc|
 00d0  6f 64 69 6e 67 3a 20 63  68 75 6e 6b 65 64 0d 0a  |oding: chunked..|
 ...
 2230  4f 72 74 68 6f 70 61 65  64 69 63 61 09 68 74 74  |Orthopaedica.htt|
 2240  70 3a 2f 2f 69 6e 66 6f  72 6d 61 68 65 61 6c 74  |p://informahealt|
 2250  68 63 61 72 65 2e 63 6f  6d 2f 61 63 74 69 6f 6e  |hcare.com/action|
 2260  2f 73 68 6f 77 46 65 65  64 3f 6a 63 3d 6f 72 74  |/showFeed?jc=ort|
 2270  26 74 79 70 65 3d 65 74  6f 63 26 66 65 65 64 3d  |type=etocfeed=|
 2280  72 73 73 09 31 37 34 35  2d 33 36 37 34 09 31 37  |rss.1745-3674.17|
 2290  34 35 2d 33 36 38 32 0a  32 32 31 09 41 63 74 61  |45-3682.221.Acta|
 22a0  20 4f 72 74 6f 70 c3 a9  64 69 63 61 20 42 72 61  | Ortop..dica Bra|
 22b0  73 69 6c 65 69 72 61 09  68 74 74 70 3a 2f 2f 77  |sileira.http://w|
 ...

 best,
 Erik Hetzner

 ;; Erik Hetzner, California Digital Library
 ;; gnupg key id: 1024D/01DB07E3




Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Godmar Back
On Mon, Dec 21, 2009 at 2:09 PM, Glen Newton glen.new...@nrc-cnrc.gc.ca wrote:

 The file I got with wget is:
  http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt


(Just to convince myself I'm not going nuts...) - this file, which
Glen downloaded with wget, appears double-encoded:

# curl -s http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt | od -a -t x1
| head -1082 | tail -4
0020660   -   3   6   8   2  nl   2   2   1  ht   A   c   t   a  sp   O
2d 33 36 38 32 0a 32 32 31 09 41 63 74 61 20 4f
0020700   r   t   o   p   C etx   B   )   d   i   c   a  sp   B   r   a
72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61

 - Godmar


Re: [CODE4LIB] SerialsSolutions Javascript Question

2009-10-28 Thread Godmar Back
On Wed, Oct 28, 2009 at 9:49 PM, Michael Beccaria
mbecca...@paulsmiths.eduwrote:

 I should clarify. The most granular piece of information in the html is
 a class attribute (i.e. there is no id). Here is a snippet:

 div class=SS_Holding style=background-color: #CECECE
 !-- Journal Information --
 span class=SS_JournalTitlestrongAnnals of forest
 science./strong/spannbsp;span
 class=SS_JournalISSN(1286-4560)/span


 I want to alter the span class=SS_JournalISSN(1286-4560)/span
 section. Maybe add some html after the issn that tells whether it is
 peer reviewed or not.


Yes - you'd write code similar to this one:

$(document).ready(function () {
   $(SS_JournalISSN).each(function () {
   var issn = $(this).text().replace(/[^\dxX]/g, );
   var self = this;
   $.getJSON(http: xissn.oclc.issn= + issn +
format=jsoncallback=., function (data) {
 $(self).append(  data ... [ 'is peer reviewed' ] );
   });
   });
});

 - Godmar


Re: [CODE4LIB] Setting users google scholar settings

2009-07-15 Thread Godmar Back
It used to be you could just GET the corresponding form, e.g.:

http://scholar.google.com/scholar_setprefs?num=10instq=inst=sfx-f7e167eec5dde9063b5a8770ec3aaba7q=einsteininststart=0submit=Save+Preferences

 - Godmar

On Wed, Jul 15, 2009 at 3:17 AM, Stuart Yeatesstuart.yea...@vuw.ac.nz wrote:
 It's possible to send users to google scholar using URLs such as:

 http://scholar.google.co.nz/schhp?hl=eninst=8862113006238551395

 where the institution is obtained using the standard preference setting 
 mechanism. Has anyone found a way of persisting this setting in the users 
 browser, so when they start a new session this is the default?

 Yes, I know they can go Scholar Preferences - Save to persist it, but 
 I'm looking for a more automated way of doing it...

 cheers
 stuart



Re: [CODE4LIB] tricky mod_rewrite

2009-07-01 Thread Godmar Back
On Wed, Jul 1, 2009 at 4:58 AM, Peter Kiraly pkir...@tesuji.eu wrote:

 Hi Eric,

 try this:

 IfModule mod_rewrite.c
  RewriteEngine on
  RewriteBase /script
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteCond %{REQUEST_FILENAME} !-d
  RewriteCond %{REQUEST_URI} !=/favicon.ico
  RewriteRule ^(.*)$ script.cgi?param1=$1 [L,QSA]
 /IfModule


Here's a challenge question:

is it possible to write this without hardwiring the RewriteBase in it?  So
that it can be used, for instance, in an .htaccess file from within any
/path?

  - Godmar


Re: [CODE4LIB] tricky mod_rewrite

2009-07-01 Thread Godmar Back
On Wed, Jul 1, 2009 at 9:13 AM, Peter Kiraly pkir...@tesuji.eu wrote:

 From: Godmar Back god...@gmail.com

 is it possible to write this without hardwiring the RewriteBase in it?  So
 that it can be used, for instance, in an .htaccess file from within any
 /path?


 Yes, you can put it into a .htaccess file, and the URL rewrite will
 apply on that directory only.


You misunderstood the question; let me rephrase it:

Can I write a .htaccess file without specifying the path where the script
will be located in RewriteBase?
For instance, consider
http://code.google.com/p/tictoclookup/source/browse/trunk/standalone/.htaccess
Here, anybody who wishes to use this code has to adapt the .htaccess file to
their path and change the RewriteBase entry.

Is it possible to write a .htaccess file that works *no matter* where it is
located, entirely based on where it is located relative to the Apache root
or an Apache directory?

 - Godmar


Re: [CODE4LIB] tricky mod_rewrite

2009-07-01 Thread Godmar Back
On Wed, Jul 1, 2009 at 10:18 AM, Walker, David dwal...@calstate.edu wrote:

  Is it possible to write a .htaccess file that works
  *no matter* where it is located

 I don't believe so.

 If the .htaccess file lives in a directory inside of the Apache root
 directory, then you _don't_ need to specify a RewriteBase.  It's really only
 necessary when .htacess lives in a virtual directory outside of the Apache
 root.


I see.

Unfortunately, that's the common deployment case by non-administrators (many
librarians). They can create .htaccess files, but don't always have control
of the main Apache httpd.conf or the root directory.

 - Godmar


Re: [CODE4LIB] tricky mod_rewrite

2009-07-01 Thread Godmar Back
On Wed, Jul 1, 2009 at 10:38 AM, Walker, David dwal...@calstate.edu wrote:

  They can create .htaccess files, but don't always
  have control of the main Apache httpd.conf or the
  root directory.

 Just to be clear, I didn't mean just the root directory itself.  If
 .htacess lives within a sub-directory of the Apache root, then you _don't_
 need RewriteBase.

 RewriteBase is only necessary when you're in a virtual directory, which is
 physically located outside of Apache's DocumentRoot path.

 Correct me if I'm wrong.


You are correct!  If I omit the RewriteBase, it still works in this case.

Let's have some more of that sendmail koolaid and up the challenge.

How can I write an .htaccess that's path-independent if I like to exclude
certain files in that directory, such as index.html?  So far, I've been
doing:

RewriteCond %{REQUEST_URI} !^/services/tictoclookup/standalone/index.html

To avoid running my script for index.html.  How would I do that?  (Hint: the
use of SERVER variables on the right-hand side in the CondPattern of a
RewriteCond is not allowed, but some trickery may be possible, according to
http://www.issociate.de/board/post/495372/Server-Variables_in_CondPattern_of_RewriteCond_directive.html)

 - Godmar


Re: [CODE4LIB] How to access environment variables in XSL

2009-06-23 Thread Godmar Back
Let me repeat a small comment I already sent to Mike in private email:
in a J2EE environment, information that characterizes a request (such as
path, remote addr, etc.) is not accessible in environment variables or
properties, unlike in a CGI environment. That means that even if you write
an extension for XALAN-J to trigger the execution of your Java code while
processing a stylesheet during a request, you don't normally obtain access
to this information. Rather it is passed by the servlet container to the
servlet via a request object. If you don't control the servlet code - say
because it's vendor-provided - then you have to either rely on any extension
functionality the vendor may provide, or you have to create your own servlet
that wraps the vendor's servlet, saving the request information somewhere
where your xalan extension can retrieve it, then forwards the request to the
vendor's servlet.

 - Godmar


On Tue, Jun 23, 2009 at 2:04 PM, Cloutman, David
dclout...@co.marin.ca.uswrote:

 I'm in a similar situation in that I've spent the last 6 months cramming
 XSLT in order to do output from an application provided by a vendor. In
 my situation, I'm taking information stored in a CMS database as XML
 fragments and transforming it into our Web site's pages. (The CMS is
 called Cascade, and is okay, but not fantastic.)

 The tricky part of this situation is that simply grabbing a book on
 XPath and XSLT will not tell you everything you need to know in order to
 work with your proprietary software. Neither will simply knowing what
 language the middleware layer is written in. Specifically, you need to
 find out from your vendor what XSLT processor their application. In my
 case, I found out that my CMS uses Xalan, which impacts my situation
 significantly, since it limits me to XSLT 1.0. However, the Xalan
 processor does allow for one to script extensions, and in my case I
 _might_ be able to leverage that fact to access some system information,
 depending on what capabilities my vendor has given me. So, in short,
 making the most of the development environment you have in creating your
 XSLT will require you not only to grok the complexities of what I think
 is a rather difficult language to master, but also to gain a good
 understanding of what tools are and are not available to you through
 your preprocessor.

 Just to address your original question, XSLT really is not designed to
 work like a conventional programming language per-se. You may or may not
 have direct access to environment variables. That is dependent upon how
 the XSLT processor is implemented by your vendor. I did see some
 creative ideas in other posts, and I do not know if they will or will
 not work. However, it is often possible for the middleware layer to pass
 data to the XSLT processor, thus exposing it to the XSLT developer.
 However, what data gets passed to the XSLT developer is generally under
 the control of the application developer.

 Here is a quick example of how XML data and XSLT presentation logic can
 be glued together in PHP using a non-native XSLT processor. This is
 being done similarly by our respective Java applications, using
 different XSLT processors, and hopefully a lot more error checking.

 http://frenzy.marinlibrary.org/code-samples/php-xslt/middleware.php

 In the example, I have passed some environment data to the XSLT
 processor from the PHP middleware layer. As you will see, what data is
 exposed is entirely determined by the PHP.

 Good luck!

 - David

 ---
 David Cloutman dclout...@co.marin.ca.us
 Electronic Services Librarian
 Marin County Free Library

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Doran, Michael D
 Sent: Friday, June 19, 2009 2:53 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] How to access environment variables in XSL


 Hi Dave,

  What XSLT processor and programming language are you using?

 I'm embarrassed to say that I'm not sure.  I'm making modifications and
 enhancements to already existing XSL pages that are part of the
 framework of Ex Libris' new Voyager 7.0 OPAC.  This new version of the
 OPAC is running under Apache Tomcat (on Solaris) and my assumption is
 that the programming language is Java; however the source code for the
 app itself is not available to me (and I'm not a Java programmer anyway,
 so it's a moot point).  I assume also that the XSLT processor is what
 comes with Solaris (or Tomcat?).  As you can probably tell, this stuff
 is new to me.  I've been trying to take a Sun Ed XML/XSL class for the
 last year, but it keeps getting cancelled for lack of students.
 Apparently I'm the last person left in the Dallas/Fort Worth area that
 needs to learn this stuff. ;-)

 -- Michael

 # Michael Doran, Systems Librarian
 # University of Texas at Arlington
 # 817-272-5326 office
 # 817-688-1926 mobile
 # do...@uta.edu
 # http://rocky.uta.edu/doran/


  -Original Message-
  From: Code for 

Re: [CODE4LIB] How to access environment variables in XSL

2009-06-19 Thread Godmar Back
Running in a J2EE is somewhat different from running in a CGI environment.
Specifically, variables such as REMOTE_ADDR, etc. are not stored in
environment variables that are easily accessible.

Assuming that your XSLT is executed for each request (which, btw, is not a
given since Voyager may well be caching the results of the style-sheet
application), your vendor may set up the XSLT processor environment to
provide access to variables related to the current request, for instance,
via XALAN-J extensions. If they did that, it would probably be in the
documentation to which you have access under NDA.

If not, things will be a lot more complicated. You'll probably have to wrap
the servlet in your own; store the current servlet request in a thread-local
variable, then create an xalan extension to access it during the XSLT
processing. That requires a fair bit of Java/J2EE trickery, but is
definitely possible (and will likely void your warranty.)

 - Godmar

On Fri, Jun 19, 2009 at 9:42 PM, Tom Pasley tom.pas...@gmail.com wrote:

 Hi,

 I see Michael's here too - (he's a bit of a guru on the Voyager-L listserv
 :-D).

 Michael, if you have a look at the Vendor URL, there's some info there, but
 you might also try having a look through some of these G.search results:

 site:xml.apache.org inurl:xalan-j system

 - see if that helps any - like to help more, but I've got to go!

 Tom

 On Sat, Jun 20, 2009 at 10:11 AM, Doran, Michael D do...@uta.edu wrote:

  Hi Jon,
 
   Try putting somewhere in one of the xslt pages
 
  Cool!  Here's the output:
 
 Version: 1
 Vendor: Apache Software Foundation
 Vendor URL: http://xml.apache.org/xalan-j
 
  -- Michael
 
  # Michael Doran, Systems Librarian
  # University of Texas at Arlington
  # 817-272-5326 office
  # 817-688-1926 mobile
  # do...@uta.edu
  # http://rocky.uta.edu/doran/
 
 
   -Original Message-
   From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
   Behalf Of Jon Gorman
   Sent: Friday, June 19, 2009 5:05 PM
   To: CODE4LIB@LISTSERV.ND.EDU
   Subject: Re: [CODE4LIB] How to access environment variables in XSL
  
   Try putting somewhere in one of the xslt pages
  
   p
   Version:
   xsl:value-of select=system-property('xsl:version') /
   br /
   Vendor:
   xsl:value-of select=system-property('xsl:vendor') /
   br /
   Vendor URL:
   xsl:value-of select=system-property('xsl:vendor-url') /
   /p
  
   Jon
  
   On Fri, Jun 19, 2009 at 4:53 PM, Doran, Michael
   Ddo...@uta.edu wrote:
Hi Dave,
   
What XSLT processor and programming language are you using?
   
I'm embarrassed to say that I'm not sure.  I'm making
   modifications and enhancements to already existing XSL pages
   that are part of the framework of Ex Libris' new Voyager 7.0
   OPAC.  This new version of the OPAC is running under Apache
   Tomcat (on Solaris) and my assumption is that the programming
   language is Java; however the source code for the app itself
   is not available to me (and I'm not a Java programmer anyway,
   so it's a moot point).  I assume also that the XSLT processor
   is what comes with Solaris (or Tomcat?).  As you can probably
   tell, this stuff is new to me.  I've been trying to take a
   Sun Ed XML/XSL class for the last year, but it keeps getting
   cancelled for lack of students.  Apparently I'm the last
   person left in the Dallas/Fort Worth area that needs to learn
   this stuff. ;-)
   
-- Michael
   
# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
   
   
-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
Behalf Of Walker, David
Sent: Friday, June 19, 2009 2:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] How to access environment variables in XSL
   
Micahael,
   
What XSLT processor and programming language are you using?
   
--Dave
   
==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf
Of Doran, Michael D [do...@uta.edu]
Sent: Friday, June 19, 2009 12:44 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] How to access environment variables in XSL
   
I am working with some XSL pages that serve up HTML on the
web.  I'm new to XSL.   In my prior web development, I was
accustomed to being able to access environment variables (and
their values, natch) in my CGI scripts and/or via Server Side
Includes.  Is there an equivalent mechanism for accessing
those environment variables within an XSL page?
   
These are examples of the variables I'm referring to:
SERVER_NAME
SERVER_PORT
HTTP_HOST
DOCUMENT_URI
REMOTE_ADDR

Re: [CODE4LIB] FW: [CODE4LIB] Newbie asking for some suggestions with javascript

2009-06-15 Thread Godmar Back
On Mon, Jun 15, 2009 at 4:09 PM, Roy Tennant tenna...@oclc.org wrote:

 It is worth following up on Xiaoming's statement of a limit of 100 uses per
 day of the xISSN service with the information that exceptions to this
 limite
 are certainly granted. Annette probably knows that just such an exception
 was granted to her LibX project, and LibX remains the single largest user
 of
 this service.
 Roy


Yes, Roy is correct.

We are very grateful for OCLC's generous support and would like to
acknowledge that publicly.

FWIW, I suggested the inclusion of ticTOCs RSS feed data in the survey OCLC
sent out two weeks ago, and less than a week later, OCLC rolls out the
improved service. Excellent!

[ As an aside, in LibX, we are changing the way we use the service;
previously, we were looking up all ISSNs on any page a user visits; we are
now retrieving the metadata if the user actually hovers over the link. Not
that OCLC complained - but CrossRef did when they noticed  100,000 hits per
day against their service for DOI metadata lookups. In fairness to CrossRef,
they are working on beefing up their servers as well. ]

 - Godmar  Annette for Team LibX.


Re: [CODE4LIB] Newbie asking for some suggestions with javascript

2009-06-11 Thread Godmar Back
Yes - see this email
http://serials.infomotions.com/code4lib/archive/2009/200905/0909.html

If you can host yourself, the stand-alone version is efficient and easy to
keep up to date - just run a cronjob that downloads the text file from JISC.
My WSGI script will automatically pick up if it has changed on disk.

 - Godmar

On Thu, Jun 11, 2009 at 4:08 PM, Annette Bailey afbai...@vt.edu wrote:

 Godmar Back wrote a web service in python for ticTOC with an eye to
 incorporating links into III's Millennium catalog.

 http://code.google.com/p/tictoclookup/

 http://tictoclookup.appspot.com/

 Annette

 On Thu, Jun 11, 2009 at 12:34 PM, Derik Badmandbad...@temple.edu wrote:
  Hello all,
 
  Just joined the list, and I'm hoping to get a suggestion or two.
 
  I'm working on using the ticTOCs ( http://www.tictocs.ac.uk/ ) text file
 of
  rss feed urls for journals to insert links to those feeds in our Serials
  Solution Journal Finder.
 
  I've got it working using a bit of jQuery.
 
  Demo here: http://155.247.22.22/badman/toc/demo.html
  The javascript is here: http://155.247.22.22/badman/toc/toc-rss.js
 
  Getting that working wasn't too hard, but I'm a bit concerned about
  efficiency and caching.
 
  I'm not sure the way I'm checking isbns against the text file is the most
  efficient way to go. Basically I'm making an ajax call to the file that
  takes the data and makes an array of objects. I then query the isbn of
 each
  journal on the page against the array of objects. If there's a match I
 pull
  the data and put it on the page. I'm wondering if there's a better way to
 do
  this, especially since the text file is over 1mb. I'm not looking for
 code,
  just ideas.
 
  I'm also looking for any pointers about using the file itself and somehow
  auto-downloading it to my server on a regular basis. Right now I just
 saved
  a copy to my server, but in the future it'd be good to automate grabbing
 the
  file from ticTOCs server on a regular basis and updating the one on my
  server (perhaps I'd need to use a cron job to do that?).
 
  Thanks for much for any suggestions or pointers. (For what it's worth, I
 can
  manage with javascript or php.)
 
 
  --
  Derik A. Badman
  Digital Services Librarian
  Reference Librarian for Education and Social Work
  Temple University Libraries
  Paley Library 209
  Philadelphia, PA
  Phone: 215-204-5250
  Email: dbad...@temple.edu
  AIM: derikbad
 
  Research makes times march forward, it makes time march backward, and it
  also makes time stand still. -Greil Marcus
 



Re: [CODE4LIB] A Book Grab by Google

2009-05-20 Thread Godmar Back
On Wed, May 20, 2009 at 8:42 PM, Karen Coyle li...@kcoyle.net wrote:

 No, it's not uniquely Google, but adding another price pressure point to
 libraries is still seen as detrimental.


I'm sure you saw:
http://www.nytimes.com/2009/05/21/technology/companies/21google.html

The new agreement, which Google hopes other libraries will endorse,
lets the University of Michigan object if it thinks the prices Google
charges libraries for access to its digital collection are too high, a
major concern of some librarians. Any pricing dispute would be
resolved through arbitration.

 - Godmar


Re: [CODE4LIB] web services and widgets: MAJAX 2, ticTOC lookup, Link/360 JSON, and Google Book Classes

2009-05-19 Thread Godmar Back
On Tue, May 19, 2009 at 8:26 AM, Boheemen, Peter van
peter.vanbohee...@wur.nl wrote:
 Clever idea to put the TicToc stuff 'in the cloud'. How are you going to
 keep it up-to-date ?

By periodically reuploading the entire set (which takes about 15-20
mins), new or changed records can be updated. A changed record is one
with a new RSS feed for the same ISSN + Title combination; the data is
keyed by ISSN+Title. This process can be optimized by only uploading
the delta (you upload .csv files, so the delta can be obtained easily
via comm(1)).

Removing records is a bit of a hassle since GAE does not provide an
easy-to-use interface for that. It's possible to wipe an entire table
clean by repeatedly deleting 500 records at a time (the entire set is
about 19,000 records), then doing a fresh import. This can be done by
uploading a console application into the cloud.
(http://con.appspot.com/console/help/about ) Alternatively, smaller
sets of records can be deleted via a remove handler, which I haven't
implemented yet.  A script will need to post the data to be removed
against the handler. Will do that though if anybody uses it. User
impact is low if old records aren't removed.

A possible alternative is to have the GAE app periodically verify the
validity of each requested record with a server we'd have to run.
(Pulling the data straight from tictocs.ac.uk doesn't work since it's
larger what you're allowed to fetch.) This approach would somewhat
defeat the idea of the cloud since we'd have to rely on keeping that
server operational, albeit at a lower degree of availability and load.

Another potential issue is the quota Google provides: you get 10GBytes
and 1.3M requests free per 24 hour period, then they start charging
you ($.12 per GByte)

I think I mentioned in my post that I included a non-GAE version of
the server that only requires mod_wsgi. For that standalone version,
keeping the data set up to date is implemented by checking the last
mod time of its localy copy - it will reread its data when it detects
a more recent jrss.txt in its current directory, so keeping its data
up to date is a simple a periodically curling
http://www.tictocs.ac.uk/text.php

 - Godmar


[CODE4LIB] web services and widgets: MAJAX 2, ticTOC lookup, Link/360 JSON, and Google Book Classes

2009-05-18 Thread Godmar Back
Hi,

I would like to share a few pointers to web services and widgets
Annette and I recently collaborated on. All are available under an
open source license.

Widgets are CSS-styled HTML elements (span or div) that provide
dynamic behavior related to the underlying web service. These are
suitable for non-JavaScript programmers familiar with HTML/CSS.

1. MAJAX 2: Includes a JSON web service (e.g.,
http://libx.lib.vt.edu/services/majax2/isbn/1412936373 or
http://libx.lib.vt.edu/services/majax2/isbn/006073132x?opacbase=http%3A%2F%2Flibcat.lafayette.edu%2Fsearchjsoncallback=majax.processResults
) and a set of widgets to include results into web pages, see
http://libx.lib.vt.edu/services/majax2/  Supports the same set of
features as MAJAX 1 (libx.org/majax)
Source is at http://code.google.com/p/majax2/

2. ticTOC lookup: is a Google App Engine app that provides a REST
interface to JISC's ticTOC data set that maps ISSN to URLs of table of
contents RSS feeds. See http://tictoclookup.appspot.com/
Example: http://tictoclookup.appspot.com/0028-0836 and optional
refinement by title:
http://tictoclookup.appspot.com/0028-0836?title=Nature
A widget library is available; see
http://laurel.lib.vt.edu/record=b1251610~S7 for a demo (shows floating
tooltips with table of contents preview via Google Feeds and places a
link to RSS feeds)  The source is at
http://code.google.com/p/tictoclookup/ and includes a stand-alone
version of the web service which doesn't use GAE. The widget library
includes support for integration into III's record display.

3. Google Book Classes at http://libx.lib.vt.edu/services/googlebooks/
- these are widgets for Google's Book Search Dynamic Links API.
Noteworthy is support for integration into III's OPAC on the search
results page (briefcit.html), on the so-called bib display page
(bib_display.html) and their WebBridge product via field
selectors, all without JavaScript. Source is at
http://code.google.com/p/googlebooks/

4. A Link/360 JSON Proxy.  See
http://libx.lib.vt.edu/services/link360/index.html
This one takes Serials Solution's Link/360 XML Service and proxies it
as JSON. Currently does not include a widget set. Caches results 24
hours to match db update frequency.  Source is at
http://code.google.com/p/link360/  Could be combined with a widget
library, or programmed to directly, to weave Link/360 holdings data
into pages.

All JSON services accept 'jsoncallback=' for cross-domain client-side
integration.  The libx.lib.vt.edu URLs are ok to use for testing, but
for production use we recommend your own server. All modules are
written in Python as WSGI scripts, requiring setup as simple as
mod_wsgi + .htaccess.

 - Godmar


Re: [CODE4LIB] Q: AtomPub (APP) server libraries for Python?

2009-01-28 Thread Godmar Back
 2) an XML library that doesn't choke on foreign characters. (I assume
 you're using ElementTree now?)

I meant foreign markup, as in foreign to the atom: name space.

Let me give an example. Suppose I want to serve results the way Google
does in YouTube; suppose I want to return XML similar to this one:

http://gdata.youtube.com/feeds/api/videos?vq=triumph+street+tripleracy=includeorderby=viewCount

It contains lots of foreign XML (opensearch, etc.) and it contains
lots of boilerplate (title, link, id, updated, category, etc. etc.)
that must be gotten right to be Atom-compliant. I don't want to
implement any of this.

I'd like to write the minimum amount of code that can turn information
I have in flat files into Atom documents, without having to worry
about the well-formedness or even construction of an Atom feed, or its
internal consistency.
(Perhaps similar to Pilgrim's feedparser, except that this library a)
doesn't handle all of Atom, b) doesn't support foreign XML - in fact,
doesn't even use an XML library), and is generally not intended for
the creation of feeds.

Given the adoption RFC 5023 has seen by major companies, I'm really
surprised at the lack of any supporting server libraries; perhaps not
surprisingly, the same is not true for client libraries.

 - Godmar

On Wed, Jan 28, 2009 at 9:43 AM, Ross Singer rossfsin...@gmail.com wrote:
 Godmar,

 What do you need the library to do?  It seems like you'd be able to
 make an AtomPub server pretty easily with web.py (you could use the
 Jangle Core as a template, it's in Ruby, but the framework it uses,
 Sinatra, is very similar to web.py).

 It seems like there are two things you need here:

 1) something that can RESTfully broker a bunch of incoming HTTP
 requests and return Atom Feeds and Service documents

 Is that right?
 -Ross.

 On Wed, Jan 28, 2009 at 8:13 AM, Godmar Back god...@gmail.com wrote:
 Hi,

 does anybody know or can recommend any server side libraries for
 Python that produce AtomPub (APP)?

 Here are the options I found, none of which appear suitable for what
 I'd like to do:

 amplee: 
 http://mail.python.org/pipermail/python-announce-list/2008-February/006436.html
 django-atompub:  http://code.google.com/p/django-atompub/
 flatatompub http://blog.ianbicking.org/2007/09/12/flatatompub/

 Either they are immature, or require frameworks, or form frameworks,
 and most cannot well handle foreign XML.

  - Godmar




Re: [CODE4LIB] COinS in OL?

2008-12-05 Thread Godmar Back
On Thu, Dec 4, 2008 at 2:31 PM, Jonathan Rochkind [EMAIL PROTECTED] wrote:
 Not that I know of.

 You can say display:none, but that'll probably hide it from LibX etc too.

No, why would it.

BTW, I don't see why screen readers would stumble over this when the
child of the span is empty. Do they try to read empty text?  And if
a COinS is processed, we fix up the title so tooltips show nicely.

 - Godmar


 What is needed is a CSS @media for screen readers, like one exists for
 'print'. So you could have a seperate stylesheet for screenreaders, like you
 can have a seperate stylesheet for print. That would be the right way to do
 it.

 But doesn't exist.

 Jonathan

 Thomas Dowling wrote:

 On 12/04/2008 02:02 PM, Jonathan Rochkind wrote:


 Yeah, I had recently noticed indepedently, been unhappy with the way a
 COinS title shows up in mouse-overs, and is reccommended to be used by
 screen readers. Oops.



 By any chance, do current screen readers honor something like 'span
 class=Z3988 style=speak:none title=...'?



 --
 Jonathan Rochkind
 Digital Services Software Engineer
 The Sheridan Libraries
 Johns Hopkins University
 410.516.8886 rochkind (at) jhu.edu



Re: [CODE4LIB] COinS in OL?

2008-12-05 Thread Godmar Back
On Fri, Dec 5, 2008 at 1:14 PM, Ross Singer [EMAIL PROTECTED] wrote:
 On Fri, Dec 5, 2008 at 10:50 AM, Godmar Back [EMAIL PROTECTED] wrote:

 BTW, I don't see why screen readers would stumble over this when the
 child of the span is empty. Do they try to read empty text?  And if
 a COinS is processed, we fix up the title so tooltips show nicely.

 Thinking about this a bit more -- does this leave the COinS in an
 unusable state if some other agent executes after LibX is done?


I spoke too soon. We don't touch the 'title' attribute.

But we put content in the previously empty span/span, so there is
a potential problem with a screen reader then. (That content, though,
has its own 'title' attribute.)

 - Godmar


Re: [CODE4LIB] COinS in OL?

2008-12-04 Thread Godmar Back
On Wed, Dec 3, 2008 at 9:12 PM, Ed Summers [EMAIL PROTECTED] wrote:
 On Tue, Dec 2, 2008 at 3:11 PM, Godmar Back [EMAIL PROTECTED] wrote:
 COinS are still needed, in particular in situations in which multiple
 resources are displayed on a page (like, for instance, in the search
 results pages of most online systems or on pages such as
 http://citeulike.org, or in a list of references such as in the
 references section of many Wikipedia pages.)

 JSON is perfectly capable of returning a list of things.


True, but that's besides the point.

The metadata needs to be related to some element on the page, such as
the text in a reference. The most natural way to do this (and COinS
allows this) is to place the COinS next to (for instance) the
reference to which it refers.

 - Godmar


Re: [CODE4LIB] COinS in OL?

2008-12-02 Thread Godmar Back
Having a per-page link to get an alternate representation of a
resource is certainly helpful for some applications, and please do
support it, but don't consider the problem solved.

The primary weakness of this approach is that it works only if a page
is dedicated to a single resource.

COinS are still needed, in particular in situations in which multiple
resources are displayed on a page (like, for instance, in the search
results pages of most online systems or on pages such as
http://citeulike.org, or in a list of references such as in the
references section of many Wikipedia pages.)

 - Godmar

On Mon, Dec 1, 2008 at 11:21 PM, Ed Summers [EMAIL PROTECTED] wrote:
 On Mon, Dec 1, 2008 at 11:05 PM, Karen Coyle [EMAIL PROTECTED] wrote:
 I asked about COinS because it's something I have vague knowledge of. (And I
 assume it isn't too difficult to implement.) However, if there are other
 services that would make a bigger difference, I invite you (all) to speak
 up. It makes little sense to have this large quantity of bib data if it
 isn't widely and easily usable.

 Sorry to be overwhelming. I guess the main thing I wanted to
 communicate is that you could simply add:

  link rel=alternate type=application/json
 href=http://openlibrary.org/api/get?key=/b/{open-library-id}; /

 to the head element in OpenLibrary HTML pages for books, and that
 would go a long way to making machine readable data for books
 discoverable by web clients.

 //Ed



Re: [CODE4LIB] COinS in OL?

2008-12-01 Thread Godmar Back
Correct.

Right now, COinS handling in LibX 1.0 is primitive and always links to
the OpenURL resolver. However, LibX 2.0 will allow customized handling
so that, for instance, ISBN COinS can be treated differently than
dissertation COinS or article CoinS.  The framework for this is
already partially in place, so ambitious JavaScript programmers can
implement such custom handling for their extension; with LibX 2.0,
every LibX maintainer will be able to choose their own preferred way
of making use of COinS.

When you place COinS, don't assume it'll only be used by tools that
simply read the info from it - place it in a place in your DOM where
there's some white space, or where placing a small link or icon would
not destroy the look and feel of your interface.

 - Godmar

On Mon, Dec 1, 2008 at 11:45 AM, Stephens, Owen
[EMAIL PROTECTED] wrote:
 LibX uses COinS as well I think - so generally be useful in taking
 people from the global context (Open Library) to the local (via LibX)

 Owen

 Owen Stephens
 Assistant Director: eStrategy and Information Resources
 Central Library
 Imperial College London
 South Kensington Campus
 London
 SW7 2AZ

 t: +44 (0)20 7594 8829
 e: [EMAIL PROTECTED]

 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf
 Of
 Karen Coyle
 Sent: 01 December 2008 16:08
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] COinS in OL?

 I have a question to ask for the Open Library folks and I couldn't
 quite
 figure out where to ask it. This seems like a good place.

 Would it be useful to embed COinS in the book pages of the Open
 Library?
 Does anyone think they might make use of them?

 Thanks,
 kc

 --
 ---
 Karen Coyle / Digital Library Consultant
 [EMAIL PROTECTED] http://www.kcoyle.net
 ph.: 510-540-7596   skype: kcoylenet
 fx.: 510-848-3913
 mo.: 510-435-8234
 



[CODE4LIB] GAE sample (was: a brief summary of the Google App Engine)

2008-07-16 Thread Godmar Back
FWIW, the sample application I built to familiarize myself with GAE is
a simple REST cache. It's written in  250 lines overall, including
Python + YAML.

For instance, a resource such as:
http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmedretmode=xmlid=3966282
can be accessed via GAE using:
http://libxcache.appspot.com/get?url=http%3a%2f%2fwww.ncbi.nlm.nih.gov%2fentrez%2feutils%2fesummary.fcgi%3fdb%3dpubmed%26retmode%3dxml%26id%3d3966282

Or, you can access:
http://demo.jangle.org/openbiblio/resources/5974
as
http://libxcache.appspot.com/get?url=http%3a%2f%2fdemo.jangle.org%2fopenbiblio%2fresources%2f5974
(To take some load off that Jangle demo, Ross, in case it's slashdotted.)

 - Godmar


Re: [CODE4LIB] a brief summary of the Google App Engine

2008-07-15 Thread Godmar Back
On Tue, Jul 15, 2008 at 2:16 PM, Fernando Gomez [EMAIL PROTECTED] wrote:

 Any thoughts about a convenient way of storing and (more importantly)
 indexing  retrieving MARC records using GAE's Bigtable?


GAE uses Django's object-relational model. You can define a Python
class, inherit from db.model, declare properties of your model; then
instances can be created, stored, retrieved and updated.
GAE performs automatic indexing on some fields, and you can tell it to
index on others, or using certain combinations.

Aside from the limitations imposed by the index model, the problem
then is fundamentally similar to how you index MARC data for use in
any discovery system.  Presumably, you could learn from the
experiences of the many projects that have done that - some in Python,
such as http://code.google.com/p/fac-back-opac/  (though they use
Django, they don't appear to be using its object-relational db model
for MARC records; I say this from a 2 min examination of parts of
their code; I may be wrong. PyMarc itself doesn't support it.)

 - Godmar


[CODE4LIB] a brief summary of the Google App Engine

2008-07-13 Thread Godmar Back
Hi,

since I brought up the issue of the Google App Engine (GAE) (or
similar services, such as Amazon's EC2 Elastic Compute Cloud), I
thought I give a brief overview of what it can and cannot do, such
that we may judge its potential use for library services.

GAE is a cloud infrastructure into which developers can upload
applications. These applications are replicated among Google's network
of data centers and they have access to its computational resources.
Each application has access to a certain amount of resources at no
fee; Google recently announced the pricing for applications whose
resource use exceeds the no fee threshold [1]. The no fee threshold
is rather substantial: 500MB of persistent storage, and, according to
Google, enough bandwidth and cycles to serve about 5 million page
views per month.

Google Apps must be written in Python. They run in a sandboxed
environment. This environment limits what applications can do and how
they communicate with the outside world.  Overall, the sandbox is very
flexible - in particular, application developers have the option of
uploading additional Python libraries of their choice with their
application. The restrictions lie primarily in security and resource
management. For instance, you cannot use arbitrary socket connections
(all outside world communication must be through GAE's fetch service
which supports http/https only), you cannot fork processes or threads
(which would use up CPU cycles), and you cannot write to the
filesystem (instead, you must store all of your persistent data in
Google's scalable datastorage, which is also known as BigTable.)

All resource usage (CPU, Bandwidth, Persistent Storage - though not
memory) is accounted for and you can see your use in the application's
dashboard control panel. Resources are replenished on the fly where
possible, as in the case of CPU and Bandwidth. Developers are
currently restricted to 3 applications per account. Making
applications in multiple accounts work in tandem to work around quota
limitations is against Google's terms of use.

Applications are described by a configuration file that maps URI paths
to scripts in a manner similar to how you would use Apache
mod_rewrite.  URIs can also be mapped to explicitly named static
resources such as images. Static resources are uploaded along with
your application and, like the application, are replicated in Google's
server network.

The programming environment is CGI 1.1.  Google suggests, but doesn't
require, the use of supporting libraries for this model, such as WSGI.
 This use of high-level libraries allows applications to be written in
a very compact, high-level style, the way one is used to from Python.
In addition to the WSGI framework, this allows the use of several
template libraries, such as Django.  Since the model is CGI 1.1, there
are no or very little restrictions on what can be returned - you can
return, for instance, XML or JSON and you have full control over the
Content-Type: returned.

The execution model is request-based.  If a client request arrives,
GAE will start a new instance (or reuse an existing instance if
possible), then invoke the main() method. At this point, you have a
set limit to process this request (though not explicitly stated in
Google's doc, the limit appears to be currently 9 seconds) and return
a result to the client. Note that this per-request limit is a maximum;
you should usually be much quicker in your response. Also note that
any CPU cycles you use during those 9 seconds (but not time you spent
wait fetching results from other application tiers) count against your
overall CPU budget.

The key service the GAE runtime libraries provide is the Google
datastore, aka BigTable [2].
You can think of this service as a highly efficient, persistent store
for structured data. You may think of it as a simplified database that
allows the creation, retrieval, updating, and deletion (CRUD) of
entries using keys and, optionally, indices. It provides limited
support transactions as well. Though it is less powerful than
conventional relational databases - which aren't nearly as scalable -
it can be accessed using GQL, a query language that's similar in
spirit to SQL.  Notably, GQL (or BigTable) does not support JOINs,
which means that you will have to adjust your traditional approach to
database normalization.

The Python binding for the structured data is intuitive and seamless.
You simply declare a Python class for the properties of objects you
wish to store, along with the types of the properties you wish
included, and you can subsequently use a put() or delete() method to
write and delete. Queries will return instances of the objects you
placed in a given table.  Tables are named using the Python classes.

Google provides a number of additional runtime libraries, such as for
simple Image processing a la Google Picasa, for the sending of email
(subject to resource limits), and for user authentication, solely
using Google 

Re: [CODE4LIB] anyone know about Inera?

2008-07-12 Thread Godmar Back
Min, Eric, and others working in this domain -

have you considered designing your software as a scalable web service
from the get-go, using such frameworks as Google App Engine? You may
be able to use Montepython for the CRF computations
(http://montepython.sourceforge.net/)

I know Min offers a WSDL wrapper around their software, but that's
simply a gateway to one single-machine installation, and it's not
intended as a production service at that.

 - Godmar

On Sat, Jul 12, 2008 at 3:20 AM, Min-Yen Kan [EMAIL PROTECTED] wrote:
 Hi Steve, all:

 I'm the key developer of ParsCit.  I'm glad to hear your feedback
 about what doesn't work with ParsCit.  Erik is correct in saying that
 we have only trained the system for what data we have correct answers
 for, namely computer science.  As such it doesn't perform well with
 other data (especially health sciences citations, which we have also
 done some pilot tests on.  I note that there are other citation
 parsers out there, include Erik's own HMM parser (I think Erik
 mentioned it as well, available from his website here:
 http://gales.cdlib.org/~egh/hmm-citation-extractor/)

 Anyways, I've tried your citation too, and got the same results from
 the demo -- it doesn't handle the authors correctly in this case.  I
 would very much love to have as many example cases of incorrectly
 parsed citations as the community is willing to share with us so we
 can improve ParsCit (it's open source so all can benefit from
 improvements to ParsCit).

 We are trying to be as proactive as possible about maintaining and
 improving ParsCit.  I know of at least two groups that have said they
 are willing to contribute more citations (with correct markings) to us
 so that we can re-train ParsCit, and there is interest in porting it
 to other languages (i.e. German right now).  We would love to get
 samples of your data too, where the program does go wrong, to help
 improve our system.  And to get feedback of other fields that need to
 be parsed in as well: ISSN, ISBNs, volume, and issues.

 We are also looking to make the output of the ParsCit system
 compatible with EndNote, BibTeX.  We actually have an internal project
 to try to hook up ParsCit to find references on arbitrary web pages
 (to form something like Zotero that's not site specific and
 non-template based).  If and when this project comes to fruition we'll
 be announcing it to the list.

 If anyone has used ParsCit and has feedback on what can be further
 improved we'd love to hear from you.  You are our target audience!

 Cheers,

 Min

 --
 Min-Yen KAN (Dr) :: Assistant Professor :: National University of
 Singapore :: School of Computing, AS6 05-12, Law Link, Singapore
 117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
 [EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W)

 PS: Hi Erik, still planning on studying your HMM package for improving
 ParsCit ... It's on my agenda.
 Thanks again.

 On Sat, Jul 12, 2008 at 5:36 AM, Steve Oberg [EMAIL PROTECTED] wrote:
 Yeah, I am beginning to wonder, based on these really helpful replies, if I
 need to scale back to what is doable and reasonable. And reassess
 ParsCit.

 Thanks to all for this additional information.

 Steve

 On Fri, Jul 11, 2008 at 4:18 PM, Nate Vack [EMAIL PROTECTED] wrote:

 On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg [EMAIL PROTECTED] wrote:

  I fully realize how much of a risk that is in terms of reliability and
  maintenance.  But right now I just want a way to do this in bulk with a
 high
  level of accuracy.

 How bad is it, really, if you get some (5%?) bad requests into your
 document delivery system? Customers submit poor quality requests by
 hand with some frequency, last I checked...

 Especially if you can hack your system to deliver the original
 citation all the way into your doc delivery system, you may be able to
 make the case that 'this is a good service to offer; let's just deal
 with the bad parses manually.'

 Trying to solve this via pure technology is gonna get into a world of
 diminishing returns. A surprising number of citations in references
 sections are wrong. Some correct citations are really hard to parse,
 even by humans who look at a lot of citations.

 ParsCit has, in my limited testing, worked as well as anything I've
 seen (commercial or OSS), and much better than most.

 My $0.02,
 -Nate





Re: [CODE4LIB] use of OpenSearch response elements in libraries?

2008-06-24 Thread Godmar Back
[ this discussion may be a bit too detailed for the general readership of
code4lib; readers not interested in the upcoming WC search API may wish to
skip... ]

Roy,

Atom/RSS are simply the container formats used to return multiple items of
some kind --- I'm curious about what those items contain.

In the example shown in
http://worldcat.org/devnet/index.php/SearchAPIDetails#Using_OpenSearch it
appears that the items are only preformatted citations, rather than, for
instance, MARCXML or DC representation of records.  (The SRU interface, on
the other hand, appears to return MARCXML/DC.)  Is this impression false and
does the OpenSearch API in fact return record metadata beyond preformatted
citations? (I note that your search syntax for OpenURL does not allow the
choice of a recordSchema.)

If so, what's the rationale for not supporting the retrieval of record
metadata via OpenSearch?

 - Godmar

On Tue, Jun 24, 2008 at 10:17 AM, Roy Tennant [EMAIL PROTECTED] wrote:

 To be specific, currently supported record formats for an OpenSearch query
 of the WorldCat API are Atom and RSS as well as the preformatted citation.
 Roy


 On 6/23/08 6/23/08 • 10:18 PM, Godmar Back [EMAIL PROTECTED] wrote:

  Thanks --- let me do some query refinement then -- does anybody know of
  examples where record metadata (e.g., MARCXML or DC) is returned as an
  OpenSearch response?  [ If I understand the proposed Worldcat API
 correctly,
  OpenSearch is used only for pre-formatted citations in HTML. ]
 
   - Godmar
 
  On Tue, Jun 24, 2008 at 12:54 AM, Roy Tennant [EMAIL PROTECTED] wrote:
 
  I believe WorldCat qualifies, although the API is not yet ready for
 general
  release (but soon):
 
  http://worldcat.org/devnet/index.php/SearchAPIDetails
 
  Roy
 
 
  On 6/23/08 6/23/08 € 8:55 PM, Godmar Back [EMAIL PROTECTED] wrote:
 
  Hi,
 
  are there any examples of functioning OpenSearch interfaces to library
  catalogs or library information systems?
 
  I'm specifically interested in those that not only advertise a
 text/html
  interface to their catalog, but who include OpenSearch response
 elements.
  One example I've found is Evergreen; though it's not clear to what
 extent
  this interface is used or implemented. For instance, their demo
  installation's OpenSearch description advertises an ATOM feed, but
 what's
  returned doesn't validate. (*)
 
  Are there other examples deployed (and does anybody know applications
  that
  consume OpenSearch feeds?)
 
   - Godmar
 
  (*) See, for instance:
 
 

 http://demo.gapines.org/opac/extras/opensearch/1.1/PINES/atom-full/keyword/?s
 
 e
  archTerms=musicstartPage=startIndex=count=searchLang
  which is not a valid ATOM feed:
 
 

 http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fdemo.gapines.org%2Fop
 
 a
 
 

 c%2Fextras%2Fopensearch%2F1.1%2FPINES%2Fatom-full%2Fkeyword%2F%3FsearchTerms%
 3
  Dmusic%26startPage%3D%26startIndex%3D%26count%3D%26searchLang
 
 
  --
 
 

 --



Re: [CODE4LIB] use of OpenSearch response elements in libraries?

2008-06-24 Thread Godmar Back
I too find this decision intriguing, and I'm wondering about its wider
implications on the use of RSS/Atom as a container format inside and
outside the context of OpenSearch as it relates to library systems.

I note that an OpenSearch description does not allow you to specify
type of the items contained within a RSS or Atom feed being
advertised. As such, it's impossible to advertise multiple output
formats within a single OpenSearchDescription (specifically, you can
only have 1 Url element with 'type=application/rss+xml').
Therefore, clients consuming OpenSearch must be prepared to interpret
the record types correctly, but cannot learn from the server a priori
what those are.

My guess would be that OCLC is shooting for OpenSearch consumers that
expect RSS/Atom feeds and that have some generic knowledge on how to
process items that contain, for instance, HTML; but at the same time
are unprepared to handle MARCXML or other metadata formats. Examples
may include Google Reader or the A9 metasearch engine.

The alternative, SRU, contains no expectation that items by processed
by clients that are unaware of library metadata formats. In addition,
its 'explain' verb allows clients to learn which metadata formats they
can request.

This may be reviving a discussion that an Internet search shows was
very active in the community about 4 years ago, although 4 years
later, I was unable to find out the outcome of this discussion, so it
may be good to capture the current thinking.

What client applications currently consume OpenSearch results vs. what
client applications consume SRU results?

I understand that a number of ILS vendors besides OCLC have already or
are in the process of providing web services interfaces to their
catalog; do they choose OpenSearch and/or SRU, or a heterogeneous mix
in the way OCLC does. If they choose OpenSearch, do they use RSS or
ATOM feeds to carry metadata records?

 - Godmar

On Tue, Jun 24, 2008 at 1:23 PM, Jonathan Rochkind [EMAIL PROTECTED] wrote:
 In general, is there a reason to have different metadata formats from SRU vs
 OpenSearch? Is there a way to just have the same metadata formats available
 for each? Or are the demands of each too different to just use the same
 underlying infrastructure, such that it really does take more work to
 include a metadata format as an OpenSearch option even if it's already been
 included as an SRU option?

 Personally, I'd like these alternate access methods to still have the same
 metadata format options, if possible. And other options. Everything should
 be as consistent as possible to avoid confusion.

 Jonathan

 Washburn,Bruce wrote:

 Godmar,

 I'm one of the developers working on the WorldCat API.  My take is that
 the API is evolving and adapting as we learn more about how it's
 expected to be used.  We haven't precluded the addition of more record
 metadata to OpenSearch responses; we opted not to implement it until we
 had more evidence of need.
 As you've noted, WorldCat API OpenSearch responses are currently limited
 to title and author information plus a formatted bibliographic citation,
 while more complete record metadata is available in DC or MARC XML in
 SRU responses. Until now we had not seen a strong push from the API
 early implementers for more record metadata in OpenSearch responses,
 based on direct feedback and actual use.  I can see how it could be a
 useful addition, though, so we'll look into it.

 Bruce



 --
 Jonathan Rochkind
 Digital Services Software Engineer
 The Sheridan Libraries
 Johns Hopkins University
 410.516.8886 rochkind (at) jhu.edu



Re: [CODE4LIB] use of OpenSearch response elements in libraries?

2008-06-23 Thread Godmar Back
Thanks --- let me do some query refinement then -- does anybody know of
examples where record metadata (e.g., MARCXML or DC) is returned as an
OpenSearch response?  [ If I understand the proposed Worldcat API correctly,
OpenSearch is used only for pre-formatted citations in HTML. ]

 - Godmar

On Tue, Jun 24, 2008 at 12:54 AM, Roy Tennant [EMAIL PROTECTED] wrote:

 I believe WorldCat qualifies, although the API is not yet ready for general
 release (but soon):

 http://worldcat.org/devnet/index.php/SearchAPIDetails

 Roy


 On 6/23/08 6/23/08 € 8:55 PM, Godmar Back [EMAIL PROTECTED] wrote:

  Hi,
 
  are there any examples of functioning OpenSearch interfaces to library
  catalogs or library information systems?
 
  I'm specifically interested in those that not only advertise a text/html
  interface to their catalog, but who include OpenSearch response elements.
  One example I've found is Evergreen; though it's not clear to what extent
  this interface is used or implemented. For instance, their demo
  installation's OpenSearch description advertises an ATOM feed, but what's
  returned doesn't validate. (*)
 
  Are there other examples deployed (and does anybody know applications
 that
  consume OpenSearch feeds?)
 
   - Godmar
 
  (*) See, for instance:
 
 http://demo.gapines.org/opac/extras/opensearch/1.1/PINES/atom-full/keyword/?se
  archTerms=musicstartPage=startIndex=count=searchLang
  which is not a valid ATOM feed:
 
 http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fdemo.gapines.org%2Fopa
 
 c%2Fextras%2Fopensearch%2F1.1%2FPINES%2Fatom-full%2Fkeyword%2F%3FsearchTerms%3
  Dmusic%26startPage%3D%26startIndex%3D%26count%3D%26searchLang
 

 --



Re: [CODE4LIB] Open Source Repositories

2008-05-16 Thread Godmar Back
Generally, you won't find a credible site that would allow you to
upload unvetted binaries of adapted versions of low-volume software.
The obvious risks are just too high.

My recommendation would be a personal webpage, hosted on a site that's
associated with a real-world institution, and a real-world contact.

 - Godmar

On Fri, May 16, 2008 at 10:24 AM, Carol Bean [EMAIL PROTECTED] wrote:
 I probably should clarify that the friend is looking for a place to share
 what she's already fixed and compiled to run on a low resource machine (both
 in Windows and Linux)

 Thanks,
 Carol

 On Fri, May 16, 2008 at 9:52 AM, MJ Ray [EMAIL PROTECTED] wrote:

 Carol Bean [EMAIL PROTECTED] wrote:
  Done anyone know of open source repositories that have precompiled
  software?  (Especially low resource software)

 As well as their own, most of the free software operating systems have
 third-party repositories, such as those listed at
 http://www.apt-get.org/ for debian.

 Make sure you trust the third party provider, though!

 Regards,
 --
 MJ Ray (slef)
 Webmaster for hire, statistician and online shop builder for a small
 worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/
 (Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237




 --
 Carol Bean
 [EMAIL PROTECTED]



Re: [CODE4LIB] google books and OCLC numbers

2008-05-08 Thread Godmar Back
Mark,

I'll answer this one on list, but let's take discussion that is
specifically related to GBS classes off-list since you're asking
questions about this particular software --- I had sent the first
email to Code4Lib because I felt that our method of integrating the
Google Book viewability API into III Millennium in a clean way was
worth sharing with the community.

On Thu, May 8, 2008 at 10:07 AM, Custer, Mark [EMAIL PROTECTED] wrote:
 Slide 4 in that PowerPoint mentions something about a small set of
  Google Book Search information, but is also says that the items are
  indexed by ISBN, OCLC#, and LCCN.  And yet, during the admittedly brief
  time that I tried out this really nice demo, I was unable to find any
  links to books that were available in full view, which made me wonder
  if any of the search results were searching GBS with their respective
  OCLC #s (and not just ISBNs, if available).

GBS searches by whatever you tell it: ISBN, OCLC, *OR* LCCN. Not all of them.


  For example, if I use the demo site that's provided and search for mark
  twain and limit my results to publication dates of, say, 1860-1910, I
  don't receive a single GBS link.  So I checked to see if Eve's Diary
  was in GBS and, of course, it was... and then I made sure that the copy
  I found in the demo had the same OCLC# as the one in GBS; and it was.
  So, is this a feature that will be added later, or is it just that the
  entire set of bib records available at the demo site are not included in
  the GBS aspect of the demo?

By demo site provided, do you mean addison.vt.edu:2082?
Remember that in this demo, the link is only displayed if Google has a
partial view, and *not* if Google has full text or no view. It's my
understanding that Twain's books are past copyright, so Google has
fully scanned them and they are available as full text.

If you take that into account, Eve's Diary (OCLC# 01052228) works
fine. I added it at the bottom of http://libx.org/gbs/tests.html
To search for this book by OCLC, you'd use this span:

span title=OCLC:01052228 class=gbs-thumbnail gbs-link-to-preview
gbs-if-partial-or-fullEve's Diary/span

which links to the full text version. Note that --- interestingly ---
Google does not appear to have a thumbnail for this book's cover.


  Secondly, I have another question which I hope that someone can clear up
  for me.  Again, I'll use this copy of Eve's Diary as an example, which
  has an  OCLC number of 01052228.  Now, if you search worldcat.org (using
  the advance search, basic search, of even adding things like oclc:
  before the number), the only way that I can access this item is to
  search for 1052228 (removing the leading zero).  And this is exactly
  how the OCLC number displays in the metadata record, directly below the
  field that states that there are 18 editions of this work.

  All of that said, I can still access the book with either of these URLs:

  http://worldcat.org/wcpa/oclc/1052228
  http://worldcat.org/wcpa/oclc/01052228

  Now, I could've sworn that GBS followed a similar route, and so, I
  previously searched it for OCLC numbers by removing any leading zeroes.
  As of at least today, though, the only way for me to access this book
  via GBS is to use the OCLC number as it appears in the MARC record...
  that is, by searching for oclc01052228.

  Has anyone else noticed this change in GBS (though it's quite possible
  that I'm simply mistaken)?  And could anyone inform me about the
  technical details of any of these issues?  I mean, I get that worldcat
  has to also deal with ISSNs, but is there a way to use the search box to
  explicitly declare what type of number the query is... and why would the
  value need to have the any leading 0's removed in the metadata display
  (especially since the URL method can access either)?


That's a question about the search interface accessed at
books.google.com, not about the book viewability API. Those are two
different services. The viewability API advertises that it supports
OCLC: and LCCN: prefixes to search for OCLC and LCCN, respectively, in
addition to ISBNs, and that works in your example, for instance,
visit:

http://books.google.com/books?jscmd=viewapibibkeys=OCLC:01052228callback=X
or
http://books.google.com/books?jscmd=viewapibibkeys=OCLC:1052228callback=X

The books.google.com search interface doesn't advertise the ability to
search by OCLC number --- the only reason you are successful with
searching for OCLC01052228 is because this string happens to occur
somewhere in this book's metadata description, and Google has the full
content of the metadata descriptions indexed like it indexed webpages.

Take also a look at the advanced search interface at:
http://books.google.com/advanced_book_search
You'll find no support for OCLC or LCCN. It does show, however, than
isbn: can be used to search for ISBNs, in the style prefixes can be
used in other search interfaces.

 - Godmar


[CODE4LIB] google books for III millennium

2008-05-06 Thread Godmar Back
Hi,

here's a pointer to follow up on the earlier discussion on how to
integrate Google books viewability API into closed legacy systems that
allow only limited control regarding what is being output, such as
III's Millennium system. Compared to other solutions, no JavaScript
programming is required, and the integration into the vendor-provided
templates (such as briefcit.html etc.) is reasonably clean, provides
targeted placement, and allows for multiple uses per page.

Slides (excerpted from Annette Bailey's presentation at IUG 2008):
http://libx.org/gbs/GBSExcerptFromIUGTalk2008.ppt
A demo is currently available here: http://addison.vt.edu:2082/

 - Godmar


[CODE4LIB] coverage of google book viewability API

2008-05-06 Thread Godmar Back
Hi,

to examine the usability of Google's book viewability API when lookup
is done via ISBN, we did some experiments, the results of which I'd
like to share. [1]

For 1000 randomly drawn ISBN from 3,192,809 ISBN extracted from a
snapshot of LoC's records [2], Google Books returned results for 852
ISBN.  We then downloaded the page that was referred to in the
info_url parameter of the response (which is the About page Google
provides) for each result.

To examine whether Google retrieved the correct book, we checked if
the Info page contained the ISBN for which we'd searched. 815 out of
852 contained the same ISBN. 37 results referred to a different ISBN
than the one searched for.  We examined the 37 results manually: 33
referred to a different edition of the book whose ISBN was used to
search, as judged by comparing author/title information with OCLC's
xISBN service. (We compared the author/title returned by xISBN with
the author/title listed on Google's book information page.)  4 records
appeared to be misindexed.

I found the results (85.2% recall and 99% precision, if you allow for
the ISBN substitution; with a 3.1% margin of error) surprisingly high.

 - Godmar

[1] http://top.cs.vt.edu/~gback/gbs-accuracy-study/
[2] http://www.archive.org/details/marc_records_scriblio_net


Re: [CODE4LIB] google books for III millennium

2008-05-06 Thread Godmar Back
Kent,

the link you provide is for the Google API --- however, I was
referring to the Google Book Viewability API. They're unrelated, to my
knowledge.

My experience with the Google Book Viewability API is that it can be
invoked server-side (Google's terms notwithstanding), but requires a
user-agent that mimics an existing browser. A user agent such as the
one provided by Sun's JDK (I think it's jdk-1.6 or some such) will
be rejected; a referrer URL, on the other hand, does not appear to be
required).

 - Godmar

On Tue, May 6, 2008 at 6:32 PM, Kent Fitch [EMAIL PROTECTED] wrote:
 Hi Jonathan,

  The Google API can now be invoked guilt-free from server-side, see:

  http://code.google.com/apis/ajaxsearch/documentation/#fonje

  For Flash developers, and those developers that have a need to access
  the AJAX Search API from other Non-Javascript environments, the API
  exposes a simple RESTful interface. In all cases, the method supported
  is GET and the response format is a JSON encoded result set with
  embedded status codes. Applications that use this interface must abide
  by all existing terms of use. An area to pay special attention to
  relates to correctly identifying yourself in your requests.
  Applications MUST always include a valid and accurate http referer
  header in their requests. In addition, we ask, but do not require,
  that each request contains a valid API Key. By providing a key, your
  application provides us with a secondary identification mechanism that
  is useful should we need to contact you in order to correct any
  problems.

  Well, guilt-free if you agree to the terms, which include:

  The API may be used only for services that are accessible to your end
  users without charge.

  You agree that you will not, and you will not permit your users or
  other third parties to: (a) modify or replace the text, images, or
  other content of the Google Search Results, including by (i) changing
  the order in which the Google Search Results appear, (ii) intermixing
  Search Results from sources other than Google, or (iii) intermixing
  other content such that it appears to be part of the Google Search
  Results; or (b) modify, replace or otherwise disable the functioning
  of links to Google or third party websites provided in the Google
  Search Results.

  Regards,

  Kent Fitch



  On Wed, May 7, 2008 at 7:53 AM, Jonathan Rochkind [EMAIL PROTECTED] wrote:
   This is interesting. These slides don't give me quite enough info to
figure out what's going on (I hate reading slides by themselves!), but
I'm curious about this statement: Without JavaScript coding
(even though Google's API requires JavaScript coding as it is) . Are
you making calls server-side, or are you still making them client-side?
  
As you may recall, one issue I keep beating upon is the desire to call
Google's API server-side. While it's technically possible to call it
server-side, Google doesn't want you to. I wonder if this is what
they're doing there? The problems with that are:
  
1) It may violate Googles terms of service
2) It may run up against Google traffic-limiting defenses
3) [Google's given reason]: It doesn't allow Google to tailor the
results to the end-users location (determined by IP).
  
Including an x-forwarded-for header _may_ get around #2 or #3. Including
an x-forwarded-for header should probably be considered a best practice
when doing this sort of thing server-side in general, but I'm still
nervous about doing this, and wish that Google would just plain say they
allow server-side calls.
  
  
  
  
  
Godmar Back wrote:
  
Hi,
   
here's a pointer to follow up on the earlier discussion on how to
integrate Google books viewability API into closed legacy systems that
allow only limited control regarding what is being output, such as
III's Millennium system. Compared to other solutions, no JavaScript
programming is required, and the integration into the vendor-provided
templates (such as briefcit.html etc.) is reasonably clean, provides
targeted placement, and allows for multiple uses per page.
   
Slides (excerpted from Annette Bailey's presentation at IUG 2008):
http://libx.org/gbs/GBSExcerptFromIUGTalk2008.ppt
A demo is currently available here: http://addison.vt.edu:2082/
   
 - Godmar
   
   
   
  
--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu
  



Re: [CODE4LIB] google books for III millennium

2008-05-06 Thread Godmar Back
The solution is entirely client-side; as it has to be for this
particular kind of legacy system. (In some so-called turn-key
versions, this particular company does not even provide access to the
server's file system, let alone the option of running any services.)

We had already discussed how it works (check the threads from March);
this particular pointer was simply a pointer about how to integrate it
into this particular system (since there were doubts back then about
how hard or easy such integration is.)

 - Godmar

On Tue, May 6, 2008 at 5:53 PM, Jonathan Rochkind [EMAIL PROTECTED] wrote:
 This is interesting. These slides don't give me quite enough info to
  figure out what's going on (I hate reading slides by themselves!), but
  I'm curious about this statement: Without JavaScript coding
  (even though Google's API requires JavaScript coding as it is) . Are
  you making calls server-side, or are you still making them client-side?

  As you may recall, one issue I keep beating upon is the desire to call
  Google's API server-side. While it's technically possible to call it
  server-side, Google doesn't want you to. I wonder if this is what
  they're doing there? The problems with that are:

  1) It may violate Googles terms of service
  2) It may run up against Google traffic-limiting defenses
  3) [Google's given reason]: It doesn't allow Google to tailor the
  results to the end-users location (determined by IP).

  Including an x-forwarded-for header _may_ get around #2 or #3. Including
  an x-forwarded-for header should probably be considered a best practice
  when doing this sort of thing server-side in general, but I'm still
  nervous about doing this, and wish that Google would just plain say they
  allow server-side calls.





  Godmar Back wrote:

  Hi,
 
  here's a pointer to follow up on the earlier discussion on how to
  integrate Google books viewability API into closed legacy systems that
  allow only limited control regarding what is being output, such as
  III's Millennium system. Compared to other solutions, no JavaScript
  programming is required, and the integration into the vendor-provided
  templates (such as briefcit.html etc.) is reasonably clean, provides
  targeted placement, and allows for multiple uses per page.
 
  Slides (excerpted from Annette Bailey's presentation at IUG 2008):
  http://libx.org/gbs/GBSExcerptFromIUGTalk2008.ppt
  A demo is currently available here: http://addison.vt.edu:2082/
 
   - Godmar
 
 
 

  --
  Jonathan Rochkind
  Digital Services Software Engineer
  The Sheridan Libraries
  Johns Hopkins University
  410.516.8886
  rochkind (at) jhu.edu



Re: [CODE4LIB] coverage of google book viewability API

2008-05-06 Thread Godmar Back
On Tue, May 6, 2008 at 11:02 PM, Michelle Watson
[EMAIL PROTECTED] wrote:

  Is there something in the code that prevents the link from being
  offered unless it goes to at least a partial preview (which I take to
  mean scanned pages), or have I just been lucky in my searching?  I
  can't comment on whether or not the 'no preview'  is useful because
  every book I see has some scanned content.


Yes, in Annette's example, the link is only offered if Google has
preview pages in addition to the book information. See the docs on
libx.org/gbs for further detail (look for gbs-if-partial )

I had the same subjective impression in that I was surprised by how
many books have previews - for instance, if I search for genomics on
addison.vt.edu:2082, 24 of the first 50 hits returned have partial
previews. Incidentally, 2 out of the 24 lead to the wrong book.
This is why I sampled the LoC's ISBN set.

It's likely that there's observer bias (such as trying genomics),
and it's also possible that Google is more likely to have previews for
books libraries tend to hold, such as popular or recent books. (I note
that most of the 24 hits for genomics that have previews are less than
4 years old.)
Conversely, for those recent years, precision may be lower, with more
books misindexed.

 - Godmar


[CODE4LIB] how to obtain a sampling of ISBNs

2008-04-28 Thread Godmar Back
Hi,

for an investigation/study, I'm looking to obtain a representative
sample set (say a few hundreds) of ISBNs. For instance, the sample
could represent LoC's holdings (or some other acceptable/meaningful
population in the library world).

Does anybody have any pointers/ideas on how I might go about this?

Thanks!

 - Godmar


Re: [CODE4LIB] how to obtain a sampling of ISBNs

2008-04-28 Thread Godmar Back
Hi,

thanks to everybody who's replied with offers to provide ISBNs.

I need to clarify that I'm looking for a sample of ISBNs that is
representative of some larger population, such as all books cataloged
by LoC, or all books in library X's catalog, or all books sold by
Amazon.

It could be, for instance, a simple random sample [1].

What will not work are ISBNs coming from a FRBR service, from a
specialized collections, or the first n ISBNs coming from a catalog
dump (unless that order in which the catalog database is dumped is
explicitly random).

 - Godmar

[1] http://en.wikipedia.org/wiki/Simple_random_sample

On Mon, Apr 28, 2008 at 10:40 AM, Shanley-Roberts, Ross A. Mr.
[EMAIL PROTECTED] wrote:
 I could give you any number of sets of isbns. What kind of material are you 
 interested in: videos, books, poetry, electronic resources, etc., or I could 
 supply a set of isbns for any subject area or LC classification area that you 
 might be interested in.

  Ross


  Ross Shanley-Roberts
  Special Projects Technologist
  Miami University Libraries
  Oxford, OH 45056
  [EMAIL PROTECTED]
  847 672-9609
  847 894-3911 cell




  -Original Message-
  From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Godmar Back
  Sent: Monday, April 28, 2008 8:35 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: [CODE4LIB] how to obtain a sampling of ISBNs



 Hi,

  for an investigation/study, I'm looking to obtain a representative
  sample set (say a few hundreds) of ISBNs. For instance, the sample
  could represent LoC's holdings (or some other acceptable/meaningful
  population in the library world).

  Does anybody have any pointers/ideas on how I might go about this?

  Thanks!

   - Godmar



Re: [CODE4LIB] Serials Solutions 360 API - PHP classes?

2008-04-03 Thread Godmar Back
Could you share, briefly, what this API actually does (if doing so
doesn't violate your NDA?)

 - Godmar

On Thu, Apr 3, 2008 at 1:40 PM, Yitzchak Schaffer [EMAIL PROTECTED] wrote:
 
  From: Code for Libraries on behalf of Yitzchak Schaffer
  Sent: Wed 4/2/2008 12:28 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: [CODE4LIB] Serials Solutions 360 API - PHP classes?
 
 
 
  Does anyone have/know of PHP classes for searching the Serials Solutions
  360 APIs, particularly Search?
 

  Okay, having not heard any affirmatives, I'm starting work on this.  I'm
  an OOP and PHP noob, so I'm donning my flak jacket/dunce cap in advance,
  but I'll try to make this as useful to the community and comprehensive
  as time and my ability allow.  Assuming that Serials Solutions will
  allow some kind of sharing for these - they make clients sign a NDA
  before they show you the docs.  I'm waiting to hear their response; I
  would be surprised if they wouldn't allow sharing of something like this
  among clients.



  --
  Yitzchak Schaffer
  Systems Librarian
  Touro College Libraries
  33 West 23rd Street
  New York, NY 10010
  Tel (212) 463-0400 x230
  Fax (212) 627-3197
  [EMAIL PROTECTED]



Re: [CODE4LIB] Google Book Search API - JavaScript Query

2008-03-26 Thread Godmar Back
On Thu, Mar 20, 2008 at 12:44 PM, KREYCHE, MICHAEL [EMAIL PROTECTED] wrote:
  -Original Message-
   From: Code for Libraries [mailto:[EMAIL PROTECTED] On

  Behalf Of Godmar Back
   Sent: Thursday, March 20, 2008 10:45 AM
   To: CODE4LIB@LISTSERV.ND.EDU
   Subject: Re: [CODE4LIB] Google Book Search API - JavaScript Query
  

  Have you tried placing your code in an window.onload handler?
Read the example I created at libx.org/gbs and if that works
   for you in IE6, use the technique there. (Or you may just use
   the entire script - it seems you're reimplementing a lot of
   it anyway.)

  I'll have to study that a bit and see how it works. I was aiming for a
  solution with a minimal amount of code, but perhaps a more robust
  approach like yours is in order.


I looked into this issue some more and like to share a bit of what I learned.

The short answer is: use jQuery (or a library like it.)

The longer answer is that window.onload - even the browser-compatible
version using the addEvent function described here:
http://www.dustindiaz.com/rock-solid-addevent/ ) won't fire until all
images on a page have loaded, which can incur significant latency -
especially if there's a large number of embedded objects from
different origins on the page, some of which may be stragglers when
loading.

Instead, what you want is a browser-compatible notification when the
HTML content has been downloaded and parsed into the DOM tree
structure, that is, when the document is ready and it's safe to
manipulate it using such methods as getElementById(). Implementing
this notification requires a variety of browser-specific hacks (google
for details, or examine 'bindReady' in jQuery for a distilled summary
of the collective experience.)

jQuery implements those hacks and hides them from you, so in jQuery,
it's as simple as saying
$(function () { insert work here });

JQuery will determine when the document is ready and execute your
anonymous function then, which is at the earliest possible time. If no
hack is known for a particular platform, jQuery falls back to the
load handler.
If you think about it, that's not something you want to implement or
even think about.

(Note that jQuery's init constructor, which is what the $ symbol is
bound to, heavily adjusts to the type of its first argument.  If the
argument is a function, it means to call the function when the
document is ready. An alternate syntax is $().ready(function ...) -
which relies on jQuery substituting 'document' if the first argument
is not given.  The most readable syntax may be:
$(document).ready(function ...) though $(function ...) may make for
a good idiom.)

 - Godmar


Re: [CODE4LIB] Google Book Search API - JavaScript Query

2008-03-20 Thread Godmar Back
I didn't mean window.onload literally; use a browser-compatible
version of it [jQuery, btw, would figure that out automatically for
you, so if you can integrate jQuery in your page, you may want to try
Matt's plugin.]

My prototype uses a function called addEvent from Dustin Diaz, see
http://www.dustindiaz.com/rock-solid-addevent  I think it uses
'attachEvent' in IE6, which appears to work.
I'm also using it in Majax (libx.org/majax) and it works there as well in IE6.

 - Godmar

On Thu, Mar 20, 2008 at 11:22 AM, David Kane [EMAIL PROTECTED] wrote:
 Hi Godmar,

  Thanks.  Yes. I tried that, but the support for window.onload does not exist
  in IE6.  I also tried the defer=defer attribute in the script tag, which
  did not work either.  Tim's solution looks good.  I have yet to try it
  though.  ( will wait until after Easter).

  Cheers,

  David



  On 20/03/2008, Godmar Back [EMAIL PROTECTED] wrote:
  
   Have you tried placing your code in an window.onload handler?  Read
   the example I created at libx.org/gbs and if that works for you in
   IE6, use the technique there. (Or you may just use the entire script -
   it seems you're reimplementing a lot of it anyway.)
  
  
 - Godmar
  
  
   On Thu, Mar 20, 2008 at 9:09 AM, KREYCHE, MICHAEL [EMAIL PROTECTED]
   wrote:
Tim and David,
   
 Thanks for sharing you solutions; the IE problem has been driving me
 crazy. I've mostly been working on the title browse page of our
   catalog.
   
   
 Originally I had it working on Firefox, Safari, and IE7 (IE6 worked if
   I
 refreshed the page); after some rearrangement of the script, it's now
 working on IE6 but broken on Safari.
   
 This is still proof of concept code and is only on our staging server
 (http://kentlink.kent.edu:2082/). Try a keyword search and you should
 see some Google links.
   
 Mike
 --
 Michael Kreyche
 Systems Librarian / Associate Professor
 Libraries and Media Services
 Kent State University
 330-672-1918
   
   
   
  -Original Message-
  From: Code for Libraries [mailto:[EMAIL PROTECTED] On
  Behalf Of Tim Hodson
  Sent: Thursday, March 20, 2008 7:21 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Google Book Search API - JavaScript Query
 
  One way I have used to resolve this is to poll the object
  until it exisits before continuing.
 
  function myInit(id){
  13 // if Obj is not defined yet, call this function again until it
   is.
  14
  15 if (typeof myObj == undefined){
  16 createScript();
  17 setTimeout(myInit(), 60);
  18 return;
  19 }
  20 // do stuff onlu if myObj is now an object
  21 else if (typeof myObj == object){
  22
  23 myGo();
  24 return;
  25 }
 
  HTH Tim
  26}
 
  On 20/03/2008, David Kane [EMAIL PROTECTED] wrote:
   HI Folks,
  
We were one of the first libraries to get the GBS API
  working on our OPAC.
Like many OPACs, ours is difficult to modify at times and
  requires a
   dynamic  insert of a generated (by PHP) JavaScript, which
  is hosted on
   a separate  server to the OPAC pages.
  
It seems to work fine on most browsers, giving an
  appropriate link to
   a  full/partial text preview of that work on GBS.  I run into a
   problem with  IE6, which means that the function defined in the
   JavaScript aren't  available by the time the script is
  called at the bottom of the page.
  
You should be able to see a GBS link on most pages, here
  is an example:
  
http://witcat.wit.ie/search/i?SEARCH=0192833987
  
The attached image shows you what you should see.
  
If anyone can shed any light on this, it would be appreciated.
  
Thanks and best regards,
  
  
David Kane
Systems Librarian
Waterford Institute of Technology
Ireland
  
  
 
 
  --
  Tim Hodson
  www.informationtakesover.co.uk
  www.timhodson.co.uk
 
   
  



Re: [CODE4LIB] Free covers from Google

2008-03-17 Thread Godmar Back
FWIW, realize that this is client-side mashup. Google will see
individual requests from individual IP addresses from everybody
viewing your page. For each IP address from which it sees requests
it'll decide whether to block or not. It'll block if it thinks you're
harvesting their data.

Wageningen University owns the 137.224/16 network, so I find it
doubtful that you're all sharing the same IP address. It's probably
just your desktop IP address (or, if you're behind a NAT device, the
address used by that device - but that's probably only a small group
of computers.)

That makes it even more concerning that Google's defenses could be
triggered by your development and testing activities. Do complain
about it to them. (I doubt they change their logic, but you can try.)

I've received the CAPTCHA from Google in the past a few times if I use
it as a calculator. Enter more than a dozen or so expressions, and it
thinks I'm a computer who needs help from Google to compute simple
things such as english-to-metric conversions.

I think that's a huge drawback, actually. How does Amazon's image
service work? Does it suffer from the same issue?

 - Godmar

On Mon, Mar 17, 2008 at 4:50 AM, Boheemen, Peter van
[EMAIL PROTECTED] wrote:
 As i wrote earlier, I have implemented a link using the Google API in
  our library catalog.
  It worked . for a while :)

  What we notice now is, that Google responds with an error message. It
  thinks that it has detected spyware or some virus.
  i see the same effect now when I click on the examples Godmar and Tim
  created.
  When I go to Google books directly with my browser now, I get the same
  message and get the request to enter a non machine readable
  string and then I can go on. My API calls however, still fail.
  This has probably got to do with the fact that anybody who is accessing
  Google from the university campus exposes the same IP adress to Google.
  This is probably a trigger for Google to respond with this error.
  Does anybody have any ideas about what to do about this, before I try to
  get in touch with Google?


  Peter van Boheemen

 Wageningen University and Research Library
  The Netherlands



Re: [CODE4LIB] Free covers from Google

2008-03-17 Thread Godmar Back
Although I completely agree that server-side queryability is something
we should ask from Google, I'd like to follow up on:

On Mon, Mar 17, 2008 at 11:06 AM, Jonathan Rochkind [EMAIL PROTECTED] wrote:
 The
  architecture of SFX would make it hard to implement Google Books API
  access as purely client javascript, without losing full integration with
  SFX on par with other 'services' used by SFX.  We will see what happens.


Could you elaborate? Do you mean 'hard' or 'impossible'?

Meanwhile, I've extended the google book classes (libx.org/gbs ) to
provide more flexibility; it now supports these classes:

gbs-thumbnail Include an img... embedding the thumbnail image
gbs-link-to-preview Wrap span in link to preview at GBS
gbs-link-to-info Wrap span in link to info page at GBS
gbs-link-to-thumbnail Wrap span in link to thumbnail at GBS
gbs-if-noview Keep this span only if GBS reports that book's
viewability is 'noview'
gbs-if-partial-or-full Keep this span only if GBS reports that book's
viewability is at least 'partial'
gbs-if-partial Keep this span only if GBS reports that book's
viewability is 'partial'
gbs-if-full Keep this span only if GBS reports that book's viewability is 'full'
gbs-remove-on-failure Remove this span if GBS doesn't return bookInfo
for this item

 - Godmar


Re: [CODE4LIB] Free covers from Google

2008-03-17 Thread Godmar Back
On Mon, Mar 17, 2008 at 11:13 AM, Tim Spalding [EMAIL PROTECTED] wrote:
   limits. I don't think it's a strict hits-per-day, I think it's heuristic
software meant to stop exactly what we'd be trying to do, server-side
machine-based access.

  Aren't we still talking about covers? I see *no* reason to go
  server-side on that. Browser-side gets you what you want—covers from
  Google—without the risk they'll shut you down over overuse.


But Peter's experience says otherwise, no?
His computer was shut down during development - I don't see how Google
would tell his use from the use of someone doing research using a
library catalog. Especially if NAT is used with a substantial number
of users as in Giles's use case.

 - Godmar


Re: [CODE4LIB] jquery plugin to grab book covers from Google and link to Google books

2008-03-17 Thread Godmar Back
Good, but why limit it to 1 class per span?

My proposal separates different functionality in multiple classes,
allowing the user to mix and match. If you limit yourself to 1 class,
you have to provide classes for all possible combinations a user might
want, such as: gbsv-link-to-preview-with-thumbnail.

 - Godmar

On Mon, Mar 17, 2008 at 4:30 PM, Bess Sadler [EMAIL PROTECTED] wrote:
 Matt Mitchell here at UVa just wrote a jquery plugin to access google
  book covers and link to google books. I wrote up how to use it here:
  http://www.ibiblio.org/bess/?p=107

  We're using it as part of Blacklight, and we're making it
  available through the Blacklight source code repository under an
  Apache 2.0 license.

  First, grab the plugin here: http://blacklight.rubyforge.org/svn/
  javascript/gbsv-jquery.js, and download jquery here: http://
  code.google.com/p/jqueryjs/downloads/detail?name=jquery-1.2.3.min.js.

  Now make yourself some HTML that looks like this:
  html
head
  script type=text/javascript
  src=jquery-1.2.3.min.js/script
  script type=text/javascript src=gbsv-jquery.js/
  script
  script type=text/javascript
  $(function(){
  $.GBSV.init();
  });
  /script
/head
body
  span title=ISBN:0743226720 $B!m (B class=gbsv-link-to-
  preview/span
  span title=ISBN:0743226720 $B!m (B class=gbsv-link-to-
  info/span
  span title=ISBN:0743226720 $B!m (B class=gbsv-
  thumbnail/span
  span title=ISBN:0743226720 $B!m (B class=gbsv-link-to-
  preview-with-thumbnail/span
/body
/html

  Now load your page and you should see something like this: http://
  blacklight.rubyforge.org/gbsv.html

  If you link to a non-existent ISBN it will be silently ignored.

  Give it a shot and give us some feedback!

  Bess


  Elizabeth (Bess) Sadler
  Research and Development Librarian
  Digital Scholarship Services
  Box 400129
  Alderman Library
  University of Virginia
  Charlottesville, VA 22904

  [EMAIL PROTECTED]
  (434) 243-2305



Re: [CODE4LIB] many processes, one resultCode for Libraries [EMAIL PROTECTED]

2008-02-18 Thread Godmar Back
If you're doing this in Java, use the java.util.concurrent package and
its Executor and Future framework, instead of using Thread.start/join,
synchronized etc. directly.

Get the book Concurrent Programming in Java: Design Principles and
Patterns  (ISBN 0-201-31009-0) written by the master himself (Doug
Lea; see http://gee.cs.oswego.edu/dl/cpj/ )

 - Godmar

On Feb 18, 2008 2:19 PM, Durbin, Michael R [EMAIL PROTECTED] wrote:
 This can be done in Java, but like everything in Java the solution is kind of 
 lengthy and perhaps requires several classes.

 I've attached a simple skeleton program that spawns threads to search but 
 then processes only those results returned in the first 10 seconds.  The code 
 for performing the searches is obviously missing as is the consolidation 
 code, but the concurrency issue is addressed.  In this example the search 
 threads aren't killed, but instead left running to finish naturally though 
 their results would be ignored if they weren't done in 10 seconds.  It might 
 be better to kill them depending on the circumstances.

 -Mike

 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Eric Lease 
 Morgan
 Sent: Monday, February 18, 2008 1:43 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] many processes, one resultCode for Libraries [EMAIL 
 PROTECTED]

 How do I write a computer program that spawns many processes but
 returns one result?

 I suppose the classic example of my query is the federated search. Get
 user input. Send it to many remote indexes. Wait. Combine results.
 Return. In this scenario when one of the remote indexes is slow things
 grind to a halt.

 I have a more modern example. Suppose I want to take advantage of many
 Web Services. One might be spell checker. Another might be a
 thesaurus. Another might be an index. Another might be a user lookup
 function. Given this environment, where each Web Service will return
 different sets of streams, how do I query each of them simultaneously
 and then aggregate the result? I don't want to so this sequentially. I
 want to fork them all at once and wait for their return before a
 specific time out. In Perl I can use the system command to fork a
 process, but I must wait for it to return. There is another Perl
 command allowing me to fork a process and keep going but I don't
 remember what it is. Neither one of these solutions seem feasible. Is
 the idea of threading in Java suppose to be able to address this
 problem?

 --
 Eric Lease Morgan
 University Libraries of Notre Dame

 (574) 631-8604



Re: [CODE4LIB] xml java package

2008-02-01 Thread Godmar Back
On Jan 27, 2008 5:40 PM, Eric Lease Morgan [EMAIL PROTECTED] wrote:
 What is the most respected (useful, understandable) XML Java package?

 In a few fits of creative rage, I have managed to write my first Java
 programs. I can now index plain text files with Lucene and search the
 index. I can parse MARC files with MARC4J, index them with Lucene,
 and search the index. I can dump the results of the OAI-PMH
 ListRecords and Identify verbs using harvest2 from OCLC.

 I now need to read XML. Unlike indexing and doing OAI-PMH, there are
 a myriad of tools for reading and writing XML. I've done SAX before.
 I think I've done a bit of DOM. If I wanted a straight-forward and
 well-supported Java package that supported these APIs, then what
 package might I use?


If the data you're manipulating is partially or fully described by a
Schema or DTD, consider using a package such as Castor (castor.org)
that generates classes that stores you XML data as Java beans. In this
case, you get XML parsing, XML generation, and even validation for
free, that is, using only about 3 lines of code. If you don't have a
Schema, considering creating one or asking the data provider for one -
compared to using SAX or compared to using a DOM-like API, the gain in
productivity and robustness is significant.

We're using Castor extensively in the LibX edition builder - for our
own configuration data, which is stored as XML, but also for accessing
a number of OCLC services, including the OpenURL registry (which has a
complete Schema!), the Worldcat registry (partial schema for SRW), and
the OCLC institution profiles (no Schema :-(, so slightly more
awkward.)

 - Godmar


Re: [CODE4LIB] xml java package

2008-02-01 Thread Godmar Back
I haven't used Castor for mixed content, but obviously, mixed content
is more difficult to map to Java types, even if you have a schema. I
probably wouldn't use Castor in those situations. Otherwise, it - or a
tool like it that can map schemata to Java types for automatic
parsing, generation, and validation - should still be your first
choice.

 - Godmar

On Feb 1, 2008 11:22 AM, Clay Redding [EMAIL PROTECTED] wrote:
 I don't know if it's still the case, but I know a recent EAD project
 that tried to use Castor said that it had problems with mixed content
 models.  -- Clay


 On Feb 1, 2008, at 10:50 AM, Riley, Jenn wrote:

  -Original Message-
  I now need to read XML. Unlike indexing and doing OAI-PMH, there are
  a myriad of tools for reading and writing XML. I've done SAX before.
  I think I've done a bit of DOM. If I wanted a straight-forward and
  well-supported Java package that supported these APIs, then what
  package might I use?
 
 
  If the data you're manipulating is partially or fully described by a
  Schema or DTD, consider using a package such as Castor (castor.org)
 
  I think I recall hearing in the past that Castor had trouble with
  XML files that used mixed content models (a set into which TEI and
  EAD both fall) - can anyone confirm if that's currently the case
  (or that it never was and I'm completely misremembering)?
 
  Jenn
 
  
  Jenn Riley
  Metadata Librarian
  Digital Library Program
  Indiana University - Bloomington
  Wells Library W501
  (812) 856-5759
  www.dlib.indiana.edu
 
  Inquiring Librarian blog: www.inquiringlibrarian.blogspot.com



Re: [CODE4LIB] arg! classpaths! [resolved]

2008-01-26 Thread Godmar Back
To add a bit of experience gained from 13 years of Java programming: I
strongly recommend against setting CLASSPATH in the shell. Instead,
use either the -cp switch to java, as in

java -cp lucene-core...jar:lucene-demo-.jar 

or use the env command in Unix, as in

env 
CLASSPATH=/home/eric/lucene/lucene-core-2.3.0.jar:/home/eric/lucence/lucene-demos-2.3.0.jar
java 

These options achieve the same effect, but unlike export, they will
not change the CLASSPATH environment variable for the remainder of
your shell session.  For instance, this command:

export 
CLASSPATH=/home/eric/lucene/lucene-core-2.3.0.jar:/home/eric/lucence/lucene-demos-2.3.0.jar

will make it impossible to execute javac or java for .class files in
the current directory (because you've excluded . from the classpath,
which by default is included.)

Note, however, that this rule does not apply to shell scripts: inside
shell scripts, it's okay to export CLASSPATH because such settings
will be valid only for the shell executing the script; in Unix,
changes to environment variable will not reflect back to the shell
from the shell script was started.

 - Godmar

 
  You use a plain directory as a CLASSPATH component only if you intend
  to use .class files that has not been packaged up in a JAR.



 Thank you for the prompt replies. Yes, my CLASSPATH needed to be more
 specific; it needed to specify the .jar files explicitly. I can now
 run the demo. (Arg! Classpaths!)

 --
 ELM



Re: [CODE4LIB] arg! classpaths! [resolved]

2008-01-26 Thread Godmar Back
On Jan 26, 2008 10:12 AM, Godmar Back [EMAIL PROTECTED] wrote:

 Note, however, that this rule does not apply to shell scripts: inside
 shell scripts, it's okay to export CLASSPATH because such settings
 will be valid only for the shell executing the script; in Unix,
 changes to environment variable will not reflect back to the shell
 from the shell script was started.


Oops, should read: ... changes to environment variables will not
reflect back to the shell from *which* the shell script was started.

I should also mention that if you place an export CLASSPATH command in
your ~/.bash_profile or ~/.bashrc, you've committed the same mistake
because the setting then will be valid for your initial shell session
(or every new session, or both, depending on the content of your
~/.bash_profile.) So ignore any instructions that propose you do that.

 - Godmar


  1   2   >