Re: [CODE4LIB] libxess

2007-05-09 Thread Karen Tschanz
Impressive sample. kst

 Godmar Back [EMAIL PROTECTED] 5/8/2007 5:57 PM 
On 5/8/07, Karen Tschanz [EMAIL PROTECTED] wrote:
 Hi, Godmar:

 I would be interested in receiving links from libraries that has
 implemented this, so that I could see the results. Thanks for your
help!

Given that what I propose is still in the design phase/vaporware,
asking for examples may be premature  yet here is one:
http://libx.org/libxess/cue.html shows what a possible application of
this technology might look like.

Also, keep in mind that what I'm proposing is not a new service that
libraries could deploy directly to their users -- rather, it's a piece
of infrastructure that would allow libraries to deploy services built
on this infrastructure to their users.

It's a bit of a chicken and an egg problem, except that we will go
ahead and provide an initial chicken (LibX) and an initial egg
(David's script.)

 - Godmar


[CODE4LIB] Server logs as tag clouds

2007-05-09 Thread Tom Keays

O'Reilly has a nifty feature that displays the top 20 search terms on
their various sites using terms that someone typed into a search
engine (e.g., Google) and then followed a resulting link. (They're
also distrubuting these tags as JSON, which is a nice idea.)

http://www.oreillynet.com/feeds/widgets/organic_search_tagcloud/

Presumably they are doing server log analysis to get and rank search
terms as tags (although there is no way to tell absolutely since the
code is not GPL). It seems like it would be a good complement to
search log analysis to see how people are finding and using your site.

O'Reilly has addressed the potential issues of privacy and
appropriateness of the displayed tags by matching search terms back to
an index of their site. While the keyword frequency does give some
idea of what people are looking for, keep in mind that the word had to
already be on our site in order for it to appear, and it had to be
ranked highly enough for someone to find it.

It also greatly helps that their site has a highly structured search
engine, allowing limiting of results by content type and by site. This
is probably only practical on sites that use a structured CMS.

Still, it is worth asking: Has anyone made a stab at this -- ie,
publically exposing server logs? Are there code examples (any
real-world, generalizable examples would be welcome). Sorry for
cross-posting this.

--
Tom


Re: [CODE4LIB] Server logs as tag clouds

2007-05-09 Thread Joe Hourcle
On Wed, 9 May 2007, Tom Keays wrote:

 Still, it is worth asking: Has anyone made a stab at this -- ie,
 publically exposing server logs? Are there code examples (any
 real-world, generalizable examples would be welcome). Sorry for
 cross-posting this.

I've done it in the past -- typically using general analystics programs
(eg, analog), or just parsing out relevant data w/ perl.

The problem is, a few years ago, that spammers started sending bogus
requests to servers, to try to get them to show up in your stats pages.
In ORA's case, they're only showing the top 20, and they presumably get
lots of requests, so someone would have to hit them pretty hard to get
something to show up.

If you're thinking about exposing your server logs, I'd recommend the
following:

1. Don't give out IP addresses of the requestors
   (privacy reasons)
2. Don't put on a public page any data that's generated by the
   user-agent, to include HTTP_USER_AGENT, HTTP_REFERER and
   QUERY_STRING.  All have been used by spammers to insert URLs to
   try to get links back to their sites.
3. Filter out all entries with 'error' results (people trying to
   probe your system for vulnerabilities, etc.)
4. Filter out all 'intranet' pages or other pages that the general
   public shouldn't be going to.
5. Avoid giving information that provides signatures of the CMS
   you're using, or other signatures of potential vulnerabilities.
6. Use robot.txt to request search engines to not serve whatever
   pages you generate.

For the particular case of generating tag clouds from search results, the
problem lies in that you typically need to use QUERY_STRING if it's a
local search script, and HTTP_REFERER if it's a remote search engine that
linked to you.  Both values can't be trusted.

In this particular case, I probably wouldn't try a fully automated
approach -- I'd generate the page, but require someone to manually verify
it before it got posted.

-
Joe Hourcle


(insert some statement here about everything being my personal opinions,
and that I don't speak for any company, organization, etc.)


Re: [CODE4LIB] more metadata from xISBN

2007-05-09 Thread William Denton

On 8 May 2007, Eric Hellman wrote:


xISBN is free for non-commercial, low volume use.


The xISBN web site clarifies this as meaning = 500 queries per day for
non-commercial purposes.  Over 500 queries in a day for non-commercial
use, or any number of queries for commercial use, requires paying:

   http://xisbn.worldcat.org/xisbnadmin/doc/price.htm

A library would pay $3,000 USD a year to be able to do 10,000 queries a
day.  That's a lot of queries, but I could imagine a big academic library
doing a bunch if they pushed out web tools to their students to make it
easy to check if any edition of a given book (seen at Amazon or in a blog,
etc.) is available in its collection.  1,000 queries a day (which used to
be free) is now $500 USD per year.  It's 20% off for OCLC members.

I'm not sure how to read the commercial price rates, or who would need
10,000,000 xISBN queries, but the prices push the service out of the reach
of the devoted library hacker as well as the small start-up or basement
business.

xISBN's availability, even to and through free and open source tools, is
now more limited.  On reflection, this is one of the rare times on
code4lib when an announced API offers less and not more.  Also, it's the
first big commodification of FRBR, which is intriquing.

Bill
--
William Denton, Toronto : www.miskatonic.org www.frbr.org www.openfrbr.org


Re: [CODE4LIB] more metadata from xISBN

2007-05-09 Thread Godmar Back

Interesting.

Thom Hickey commented a while ago about LibX's use of xISBN (*): I
suspect that eventually the LibX xISBN support will become both less
visible and more automatic.

We were indeed planning on making it more automatic. For instance, a
user visiting a vendor's page such as amazon might be presented with
options from their library catalog, based on related ISBN found via
xISBN.

Would that qualify as noncommercial use?  For instance, if LibX with
this feature were installed on a public library machine, 500 requests
per day might be easily exceeded. Matters would be even worse if
multiple library machines were to share an IP because they are hidden
behind a NAT device or proxy.

- Godmar

(*) http://outgoing.typepad.com/outgoing/2006/05/libx_and_xisbn.html

On 5/9/07, William Denton [EMAIL PROTECTED] wrote:

On 8 May 2007, Eric Hellman wrote:

 xISBN is free for non-commercial, low volume use.

The xISBN web site clarifies this as meaning = 500 queries per day for
non-commercial purposes.  Over 500 queries in a day for non-commercial
use, or any number of queries for commercial use, requires paying:

http://xisbn.worldcat.org/xisbnadmin/doc/price.htm

A library would pay $3,000 USD a year to be able to do 10,000 queries a
day.  That's a lot of queries, but I could imagine a big academic library
doing a bunch if they pushed out web tools to their students to make it
easy to check if any edition of a given book (seen at Amazon or in a blog,
etc.) is available in its collection.  1,000 queries a day (which used to
be free) is now $500 USD per year.  It's 20% off for OCLC members.

I'm not sure how to read the commercial price rates, or who would need
10,000,000 xISBN queries, but the prices push the service out of the reach
of the devoted library hacker as well as the small start-up or basement
business.

xISBN's availability, even to and through free and open source tools, is
now more limited.  On reflection, this is one of the rare times on
code4lib when an announced API offers less and not more.  Also, it's the
first big commodification of FRBR, which is intriquing.

Bill
--
William Denton, Toronto : www.miskatonic.org www.frbr.org www.openfrbr.org



Re: [CODE4LIB] more metadata from xISBN

2007-05-09 Thread Nathan Vack

On May 9, 2007, at 11:56 AM, William Denton wrote:


On 8 May 2007, Eric Hellman wrote:


xISBN is free for non-commercial, low volume use.


A library would pay $3,000 USD a year to be able to do 10,000
queries a
day.  That's a lot of queries, but I could imagine a big academic
library
doing a bunch if they pushed out web tools to their students to
make it
easy to check if any edition of a given book (seen at Amazon or in
a blog,
etc.) is available in its collection.  1,000 queries a day (which
used to
be free) is now $500 USD per year.  It's 20% off for OCLC members.


Y'know, we could just all chip in for the data file and provide free
access through a web service.

Heh. Someday, I'm gonna get sued.

Also... did I somehow miss the legislation in which factual
information (like, everything contained within xISBN) became
copyrightable?

-Nate


Re: [CODE4LIB] more metadata from xISBN

2007-05-09 Thread Jonathan Rochkind

Nathan Vack wrote:

Also... did I somehow miss the legislation in which factual
information (like, everything contained within xISBN) became
copyrightable?


License agreements can restrict just about anything the agreement wants
to. If it's an an agreement freely entered into, you can agree to a
restriction on what you can do way beyond what copyright law would support.

But yeah, on this general topic, this stiffles a lot of things we'd want
to do with xISBN, indeed.

What options are there?
1) thingISBN, of course.

2) More interesting---OCLC's _initial_ work set grouping algorithm is
public. However, we know they've done a lot of additional work to
fine-tune the work set grouping algorithms.
(http://www.frbr.org/2007/01/16/midwinter-implementers).  Some of these
algorithms probably take advantage of all the cool data OCLC has that we
don't, okay.

But how about we start working to re-create this algorithm? Re-create
isn't a good word, because we aren't going to violate any NDA's, we're
going to develop/invent our own algorithm, but this one is going to be
open source, not a trade secret like OCLC's.

So we develop an algorithm on our own, and we run that algorithm on our
own data. Our own local catalog. Union catalogs. Conglomerations of
different catalogs that we do ourselves. Even reproductions of the OCLC
corpus (or significant subsets thereof) that we manage to assemble in
ways that don't violate copyright or license agreements.

And then we've got our own workset grouping service. Which is really all
xISBN is.  What is OCLC providing that is so special? Well, if what I've
just outlined above is so much work that we _can't_ pull it off, then I
guess we've got pay OCLC, and if we are willing to do so (rather than go
without the service), then I guess OCLC has correctly pegged their
market price.

But our field is not a healthy field if all research is being done by
OCLC and other vendors. We need research from other places, we need
research that produces public domain results, not proprietary trade
secrets.

Jonathan

-Nate



--
Jonathan Rochkind
Sr. Programmer/Analyst
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] more metadata from xISBN

2007-05-09 Thread Eric Hellman

As long as LibX is free and not being used as a way to drive Amazon
revenue, I don't see how it could be considered to be commercial.

We've studied our logs pretty carefully. Most of the sites that have
exceeded the limit we set were commercial sites doing bulk harvest.

You can track the xISBN use by LibX by getting an affiliate id.

Eric

At 2:32 PM -0400 5/9/07, Godmar Back wrote:

Interesting.

Thom Hickey commented a while ago about LibX's use of xISBN (*): I
suspect that eventually the LibX xISBN support will become both less
visible and more automatic.

We were indeed planning on making it more automatic. For instance, a
user visiting a vendor's page such as amazon might be presented with
options from their library catalog, based on related ISBN found via
xISBN.

Would that qualify as noncommercial use?  For instance, if LibX with
this feature were installed on a public library machine, 500 requests
per day might be easily exceeded. Matters would be even worse if
multiple library machines were to share an IP because they are hidden
behind a NAT device or proxy.

- Godmar

(*) http://outgoing.typepad.com/outgoing/2006/05/libx_and_xisbn.html


--

Eric Hellman, DirectorOCLC Openly
Informatics Division
[EMAIL PROTECTED]2 Broad St., Suite 208
tel 1-973-509-7800 fax 1-734-468-6216  Bloomfield, NJ 07003
http://openly.oclc.org/1cate/  1 Click Access To Everything


Re: [CODE4LIB] more metadata from xISBN

2007-05-09 Thread Godmar Back

On 5/9/07, Eric Hellman [EMAIL PROTECTED] wrote:

As long as LibX is free and not being used as a way to drive Amazon
revenue, I don't see how it could be considered to be commercial.



Probably a way to drive Amazon revenue down, considering that we offer
the alternative to borrow the book rather than buy it.


We've studied our logs pretty carefully. Most of the sites that have
exceeded the limit we set were commercial sites doing bulk harvest.

You can track the xISBN use by LibX by getting an affiliate id.



LibX is a client-side tool. We're not a user of xISBN, we provide
clients who have installed it the option to use xISBN.

Also, keep in mind that an important reason to use OCLC's xISBN
service - rather than using an alternate service or using the data
directly - is Jeff Young's OAI bookmark service, specifically the
know-how he's put into searching multiple catalogs and his keeping a
database of which library uses which catalog. That, as I understand,
is still not part of the officially supported xISBN, though.

- Godmar


Re: [CODE4LIB] more metadata from xISBN

2007-05-09 Thread Jonathan Rochkind

Yeah, that's a good point, Eric.

I am, however, worried that I can't do what I want to do without
breaking 500 querries a day, and my institution is not going to be
willing to pay for it. So I'm interested in exploring other
opportunities. (Does Umlaut really not exceed 500 querries a day, for
instance?).

I am also interested in publically shared and open sourced algorithms
for workset grouping, that we can all collectively work on to improve
the state of our collective knowledge.  I am unhappy that 'our'
collective institution (OCLC) keeps the products of it's research (such
as the workset algorithm currently being used, but there are other
significant examples many of us know of) as trade secrets, and am
interested in a research project that would not do so.

If 'our' collective institution, OCLC, would share the results of it's
research as open-sourced algorithms, and would provide the services I
need at more affordable costs, then  of course neither of those would be
neccesary. One option is certainly spending time on trying to lobby OCLC
to behave differently. Another option is creating an alternative. Both
are to me legitimate options.

Jonathan

Eric Hellman wrote:

Jonathan,

It's worth noting that OCLC *is* the we you are talking about.

OCLC member libraries contribute resources to do exactly what you
suggest, and to do it in a way that is sustainable for the long term.
Worldcat is created and maintained by libraries and by librarians.
I'm the last to suggest that OCLC is the best possible instantiation
of libraries-working-together, but we do try.


Eric



At 3:01 PM -0400 5/9/07, Jonathan Rochkind wrote:

2) More interesting---OCLC's _initial_ work set grouping algorithm is
public. However, we know they've done a lot of additional work to
fine-tune the work set grouping algorithms.
(http://www.frbr.org/2007/01/16/midwinter-implementers).  Some of these
algorithms probably take advantage of all the cool data OCLC has that we
don't, okay.

But how about we start working to re-create this algorithm? Re-create
isn't a good word, because we aren't going to violate any NDA's, we're
going to develop/invent our own algorithm, but this one is going to be
open source, not a trade secret like OCLC's.

So we develop an algorithm on our own, and we run that algorithm on our
own data. Our own local catalog. Union catalogs. Conglomerations of
different catalogs that we do ourselves. Even reproductions of the OCLC
corpus (or significant subsets thereof) that we manage to assemble in
ways that don't violate copyright or license agreements.

And then we've got our own workset grouping service. Which is really all
xISBN is.  What is OCLC providing that is so special? Well, if what I've
just outlined above is so much work that we _can't_ pull it off, then I
guess we've got pay OCLC, and if we are willing to do so (rather than go
without the service), then I guess OCLC has correctly pegged their
market price.

But our field is not a healthy field if all research is being done by
OCLC and other vendors. We need research from other places, we need
research that produces public domain results, not proprietary trade
secrets.



--

Eric Hellman, DirectorOCLC Openly
Informatics Division
[EMAIL PROTECTED]2 Broad St., Suite 208
tel 1-973-509-7800 fax 1-734-468-6216  Bloomfield, NJ 07003
http://openly.oclc.org/1cate/  1 Click Access To Everything



--
Jonathan Rochkind
Sr. Programmer/Analyst
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] more metadata from xISBN

2007-05-09 Thread Eric Hellman

At 4:41 PM -0400 5/9/07, Godmar Back wrote:

On 5/9/07, Eric Hellman [EMAIL PROTECTED] wrote:

We've studied our logs pretty carefully. Most of the sites that have
exceeded the limit we set were commercial sites doing bulk harvest.

You can track the xISBN use by LibX by getting an affiliate id.



LibX is a client-side tool. We're not a user of xISBN, we provide
clients who have installed it the option to use xISBN.


I know, and I had to explain that to the legal department!



Also, keep in mind that an important reason to use OCLC's xISBN
service - rather than using an alternate service or using the data
directly - is Jeff Young's OAI bookmark service, specifically the
know-how he's put into searching multiple catalogs and his keeping a
database of which library uses which catalog. That, as I understand,
is still not part of the officially supported xISBN, though.


We will improve on that service...
--

Eric Hellman, DirectorOCLC Openly
Informatics Division
[EMAIL PROTECTED]2 Broad St., Suite 208
tel 1-973-509-7800 fax 1-734-468-6216  Bloomfield, NJ 07003
http://openly.oclc.org/1cate/  1 Click Access To Everything


Re: [CODE4LIB] Z39.50 for III Database?

2007-05-09 Thread Birkin James Diana

Godmar,


... Is this code available under a license? ...


Not yet.

A third of me wishes I'd never seen Michael Doran's excellent
code4lib2007 presentation and could just blindly release stuff open-
source (for those not there, amongst great info, he cautioned against
claiming to release stuff as open-source when it may not legally be
so), but the other 2/3 is *very* appreciative I was there, and our
pro-open-source team hopes to get a process in place to legitimately
release stuff with an explicit license.

http://www.code4lib.org/2007/doran

So I'll just informally say that I hope this is useful to others for
now.

By the way, to all: when I went to the code4lib site to make sure I
attributed Michael properly, I didn't expect to see the nice
presentation of the slideshow and video. Kudos to those of you who
took the work of those we've thanked for producing this stuff, for
putting it together on the conference-schedule links. Very nice.

http://www.code4lib.org/2007/schedule

-Birkin

---
Birkin James Diana
Programmer, Integrated Technology Services
Brown University Library
[EMAIL PROTECTED]


On May 8, 2007, at 6:56 PM, Godmar Back wrote:


... Is this code available under a license? ...


On 5/8/07, Birkin James Diana [EMAIL PROTECTED] wrote:


On May 1, 2007 Godmar Back wrote:

 ..Are there any reusable, open source scripts out there that
 implements a REST interface that screenscrapes or otherwise
 efficiently accesses a III catalog?...

...Below is the link to my code

http://dl.lib.brown.edu/code/iii_opac_webservice.zip

http://128.148.7.210/~birkin/wikinotes/doku.php?
id=public:soa_josiah_status


Re: [CODE4LIB] more metadata from xISBN

2007-05-09 Thread Ross Singer

On 5/9/07, Jonathan Rochkind [EMAIL PROTECTED] wrote:


I am, however, worried that I can't do what I want to do without
breaking 500 querries a day, and my institution is not going to be
willing to pay for it. So I'm interested in exploring other
opportunities. (Does Umlaut really not exceed 500 querries a day, for
instance?).


The current state of of OpenURLs being what they are and how few of
them have ISBNs, I don't think this would be a problem.

At least, it probably wouldn't be a problem at Tech...

-Ross.