Re: [CODE4LIB] libxess
Impressive sample. kst Godmar Back [EMAIL PROTECTED] 5/8/2007 5:57 PM On 5/8/07, Karen Tschanz [EMAIL PROTECTED] wrote: Hi, Godmar: I would be interested in receiving links from libraries that has implemented this, so that I could see the results. Thanks for your help! Given that what I propose is still in the design phase/vaporware, asking for examples may be premature yet here is one: http://libx.org/libxess/cue.html shows what a possible application of this technology might look like. Also, keep in mind that what I'm proposing is not a new service that libraries could deploy directly to their users -- rather, it's a piece of infrastructure that would allow libraries to deploy services built on this infrastructure to their users. It's a bit of a chicken and an egg problem, except that we will go ahead and provide an initial chicken (LibX) and an initial egg (David's script.) - Godmar
[CODE4LIB] Server logs as tag clouds
O'Reilly has a nifty feature that displays the top 20 search terms on their various sites using terms that someone typed into a search engine (e.g., Google) and then followed a resulting link. (They're also distrubuting these tags as JSON, which is a nice idea.) http://www.oreillynet.com/feeds/widgets/organic_search_tagcloud/ Presumably they are doing server log analysis to get and rank search terms as tags (although there is no way to tell absolutely since the code is not GPL). It seems like it would be a good complement to search log analysis to see how people are finding and using your site. O'Reilly has addressed the potential issues of privacy and appropriateness of the displayed tags by matching search terms back to an index of their site. While the keyword frequency does give some idea of what people are looking for, keep in mind that the word had to already be on our site in order for it to appear, and it had to be ranked highly enough for someone to find it. It also greatly helps that their site has a highly structured search engine, allowing limiting of results by content type and by site. This is probably only practical on sites that use a structured CMS. Still, it is worth asking: Has anyone made a stab at this -- ie, publically exposing server logs? Are there code examples (any real-world, generalizable examples would be welcome). Sorry for cross-posting this. -- Tom
Re: [CODE4LIB] Server logs as tag clouds
On Wed, 9 May 2007, Tom Keays wrote: Still, it is worth asking: Has anyone made a stab at this -- ie, publically exposing server logs? Are there code examples (any real-world, generalizable examples would be welcome). Sorry for cross-posting this. I've done it in the past -- typically using general analystics programs (eg, analog), or just parsing out relevant data w/ perl. The problem is, a few years ago, that spammers started sending bogus requests to servers, to try to get them to show up in your stats pages. In ORA's case, they're only showing the top 20, and they presumably get lots of requests, so someone would have to hit them pretty hard to get something to show up. If you're thinking about exposing your server logs, I'd recommend the following: 1. Don't give out IP addresses of the requestors (privacy reasons) 2. Don't put on a public page any data that's generated by the user-agent, to include HTTP_USER_AGENT, HTTP_REFERER and QUERY_STRING. All have been used by spammers to insert URLs to try to get links back to their sites. 3. Filter out all entries with 'error' results (people trying to probe your system for vulnerabilities, etc.) 4. Filter out all 'intranet' pages or other pages that the general public shouldn't be going to. 5. Avoid giving information that provides signatures of the CMS you're using, or other signatures of potential vulnerabilities. 6. Use robot.txt to request search engines to not serve whatever pages you generate. For the particular case of generating tag clouds from search results, the problem lies in that you typically need to use QUERY_STRING if it's a local search script, and HTTP_REFERER if it's a remote search engine that linked to you. Both values can't be trusted. In this particular case, I probably wouldn't try a fully automated approach -- I'd generate the page, but require someone to manually verify it before it got posted. - Joe Hourcle (insert some statement here about everything being my personal opinions, and that I don't speak for any company, organization, etc.)
Re: [CODE4LIB] more metadata from xISBN
On 8 May 2007, Eric Hellman wrote: xISBN is free for non-commercial, low volume use. The xISBN web site clarifies this as meaning = 500 queries per day for non-commercial purposes. Over 500 queries in a day for non-commercial use, or any number of queries for commercial use, requires paying: http://xisbn.worldcat.org/xisbnadmin/doc/price.htm A library would pay $3,000 USD a year to be able to do 10,000 queries a day. That's a lot of queries, but I could imagine a big academic library doing a bunch if they pushed out web tools to their students to make it easy to check if any edition of a given book (seen at Amazon or in a blog, etc.) is available in its collection. 1,000 queries a day (which used to be free) is now $500 USD per year. It's 20% off for OCLC members. I'm not sure how to read the commercial price rates, or who would need 10,000,000 xISBN queries, but the prices push the service out of the reach of the devoted library hacker as well as the small start-up or basement business. xISBN's availability, even to and through free and open source tools, is now more limited. On reflection, this is one of the rare times on code4lib when an announced API offers less and not more. Also, it's the first big commodification of FRBR, which is intriquing. Bill -- William Denton, Toronto : www.miskatonic.org www.frbr.org www.openfrbr.org
Re: [CODE4LIB] more metadata from xISBN
Interesting. Thom Hickey commented a while ago about LibX's use of xISBN (*): I suspect that eventually the LibX xISBN support will become both less visible and more automatic. We were indeed planning on making it more automatic. For instance, a user visiting a vendor's page such as amazon might be presented with options from their library catalog, based on related ISBN found via xISBN. Would that qualify as noncommercial use? For instance, if LibX with this feature were installed on a public library machine, 500 requests per day might be easily exceeded. Matters would be even worse if multiple library machines were to share an IP because they are hidden behind a NAT device or proxy. - Godmar (*) http://outgoing.typepad.com/outgoing/2006/05/libx_and_xisbn.html On 5/9/07, William Denton [EMAIL PROTECTED] wrote: On 8 May 2007, Eric Hellman wrote: xISBN is free for non-commercial, low volume use. The xISBN web site clarifies this as meaning = 500 queries per day for non-commercial purposes. Over 500 queries in a day for non-commercial use, or any number of queries for commercial use, requires paying: http://xisbn.worldcat.org/xisbnadmin/doc/price.htm A library would pay $3,000 USD a year to be able to do 10,000 queries a day. That's a lot of queries, but I could imagine a big academic library doing a bunch if they pushed out web tools to their students to make it easy to check if any edition of a given book (seen at Amazon or in a blog, etc.) is available in its collection. 1,000 queries a day (which used to be free) is now $500 USD per year. It's 20% off for OCLC members. I'm not sure how to read the commercial price rates, or who would need 10,000,000 xISBN queries, but the prices push the service out of the reach of the devoted library hacker as well as the small start-up or basement business. xISBN's availability, even to and through free and open source tools, is now more limited. On reflection, this is one of the rare times on code4lib when an announced API offers less and not more. Also, it's the first big commodification of FRBR, which is intriquing. Bill -- William Denton, Toronto : www.miskatonic.org www.frbr.org www.openfrbr.org
Re: [CODE4LIB] more metadata from xISBN
On May 9, 2007, at 11:56 AM, William Denton wrote: On 8 May 2007, Eric Hellman wrote: xISBN is free for non-commercial, low volume use. A library would pay $3,000 USD a year to be able to do 10,000 queries a day. That's a lot of queries, but I could imagine a big academic library doing a bunch if they pushed out web tools to their students to make it easy to check if any edition of a given book (seen at Amazon or in a blog, etc.) is available in its collection. 1,000 queries a day (which used to be free) is now $500 USD per year. It's 20% off for OCLC members. Y'know, we could just all chip in for the data file and provide free access through a web service. Heh. Someday, I'm gonna get sued. Also... did I somehow miss the legislation in which factual information (like, everything contained within xISBN) became copyrightable? -Nate
Re: [CODE4LIB] more metadata from xISBN
Nathan Vack wrote: Also... did I somehow miss the legislation in which factual information (like, everything contained within xISBN) became copyrightable? License agreements can restrict just about anything the agreement wants to. If it's an an agreement freely entered into, you can agree to a restriction on what you can do way beyond what copyright law would support. But yeah, on this general topic, this stiffles a lot of things we'd want to do with xISBN, indeed. What options are there? 1) thingISBN, of course. 2) More interesting---OCLC's _initial_ work set grouping algorithm is public. However, we know they've done a lot of additional work to fine-tune the work set grouping algorithms. (http://www.frbr.org/2007/01/16/midwinter-implementers). Some of these algorithms probably take advantage of all the cool data OCLC has that we don't, okay. But how about we start working to re-create this algorithm? Re-create isn't a good word, because we aren't going to violate any NDA's, we're going to develop/invent our own algorithm, but this one is going to be open source, not a trade secret like OCLC's. So we develop an algorithm on our own, and we run that algorithm on our own data. Our own local catalog. Union catalogs. Conglomerations of different catalogs that we do ourselves. Even reproductions of the OCLC corpus (or significant subsets thereof) that we manage to assemble in ways that don't violate copyright or license agreements. And then we've got our own workset grouping service. Which is really all xISBN is. What is OCLC providing that is so special? Well, if what I've just outlined above is so much work that we _can't_ pull it off, then I guess we've got pay OCLC, and if we are willing to do so (rather than go without the service), then I guess OCLC has correctly pegged their market price. But our field is not a healthy field if all research is being done by OCLC and other vendors. We need research from other places, we need research that produces public domain results, not proprietary trade secrets. Jonathan -Nate -- Jonathan Rochkind Sr. Programmer/Analyst The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] more metadata from xISBN
As long as LibX is free and not being used as a way to drive Amazon revenue, I don't see how it could be considered to be commercial. We've studied our logs pretty carefully. Most of the sites that have exceeded the limit we set were commercial sites doing bulk harvest. You can track the xISBN use by LibX by getting an affiliate id. Eric At 2:32 PM -0400 5/9/07, Godmar Back wrote: Interesting. Thom Hickey commented a while ago about LibX's use of xISBN (*): I suspect that eventually the LibX xISBN support will become both less visible and more automatic. We were indeed planning on making it more automatic. For instance, a user visiting a vendor's page such as amazon might be presented with options from their library catalog, based on related ISBN found via xISBN. Would that qualify as noncommercial use? For instance, if LibX with this feature were installed on a public library machine, 500 requests per day might be easily exceeded. Matters would be even worse if multiple library machines were to share an IP because they are hidden behind a NAT device or proxy. - Godmar (*) http://outgoing.typepad.com/outgoing/2006/05/libx_and_xisbn.html -- Eric Hellman, DirectorOCLC Openly Informatics Division [EMAIL PROTECTED]2 Broad St., Suite 208 tel 1-973-509-7800 fax 1-734-468-6216 Bloomfield, NJ 07003 http://openly.oclc.org/1cate/ 1 Click Access To Everything
Re: [CODE4LIB] more metadata from xISBN
On 5/9/07, Eric Hellman [EMAIL PROTECTED] wrote: As long as LibX is free and not being used as a way to drive Amazon revenue, I don't see how it could be considered to be commercial. Probably a way to drive Amazon revenue down, considering that we offer the alternative to borrow the book rather than buy it. We've studied our logs pretty carefully. Most of the sites that have exceeded the limit we set were commercial sites doing bulk harvest. You can track the xISBN use by LibX by getting an affiliate id. LibX is a client-side tool. We're not a user of xISBN, we provide clients who have installed it the option to use xISBN. Also, keep in mind that an important reason to use OCLC's xISBN service - rather than using an alternate service or using the data directly - is Jeff Young's OAI bookmark service, specifically the know-how he's put into searching multiple catalogs and his keeping a database of which library uses which catalog. That, as I understand, is still not part of the officially supported xISBN, though. - Godmar
Re: [CODE4LIB] more metadata from xISBN
Yeah, that's a good point, Eric. I am, however, worried that I can't do what I want to do without breaking 500 querries a day, and my institution is not going to be willing to pay for it. So I'm interested in exploring other opportunities. (Does Umlaut really not exceed 500 querries a day, for instance?). I am also interested in publically shared and open sourced algorithms for workset grouping, that we can all collectively work on to improve the state of our collective knowledge. I am unhappy that 'our' collective institution (OCLC) keeps the products of it's research (such as the workset algorithm currently being used, but there are other significant examples many of us know of) as trade secrets, and am interested in a research project that would not do so. If 'our' collective institution, OCLC, would share the results of it's research as open-sourced algorithms, and would provide the services I need at more affordable costs, then of course neither of those would be neccesary. One option is certainly spending time on trying to lobby OCLC to behave differently. Another option is creating an alternative. Both are to me legitimate options. Jonathan Eric Hellman wrote: Jonathan, It's worth noting that OCLC *is* the we you are talking about. OCLC member libraries contribute resources to do exactly what you suggest, and to do it in a way that is sustainable for the long term. Worldcat is created and maintained by libraries and by librarians. I'm the last to suggest that OCLC is the best possible instantiation of libraries-working-together, but we do try. Eric At 3:01 PM -0400 5/9/07, Jonathan Rochkind wrote: 2) More interesting---OCLC's _initial_ work set grouping algorithm is public. However, we know they've done a lot of additional work to fine-tune the work set grouping algorithms. (http://www.frbr.org/2007/01/16/midwinter-implementers). Some of these algorithms probably take advantage of all the cool data OCLC has that we don't, okay. But how about we start working to re-create this algorithm? Re-create isn't a good word, because we aren't going to violate any NDA's, we're going to develop/invent our own algorithm, but this one is going to be open source, not a trade secret like OCLC's. So we develop an algorithm on our own, and we run that algorithm on our own data. Our own local catalog. Union catalogs. Conglomerations of different catalogs that we do ourselves. Even reproductions of the OCLC corpus (or significant subsets thereof) that we manage to assemble in ways that don't violate copyright or license agreements. And then we've got our own workset grouping service. Which is really all xISBN is. What is OCLC providing that is so special? Well, if what I've just outlined above is so much work that we _can't_ pull it off, then I guess we've got pay OCLC, and if we are willing to do so (rather than go without the service), then I guess OCLC has correctly pegged their market price. But our field is not a healthy field if all research is being done by OCLC and other vendors. We need research from other places, we need research that produces public domain results, not proprietary trade secrets. -- Eric Hellman, DirectorOCLC Openly Informatics Division [EMAIL PROTECTED]2 Broad St., Suite 208 tel 1-973-509-7800 fax 1-734-468-6216 Bloomfield, NJ 07003 http://openly.oclc.org/1cate/ 1 Click Access To Everything -- Jonathan Rochkind Sr. Programmer/Analyst The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] more metadata from xISBN
At 4:41 PM -0400 5/9/07, Godmar Back wrote: On 5/9/07, Eric Hellman [EMAIL PROTECTED] wrote: We've studied our logs pretty carefully. Most of the sites that have exceeded the limit we set were commercial sites doing bulk harvest. You can track the xISBN use by LibX by getting an affiliate id. LibX is a client-side tool. We're not a user of xISBN, we provide clients who have installed it the option to use xISBN. I know, and I had to explain that to the legal department! Also, keep in mind that an important reason to use OCLC's xISBN service - rather than using an alternate service or using the data directly - is Jeff Young's OAI bookmark service, specifically the know-how he's put into searching multiple catalogs and his keeping a database of which library uses which catalog. That, as I understand, is still not part of the officially supported xISBN, though. We will improve on that service... -- Eric Hellman, DirectorOCLC Openly Informatics Division [EMAIL PROTECTED]2 Broad St., Suite 208 tel 1-973-509-7800 fax 1-734-468-6216 Bloomfield, NJ 07003 http://openly.oclc.org/1cate/ 1 Click Access To Everything
Re: [CODE4LIB] Z39.50 for III Database?
Godmar, ... Is this code available under a license? ... Not yet. A third of me wishes I'd never seen Michael Doran's excellent code4lib2007 presentation and could just blindly release stuff open- source (for those not there, amongst great info, he cautioned against claiming to release stuff as open-source when it may not legally be so), but the other 2/3 is *very* appreciative I was there, and our pro-open-source team hopes to get a process in place to legitimately release stuff with an explicit license. http://www.code4lib.org/2007/doran So I'll just informally say that I hope this is useful to others for now. By the way, to all: when I went to the code4lib site to make sure I attributed Michael properly, I didn't expect to see the nice presentation of the slideshow and video. Kudos to those of you who took the work of those we've thanked for producing this stuff, for putting it together on the conference-schedule links. Very nice. http://www.code4lib.org/2007/schedule -Birkin --- Birkin James Diana Programmer, Integrated Technology Services Brown University Library [EMAIL PROTECTED] On May 8, 2007, at 6:56 PM, Godmar Back wrote: ... Is this code available under a license? ... On 5/8/07, Birkin James Diana [EMAIL PROTECTED] wrote: On May 1, 2007 Godmar Back wrote: ..Are there any reusable, open source scripts out there that implements a REST interface that screenscrapes or otherwise efficiently accesses a III catalog?... ...Below is the link to my code http://dl.lib.brown.edu/code/iii_opac_webservice.zip http://128.148.7.210/~birkin/wikinotes/doku.php? id=public:soa_josiah_status
Re: [CODE4LIB] more metadata from xISBN
On 5/9/07, Jonathan Rochkind [EMAIL PROTECTED] wrote: I am, however, worried that I can't do what I want to do without breaking 500 querries a day, and my institution is not going to be willing to pay for it. So I'm interested in exploring other opportunities. (Does Umlaut really not exceed 500 querries a day, for instance?). The current state of of OpenURLs being what they are and how few of them have ISBNs, I don't think this would be a problem. At least, it probably wouldn't be a problem at Tech... -Ross.