Re: [CODE4LIB] Question re: ranking and FRBR

2006-04-10 Thread Colleen Whitney it just a straight sum, Thom?


Hickey,Thom wrote:

Here at OCLC we're ranking based on the holdings of all the records in
the retrieved work set.  Seems to work pretty well.


-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Colleen Whitney
Sent: Monday, April 10, 2006 1:06 PM
Subject: [CODE4LIB] Question re: ranking and FRBR

Hello all,

Here's a question for anyone who has been thinking about or working with
FRBR for creating record groupings for display.  (Perhaps others have
already discussed or addressed which case I'd be happy to have
a pointer to resources that are already out there.)

In a retrieval environment that presents ranked results (ranked by
record content, optionally boosted by circulation and/or holdings), how
could/should FRBR-like record groupings be factored into ranking?
Several approaches have been discussed here:
- Rank the results using the score from the highest-scoring record in a
- Use the sum of scores of documents in a group (this seems to me to
place too much weight on the group)
- Use the log of the sum of the scores of documents in a group

I'd be very interested in knowing whether others have already been
thinking about this


--Colleen Whitney

Re: [CODE4LIB] analysis of referrer data

2006-03-30 Thread Colleen Whitney

In my webmastering days we used AWStats to analyze our log files.

It has been a while, but I remember it being very configurable and easy
to use.  It might we worth looking it over to see whether it would yield
what you want for your analysis...might save you some headaches.

Eric Lease Morgan wrote:

How would you go about doing some analysis of your website's referrer

I have committed to writing an article for the anniversary issue of
First Monday (as if I don't already have enough to do). Here is the
accepted/proposed title and abstract:

  Ethical issues surrounding freely available information
  found on the Web

  By reverse engineering Google queries and by tracing back
  the referrer values found in Apache log files, the use of
  content made available from is examined and
  ethical questions are asked. While all the content from the
  site is freely available under the GNU Public License, the
  content is not always used in the intended manner. This
  raises interesting questions regarding the time spent making
  the content available, the expense of the hardware and
  network connections, and whether or not the application of
  the content is put to good and moral purposes. This essay
  addresses these and other ethical questions in an attempt to
  come to an understanding regarding the place of information
  and knowledge in an open environment.

I find it interesting to watch the content of my access_log scroll by
on my console. I am most interested in the referrer information. Most
of my hits originate as searches against Google. It is fun feed these
queries back into Google and see what people searched for, watch what
the searches return, and see what page number my item is located. I
see that a lot of the hits to my site come from where
teenaged and college aged girls have incorporated some of my pictures
into their pages. Another common use is on bulletin board systems
where someone used one of my pictures as their avatar. In these
second and third cases should I expect some sort of remuneration or
at least a link back to

Some hits come from really weird places. For example, the search for
lease brings back many hits about equipment rental, but sometimes
my name and/or the Alex Catalogue of Electronic Texts is linked from
the equipment rental site. Sort of strange if you ask me. They are
using my name, sort of. (Is it 'my' name?)

In any event, I plan to take two months of access_log data, extract
the pages being looked at and the referrer information to more
systematically examine how the content on Infomotions is being
incorporated into other sites. How would you suggest I do this?
Presently I plan to extract the necessary information from my logs
and dump it into a flat database file where I will exploit various
incarnations of SQL SELECT statements. Count this. Group that. Sort
this way. Etc. Mind you, I am most interested in the one-off sort of
hits, not just the overall usage.

How would you go about doing this sort of analysis? All I have to
start with is my Apache combined access_log files?

Eric Lease Morgan
University Libraries of Notre Dame

Re: [CODE4LIB] SV: Re: [CODE4LIB] tagging

2006-03-09 Thread Colleen Whitney

Magnus Enger wrote:

[EMAIL PROTECTED] 09.03.2006 00:05 

On 3/8/06, Ian Nebe Barnett [EMAIL PROTECTED] wrote:

Ed's point about the tags being tied to the submitting user so that
obvious troublemakers can be blocked is a good one - one that should have
occurred to me, but that's why we're having the discussion.  That doesn't
address more subtle problems - theoretically, having a large enough
userbase to drown out the ignorant or malicious entries with good ones
will take care of it, but not everyone has enough users (that will
actually enter tags) to make that work.

Actually, this is the best point of all --  (in general) our communities are
/quite/ small and our collections /quite/ large.  Trying to figure out how
to make the tagging and other user-added input statistically significant is
something we've been struggling with here for the greater part of a year.
The logical choice is to open the collection up to other communities, but
then we struggle with the accountability issue.

I think the problem of large collections and small communities is an important 
one, and well described. One solution could perhaps be to build tagging etc. 
into a service outside of the catalog itself,

I agree with this last comment completely.  Most people don't do the
bulk of their research in the library catalog; it's one source, but far
from the only one.  I've been interviewing humanities undergraduates
this week for a related project and they've uniformly commented on how
hard it is to manage the disconnect between what they find in the
library catalog and what they find using other online resources.

Re: [CODE4LIB] Code4Lib Code Sharing - (Re: [CODE4LIB] journal)

2006-02-23 Thread Colleen Whitney

My point yesterday was not just about the dangers of the small group
(although that's true too), but more that if you go down the path of
starting a journal it makes sense to be really clear about your goals,
and the audience you expect to serve.   Which in turn drives the format
and content.  *If* you want to attract people on the fringes, then you
might need to do a little bit of legwork to find out what sort of
content appeals to them and make sure to provide at least some of that.
If serving the people who have already decided to participate is the
goal, then you've already got your finger on that pulse.

To pick up on your question of perception:  I lurk on this list and I
decided to submit a proposal to the conference because others in this
community are working on similar problems.  And I was really struck by
the openness of the process...decisions made by vote, not by an
invisible conference panel.  I think that the same openness was
reflected in things like the lightening talk format at the conference.
And that's all good.

But stepping into more formalized things like journals and going after
nonprofit status takes things up a notch in terms of the need for
awareness of your goals, and how you present and conduct yourselves,
because it becomes a reflection of a professional peer group.  That's
non-trivial.  Which way you go depends on whether you intend to grow
beyond the current self-selected participants.  That's not a value
judgement...just an observation.

Finis.  I promise. :-)


It's been nice to see this thread in public...since oftentimes
discussion happens in real time in IRC (which isn't publicly logged--
yet). I think Colleen is right to point out the dangers of having a
'core group'. But I would argue that the core group is really a
mirage, and that up until now it has simply been people who've
decided to participate (for better or for worse) in #code4lib.


Re: [CODE4LIB] Catalog Enhancements Extensions (Re: mylibrary @ockham)

2005-12-02 Thread Colleen Whitney

Roy, we're in the middle of re-indexing, and there's some broken code in
the search that won't be fixed until after Martin returns from
vacation.  Plus, we're working on UI right will be much more
readable very soon.

So I would suggest not sending him the link quite yet.  Within 2 - 3
weeks I suspect we can start giving out sneak previews though.


Roy Tennant wrote:

Short answer now, longer/better answer next week when someone gets
back in the office. We have 4.5 million records indexed at the
moment, but have had up to 9 million indexed. Our dev system runs on
a Unix server (specs to come) that runs other apps as well. I'm not
sure if we can share the crude search interface so you can judge the
response, but will try to find out.

On Dec 2, 2005, at 12:36 PM, Andrew Nagy wrote:

Roy Tennant wrote:

Andrew, just as an additional data point, we have millions of records
indexed in our Lucene-based XTF system, and the response isn't too
bad even on a development server.

Can you and others on this list briefly describe your hardware
for this?  I am assuming this is not running on an old 486 that is
around in your office :)

Do you feel that the searching is processor intensive and may be best
suited for a load balanced infrastructure?  I am implementing my pilot
using eXist which stores the XML Database in B Trees which from my
knowledge is an in memory data structure so therefor the machine would
need lots of ram however I am curious as to the processing

Thanks, you guys rock!


[CODE4LIB] Sneak previews, etc.

2005-12-02 Thread Colleen Whitney

Sorry all, that was obviously meant for Roy.

But...that said...when the prototype is ready for sneak previews I'll
send the link around to the list.