Thanks...is it just a straight sum, Thom? --C Hickey,Thom wrote: Here at OCLC we're ranking based on the holdings of all the records in the retrieved work set. Seems to work pretty well. --Th -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Colleen Whitney Sent: Monday, April 10, 2006 1:06 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Question re: ranking and FRBR Hello all, Here's a question for anyone who has been thinking about or working with FRBR for creating record groupings for display. (Perhaps others have already discussed or addressed this...in which case I'd be happy to have a pointer to resources that are already out there.) In a retrieval environment that presents ranked results (ranked by record content, optionally boosted by circulation and/or holdings), how could/should FRBR-like record groupings be factored into ranking? Several approaches have been discussed here: - Rank the results using the score from the highest-scoring record in a group - Use the sum of scores of documents in a group (this seems to me to place too much weight on the group) - Use the log of the sum of the scores of documents in a group I'd be very interested in knowing whether others have already been thinking about this Regards, --Colleen Whitney
In my webmastering days we used AWStats to analyze our log files. http://awstats.sourceforge.net/ It has been a while, but I remember it being very configurable and easy to use. It might we worth looking it over to see whether it would yield what you want for your analysis...might save you some headaches. Eric Lease Morgan wrote: How would you go about doing some analysis of your website's referrer data? I have committed to writing an article for the anniversary issue of First Monday (as if I don't already have enough to do). Here is the accepted/proposed title and abstract: Ethical issues surrounding freely available information found on the Web By reverse engineering Google queries and by tracing back the referrer values found in Apache log files, the use of content made available from infomotions.com is examined and ethical questions are asked. While all the content from the site is freely available under the GNU Public License, the content is not always used in the intended manner. This raises interesting questions regarding the time spent making the content available, the expense of the hardware and network connections, and whether or not the application of the content is put to good and moral purposes. This essay addresses these and other ethical questions in an attempt to come to an understanding regarding the place of information and knowledge in an open environment. I find it interesting to watch the content of my access_log scroll by on my console. I am most interested in the referrer information. Most of my hits originate as searches against Google. It is fun feed these queries back into Google and see what people searched for, watch what the searches return, and see what page number my item is located. I see that a lot of the hits to my site come from MySpace.com where teenaged and college aged girls have incorporated some of my pictures into their pages. Another common use is on bulletin board systems where someone used one of my pictures as their avatar. In these second and third cases should I expect some sort of remuneration or at least a link back to infomotions.com? Some hits come from really weird places. For example, the search for lease brings back many hits about equipment rental, but sometimes my name and/or the Alex Catalogue of Electronic Texts is linked from the equipment rental site. Sort of strange if you ask me. They are using my name, sort of. (Is it 'my' name?) In any event, I plan to take two months of access_log data, extract the pages being looked at and the referrer information to more systematically examine how the content on Infomotions is being incorporated into other sites. How would you suggest I do this? Presently I plan to extract the necessary information from my logs and dump it into a flat database file where I will exploit various incarnations of SQL SELECT statements. Count this. Group that. Sort this way. Etc. Mind you, I am most interested in the one-off sort of hits, not just the overall usage. How would you go about doing this sort of analysis? All I have to start with is my Apache combined access_log files? -- Eric Lease Morgan University Libraries of Notre Dame
Magnus Enger wrote: [EMAIL PROTECTED] 09.03.2006 00:05 On 3/8/06, Ian Nebe Barnett [EMAIL PROTECTED] wrote: Ed's point about the tags being tied to the submitting user so that obvious troublemakers can be blocked is a good one - one that should have occurred to me, but that's why we're having the discussion. That doesn't address more subtle problems - theoretically, having a large enough userbase to drown out the ignorant or malicious entries with good ones will take care of it, but not everyone has enough users (that will actually enter tags) to make that work. Actually, this is the best point of all -- (in general) our communities are /quite/ small and our collections /quite/ large. Trying to figure out how to make the tagging and other user-added input statistically significant is something we've been struggling with here for the greater part of a year. The logical choice is to open the collection up to other communities, but then we struggle with the accountability issue. I think the problem of large collections and small communities is an important one, and well described. One solution could perhaps be to build tagging etc. into a service outside of the catalog itself, I agree with this last comment completely. Most people don't do the bulk of their research in the library catalog; it's one source, but far from the only one. I've been interviewing humanities undergraduates this week for a related project and they've uniformly commented on how hard it is to manage the disconnect between what they find in the library catalog and what they find using other online resources.
My point yesterday was not just about the dangers of the small group (although that's true too), but more that if you go down the path of starting a journal it makes sense to be really clear about your goals, and the audience you expect to serve. Which in turn drives the format and content. *If* you want to attract people on the fringes, then you might need to do a little bit of legwork to find out what sort of content appeals to them and make sure to provide at least some of that. If serving the people who have already decided to participate is the goal, then you've already got your finger on that pulse. To pick up on your question of perception: I lurk on this list and I decided to submit a proposal to the conference because others in this community are working on similar problems. And I was really struck by the openness of the process...decisions made by vote, not by an invisible conference panel. I think that the same openness was reflected in things like the lightening talk format at the conference. And that's all good. But stepping into more formalized things like journals and going after nonprofit status takes things up a notch in terms of the need for awareness of your goals, and how you present and conduct yourselves, because it becomes a reflection of a professional peer group. That's non-trivial. Which way you go depends on whether you intend to grow beyond the current self-selected participants. That's not a value judgement...just an observation. Finis. I promise. :-) --C It's been nice to see this thread in public...since oftentimes discussion happens in real time in IRC (which isn't publicly logged-- yet). I think Colleen is right to point out the dangers of having a 'core group'. But I would argue that the core group is really a mirage, and that up until now it has simply been people who've decided to participate (for better or for worse) in #code4lib. //Ed
Roy, we're in the middle of re-indexing, and there's some broken code in the search that won't be fixed until after Martin returns from vacation. Plus, we're working on UI right now...it will be much more readable very soon. So I would suggest not sending him the link quite yet. Within 2 - 3 weeks I suspect we can start giving out sneak previews though. --C Roy Tennant wrote: Short answer now, longer/better answer next week when someone gets back in the office. We have 4.5 million records indexed at the moment, but have had up to 9 million indexed. Our dev system runs on a Unix server (specs to come) that runs other apps as well. I'm not sure if we can share the crude search interface so you can judge the response, but will try to find out. Roy On Dec 2, 2005, at 12:36 PM, Andrew Nagy wrote: Roy Tennant wrote: Andrew, just as an additional data point, we have millions of records indexed in our Lucene-based XTF system, and the response isn't too bad even on a development server. Can you and others on this list briefly describe your hardware platform for this? I am assuming this is not running on an old 486 that is lying around in your office :) Do you feel that the searching is processor intensive and may be best suited for a load balanced infrastructure? I am implementing my pilot using eXist which stores the XML Database in B Trees which from my knowledge is an in memory data structure so therefor the machine would need lots of ram however I am curious as to the processing requirements. Thanks, you guys rock! Andrew
Sorry all, that was obviously meant for Roy. But...that said...when the prototype is ready for sneak previews I'll send the link around to the list. Cheers, --Colleen