Re: [CODE4LIB] indexing word documents using solr
On Feb 10, 2015, at 12:43, Eric Lease Morgan emor...@nd.edu wrote: On Feb 10, 2015, at 11:46 AM, Erik Hatcher erikhatc...@mac.com wrote: First, with Solr 5, it’s this easy: Where can I download Solr 5 because none of the other version seem to be complete. —ELM It's not yet released but will be in a matter of days. RC2 was generated last night here: http://people.apache.org/~anshum/staging_area/lucene-solr-5.0.0-RC2-rev1658469/solr/ Sorry for the tease on Solr 5, that's just where I've been living lately :) Erik
Re: [CODE4LIB] Restrict solr index results based on client IP
Post processing results as in #1 has big disadvantages as you can’t easily “fill back in” as those docs that were removed and may have been accounted for in facet counts for example. #2 would be my recommendation as well. There is an open issue to create an IP(v6) field type in Solr, with a patch there for IPv4 already. Erik On Jan 7, 2015, at 11:41 AM, Chad Mills cmmi...@rci.rutgers.edu wrote: Hello, Basically I have a solr index where, at times, some of the results from a query will only be limited to a set of users based on their clients IP address. I have been thinking about accomplishing this in either two ways. 1) Post-processing the results for IP validity against an external data source and dropping out those results which are not valid. That could leave me with a portioned result list that would need another query to fill back in. Say I want 10 results, I end up dropping 2 of them, I need to fill back in those 2 by performing another query. 2) Making the IP permission check part of the query. Basically appending an AND in the query on a field that stores the permissible IP addresses. The index field would be set to allow all IPs to access the result by default, but at times can contain the allowable IP addresses or maybe even ranges somehow. Are there some other ways to accomplish this I haven't considered? Right now #2 sounds seems more desirable to me. Thanks in advance for your thoughts! -- Chad Mills Digital Library Architect Ph: 848.932.5924 Fax: 848.932.1386 Cell: 732.309.8538 Rutgers University Libraries Scholarly Communication Center Room 409D, Alexander Library 169 College Avenue, New Brunswick, NJ 08901 https://rucore.libraries.rutgers.edu/
Re: [CODE4LIB] Restrict solr index results based on client IP
I meant to include this link in my first reply, sorry: https://issues.apache.org/jira/browse/SOLR-6741 https://issues.apache.org/jira/browse/SOLR-6741 On Jan 7, 2015, at 11:53 AM, Erik Hatcher erikhatc...@mac.com wrote: Post processing results as in #1 has big disadvantages as you can’t easily “fill back in” as those docs that were removed and may have been accounted for in facet counts for example. #2 would be my recommendation as well. There is an open issue to create an IP(v6) field type in Solr, with a patch there for IPv4 already. Erik On Jan 7, 2015, at 11:41 AM, Chad Mills cmmi...@rci.rutgers.edu wrote: Hello, Basically I have a solr index where, at times, some of the results from a query will only be limited to a set of users based on their clients IP address. I have been thinking about accomplishing this in either two ways. 1) Post-processing the results for IP validity against an external data source and dropping out those results which are not valid. That could leave me with a portioned result list that would need another query to fill back in. Say I want 10 results, I end up dropping 2 of them, I need to fill back in those 2 by performing another query. 2) Making the IP permission check part of the query. Basically appending an AND in the query on a field that stores the permissible IP addresses. The index field would be set to allow all IPs to access the result by default, but at times can contain the allowable IP addresses or maybe even ranges somehow. Are there some other ways to accomplish this I haven't considered? Right now #2 sounds seems more desirable to me. Thanks in advance for your thoughts! -- Chad Mills Digital Library Architect Ph: 848.932.5924 Fax: 848.932.1386 Cell: 732.309.8538 Rutgers University Libraries Scholarly Communication Center Room 409D, Alexander Library 169 College Avenue, New Brunswick, NJ 08901 https://rucore.libraries.rutgers.edu/
Re: [CODE4LIB] MARC reporting engine
I’m surprised you didn’t recommend going straight to Solr and doing the reporting from there :) Index into Solr using your MARC library of choice (e.g. solrmarc) and then get all authorities using facet.field=authorities (or whatever field name used). Erik On Nov 2, 2014, at 7:24 PM, Jonathan Rochkind rochk...@jhu.edu wrote: If you are, can become, or know, a programmer, that would be relatively straightforward in any programming language using the open source MARC processing library for that language. (ruby marc, pymarc, perl marc, whatever). Although you might find more trouble than you expect around authorities, with them being less standardized in your corpus than you might like. From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates [stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine I have ~800,000 MARC records from an indexing service (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to generate: (a) a list of person authorities (and sundry metadata), sorted by how many times they're referenced, in wikimedia syntax (b) a view of a person authority, with all the records by which they're referenced, processed into a wikipedia stub biography I have established that this is too much data to process in XSLT or multi-line regexps in vi. What other MARC engines are there out there? The two options I'm aware of are learning multi-line processing in sed or learning enough koha to write reports in whatever their reporting engine is. Any advice? cheers stuart -- I have a new phone number: 04 463 5692
Re: [CODE4LIB] solr computation field norm problem
Nicolas - Lucene 4 still encodes norms, as described here: http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#encodeNormValue%28float%29 using this function: http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/util/SmallFloat.html#floatToByte315%28float%29 You might want to give SweetSpotSimilarity a try: http://lucene.apache.org/core/4_4_0/misc/org/apache/lucene/misc/SweetSpotSimilarity.html Erik On Sep 26, 2013, at 8:02 AM, Nicolas Franck nicolas.fra...@ugent.be wrote: I've been testing with Solr 4 (Lucene 4) that uses the new DefaultSimilarity class. It does not use the encodeNorm and decodeNorm methods anymore that caused all the trouble (storing the floats as a single byte). But it doesn't change anything? The field norms remain the same? From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Chris Fitzpatrick [chrisfitz...@gmail.com] Sent: Wednesday, September 25, 2013 7:57 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] solr computation field norm problem Yeah...I think you're running into this: http://lucene.472066.n3.nabble.com/field-length-normalization-tp495308p495311.html TL;DR: Jay Hill says fields with 3 terms and 4 terms both score at .5 in the lengthNorm. On Wed, Sep 25, 2013 at 4:21 PM, Nicolas Franck nicolas.fra...@ugent.bewrote: Hi there, I have a question about the way Lucene computes the length norm of field norm for its documents. My documents are indexed using Solr. These are the documents that where indexed (ignore 'score', that is not part of the document itself) doc float name=score1.00711/float str name=_idejn01:25675596/str str name=titleJournal of neurology research/str /doc doc float name=score1.00711/float str name=_idejn01:954925518616/str str name=titleJournal of neurology/str /doc The field title has the following definition in schema.xml: fieldType name=utf8text class=solr.TextField positionIncrementGap=100 omitNorms=false analyzer type=index tokenizer class=solr.StandardTokenizerFactory maxTokenLength=1024/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt format=solr ignoreCase=false expand=true tokenizerFactory=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory maxTokenLength=1024/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt format=solr ignoreCase=false expand=true tokenizerFactory=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType If I use the query journal of neurology, both documents have the same score, although the second document is more exact. Supplying a phrase query does not fix the issue. I also see that the computed fieldNorm is 0.5 for both documents. Does this have something to do with the loss of precision when storing the length norm into one byte? These are all the supplied parameters (defaults in solrconfig.xml): str name=lowercaseOperatorsfalse/str str name=mm-10%/str str name=pfauthor^3 title^2/str str name=sortscore desc/str arr name=bq strsource:ser01^10/str strsource:ejn01^10/str str(*:* -type:article)^999/str /arr str name=echoParamsall/str str name=dfall/str str name=tie0/str str name=qf author^15 title^10 subject^1 summary^1 library^1 location^1 publisher^1 place_published^1 issn^1 isbn^1 /str str name=q.alt*:*/str str name=ps2/str str name=defTypeedismax/str str name=qjournal of neurology/str str name=echoParamsall/str str name=sortscore desc/str Looking the computation of the score, I see no single difference between them (see down below) Any idea why the fieldNorm is the same for both documents? Thanks in advance! Greetings, Nicolas str name=ejn01:25675596 1.0071099 = (MATCH) sum of: 0.0053001107 = (MATCH) sum of: 0.0017667036 = (MATCH) max of: 0.0017667036 = (MATCH) weight(title:journal^10.0 in 0), product of: 0.005943145 = queryWeight(title:journal^10.0), product of: 10.0 = boost 0.5945349 = idf(docFreq=2, maxDocs=2) 9.996294E-4 = queryNorm 0.29726744 = (MATCH) fieldWeight(title:journal in 0), product of: 1.0 = tf(termFreq(title:journal)=1) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(field=title, doc=0) 0.0017667036 = (MATCH) max of: 0.0017667036 = (MATCH) weight(title:of^10.0 in 0), product of: 0.005943145 = queryWeight(title:of^10.0), product of: 10.0 = boost 0.5945349 = idf(docFreq=2, maxDocs=2) 9.996294E-4 = queryNorm 0.29726744 = (MATCH) fieldWeight(title:of in 0), product of:
[CODE4LIB]
We can have the Solr session when and wherever! :) Organizers - feel free to move it however it fits best. Related: With all of those pre-conferences, it looks like there'll need to be 6 rooms but the page says 4 (admittedly 4+ it says) Erik On Nov 28, 2012, at 16:23 , Bess Sadler wrote: On Nov 28, 2012, at 1:04 PM, Shaun Ellis sha...@princeton.edu wrote: In that respect, I would suggest the preconference hackfests/workshops that involve some kind of pair programming with experienced/inexperienced hackers, which could follow up into a mentor relationship outside of the conference. I do like the idea of mentor/mentee speed-dating to align interests, but in this sense, the workshop/hackfest you sign up for kind of does that for you (assuming all the preconference proposals[1] are actually going to happen). [1] http://wiki.code4lib.org/index.php/2013_preconference_proposals -Shaun My understanding is that all of the pre-conference proposals are going to happen (note to self: ask Erik Hatcher whether the evening solr session could happen at a bar somewhere). The RailsBridge workshop in particular is aimed at folks who are new to Rails and perhaps new to programming in general, and RailsBridge as a thing was started as a way to bring more women into tech. If anyone is interested in helping out at the RailsBridge session, or at the Blacklight-tailored-for-RailsBridge session in the afternoon, please join us! Workshops like this can never have too many people walking the room to help out, and if we had enough experienced folks, this would be a great opportunity for pair programming and meeting potential mentors. Bess
Re: [CODE4LIB] extracting tiff info
There's Tika http://tika.apache.org/, which has command-line capabilities. I just launched the UI app, dropped a TIFF on it, and got this output: Bits Per Sample: 8 8 8 8 bits/component/pixel Compression: LZW Content-Length: 262844 Content-Type: image/tiff Orientation: Top, left side (Horizontal / normal) Photometric Interpretation: RGB Planar Configuration: Chunky (contiguous for each subsampling pixel) Predictor: 2 Rows Per Strip: 30 rows/strip Samples Per Pixel: 4 samples/pixel Strip Byte Counts: 20668 7759 13240 15631 14302 17278 11236 14414 6226 5401 7310 4813 12716 5368 4213 3357 5664 6081 8466 12266 8083 8541 14306 7245 11916 9443 4636 705 705 417 bytes Strip Offsets: 8 20676 28435 41675 57306 71608 6 100122 114536 120762 126163 133473 138286 151002 156370 160583 163940 169604 175685 184151 196417 204500 213041 227347 234592 246508 255951 260587 261292 261997 Thumbnail Image Height: 881 pixels Thumbnail Image Width: 1081 pixels Unknown tag (0x0152): 1 Unknown tag (0x0153): 1 1 1 1 resourceName: tika-view.tiff tiff:BitsPerSample: 8 tiff:ImageLength: 881 tiff:ImageWidth: 1081 tiff:Orientation: 1 tiff:SamplesPerPixel: 4 Erik On Nov 19, 2012, at 14:31 , Kyle Banerjee wrote: Howdy all, I need to extract all the metadata from a few thousand images on a network drive and put it into spreadsheet. Since the files are huge (each is 100MB+) and my connection isn't that fast, I strongly prefer to not move them before working on them -- i.e. I'm using cygwin and/or windows. Just eyeballing these things, I see the headers contain everything I need in purty rdf. What's the best way to extract this? I thought tiffinfo would do the trick, but it's just giving me technical info. Of course I can just parse the files with perl but I'm thinking there just has to be a slicker way to do this. What's my best option? Thanks, kyle
Re: [CODE4LIB] New Newcomer Dinner option
Looks like some MARC records I've seen. On Feb 4, 2012, at 16:19, Cary Gordon listu...@chillco.com wrote: Probably their cat… They need this: http://www.bitboost.com/pawsense/ On Sat, Feb 4, 2012 at 12:49 PM, Eric Lease Morgan emor...@nd.edu wrote: LlkjyYYYYyetyeyppf Prpfc EXpdpppePeppp Pp P$ $p Pp$epepp $ppeppPP PRpp PepplpereprpeprrprPRPeeopwprprPprppertrretrtrrterrtwrtrtww TrWtwteteetrteeetetttetrteyertEtrrtEgrerrtetteyeyeeytwtyeyeyeeyeeeyeey eryeeyeyyyeryyyeyeyeyeyeyyyeyyyeeyreyytrtrttrrtrregtrgghgg gdhfgdhfrtgrhdrghdghdhdggdffdfffvbXVcyvvfvfvffvffvvvfvffvvffffffvf ffxBbbCnvNVqfddZuytuyrutyguhUOyy ROTFL!!! I'm not sure, but I think somebody's b^tt has sent a message to the maIling list. Does anybody here speak b^tt? -- ELM -- Cary Gordon The Cherry Hill Company http://chillco.com
Re: [CODE4LIB] Another Sharpie Opportunity
canadian_snacks++ unless you mean poutine ;) but if you're talking Dangerous Dan's Diner, +1: http://www.dangerousdansdiner.com/
Re: [CODE4LIB] code4lib conference '12 - Solr pre-conference
first come first serve - Since I'm not going to be making it to Seattle, I will gladly donate my conference slot to whoever 1) can make it and 2) e-mails me first @ erik.hatc...@lucidimagination.com On Feb 1, 2012, at 19:02 , Erik Hatcher wrote: Regretfully I must cancel my trip to Seattle, a bummer on several levels as I always love code4lib conferences, the people, the topics, and was also looking forward to enjoying downtown Seattle a bit too. Last minute urgent business duties call, alas. I have alerted the code4libcon e-mail list as well. This means I won't be at the What's New in Solr pre-conference event that I was going to lead. However, I will make myself available to call/Skype/IRC in and do a bit of facilitation and contribute what I can to the get-together. I think it will be a useful/productive time slot for folks to discuss Solr experiences, challenges, and future needs, so please don't worry about me not being physically there, and take the opportunity to make it an interactive session where everyone introduces themselves and their projects and delves into the gory details of Solr experiences. I'm open to suggestions on how I can best participate remotely and contribute as best I can. Thanks, Erik
Re: [CODE4LIB] my conference slot?
Don't sweat it Elizabeth... this is the case of the sharpie marker. If someone takes my slot, just pretend they're me as far as everything on your side goes and they sharpie their name on a badge. But no one has responded to me anyway. I know it's rough running an event (my company runs two major conferences a year for the past couple of years). But it's silly that someone can't fill my seat if I hand it over to them. I was told yesterday that it was up to me to solicit someone to fill that seat, so I solicited. Again however, no one has responded yet. Erik On Feb 2, 2012, at 11:10 , Elizabeth Duell wrote: NO. We are NOT adding anyone else to the participants list. The registration has been closed and will remain so. It is 2 working days before the start of the convention. No changes to the participants list is going to happen. Again, we are NOT ADDING ANYONE ELSE TO THE PARTICIPANTS LIST. Elizabeth Elizabeth Duell Orbis Cascade Alliance edu...@uoregon.edu (541) 346-1883
[CODE4LIB] code4lib conference '12 - Solr pre-conference
Regretfully I must cancel my trip to Seattle, a bummer on several levels as I always love code4lib conferences, the people, the topics, and was also looking forward to enjoying downtown Seattle a bit too. Last minute urgent business duties call, alas. I have alerted the code4libcon e-mail list as well. This means I won't be at the What's New in Solr pre-conference event that I was going to lead. However, I will make myself available to call/Skype/IRC in and do a bit of facilitation and contribute what I can to the get-together. I think it will be a useful/productive time slot for folks to discuss Solr experiences, challenges, and future needs, so please don't worry about me not being physically there, and take the opportunity to make it an interactive session where everyone introduces themselves and their projects and delves into the gory details of Solr experiences. I'm open to suggestions on how I can best participate remotely and contribute as best I can. Thanks, Erik
Re: [CODE4LIB] jQuery Ajax request to update a PHP variable
I'm with jrock on this one. But maybe I'm a luddite that didn't get the memo either (but I am credited for being one of the instrumental folks in the Ajax world, heh - in one or more of the Ajax books out there, us old timers called it remote scripting). What I hate hate hate about seeing JSON being returned from a server for the browser to generate the view is stuff like: string = div + some_data_from_JSON + /div; That embodies everything that is wrong about Ajax + JSON. As Jonathan said, the server is already generating dynamic HTML... why have it return JSON and move processing/templating to the client for some things but not other things? Rhetorical question... of course it depends on the application. If everything is entirely client-side generated, then sure. But for traditional webapps, JSON to the client to simply piece it together as HTML is hideous. I spoke to this a bit at my recent ApacheCon talk, slides are here: http://www.slideshare.net/erikhatcher/solr-flair-10173707 slides 4 and 8 particularly on this topic. So in short, opinions differ on the right way to do Ajax obviously. It depends, no question, on the bigger picture and architectural pieces in play, but there is absolutely nothing wrong with having HTML being returned from the server for partial pieces of the page. An in many cases it's the cleanest way to do it anyway. Erik On Dec 5, 2011, at 18:45 , Jonathan Rochkind wrote: I still like sending HTML back from my server. I guess I never got the message that that was out of style, heh. My server application already has logic for creating HTML from templates, and quite possibly already creates this exact same piece of HTML in some other place, possibly for use with non-AJAX fallbacks, or some other context where that snippet of HTML needs to be rendered. I prefer to re-use this logic that's already on the server, rather than have a duplicate HTML generating/templating system in the javascript too. It's working fine for me, in my use patterns. Now, certainly, if you could eliminate any PHP generation of HTML at all, as I think Godmar is suggesting, and basically have a pure Javascript app -- that would be another approach that avoids duplication of HTML generating logic in both JS and PHP. That sounds fine too. But I'm still writing apps that degrade if you have no JS (including for web spiders that have no JS, for instance), and have nice REST-ish URLs, etc. If that's not a requirement and you can go all JS, then sure. But I wouldn't say that making apps that use progressive enhancement with regard to JS and degrade fine if you don't have is out of style, or if it is, it ought not to be! Jonathan On 12/5/2011 6:31 PM, Godmar Back wrote: FWIW, I would not send HTML back to the client in an AJAX request - that style of AJAX fell out of favor years ago. Send back JSON instead and keep the view logic client-side. Consider using a library such as knockout.js. Instead of your current (difficult to maintain) mix of PhP and client-side JavaScript, you'll end up with a static HTML page, a couple of clean JSON services (for checked-out per subject, and one for the syndetics ids of the first 4 covers), and clean HTML templates. You had earlier asked the question whether to do things client or server side - well in this example, the correct answer is to do it client-side. (Yours is a read-only application, where none of the advantages of server-side processing applies.) - Godmar On Mon, Dec 5, 2011 at 6:18 PM, Nate Hillnathanielh...@gmail.com wrote: Something quite like that, my friend! Cheers N On Mon, Dec 5, 2011 at 3:10 PM, Walker, Daviddwal...@calstate.edu wrote: I gotcha. More information is, indeed, better. ;-) So, on the PHP side, you just need to grab the term from the query string, like this: $searchterm = $_GET['query']; And then in your JavaScript code, you'll send an AJAX request, like: http://www.natehill.net/vizstuff/catscrape.php?query=Cooking Is that what you're looking for? --Dave - David Walker Library Web Services Manager California State University -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Nate Hill Sent: Monday, December 05, 2011 3:00 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] jQuery Ajax request to update a PHP variable As always, I provided too little information. Dave, it's much more involved than that I'm trying to make a kind of visual browser of popular materials from one of our branches from a .csv file. In order to display book covers for a series of searches by keyword, I query the catalog, scrape out only the syndetics images, and then display 4 of them. The problem is that I've hardcoded in a search for 'Drawing', rather than dynamically pulling the correct term and putting it into the catalog
Re: [CODE4LIB] stemming in author search?
On Jun 14, 2011, at 08:10 , Keith Jenkins wrote: Does Solr support Soundex? (Soundex was originally developed to assist with alternate spellings of names) Indeed. And several other phonetic algorithms: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory Erik
Re: [CODE4LIB] stemming in author search?
It's documented in that wiki page link below as true/false -- true will add tokens to the stream, false will replace the existing token So if you index cat and it the phonetic filter turns it into KT, it can either index cat and KT or just KT. Erik On Jun 14, 2011, at 10:45 , Jonathan Rochkind wrote: Hey Erik, in that wiki documentation the example it gives is: filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone inject=true/ Do you know what that 'inject' argument is about, and where (if anywhere) I'd find it (and other available arguments for PhoneticFilterFactory, which may or may not differ depending on encoder chosen?) documented? On 6/14/2011 8:31 AM, Erik Hatcher wrote: On Jun 14, 2011, at 08:10 , Keith Jenkins wrote: Does Solr support Soundex? (Soundex was originally developed to assist with alternate spellings of names) Indeed. And several other phonetic algorithms: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory Erik
[CODE4LIB] japanese (Solr) analysis
I'm trying to cull together the best practices for indexing/searching Japanese text. For those of you using Solr, what analyzer/field-type definition do you have for Japanese? Thanks for sharing! Erik
Re: [CODE4LIB] A suggested role for text mining in library catalogs?
Solr _can_ use stemming, but to do it with POS would be flakey I'd think. Is work a verb or noun? Some of the (Solr-using) customers that I work with have done POS tagging (using tools like BasisTech Solr plugins for entity tagging). Payloads can be assigned to terms during indexing and then used to weight the score when query terms match. Lucene supports payloads and scoring based on them natively, but it requires some code to wire together. Solr supports a little in terms of payloads, but to really use them effectively custom coding is needed. See https://issues.apache.org/jira/browse/SOLR-1485 for example. Erik On Feb 22, 2011, at 09:02 , Cindy Harper wrote: It's not ironic - my post was musing inspired by your work. I guess I wasn't sure if I understood your results. You were looking at the overall POS usage in the entire texts as a possible way of ranking the texts. I was wondering about POS of particular search terms - those that could take on several POS. A related question - does SOLR use stemming to widen the search to various POS? Then would it be meaningful to rank the given texts by the POS of the actual search terms? And has anyone looked at samples of user search terms - are they almost always noun phrases? Just wanting to understand what you have explored. And I probably should have added to your thread on NGC4LIB, rather than Code4lib - I tend to conflate them. Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edu 315-228-7363 On Sat, Feb 19, 2011 at 5:42 PM, Eric Lease Morgan emor...@nd.edu wrote: On Feb 19, 2011, at 11:26 AM, Cindy Harper wrote: I just was testing our discovery engine for any technical issues after a reboot. I was just using random single words, and one word I used was correct. Looking at the first ranked items, I wondered if there's some role for parts-of-speech in ranking hits - are nouns and , in this case, adjectives more indicative of aboutness than verbs? The first items were Miss Manners ... excruciating correctly behavior, then a bunch of govdocs on an act to correct. I don't think there's any reason to prefer nouns over verbs, but I thought I'd throw the thought at you anyway. Ironically, I was playing with parts-of-speech (POS) analysis the other day. [1] Using a pseudo-random sample of texts, I found there to be surprisingly similar POS usage between texts. With such similarity, I thought it would be difficult to use general POS as a means for ranking or sorting. On the other hand, specific POS may be useful. For example, Thoreau was dominated by first-person male pronouns but Austen was dominated by second person female pronouns. I think there is something to be explored here. [1] POS - http://bit.ly/hsxD2i -- Eric Still Counting Tweets and Chats Morgan
[CODE4LIB] [code4libcon] QA fodder for the What's New in Solr 1.4 preconference??
Just like I did last year, I'm requesting folks send me (on or off-list, as appropriate) issues/questions regarding Solr that I can factor into the session on Feb. 7 in Bloomington. Suggestions on specifics you'd like covered will be eagerly accepted and factored in too. Last year I had a ton of great questions. And it's not that I can necessarily provide concrete answers for them (will do my best, of course), but that we toss them out there to the lib+Solr intersected community and see if we can collaboratively address them. Here's a link to last years slides; see the last part of the slide deck for the community contributed questions) that we discussed live: http://www.slideshare.net/erikhatcher/solr-black-belt-preconference Looking forward to your feedback! Thanks, Erik
Re: [CODE4LIB] javascript testing?
Here at Lucid we've got some Jasmine going on for LWE JS testing. Erik On Jan 11, 2011, at 21:25, Gabriel Farrell gsf...@gmail.com wrote: I like QUnit because it's minimal and I'm used to unit testing. A lot of people are jumping on Jasmine, though. It might be more your style if you're into BDD. On Tue, Jan 11, 2011 at 7:21 PM, Bess Sadler bess.sad...@gmail.com wrote: Can anyone recommend a javascript testing framework? At Stanford, we know we need to test the js portions of our applications, but we haven't settled on a tool for that yet. I've heard good things about celerity (http://celerity.rubyforge.org/) but I believe it only works with jruby, which has been a barrier to getting started with it so far. Anyone have other tools to suggest? Is anyone doing javascript testing in a way they like? Feel like sharing? Thanks! Bess
[CODE4LIB] [JOB] Directory, Online Library Environment, University of Virginia
I'm passing this on from contacts at UVa, please use the contact info below to follow up. == DIRECTOR, ONLINE LIBRARY ENVIRONMENT University of Virginia Library The University of Virginia Library seeks a strong technical leader for the position of Director of our “online library environment,” a comprehensive suite of tools and services to provide access to the Library’s physical and digital collections. We seek candidates who can successfully architect and implement solutions providing faculty and students a cohesive, innovative environment for accessing information used in research, teaching, and learning. Environment: The University of Virginia Library (http://www.lib.virginia.edu ) is a leader in innovative customer service, an international leader in digital library research and digital scholarship, and is recognized for the strength and variety of its collections. The Library system consists of twelve libraries, with independent libraries for health sciences, law, and business. The libraries support 12,000 undergraduates, 6,000 graduate students and 1,600 teaching faculty. The University and the Library have a strong commitment to achieving diversity among faculty and staff. The Neoclassical buildings of founder Thomas Jefferson's Academical Village still serve as the center of the University's Grounds (http://www.virginia.edu/uvatours/slideshow/ ) and as a unique backdrop for teaching, learning, and research. Responsibilities: The Director of the online library environment is responsible for leading the development and implementation of emerging information technologies as well as managing daily operations for the Library’s access and delivery applications. The Director will head a newly formed department of software developers and librarians in carrying out this activity. She or he will have oversight of all aspects of the Library’s Integrated System (ILS – Sirsi/Dynix Unicorn) and will lead development of an information architecture that provides cohesive access and delivery. She or he will assess, architect, and implement new ways to provide content and workflow services traditionally provided by an ILS and develop gateways to other information resources such as the Library’s electronic resources and institutional repositories. The Director will: provide leadership and vision that ensures easy, reliable online access to a wide array of collections, information, and services in support of research, teaching and learning; provide technical leadership in the design and implementation of all aspects of the software and infrastructure for ongoing development projects. Provide technical guidance to developers and systems administrators on project requirements as needed. manage the daily operations environment for the Library’s access and delivery applications; design and implement technical enhancements to the Library’s ILS infrastructure to meet current and future needs. supervise the daily work of both faculty and classified staff positions; collaborate with and provide technical guidance to partners within the Library and among entities that require access to Library content; and engage professionally in activities related to librarianship and digital scholarship. Qualifications: Master’s degree in Library Science or master’s degree or PhD in Computer Science, Information Sciences or related area. Successful candidates should have demonstrated significant and progressively responsible experience managing positions with a range of technology-specific and administrative responsibilities. Experience in libraries or information organizations is preferred. Preferred candidates will also have: demonstrated understanding of digital library concepts and standards (e.g., metadata standards, media-specific standards); experience in systems design and systems architecture; demonstrated experience in the implementation of Open Source software and tools; these tools include, but are not limited to: Enterprise Java development Ruby scripting language or equivalent Ruby on Rails Solr Tomcat Unix, Linux, AIX preferred an understanding of and commitment to library technologies; the ability to communicate clearly, both verbally and in writing; demonstrated ability to manage and lead information technology staff and projects as well as departmental priorities; demonstrated knowledge of emerging technologies and related research; these include but are not limited to: Blacklight Fedora/DuraSpace Evergreen Koha other Open Source and proprietary systems related to online library environments strong interpersonal skills; and a customer-service orientation. Salary and Benefits: Competitive depending on qualifications. This position has general faculty status with excellent benefits, including 22 days of vacation and TIAA/CREF and other retirement plans. Review
[CODE4LIB] evening CrossFit excursion
I posted to the blog and update: http://wiki.code4lib.org/index.php/C4L2010_social_activities#CrossFit_Asheville If you're one of the few, the proud, the insane, meet me in the lobby at 5:45pm. I'll depart at 6pm. Gym is really close. Erik
[CODE4LIB] exercising at code4libcon next week
code4libcon is about here, yay! I'm kinda in a fitness craze right now, and will be doing some training in Asheville. Monday night, 6:30pm, I'm going to the CrossFit Asheville gym - http://www.crossfitasheville.com/ I contacted them and they said that was a good time to come. I'll likely go back on Wednesday night at the same time. (dunno if they'll charge some fee, though). There were a couple of folks that mentioned interest, and I can carpool up to 3 others. If you've never done it before, now's not the time to start and I'm sure they'll only let experienced folks partake, but I imagine those curious about the insanity are welcome to spectate. Jogging - what say folks up for runs meet in the hotel lobby at 6:30am any day next week. I'm game for a relatively short run (2-3 miles) both Monday and Wednesday. I fleshed out a daily signup on the wiki. If it's too cold or treacherous out, I'll just hit the treadmill or rowing machine if they have it. http://wiki.code4lib.org/index.php/C4L2010_social_activities#Working_Out I'm still debating how many pushups folks must do at the Black Belt preconference, and which kata to teach ;) Erik
Re: [CODE4LIB] preconference proposals - solr
+1, Bess! I'm especially psyched for the kata demonstrations and sparring matches we'll have at the end of the session :) I'll tinker with the advanced session description a bit when I can, but let's run with that for the time being. I'm happy to have Noami join me however she likes. Erik On Nov 13, 2009, at 11:25 AM, Bess Sadler wrote: Hey, how about this? I've been discussing this off list with Erik and Naomi and this is what we came up with (I also added it to the wiki): This is a proposal for several pre-conference sessions that would fit together nicely for people interested in implementing a next-gen catalog system. 1. Morning session - solr white belt Instructor: Bess Sadler (anyone else want to join me?) The journey of solr mastery begins with installation. We will then proceed to data types, indexing, querying, and inner harmony. You will leave this session with enough information to start running a solr service with your own data. 2. Morning session - solr black belt Instructors: Erik Hatcher (and Naomi Dushay? she has offered to help, if that's of interest) Amaze your friends with your ability to combine boolean and weighted searching. Confound your enemies with your mastery of the secrets of dismax. Leave slow queries in the dust as you performance tune solr within an inch of its life. [We should probably add more specific advanced topics here... suggestions welcome] 3. Afternoon session - Blacklight Instructors: Naomi Dushay, Jessie Keck, and Bess Sadler Apply your solr skills to running Blacklight as a front end for your library catalog, institutional repository, or anything you can index into solr. We'll cover installation, source control with git, local modifications, test driving development, and writing object-specific behaviors. You'll leave this workshop ready to revolutionize discovery at your library. Solr white belts or black belts are welcome. And then anyone else who had a topic that built on solr (e.g., vufind?) could add it in the afternoon. Obviously I'm biased, but I really do think the topic of implementing a next gen catalog is meaty enough for a half day and I know people are asking me about it and eager to attend such a thing. What do you think, folks? Bess On 12-Nov-09, at 4:10 PM, Gabriel Farrell wrote: On Tue, Nov 10, 2009 at 02:47:42PM +, Jodi Schneider wrote: If you'd be up for it Erik, I'd envision a basic session in the morning. Some of us (like me) have never gotten Solr up and running. Then the afternoon could break off for an advanced session. Though I like Bess's idea, too! Would that be suitable for a conference breakout? Not sure I'd want to pit it against Solr advanced session! The preconfs should be as inclusive as possible, but I'm wondering if the Solr session might be more beneficial if we dive into the particulars right off the bat in the morning. There are only a few steps to get Solr up and running -- it's in the configuration for our custom needs that the advice of a certain Mr. Hatcher can really be helpful. You're right, though, that the NGC thing sounds more like a BOF session. I'd support that in order to attend a full preconf day of Solr. Gabriel Elizabeth (Bess) Sadler Chief Architect for the Online Library Environment Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 b...@virginia.edu (434) 243-2305
Re: [CODE4LIB] preconference proposals - solr
On Nov 13, 2009, at 11:42 AM, Walter Lewis wrote: On 13 Nov 09, at 11:25 AM, Bess Sadler wrote: 1. Morning session - solr white belt [delightful descriptions snipped] 2. Morning session - solr black belt 3. Afternoon session - Blacklight Is there any chance that the black belt session needs to be/should be a two parter and run through the afternoon as well? ... or repeat for those who have just acquired their white belts but are headed in different directions? I'd hate to miss the Blacklight session myself though! How about a compromise? In that I'll do the morning advanced Solr session as proposed and then gladly make myself available for the remainder of the conference for any folks that have specific questions/ issues with Solr. Erik
Re: [CODE4LIB] preconference proposals
On Nov 11, 2009, at 6:46 PM, Naomi Dushay wrote: What do you think about the Solr part having some specific goodies like: +1 to it all! lots on dismax magic how to do fielded searching (author/title/subject) with dismax how to do browsing (termsComponent query, then fielded query to get matching docs) how to do boolean (use lucene QP, or fake it with dismax) Or, use the new Lucid contributed extended dismax parser ;) https://issues.apache.org/jira/browse/SOLR-1553 Erik
Re: [CODE4LIB] solr | StopFilterFactory - stopwords.txt
I often recommend against stop word removal altogether. Is there any reason you need to remove them? The primary reason stop words get removed is to increase performance of queries with very common terms. If you are encountering that, using Solr's CommonGramsFilter(Factory) is a good solution to keep your stop words and alleviate the performance degradation potential. The HathiTrust folks have had success with the common grams capability. Erik On Nov 11, 2009, at 3:41 PM, Eric James wrote: Has anyone already given some thought into refining the solr stopwords.txt for library collections, particularly finding aids? The words included in the out of the box stopwords.txt are of very questionable unimportance: an and are as at be but by for if in into is it not of on or s such t that the their then there these they this to was will with We were indexing a field id with no. as one of its tokens (for number), but wanted a query with no (where the person did not add the period) to find the doc, but in actuality the no would get stripped by the StopFilterFactory. And thus we stumbled upon this list, and was a bit suprised by some of the inclusions (ex:will), and exclusions( ex:a). Thanks, Eric James Yale University Libraries
Re: [CODE4LIB] preconference proposals
I'm interested presenting something Solr+library related at c4l10. I'm soliciting ideas from the community on what angle makes the most sense. At first I was thinking a regular conference talk proposal, but perhaps a preconference session would be better. I could be game for a half day session. It could be either an introductory Solr class, get up and running with Solr (+ Blacklight, of course). Or maybe a more advanced session on topics like leveraging dismax, Solr performance and scalability tuning, and so on, or maybe a freer form Solr hackathon session where I'd be there to help with hurdles or answer questions. Thoughts? Suggestions? Anything I can do to help the library world with Solr is fair game - let me know. Thanks, Erik On Nov 9, 2009, at 9:55 PM, Kevin S. Clarke wrote: Hi all, It's time again to collect proposals for Code4Lib 2010 preconference sessions. We have space for six full day sessions (or 12 half day sessions (or some combination of the two)). If we get more than we can accommodate, we'll vote... but I don't think we will (take that as a challenge to propose lots of interesting preconference sessions). Like last year, attendees will pay $12.50 for a half day or $25 for the whole day. The preconference space will be in the hotel so we'll have wireless available. If you have a preconference idea, send it to this list, to me, or to the code4libcon planning list. We'll put them up on the wiki once we start receiving them. Some possible ideas? A Drupal in libraries session? LOD part two? An OCLC webservices hackathon? Send the proposals along... Thanks, Kevin
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
The Lucene Highlighter doesn't require that the text you want highlighted be stored. In fact, you can pass in any arbitrary text to the Highlighter. See the various getBestFragments from the Highlighter class: http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/highlight/Highlighter.html Erik On Sep 29, 2009, at 7:01 AM, Yitzchak Schaffer wrote: Hello, Sorry for any cross-posting annoyance. I have a request for a Greenstone collection I'm working on, to add context snippets to search results; for example a search for yak culture might return this in the list of results: ... addressing the fine points of strongyak culture/strong, the zoosociologists took into account ... Sounds like a pretty basic feature, say our sponsors, and I agree. (Ah, it's also an old Trac ticket at http://trac.greenstone.org/ticket/444) I see that GS out-of-the-box is set *not* to store the fulltext in the index, which seems to be a prerequisite for this kind of thing, as in http://bit.ly/ljNkL . Has anyone modified the Lucene indexing wrapper locally to do this? Given that we don't have any Java coders on staff, I've started porting the Lucene wrapper to PHP for use with a custombuilder.pl and Zend_Search_Lucene. I already have a PHP frontend, so adjusting that to display the results shouldn't be a problem; OTOH because the frontend is PHP, I'm restricted to using buildtype lucene, or something else with good PHP support. Many thanks, -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
On Sep 29, 2009, at 7:33 AM, Yitzchak Schaffer wrote: Erik Hatcher wrote: The Lucene Highlighter doesn't require that the text you want highlighted be stored. In fact, you can pass in any arbitrary text to the Highlighter. Thanks Erik, What I'm looking for is to return the context of the search result, not just the ID of the containing document - e.g. when all I input is yak culture, I get back the context from the document as a search result, without having to retrieve the doc itself: ... addressing the fine points of strongyak culture/strong, the zoosociologists took into account ... GS out of the box does not appear to support this, as it does not store the fulltext in the index. So yes, I can highlight stuff, but as it stands, I don't have the text to work with. IANA Lucene guru, so correct me if I misunderstand. I'm a bit confused then. You mentioned that somehow Zend Lucene was going to help, but if you don't have the text to highlight anywhere then the Highlighter isn't going to be of any use. Again, you don't need the full text in the Lucene index, but you do need it get it from somewhere in order to be able to highlight it. Erik
Re: [CODE4LIB] indexing pdf files
Here's a post on how easy it is to send PDF documents to Solr from Java: http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/ Not only can you post PDF (and other rich content) files to Solr for indexing, you can also as shown in that blog entry extract the text from such files and have it returned to the client. This Solr capability makes the tool chain a bit simpler. Erik On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote: Hi all, I would like to suggest an API for extracting text (including highlighted or annotated ones) from PDF: iText (http://www.lowagie.com/iText/). This is a Java API (has C# port), and it helped me a lot, when we worked with extraordinary PDF files. Solr uses Tika (http://lucene.apache.org/tika) for extracting text from documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/) to extract from PDF files, and it is a great tool for the normal PDF files, but it has (at least had) some features, which I didn't satisfied with: - it consumed more memory comparing with iText, and couldn't read files above a given size (this was large, about 1 GB, but we had even larger files) - it couldn't handled correctly the conditional hypens at the end of the line - it had poorer documentation then iText, and its API was also poorer (that time the Manning published the iText in Action book). Our PDF files were double layered (original hi-res image + OCR-ed text), several thousands pages length documents (Hungarian scientific journals, the diary of the Houses of Parliament from the 19th century etc.). We indexed the content with Lucene, and in the UI we showed one page per screen, so the user didn't need to download the full PDF. We extracted the Table of contents from the PDF as well, and we implemented it in the web UI, so the user can browse pages according to the full file's TOC. This project happened two years ago, so it is possible, that lots of things were changed since that time. Király Péter http://eXtensibleCatalog.org - Original Message - From: Mark A. Matienzo m...@matienzo.org To: CODE4LIB@LISTSERV.ND.EDU Sent: Tuesday, September 15, 2009 3:56 PM Subject: Re: [CODE4LIB] indexing pdf files Eric, 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] Have you considered using Solr's ExtractingRequestHandler [1] for the PDFs? We're using it at NYPL with pretty great success. [1] http://wiki.apache.org/solr/ExtractingRequestHandler Mark A. Matienzo Applications Developer, Digital Experience Group The New York Public Library
Re: [CODE4LIB] Usability evaluation of library online catalogues
On Feb 4, 2008, at 4:12 PM, David Fiander wrote: Actually, the idea of using AJAX to create a way to add and remove limits diagonally is exactly what U Virginia's blacklight interface does, although with a slightly different interface: http://blacklight.betech.virginia.edu/ David - that is not accurate. Blacklight doesn't use Ajax anywhere currently (that I know of, and not for the basic search/browse/facet functionality at least). The only place I had Ajaxed it was with as- you-type suggest, but we took that out fairly early on in the Solr + Flare + MARC project, before it was even Blacklight, to avoid that as a performance issue as we were toying with the faceted UI. Blacklight morphed over time to deal with facets differently, with my first incarnation using server-side session scope to keep track of the users query/browse/invert trail. After I left, the next developer to tinker with it moved that session state to the URL, to make it all bookmarkable: http://tinyurl.com/39abwa (and that is very horrible URL, IMO, and deserves a refactoring to be readable and hackable) But, no Ajax in the mix currently, not even for as-you-type suggestion. You can see the suggest feature built into Solr Flare in action here, if you know a Japanese character or two: http:// www.rondhuit-demo.com/yademo/ Erik p.s. I enjoyed the OLA tech trends session on Saturday. How's Blacklight look on your cell phone? :)
Re: [CODE4LIB] arg! classpaths!
Sadly the Lucene demo is not all that great. I recommend you start with Solr rather than Lucene directly. Erik On Jan 26, 2008, at 9:30 AM, Eric Lease Morgan wrote: (Arg! Classpaths!) Please tell me why Java throws the NoClassDefFoundError error when I think I have set up my classpath correctly: $ pwd /home/eric/lucene $ ls -lh total 720K -rw-r--r-- 1 eric eric 650K 2008-01-26 08:30 lucene-core-2.3.0.jar -rw-r--r-- 1 eric eric 52K 2008-01-26 08:30 lucene-demos-2.3.0.jar $ export CLASSPATH=$PWD:. $ echo $CLASSPATH /home/eric/lucene:. $ java org.apache.lucene.demo.IndexFiles Exception in thread main java.lang.NoClassDefFoundError: org/apache/ lucene/demo/IndexFiles Put another way, I have downloaded Lucene and I'm trying to do the demo application. [1] You are suppose to put the .jar files in your classpath and then run java org.apache.lucene.demo.IndexFiles. What am I doing wrong? [1] http://lucene.apache.org/java/2_3_0/demo.html -- Eric Lease Morgan
Re: [CODE4LIB] Getting started with SOLR
On Nov 22, 2007, at 3:41 PM, Kent Fitch wrote: On Nov 23, 2007 4:11 AM, Binkley, Peter [EMAIL PROTECTED] wrote: ... If you use boost on the date field the way you suggest, remember you'll have to reindex from scratch every year to adjust the boost as items age. Or maybe just use a method such that 2007 dates boost the document by 3.0, 2008 dates by 3.1, 2009 by 3.2 ... Whether this is feasible depends on how else you are expecting other scoring boosts to interact with document intrinsic boosts. Don't do date boosting at index time, but rather tune things on the query end, using FunctionQuery and such. That'll give you maximum flexibility. Erik
[CODE4LIB] Portland library geeks?
I'll be in Portland later today through Monday for RailsConf. The schedule is really tight, but if there are some library geeks in the area that want to get together around a pool table let me know. Erik
Re: [CODE4LIB] pspell aspell: make your own word lists/dictionaries
Martin has created a Google Group for his spell checker, and discussions have been ongoing since c4lcon about how to contribute it to Lucene. You can learn more about it here: http://groups.google.com/group/spelt Martin has packaged the code with tests for folks to try it out easily. Erik On Apr 3, 2007, at 2:01 PM, Jonathan Rochkind wrote: I haven't had time to look at it yet, but someone at Code4Lib conference proposed a more sophisticated approach to spell checking that sounded really interesting to me, and said he was going to share the code. I hope to have time to investigate at some point. Let's see if I can find it on the conference page yeah, it was Martin Haye. You can watch his presentation here: http://video.google.com/videoplay?docid=4028600349627496246hl=en Looks like he's *martin*.*haye*[at]gmail.com. During the lightning talk, he said he didn't want to distribute the code seperately but wanted to include it in Lucene if possible---but later in the conference, he said he had been convinced by the interest in it to distriburte the code as it's own standalone thing, and planned to do that presently. If anyone does or has explored using martin's code, please let us know about your experience. Jonathan Kevin Kierans wrote: Has anyone created their own dictionaries for aspell? We've created blank delimited lists of words from our opac. One for title, one for subjects, and one for authors. (We're thinking of a series one as well) We would like to use one of these word lists to offer suggestions depending on which search the patron is making. We're assuming we can make better suggestions if the words come from our actual opac. We've got it working with the dictionary that comes with aspell, but having problems (we can't do it!) substituting our own dictionaries. Does anyone have any experience/knowledge/hints/pointers they can share with us? We are using linux, php 5, aspell 0.50.5, and php - pspell functions. Thanks, Kevin TNRD Library System, Kamloops, British Columbia, Canada -- Jonathan Rochkind Sr. Programmer/Analyst The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] Video encoding done - Mashup idea request
Slides schmides. :) Just having slides synched to a speaker works for some cases, but for those of us that love doing live demos, coding on the fly, and just flat out winging it, the slides are often just barely related to what's being said. Having the actual screen being presented is all that would have made sense for my presentation, for example. Erik On Mar 16, 2007, at 6:36 PM, Gabriel Farrell wrote: On Fri, Mar 16, 2007 at 09:20:47PM -, Richard Wallis wrote: If a mix of video and screen capture could be achieved it would be great, but certainty of a successful result AND a fallback if it doesn't work should be part of the plan. RJW Agreed. Video plus screen capture would be nice, but video of the presenter and a copy of their slides to follow along with works pretty damn well for me. As far as trying to combine the two and sleek up the experience, Columbia's SOA has a nice setup for this on their site (see http://www.columbia.edu/itc/soa/dmc/cory_arcangel/) for example). There are some glitches in the timing, though, and I'm just not sure it's worth the extra effort. Gabe PS In a similar vein, when I don't have separately available slides to follow along with, I find many of the Google TechTalks annoying. It's a lot of watching some not-terribly-handsome person refer to things I can't see. Having both video and slides is key.
Re: [CODE4LIB] Flamenco
On Mar 7, 2007, at 6:55 AM, K.G. Schneider wrote: A mention of the Flamenco project (open source faceted navigation) on Catalogablog made me wonder if anyone on c4l had looked at this: http://flamenco.berkeley.edu/ Of course! Many of us have been all over Flamenco since we first saw it. There was a lightening talk about it at the conf. last week in fact. It's very nice that it's open sourced now. I've yet to give it a try (besides using the online demos)... but it has some nice features that I'm going to borrow into the Solr flare work such as hierarchical facets. Erik
Re: [CODE4LIB] Preconference
On Feb 13, 2007, at 9:47 AM, Susan E Teague Rector/FS/VCU wrote: Are we supposed to be using a predefined set of data for the preconference or can we use our own data? Susan - I'm going to package up a lot of stuff (Solr, sample datasets, Luke, etc) to help everyone get started, but bringing your own data is encouraged as long as you also bring along the necessary tools and know-how to process that data into something usable by Solr (either XSLT to .xml files, or via code that speaks to Solr directly). So by all means bring your data. Erik
Re: [CODE4LIB] Preconference
On Feb 13, 2007, at 10:58 AM, Jonathan Rochkind wrote: If we bring MARCXML and/or MODS, can we assume that there will be people who can help us process that data into something useable by Solr? That would be a nice, at any rate. Yes, yes you can make such an assumption. However, I want this to be clear - we'll have plenty of sample data to play with on your own environments that you can rest assured that you'll leave the pre- conference knowing that you're in good hands with Solr. Fiddling with your specific data set, if it doesn't fit into any of the pre- fab scripts at the time of the pre-conf, could take some time, and it'll be much better to get Solr know-how than MARCXML know how. Erik
Re: [CODE4LIB] Preconference
Or here: http://wiki.apache.org/solr/Solr4Lib I like the Solr wiki for highly Solr specific stuff... though the code4lib site can handle file attachments whereas the Apache wiki does not (I don't think). So, maybe upload files to code4lib, but consider adding a blurb on the Solr4Lib wiki page pointing to it. Erik On Feb 13, 2007, at 12:48 PM, Bess Sadler wrote: Could we post them on the code4lib.org page about the pre-conference? http://code4lib.org/node/139 Bess On Feb 13, 2007, at 12:03 PM, Binkley, Peter wrote: That would be great. I've got a MODS-to-Solr xsl to share as well. Where would be a good place to post these, along with relevant Solr schemas? Peter -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Andrew Nagy Sent: Tuesday, February 13, 2007 9:18 AM To: CODE4LIB@listserv.nd.edu Subject: Re: [CODE4LIB] Preconference I have an XSLT doc for transforming MARCXML to SOLR XML that I can share around. Andrew Jonathan Rochkind wrote: If we bring MARCXML and/or MODS, can we assume that there will be people who can help us process that data into something useable by Solr? That would be a nice, at any rate. Jonathan Erik Hatcher wrote: On Feb 13, 2007, at 9:47 AM, Susan E Teague Rector/FS/VCU wrote: Are we supposed to be using a predefined set of data for the preconference or can we use our own data? Susan - I'm going to package up a lot of stuff (Solr, sample datasets, Luke, etc) to help everyone get started, but bringing your own data is encouraged as long as you also bring along the necessary tools and know-how to process that data into something usable by Solr (either XSLT to .xml files, or via code that speaks to Solr directly). So by all means bring your data. Erik -- Jonathan Rochkind Sr. Programmer/Analyst The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu Elizabeth (Bess) Sadler Head, Technical and Metadata Services Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305
Re: [CODE4LIB] Getting data from Voyager into XML?
On Jan 17, 2007, at 3:26 PM, Andrew Nagy wrote: One thing I am hoping that can come out of the preconference is a standard XSLT doc. I sat down with my metadata librarian to develop our XSLT doc -- determining what fields are to be searchable what fields should be left out to help speed up results, etc. It's pretty easy, I think you will be amazed how fast you can have a functioning system with very little effort. You're quite right with that last statement. I am, however, skeptical of a purely MARC - XSLT - Solr solution. The MARC data I've seen requires some basic cleanup (removing dots at the end of subjects, normalizing dates, etc) in order to be useful as facets. While XSLT is powerful, this type of data manipulation is better (IMO) done with scripting languages that allow for easy tweaking in a succinct way. I'm sure XSLT could do everything that you'd want done; you can also drive screws in with a hammer :) That being said - if you've got XSLT chops, and can easily go from MARC XML to Solr's XML [1], you'll be in great shape at the pre- conference for quickly getting your data into Solr and seeing what needs to be cleaned up. Seeing the raw data in a faceted way is actually very helpful for knowing where to go next with cleanup, and showing catalogers where inconsistencies live in the data. Erik [1] http://wiki.apache.org/solr/UpdateXmlMessages
Re: [CODE4LIB] Getting data from Voyager into XML?
Tod, Great information. I apologize for being a late comer to the game and bringing up FAQs. What about date normalization? One thing that must be considered when doing faceted browsing is that it works best with some pre-processed data, such as years rather than full dates. The question becomes where does the logic for stripping out the years belong? Solr could do it if configured with a custom analyzer for certain fields, or the client could do it. Is there XSLT to do this sort of thing with dates available? Erik On Jan 19, 2007, at 5:58 AM, Tod Olson wrote: On Jan 19, 2007, at 4:07 AM, Erik Hatcher wrote: On Jan 17, 2007, at 3:26 PM, Andrew Nagy wrote: One thing I am hoping that can come out of the preconference is a standard XSLT doc. I sat down with my metadata librarian to develop our XSLT doc -- determining what fields are to be searchable what fields should be left out to help speed up results, etc. It's pretty easy, I think you will be amazed how fast you can have a functioning system with very little effort. You're quite right with that last statement. I am, however, skeptical of a purely MARC - XSLT - Solr solution. The MARC data I've seen requires some basic cleanup (removing dots at the end of subjects, normalizing dates, etc) in order to be useful as facets. While XSLT is powerful, this type of data manipulation is better (IMO) done with scripting languages that allow for easy tweaking in a succinct way. I'm sure XSLT could do everything that you'd want done; you can also drive screws in with a hammer :) So the punctuation stripping has already been done in XSLT. LoC has a MARCXML - MODS XSLT stylesheet [1] which strips out the evil ISBD punctuation. I've generally found mapping from MODS to be more convenient than mapping from MARC, so while it's an extra step, it does save a little programmer time since some of the hidden hierarchy in the MARC data is made explicit in the MODS structure. If hopping through MODS is unacceptable, the LoC has the punctuation- stripping nicely tucked away into a MARC Conversion Utility Stylesheet that you could use directly in a MARC XML - Solr transformation. [2] [1] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS.xsl [2] http://www.loc.gov/marcxml/xslt/MARC21slimUtils.xsl Tod Olson [EMAIL PROTECTED] Programmer/Analyst University of Chicago Library
Re: [CODE4LIB] Lucene Newbie Question
Andrew, On Jan 11, 2007, at 10:47 AM, Andrew Darby wrote: Hello, all. I'm trying to get started with Lucene for the Code4Lib preconference Excellent!!! and was wondering if someone could help. Of course I'm trying to do the first example from the Lucene site (http://lucene.apache.org/java/docs/demo.html) on my Windows XP machine but when I try to build the test index from the command line like so: C:\lucene-2.0.0java org.apache.lucene.demo.IndexFiles C: \lucene-2.0.0/src I get the following error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/lucene/demo/IndexFiles My CLASSPATH looks like this: .;C:\Program Files\QuickTime\QTSystem\QTJava.zip;C:\lucene-2.0.0 \build\lucene-core-2.0.1-dev.jar;C:\lucene-2.0.0\build\lucene- demos-2.0.1-dev.jar; 2.0.1? Where'd you get that version? I pulled down the latest stable release, 2.0.0, just now to run through this myself. Rather than setting CLASSPATH (an evil thing in the Java world, it can really bite you at inopportune times), I ran it this way successfully: java -cp lucene-core-2.0.0.jar:lucene-demos-2.0.0.jar org.apache.lucene.demo.IndexFiles src/ I assume this is a basic error, and something to do with the classpath, but as best I can tell everything is correct, the IndexFiles.class file is where it should be, etc. I'm not familiar with Java, if you haven't guessed. Any suggestions? Sadly the demo that ships with Lucene is pretty weak. For more examples, grab the Lucene in Action (LIA) codebase from http:// www.lucenebook.com and fire it up simply by typing ant and following the instructions in the README too. That code is for Lucene 1.4.3 - 1.9.x. Lucene 2.0 removed deprecated methods, and there are a few tidbits of trivia to adjust LIA code to Lucene 2.0 available here: http://www.nabble.com/Lucene-in-Action-examples-complie-problem- tf2418478.html#a6743189 The demo that ships with Lucene is barely usable for anything other than yeah, it can search text, but boy is it a hassle to run. Keep in mind that Lucene is a low-level library, so for there to be much of use out of it, you have to build something around it. The Indexer and Searcher command-line apps in the LIA code base provide a better working demo out of the box, but still quite crude. Erik
Re: [CODE4LIB] Lucene Newbie Question
On Jan 11, 2007, at 12:10 PM, Andrew Darby wrote: Thanks Erik and Bess. Erik: Lamentably, your java -cp lucene-core-2.0.0.jar:lucene-demos-2.0.0.jar org.apache.lucene.demo.IndexFiles src/ threw the same error. That is probably due to your environment CLASSPATH (I told you it was trouble! :). Remove that environment variable altogether, or set it to blank, or at least remove all the Lucene JARs from it, and all should be well. B I'm going to take a look at the LuceneInAction codebase and see if I can get it working that way. Thanks for taking the time to install 2.0.0. I don't know how I ended up with 2.0.1 jars--they appeared when I ran ant . . . . Oh, you built Lucene from source... that makes sense. The version number in the build file of a 2.0.0 release distribution is set one version higher in the build environment - this allows us Lucene developers to at least have a clue that the user built it themselves (and possibly tinkered with the source code) instead of using a binary release (helps in supporting users to know what version folks are running). Erik
Re: [CODE4LIB] Lucene Newbie Question
On Jan 11, 2007, at 2:54 PM, Erik Hatcher wrote: On Jan 11, 2007, at 12:10 PM, Andrew Darby wrote: Thanks Erik and Bess. Erik: Lamentably, your java -cp lucene-core-2.0.0.jar:lucene-demos-2.0.0.jar org.apache.lucene.demo.IndexFiles src/ threw the same error. That is probably due to your environment CLASSPATH (I told you it was trouble! :). Remove that environment variable altogether, or set it to blank, or at least remove all the Lucene JARs from it, and all should be well. Also, in case you copied my exact example, (I noticed you're on Windows) you'll need to adjust the path separator from : to ; between the two JARs in the -cp switch. That might be the issue rather than CLASSPATH, come to think of it. Erik
[CODE4LIB] solrb DSL collaboration
A little edgy in #code4lib today about where we* are going with solrb (the Ruby/Solr domain-specific language API), so we're going to add a bit of process by fleshing it out via the solrb section of the Solr wiki. Below is the first draft, though I've revised it some since then slightly. Click the link to add your ideas. Notice you can get e-mail notifications of Apache wiki changes, though you'll have to sign up for the solr-commits e-mail list - email a blank message to [EMAIL PROTECTED] - and you'll also see the gory details of svn commit messages (not only in solrb/flare, but also to Solr itself). Reviewing code commit messages is a practice I highly recommend, for what it's worth. Erik * we is us. I'm really working hard on making this a collaborative community-oriented effort. Our pre-conference will hopefully consist of a few folks that have gone along for the solrb/flare ride over the next several weeks and will be literally experts in the domain by then. Begin forwarded message: From: Apache Wiki [EMAIL PROTECTED] Date: January 9, 2007 5:00:01 PM EST To: solr-commits@lucene.apache.org Subject: [Solr Wiki] Update of solrb/BrainStorming by ErikHatcher Reply-To: solr-dev@lucene.apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on Solr Wiki for change notification. The following page has been changed by ErikHatcher: http://wiki.apache.org/solr/solrb/BrainStorming New page: This is a straw-man pie-in-the-sky thinking on a Ruby Solr DSL: {{{ class Book attr_reader :isbn attr_reader :title attr_reader :subjects attr_reader :authors attr_reader :dates include Solrable solr_facet :subject solr_facet :year do self.dates.collect { |date| date.year } end end # Search for full-text matches in the author* field Book.search_by_author(sekida) # Search for full-text matches in the title* field Book.search_by_title(zen) # Search for full-text matches as a boolean AND in author and title Book.search_by_author_and_title(sekida, zen) }}} Other ideas?
[CODE4LIB] Solr Flare
code4libers, I've kicked off a sub-project of Solr called Flare. Its has several goals, including to be a Solr Ruby DSL and to achieve a general purpose user interface framework to include faceted browsing, suggest interfaces, as well as the folksonomy angle of tagging/annotating results. At the moment there isn't too much there, but I will be devoting ever waking spare moment to this over the next couple of months. To get involved, first check out the Flare wiki: http://wiki.apache.org/solr/Flare Then sign up for the solr-user e-mail list, submit patches, and so on. From #code4lib, I've seen that this community has some very sharp Ruby talent and I'm eager to have collaborators on this. I'm building this out as a distillation of Collex to the faceted browsing piece, and evolving it up to the folksonomy features, as a way to start with a clean slate in a test centric fashion. The initial goal is for this to become the demo I do for the UVa library folks on their 3M+ MARC records, and to also use this for the code4lib pre- conference. I'm looking forward to your suggestions, critique, patches, and collaborations of any level. Thanks, Erik
[CODE4LIB] Fwd: Solr 1.1 released
The state of Solr's official release was asked about on #code4lib the other day. Here ya go, hot off the press Begin forwarded message: From: Yonik Seeley [EMAIL PROTECTED] Date: December 22, 2006 5:07:49 PM EST To: solr-user@lucene.apache.org, solr-dev@lucene.apache.org Subject: Solr 1.1 released Reply-To: solr-dev@lucene.apache.org Solr 1.1 is now available for download! This is the first official release since Solr entered the Incubator. The release is available at http://people.apache.org/dist/incubator/solr/1.1/ and the detailed changelog is at http://people.apache.org/dist/incubator/solr/1.1/CHANGES.txt Thanks to everyone that helped make this happen! -Yonik
Re: [CODE4LIB] code4lib lucene pre-conference
On Nov 29, 2006, at 10:27 AM, Art Rhyno wrote: I am so behind in e-mail that I might be treading on ground that is worn out on this, but I would add to Eric's list that I don't care about the indexer if: Here's how Lucene/Solr fares on these points: * the indexer has an open and configurable relevancy weighting algorithm Adjusting relevancy with Lucene is configurable in a number of ways, boosts and tweaking Similarity. * the indexer allows control of how the data is normalized Is this the indexers job? I say not. Sure, we'd all love to have everything including the kitchen sink hidden behind some drag-and- drop interface, but it really isn't Solr's job to clean up data. I'm not quite sure what you mean by normalized though, so maybe I'm off base? * the indexer uses pluggable parsers Solr doesn't know MARC from Adam. Again, it isn't Solr's job to parse MARCXML, I'd argue. It's a full- text search engine, and overloading it to be more than that is asking for trouble later when you do want to swap things out. Maybe you mean tokenization rather than parsing though, in which case Solr and Lucene certainly have great configurability. * the indexer supports very fast retrieval :) But of course! then, on the preferred side: * the indexer allows the index process to effectively leverage commodity hardware The beefier the better. * the indexer creates an index that can be combined with others Solr may eventually federate with other Solr instances - that is on the TODO list. And there was recently a message from someone adding an SRU/SRW interface to it. One of our most common comments when we do surveys of our user community is don't show me what you can't deliver NOW. A world class indexer opens the door for scoping at the collection level, there doesn't have to be one solution for IR and it would be a very unhealthy ecosystem without variance, but I suspect it would be easier to convince a company like Elsevier that I want a lucene index for licensed content than almost any other technology offering. So a definite yes to SRU, OpenURL, Z39.50, and the rest, but I wonder if sustaining a lucene index is a good idea regardless of what the main building blocks for a library's preferred IR layer turn out to be. Library standards don't tend to delve into the architecture of indexing anyway, but this is really where a lot of what can be delivered gets defined. We discussed this in Windsor, but for everyone else's benefit I personally don't think sharing a Lucene index is the right granularity to work with. The specifics of the index format evolve with each new version of Lucene (with backwards compatibility in mind, for sure). The better granularity to consider is the interface to the index, like SRU or Solr's custom interface, etc. And the library world already has these standards in place that could easily be put on top of Lucene or Solr. Erik
Re: [CODE4LIB] code4lib lucene pre-conference
On Nov 28, 2006, at 5:44 PM, Kevin S. Clarke wrote: Is there a standard for specifying how textual analysis works as well, so that tokenization can be standardized across these XQuery engines as well? Not that I know. What I've seen so far is that tokenization is implementation specific. Perhaps this is something that is configurable so that implementations can be set up and then queried consistently. Any indexing engine worth its salt should be configurable I'd think. There is nothing I'm aware of in the fulltext work though that defines how things are indexed. If you leave out all the configurability in tokenization for indexing and querying from the XQuery standard, then there will surely be extensions needed for concrete implementations to allow this stuff to be specified. Interesting issue. For all you Java savvy folks out there, how about standards like J2EE that make it easy to move an application from one vendors app. server to another. Works for the simplest of applications, but all vendors have their own specific custom deployment descriptors too. Erik
Re: [CODE4LIB] code4lib lucene pre-conference
On Nov 28, 2006, at 3:28 PM, Andrew Nagy wrote: The major problem with it all is the ugly mess that is marcxml This brings up an interesting point about just dropping our source XML data into an XML-savvy database and using XQuery on it. Maybe y'all have much cleaner data that I've seen, but my experience with Rossetti Archive has had many XML data hurdles. When I came on board, Tamino was being used for the search engine, with XPath queries all over the place. The raw data is not consistent, and a single word query expanded into an enormous XPath query to look at many elements and attributes, not to mention it was SLOW. Analyzing the user interface and the real-world searching needs, I wrote Java code that normalized the data for searching purposes into a much courser grained set of fields, indexing it into Lucene, and voila: http://www.rossettiarchive.org/rose The point is that even with super fast full-text searching with XQuery, most of our archives are probably going to require hideous expressions to query them using their raw structure, especially if have to account for data cleanup too (such as date formatting issues, which we also have in RA raw data). I realize I'm sounding anti-XQuery, which is sorta true, but only because in the real-world in which I work it works better to have some custom digesting of the raw data than to just toss it in and work with standards. Indexing is lossy - it's about keying things the way they need to be looked up. If your data is clean, you're in better shape than me. And if XQuery on your raw data does what you need, by all means I recommend it. Erik
Re: [CODE4LIB] code4lib lucene pre-conference
On Nov 27, 2006, at 5:46 PM, Binkley, Peter wrote: You've got enough flexibility in the way you set up your Lucene index, and Lucene search results give you access to the term weights for each hit, It does? so you can tell which fields actually matched. You can? I'm curious how you're doing that! Especially with Solr in the picture. There would probably be a lot of optimizations you could do within Solr to help with this kind of thing. Art and I talked a little about this at the ILS symposium: why not nestle the XML db inside Solr alongside Lucene? Solr could then manage the indexing of the contents of the db, and augment your search results with data from the db: you could get full records as part of your search results without having to store them in the Lucene index. There has been discussions in the Solr community about having hooks added to allow Solr plugins to pull data from external sources to return with search results. I don't think Solr itself is the entry point to these external systems, as that seems to couple things a bit too much for my tastes, so I think you'd still want to manage the external data source separately from indexing into Solr, but having hooks for Solr to return hybrid results could be just the ticket here. Erik
Re: [CODE4LIB] code4lib lucene pre-conference
On Nov 27, 2006, at 6:12 PM, Binkley, Peter wrote: Fair point, and that's how my current solr-based project works. I'm thinking I would like the other advantages of an XML db: the ability to run xqueries, batch updates, etc., alongside the Lucene searching. And I want them integrated under the hood in Solr so that people smarter than me will maintain and optimize the connections. But I agree that this approach will have to prove that the extra overhead is worth it. The main concern I have is where the hood line is drawn here. I don't think Solr should become a central repository interface, but it should be easily integrated and perhaps be able to tie in external resources to returned results. My ideal solution is WebDAV interface with commit hooks into Solr. With Subversion behind the scenes of the WebDAV interface, or an XML database, or file system or whatever. :) Erik
Re: [CODE4LIB] code4lib lucene pre-conference
On Nov 27, 2006, at 5:04 PM, Jonathan Rochkind wrote: Bess Sadler wrote: application. That way you can use solr / lucene for search, faceted browse, etc, and your XML database only for known item retrieval, which it is generally able to do without performance issues. I'm hopping up and down waiting for someone to take this approach with an ILS, so please come and show us what you've got! Would this approach complicate hilighting of hits-in-context? One of the biggest things missing from most current OPACs in my opinion is google-style excerpting of WHAT part of the record matched the query--on the results page. Many mainstream OPACs do currently provide some form of hilighting on the detail/full-bib page, but it's not generally truly identifying _which_ parts of the record _actually_ matched your search (a search just on title will still hilight the word found in a non- title field), which I find annoying. Do these kind of hybrid approaches complicate the task of providing proper result hilighting in context, or am I off on the wrong direction? Highlighting is tricky business all the way around. XTF seems to be the best solution I've seen for very detailed context highlighting. But I suspect ILS systems don't need that level of complexity but rather could leverage Solr's highlighting capabilities by ensuring that the specific fields that need highlighting are stored. Solr's highlighter (which is Lucene's contrib Highlighter under the covers) does do field-specific highlighting, but it still is not perfect. For example, if you searched for title:Blessed Damozel, it would highlight blessed and damozel anywhere in the title field even though the query is a phrase query where proximity matters. For a proprietary contract job, I have written code that converts a general Lucene Query into a SpanQuery and a highlighter that does precise highlighting. Beyond this code being proprietary, such that I cannot share it, it is also not general purpose and does full field highlighting, not scoring fragments like the Lucene highlighter does. The approach to converting to a SpanQuery is a good way to go though, and has been discussed a bit in the Lucene e-mail list. In short, I think the hybrid approach is still a good one, separating the search engine from the actual data repository, but highlighting requirements need to be considered up front. Basic and decent field- specific highlighting can be achieved with Solr, but its got caveats. Erik
Re: [CODE4LIB] code4lib lucene pre-conference
On Nov 27, 2006, at 5:49 PM, Andrew Nagy wrote: My only concern about lucene is the lack of a standard query language. I went down the native XML database path because of XQuery and XSL, does something like lucene and solr offer a strong query language? Is it a standard? What if someone developed a kick ass text indexer in 2 years that totally blows lucene out of the water, would you easily be able to switch systems? What if games are mostly just guessing games in the high tech world. Agility is the trait our projects need. Software is just that... soft. And malleable. Sure, we can code ourselves into a corner, but generally we can code ourselves right back out of it too. If software is built with decent separation of concerns, we can adapt to changes readily. Specifically to your concern about a standard query language, I prefer to think of things from a users perspective. A user does not type in XQuery syntax, so at some point the system has to translate a user entered expression into something the underlying search engine understands. Lucene, and thus Solr, support the syntax already mentioned here (Google-like syntax), and there is also a contrib module to Lucene that XMLifies the Lucene query syntax (and then some), called the xml-query-parser: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/xml-query- parser/ Eventually this will integrated into Solr. So, as long as you can parse a user entered query into some type of data structure, you can convert it to either of the Lucene supported syntaxes (or directly to a Query object if you're coding in Java), or to whatever pie in the sky system that comes along in the future. I'm looking forward to more discussion on this topic, as it is one that I hear most often as a negative to Lucene around my neck of the woods, and the standard what-if scenario is used to choose inferior search engine technologies sadly. Erik