Re: [CODE4LIB] indexing word documents using solr

2015-02-10 Thread Erik Hatcher
 On Feb 10, 2015, at 12:43, Eric Lease Morgan emor...@nd.edu wrote:
 
 On Feb 10, 2015, at 11:46 AM, Erik Hatcher erikhatc...@mac.com wrote:
 
 First, with Solr 5, it’s this easy:
 
  Where can I download Solr 5 because none of the other version seem to be 
 complete. —ELM

It's not yet released but will be in a matter of days.   RC2 was generated last 
night here: 
http://people.apache.org/~anshum/staging_area/lucene-solr-5.0.0-RC2-rev1658469/solr/

Sorry for the tease on Solr 5, that's just where I've been living lately :)

Erik


Re: [CODE4LIB] Restrict solr index results based on client IP

2015-01-07 Thread Erik Hatcher
Post processing results as in #1 has big disadvantages as you can’t easily 
“fill back in” as those docs that were removed and may have been accounted for 
in facet counts for example.

#2 would be my recommendation as well.

There is an open issue to create an IP(v6) field type in Solr, with a patch 
there for IPv4 already.

Erik



 On Jan 7, 2015, at 11:41 AM, Chad Mills cmmi...@rci.rutgers.edu wrote:
 
 Hello,
 
 Basically I have a solr index where, at times, some of the results from a 
 query will only be limited to a set of users based on their clients IP 
 address.  I have been thinking about accomplishing this in either two ways.
 
 1) Post-processing the results for IP validity against an external data 
 source and dropping out those results which are not valid.  That could leave 
 me with a portioned result list that would need another query to fill back 
 in.  Say I want 10 results, I end up dropping 2 of them, I need to fill back 
 in those 2 by performing another query.
 
 2) Making the IP permission check part of the query.  Basically appending an 
 AND in the query on a field that stores the permissible IP addresses.  The 
 index field would be set to allow all IPs to access the result by default, 
 but at times can contain the allowable IP addresses or maybe even ranges 
 somehow.
 
 Are there some other ways to accomplish this I haven't considered?  Right now 
 #2 sounds seems more desirable to me.
 
 Thanks in advance for your thoughts!
 
 --
 Chad Mills
 Digital Library Architect
 Ph: 848.932.5924
 Fax: 848.932.1386
 Cell: 732.309.8538
 
 Rutgers University Libraries
 Scholarly Communication Center
 Room 409D, Alexander Library
 169 College Avenue, New Brunswick, NJ 08901
 
 https://rucore.libraries.rutgers.edu/


Re: [CODE4LIB] Restrict solr index results based on client IP

2015-01-07 Thread Erik Hatcher
I meant to include this link in my first reply, sorry: 
https://issues.apache.org/jira/browse/SOLR-6741 
https://issues.apache.org/jira/browse/SOLR-6741


 On Jan 7, 2015, at 11:53 AM, Erik Hatcher erikhatc...@mac.com wrote:
 
 Post processing results as in #1 has big disadvantages as you can’t easily 
 “fill back in” as those docs that were removed and may have been accounted 
 for in facet counts for example.
 
 #2 would be my recommendation as well.
 
 There is an open issue to create an IP(v6) field type in Solr, with a patch 
 there for IPv4 already.
 
   Erik
 
 
 
 On Jan 7, 2015, at 11:41 AM, Chad Mills cmmi...@rci.rutgers.edu wrote:
 
 Hello,
 
 Basically I have a solr index where, at times, some of the results from a 
 query will only be limited to a set of users based on their clients IP 
 address.  I have been thinking about accomplishing this in either two ways.
 
 1) Post-processing the results for IP validity against an external data 
 source and dropping out those results which are not valid.  That could leave 
 me with a portioned result list that would need another query to fill back 
 in.  Say I want 10 results, I end up dropping 2 of them, I need to fill back 
 in those 2 by performing another query.
 
 2) Making the IP permission check part of the query.  Basically appending an 
 AND in the query on a field that stores the permissible IP addresses.  The 
 index field would be set to allow all IPs to access the result by default, 
 but at times can contain the allowable IP addresses or maybe even ranges 
 somehow.
 
 Are there some other ways to accomplish this I haven't considered?  Right 
 now #2 sounds seems more desirable to me.
 
 Thanks in advance for your thoughts!
 
 --
 Chad Mills
 Digital Library Architect
 Ph: 848.932.5924
 Fax: 848.932.1386
 Cell: 732.309.8538
 
 Rutgers University Libraries
 Scholarly Communication Center
 Room 409D, Alexander Library
 169 College Avenue, New Brunswick, NJ 08901
 
 https://rucore.libraries.rutgers.edu/
 


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Erik Hatcher
I’m surprised you didn’t recommend going straight to Solr and doing the 
reporting from there :)   Index into Solr using your MARC library of choice 
(e.g. solrmarc) and then get all authorities using facet.field=authorities (or 
whatever field name used).

Erik



On Nov 2, 2014, at 7:24 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 If you are, can become, or know, a programmer, that would be relatively 
 straightforward in any programming language using the open source MARC 
 processing library for that language. (ruby marc, pymarc, perl marc, 
 whatever).  
 
 Although you might find more trouble than you expect around authorities, with 
 them being less standardized in your corpus than you might like. 
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart 
 Yeates [stuart.yea...@vuw.ac.nz]
 Sent: Sunday, November 02, 2014 5:48 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] MARC reporting engine
 
 I have ~800,000 MARC records from an indexing service 
 (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying 
 to generate:
 
 (a) a list of person authorities (and sundry metadata), sorted by how many 
 times they're referenced, in wikimedia syntax
 
 (b) a view of a person authority, with all the records by which they're 
 referenced, processed into a wikipedia stub biography
 
 I have established that this is too much data to process in XSLT or 
 multi-line regexps in vi. What other MARC engines are there out there?
 
 The two options I'm aware of are learning multi-line processing in sed or 
 learning enough koha to write reports in whatever their reporting engine is.
 
 Any advice?
 
 cheers
 stuart
 --
 I have a new phone number: 04 463 5692


Re: [CODE4LIB] solr computation field norm problem

2013-09-26 Thread Erik Hatcher
Nicolas -

Lucene 4 still encodes norms, as described here:

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#encodeNormValue%28float%29

using this function:

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/util/SmallFloat.html#floatToByte315%28float%29

You might want to give SweetSpotSimilarity a try: 
http://lucene.apache.org/core/4_4_0/misc/org/apache/lucene/misc/SweetSpotSimilarity.html

Erik


On Sep 26, 2013, at 8:02 AM, Nicolas Franck nicolas.fra...@ugent.be wrote:

 I've been testing with Solr 4 (Lucene 4) that uses the new DefaultSimilarity 
 class.
 It does not use the encodeNorm and decodeNorm methods anymore that
 caused all the trouble (storing the floats as a single byte). But it doesn't 
 change anything?
 The field norms remain the same?
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Chris 
 Fitzpatrick [chrisfitz...@gmail.com]
 Sent: Wednesday, September 25, 2013 7:57 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] solr computation field norm problem
 
 Yeah...I think you're running into this:
 
 http://lucene.472066.n3.nabble.com/field-length-normalization-tp495308p495311.html
 
 TL;DR:
 Jay Hill says fields with 3 terms and 4 terms both score at .5 in the
 lengthNorm.
 
 
 
 
 
 
 
 On Wed, Sep 25, 2013 at 4:21 PM, Nicolas Franck 
 nicolas.fra...@ugent.bewrote:
 
 Hi there,
 
 I have a question about the way Lucene computes the length norm of field
 norm for its documents.
 My documents are indexed using Solr.
 These are the documents that where indexed (ignore 'score', that is not
 part of the document itself)
 
 doc
  float name=score1.00711/float
  str name=_idejn01:25675596/str
  str name=titleJournal of neurology research/str
 /doc
 doc
  float name=score1.00711/float
  str name=_idejn01:954925518616/str
  str name=titleJournal of neurology/str
 /doc
 
 
 The field title has the following definition in schema.xml:
 
 fieldType name=utf8text class=solr.TextField
 positionIncrementGap=100 omitNorms=false
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory
 maxTokenLength=1024/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt format=solr ignoreCase=false
 expand=true tokenizerFactory=solr.WhitespaceTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory
 maxTokenLength=1024/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt format=solr ignoreCase=false
 expand=true tokenizerFactory=solr.WhitespaceTokenizerFactory/
  /analyzer
 /fieldType
 
 
 If I use the query journal of neurology, both documents have the same
 score, although the second document is more exact. Supplying a phrase query
 does not fix the issue. I also see that the computed fieldNorm is 0.5 for
 both documents. Does this have something to do with the loss of precision
 when storing the length norm into one byte?
 
 These are all the supplied parameters (defaults in solrconfig.xml):
 
 str name=lowercaseOperatorsfalse/str
 str name=mm-10%/str
 str name=pfauthor^3 title^2/str
 str name=sortscore desc/str
 arr name=bq
  strsource:ser01^10/str
  strsource:ejn01^10/str
 str(*:* -type:article)^999/str
 /arr
 str name=echoParamsall/str
 str name=dfall/str
 str name=tie0/str
 str name=qf
 author^15 title^10 subject^1 summary^1 library^1 location^1 publisher^1
 place_published^1 issn^1 isbn^1
 /str
 str name=q.alt*:*/str
 str name=ps2/str
 str name=defTypeedismax/str
 str name=qjournal of neurology/str
 str name=echoParamsall/str
 str name=sortscore desc/str
 
 Looking the computation of the score, I see no single difference between
 them (see down below)
 Any idea why the fieldNorm is the same for both documents?
 
 
 Thanks in advance!
 
 Greetings,
 
 Nicolas
 
 
 
 
 str name=ejn01:25675596
 1.0071099 = (MATCH) sum of:
  0.0053001107 = (MATCH) sum of:
0.0017667036 = (MATCH) max of:
  0.0017667036 = (MATCH) weight(title:journal^10.0 in 0), product of:
0.005943145 = queryWeight(title:journal^10.0), product of:
  10.0 = boost
  0.5945349 = idf(docFreq=2, maxDocs=2)
  9.996294E-4 = queryNorm
0.29726744 = (MATCH) fieldWeight(title:journal in 0), product of:
  1.0 = tf(termFreq(title:journal)=1)
  0.5945349 = idf(docFreq=2, maxDocs=2)
  0.5 = fieldNorm(field=title, doc=0)
0.0017667036 = (MATCH) max of:
  0.0017667036 = (MATCH) weight(title:of^10.0 in 0), product of:
0.005943145 = queryWeight(title:of^10.0), product of:
  10.0 = boost
  0.5945349 = idf(docFreq=2, maxDocs=2)
  9.996294E-4 = queryNorm
0.29726744 = (MATCH) fieldWeight(title:of in 0), product of:
 

[CODE4LIB]

2012-11-28 Thread Erik Hatcher
We can have the Solr session when and wherever! :)   Organizers - feel free to 
move it however it fits best.

Related: With all of those pre-conferences, it looks like there'll need to be 6 
rooms but the page says 4 (admittedly 4+ it says)

Erik

On Nov 28, 2012, at 16:23 , Bess Sadler wrote:

 On Nov 28, 2012, at 1:04 PM, Shaun Ellis sha...@princeton.edu wrote:
 
 In that respect, I would suggest the preconference hackfests/workshops that 
 involve some kind of pair programming with experienced/inexperienced 
 hackers, which could follow up into a mentor relationship outside of the 
 conference.  I do like the idea of mentor/mentee speed-dating to align 
 interests, but in this sense, the workshop/hackfest you sign up for kind of 
 does that for you (assuming all the preconference proposals[1] are actually 
 going to happen).
 
 [1] http://wiki.code4lib.org/index.php/2013_preconference_proposals
 
 -Shaun
 
 My understanding is that all of the pre-conference proposals are going to 
 happen (note to self: ask Erik Hatcher whether the evening solr session could 
 happen at a bar somewhere). The RailsBridge workshop in particular is aimed 
 at folks who are new to Rails and perhaps new to programming in general, and 
 RailsBridge as a thing was started as a way to bring more women into tech. If 
 anyone is interested in helping out at the RailsBridge session, or at the 
 Blacklight-tailored-for-RailsBridge session in the afternoon, please join us! 
 Workshops like this can never have too many people walking the room to help 
 out, and if we had enough experienced folks, this would be a great 
 opportunity for pair programming and meeting potential mentors. 
 
 Bess


Re: [CODE4LIB] extracting tiff info

2012-11-20 Thread Erik Hatcher
There's Tika http://tika.apache.org/, which has command-line capabilities.  I 
just launched the UI app, dropped a TIFF on it, and got this output:

Bits Per Sample: 8 8 8 8 bits/component/pixel
Compression: LZW
Content-Length: 262844
Content-Type: image/tiff
Orientation: Top, left side (Horizontal / normal)
Photometric Interpretation: RGB
Planar Configuration: Chunky (contiguous for each subsampling pixel)
Predictor: 2
Rows Per Strip: 30 rows/strip
Samples Per Pixel: 4 samples/pixel
Strip Byte Counts: 20668 7759 13240 15631 14302 17278 11236 14414 6226 5401 
7310 4813 12716 5368 4213 3357 5664 6081 8466 12266 8083 8541 14306 7245 11916 
9443 4636 705 705 417 bytes
Strip Offsets: 8 20676 28435 41675 57306 71608 6 100122 114536 120762 
126163 133473 138286 151002 156370 160583 163940 169604 175685 184151 196417 
204500 213041 227347 234592 246508 255951 260587 261292 261997
Thumbnail Image Height: 881 pixels
Thumbnail Image Width: 1081 pixels
Unknown tag (0x0152): 1
Unknown tag (0x0153): 1 1 1 1
resourceName: tika-view.tiff
tiff:BitsPerSample: 8
tiff:ImageLength: 881
tiff:ImageWidth: 1081
tiff:Orientation: 1
tiff:SamplesPerPixel: 4

Erik

On Nov 19, 2012, at 14:31 , Kyle Banerjee wrote:

 Howdy all,
 
 I need to extract all the metadata from a few thousand images on a network
 drive and put it into spreadsheet. Since the files are huge (each is
 100MB+) and my connection isn't that fast, I strongly prefer to not move
 them before working on them -- i.e. I'm using cygwin and/or windows.
 
 Just eyeballing these things, I see the headers contain everything I need
 in purty rdf. What's the best way to extract this? I thought tiffinfo would
 do the trick, but it's just giving me technical info. Of course I can just
 parse the files with perl but I'm thinking there just has to be a slicker
 way to do this. What's my best option? Thanks,
 
 kyle


Re: [CODE4LIB] New Newcomer Dinner option

2012-02-04 Thread Erik Hatcher
Looks like some MARC records I've seen.  

On Feb 4, 2012, at 16:19, Cary Gordon listu...@chillco.com wrote:

 Probably their cat… They need this: http://www.bitboost.com/pawsense/
 
 On Sat, Feb 4, 2012 at 12:49 PM, Eric Lease Morgan emor...@nd.edu wrote:
 LlkjyYYYYyetyeyppf
 Prpfc
 EXpdpppePeppp
 Pp
 P$
 $p
 
 Pp$epepp
 $ppeppPP
 PRpp
 PepplpereprpeprrprPRPeeopwprprPprppertrretrtrrterrtwrtrtww
 TrWtwteteetrteeetetttetrteyertEtrrtEgrerrtetteyeyeeytwtyeyeyeeyeeeyeey
 eryeeyeyyyeryyyeyeyeyeyeyyyeyyyeeyreyytrtrttrrtrregtrgghgg
 gdhfgdhfrtgrhdrghdghdhdggdffdfffvbXVcyvvfvfvffvffvvvfvffvvffffffvf
 ffxBbbCnvNVqfddZuytuyrutyguhUOyy
 
 
 ROTFL!!!
 
 I'm not sure, but I think somebody's b^tt has sent a message to the maIling 
 list. Does anybody here speak b^tt?
 
 --
 ELM
 
 
 
 -- 
 Cary Gordon
 The Cherry Hill Company
 http://chillco.com


Re: [CODE4LIB] Another Sharpie Opportunity

2012-02-03 Thread Erik Hatcher
 canadian_snacks++

unless you mean poutine ;)

but if you're talking Dangerous Dan's Diner, +1: 
http://www.dangerousdansdiner.com/


Re: [CODE4LIB] code4lib conference '12 - Solr pre-conference

2012-02-02 Thread Erik Hatcher
first come first serve - Since I'm not going to be making it to Seattle, I will 
gladly donate my conference slot to whoever 1) can make it and 2) e-mails me 
first @ erik.hatc...@lucidimagination.com


On Feb 1, 2012, at 19:02 , Erik Hatcher wrote:

 Regretfully I must cancel my trip to Seattle, a bummer on several levels as I 
 always love code4lib conferences, the people, the topics, and was also 
 looking forward to enjoying downtown Seattle a bit too.  Last minute urgent 
 business duties call, alas.  I have alerted the code4libcon e-mail list as 
 well.
 
 This means I won't be at the What's New in Solr pre-conference event that I 
 was going to lead.  However, I will make myself available to call/Skype/IRC 
 in and do a bit of facilitation and contribute what I can to the 
 get-together.  I think it will be a useful/productive time slot for folks to 
 discuss Solr experiences, challenges, and future needs, so please don't worry 
 about me not being physically there, and take the opportunity to make it an 
 interactive session where everyone introduces themselves and their projects 
 and delves into the gory details of Solr experiences.
 
 I'm open to suggestions on how I can best participate remotely and contribute 
 as best I can.
 
 Thanks,
   Erik
 


Re: [CODE4LIB] my conference slot?

2012-02-02 Thread Erik Hatcher
Don't sweat it Elizabeth... this is the case of the sharpie marker.  If someone 
takes my slot, just pretend they're me as far as everything on your side goes 
and they sharpie their name on a badge.

But no one has responded to me anyway.  

I know it's rough running an event (my company runs two major conferences a 
year for the past couple of years).  But it's silly that someone can't fill my 
seat if I hand it over to them.  I was told yesterday that it was up to me to 
solicit someone to fill that seat, so I solicited.  Again however, no one has 
responded yet.

Erik


On Feb 2, 2012, at 11:10 , Elizabeth Duell wrote:

 NO. We are NOT adding anyone else to the participants list. The registration 
 has been closed and will remain so.
 
 It is 2 working days before the start of the convention. No changes to the 
 participants list is going to happen.
 
 Again, we are NOT ADDING ANYONE ELSE TO THE PARTICIPANTS LIST.
 
 Elizabeth
 
 
 Elizabeth Duell
 Orbis Cascade Alliance
 edu...@uoregon.edu
 (541) 346-1883


[CODE4LIB] code4lib conference '12 - Solr pre-conference

2012-02-01 Thread Erik Hatcher
Regretfully I must cancel my trip to Seattle, a bummer on several levels as I 
always love code4lib conferences, the people, the topics, and was also looking 
forward to enjoying downtown Seattle a bit too.  Last minute urgent business 
duties call, alas.  I have alerted the code4libcon e-mail list as well.

This means I won't be at the What's New in Solr pre-conference event that I 
was going to lead.  However, I will make myself available to call/Skype/IRC in 
and do a bit of facilitation and contribute what I can to the get-together.  I 
think it will be a useful/productive time slot for folks to discuss Solr 
experiences, challenges, and future needs, so please don't worry about me not 
being physically there, and take the opportunity to make it an interactive 
session where everyone introduces themselves and their projects and delves into 
the gory details of Solr experiences.

I'm open to suggestions on how I can best participate remotely and contribute 
as best I can.

Thanks,
Erik


Re: [CODE4LIB] jQuery Ajax request to update a PHP variable

2011-12-06 Thread Erik Hatcher
I'm with jrock on this one.   But maybe I'm a luddite that didn't get the memo 
either (but I am credited for being one of the instrumental folks in the Ajax 
world, heh - in one or more of the Ajax books out there, us old timers called 
it remote scripting).

What I hate hate hate about seeing JSON being returned from a server for the 
browser to generate the view is stuff like:

   string = div + some_data_from_JSON + /div;

That embodies everything that is wrong about Ajax + JSON.

As Jonathan said, the server is already generating dynamic HTML... why have it 
return JSON and move processing/templating to the client for some things but 
not other things?  Rhetorical question... of course it depends on the 
application.  If everything is entirely client-side generated, then sure.  But 
for traditional webapps, JSON to the client to simply piece it together as 
HTML is hideous.

I spoke to this a bit at my recent ApacheCon talk, slides are here: 
http://www.slideshare.net/erikhatcher/solr-flair-10173707 slides 4 and 8 
particularly on this topic.

So in short, opinions differ on the right way to do Ajax obviously.  It 
depends, no question, on the bigger picture and architectural pieces in play, 
but there is absolutely nothing wrong with having HTML being returned from the 
server for partial pieces of the page.  An in many cases it's the cleanest way 
to do it anyway.

Erik



On Dec 5, 2011, at 18:45 , Jonathan Rochkind wrote:

 I still like sending HTML back from my server. I guess I never got the 
 message that that was out of style, heh.
 
 My server application already has logic for creating HTML from templates, and 
 quite possibly already creates this exact same piece of HTML in some other 
 place, possibly for use with non-AJAX fallbacks, or some other context where 
 that snippet of HTML needs to be rendered. I prefer to re-use this logic 
 that's already on the server, rather than have a duplicate HTML 
 generating/templating system in the javascript too.  It's working fine for 
 me, in my use patterns.
 
 Now, certainly, if you could eliminate any PHP generation of HTML at all, as 
 I think Godmar is suggesting, and basically have a pure Javascript app -- 
 that would be another approach that avoids duplication of HTML generating 
 logic in both JS and PHP. That sounds fine too. But I'm still writing apps 
 that degrade if you have no JS (including for web spiders that have no JS, 
 for instance), and have nice REST-ish URLs, etc.   If that's not a 
 requirement and you can go all JS, then sure.  But I wouldn't say that making 
 apps that use progressive enhancement with regard to JS and degrade fine if 
 you don't have is out of style, or if it is, it ought not to be!
 
 Jonathan
 
 On 12/5/2011 6:31 PM, Godmar Back wrote:
 FWIW, I would not send HTML back to the client in an AJAX request - that
 style of AJAX fell out of favor years ago.
 
 Send back JSON instead and keep the view logic client-side. Consider using
 a library such as knockout.js. Instead of your current (difficult to
 maintain) mix of PhP and client-side JavaScript, you'll end up with a
 static HTML page, a couple of clean JSON services (for checked-out per
 subject, and one for the syndetics ids of the first 4 covers), and clean
 HTML templates.
 
 You had earlier asked the question whether to do things client or server
 side - well in this example, the correct answer is to do it client-side.
 (Yours is a read-only application, where none of the advantages of
 server-side processing applies.)
 
  - Godmar
 
 On Mon, Dec 5, 2011 at 6:18 PM, Nate Hillnathanielh...@gmail.com  wrote:
 
 Something quite like that, my friend!
 Cheers
 N
 
 On Mon, Dec 5, 2011 at 3:10 PM, Walker, Daviddwal...@calstate.edu
 wrote:
 
 I gotcha.  More information is, indeed, better. ;-)
 
 So, on the PHP side, you just need to grab the term from the  query
 string, like this:
 
  $searchterm = $_GET['query'];
 
 And then in your JavaScript code, you'll send an AJAX request, like:
 
  http://www.natehill.net/vizstuff/catscrape.php?query=Cooking
 
 Is that what you're looking for?
 
 --Dave
 
 -
 David Walker
 Library Web Services Manager
 California State University
 
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Nate Hill
 Sent: Monday, December 05, 2011 3:00 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] jQuery Ajax request to update a PHP variable
 
 As always, I provided too little information.  Dave, it's much more
 involved than that
 
 I'm trying to make a kind of visual browser of popular materials from one
 of our branches from a .csv file.
 
 In order to display book covers for a series of searches by keyword, I
 query the catalog, scrape out only the syndetics images, and then
 display 4
 of them.  The problem is that I've hardcoded in a search for 'Drawing',
 rather than dynamically pulling the correct term and putting it into the
 catalog 

Re: [CODE4LIB] stemming in author search?

2011-06-14 Thread Erik Hatcher
On Jun 14, 2011, at 08:10 , Keith Jenkins wrote:

 Does Solr support Soundex?  (Soundex was originally developed to
 assist with alternate spellings of names)

Indeed.  And several other phonetic algorithms:

   
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory

Erik


Re: [CODE4LIB] stemming in author search?

2011-06-14 Thread Erik Hatcher
It's documented in that wiki page link below as true/false -- true will add 
tokens to the stream, false will replace the existing token

So if you index cat and it the phonetic filter turns it into KT, it can 
either index cat and KT or just KT.

Erik


On Jun 14, 2011, at 10:45 , Jonathan Rochkind wrote:

 Hey Erik, in that wiki documentation the example it gives is:
 
 filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone 
 inject=true/
 
 
 Do you know what that 'inject' argument is about, and where (if anywhere) I'd 
 find it (and other available arguments for PhoneticFilterFactory, which may 
 or may not differ depending on encoder chosen?) documented?
 
 On 6/14/2011 8:31 AM, Erik Hatcher wrote:
 On Jun 14, 2011, at 08:10 , Keith Jenkins wrote:
 
 Does Solr support Soundex?  (Soundex was originally developed to
 assist with alternate spellings of names)
 Indeed.  And several other phonetic algorithms:
 

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory
 
  Erik
 


[CODE4LIB] japanese (Solr) analysis

2011-04-04 Thread Erik Hatcher
I'm trying to cull together the best practices for indexing/searching Japanese 
text.

For those of you using Solr, what analyzer/field-type definition do you have 
for Japanese?

Thanks for sharing!
Erik


Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Erik Hatcher
Solr _can_ use stemming, but to do it with POS would be flakey I'd think.  Is 
work a verb or noun?

Some of the (Solr-using) customers that I work with have done POS tagging 
(using tools like BasisTech Solr plugins for entity tagging).  Payloads can be 
assigned to terms during indexing and then used to weight the score when query 
terms match.  Lucene supports payloads and scoring based on them natively, but 
it requires some code to wire together.  Solr supports a little in terms of 
payloads, but to really use them effectively custom coding is needed.  See 
https://issues.apache.org/jira/browse/SOLR-1485 for example.

Erik

On Feb 22, 2011, at 09:02 , Cindy Harper wrote:

 It's not ironic - my post was musing inspired by your work.  I guess I
 wasn't sure if I understood your results. You were looking at the overall
 POS usage in the entire texts as a possible way of ranking the texts. I was
 wondering about POS of particular search terms - those that could take on
 several POS. A related question - does SOLR use stemming to widen the search
 to various POS?  Then would it be meaningful to rank the given texts by the
 POS of the actual search terms?  And has anyone looked at samples of user
 search terms - are they almost always noun phrases?  Just wanting to
 understand what you have explored.  And I probably should have added to your
 thread on NGC4LIB, rather than Code4lib - I tend to conflate them.
 
 Cindy Harper, Systems Librarian
 Colgate University Libraries
 char...@colgate.edu
 315-228-7363
 
 
 
 On Sat, Feb 19, 2011 at 5:42 PM, Eric Lease Morgan emor...@nd.edu wrote:
 
 On Feb 19, 2011, at 11:26 AM, Cindy Harper wrote:
 
 I just was testing our discovery engine for any technical issues after a
 reboot. I was just using random single words, and one word I used was
 correct.  Looking at the first ranked items, I wondered if there's some
 role for parts-of-speech in ranking hits - are nouns and , in this case,
 adjectives more indicative of aboutness than verbs?  The first items were
 Miss Manners ...  excruciating correctly behavior, then a bunch of
 govdocs
 on an act to correct.  I don't think there's any reason to prefer
 nouns over verbs, but I thought I'd throw the thought at you anyway.
 
 
 
 Ironically, I was playing with parts-of-speech (POS) analysis the other
 day. [1]
 
 Using a pseudo-random sample of texts, I found there to be surprisingly
 similar POS usage between texts. With such similarity, I thought it would be
 difficult to use general POS as a means for ranking or sorting. On the other
 hand, specific POS may be useful. For example, Thoreau was dominated by
 first-person male pronouns but Austen was dominated by second person female
 pronouns.
 
 I think there is something to be explored here.
 
 [1] POS - http://bit.ly/hsxD2i
 
 --
 Eric Still Counting Tweets and Chats Morgan
 


[CODE4LIB] [code4libcon] QA fodder for the What's New in Solr 1.4 preconference??

2011-01-27 Thread Erik Hatcher
Just like I did last year, I'm requesting folks send me (on or off-list, as 
appropriate) issues/questions regarding Solr that I can factor into the session 
on Feb. 7 in Bloomington.  Suggestions on specifics you'd like covered will be 
eagerly accepted and factored in too.

Last year I had a ton of great questions.  And it's not that I can necessarily 
provide concrete answers for them (will do my best, of course), but that we 
toss them out there to the lib+Solr intersected community and see if we can 
collaboratively address them.  Here's a link to last years slides; see the last 
part of the slide deck for the community contributed questions) that we 
discussed live: 
http://www.slideshare.net/erikhatcher/solr-black-belt-preconference

Looking forward to your feedback!

Thanks,
Erik


Re: [CODE4LIB] javascript testing?

2011-01-11 Thread Erik Hatcher
Here at Lucid we've got some Jasmine going on for LWE JS testing. 

Erik

On Jan 11, 2011, at 21:25, Gabriel Farrell gsf...@gmail.com wrote:

 I like QUnit because it's minimal and I'm used to unit testing. A lot
 of people are jumping on Jasmine, though. It might be more your style
 if you're into BDD.
 
 On Tue, Jan 11, 2011 at 7:21 PM, Bess Sadler bess.sad...@gmail.com wrote:
 Can anyone recommend a javascript testing framework? At Stanford, we know we 
 need to test the js portions of our applications, but we haven't settled on 
 a tool for that yet. I've heard good things about celerity 
 (http://celerity.rubyforge.org/) but I believe it only works with jruby, 
 which has been a barrier to getting started with it so far. Anyone have 
 other tools to suggest? Is anyone doing javascript testing in a way they 
 like? Feel like sharing?
 
 Thanks!
 
 Bess
 


[CODE4LIB] [JOB] Directory, Online Library Environment, University of Virginia

2010-07-19 Thread Erik Hatcher
I'm passing this on from contacts at UVa, please use the contact info  
below to follow up.


==

DIRECTOR, ONLINE LIBRARY ENVIRONMENT
University of Virginia Library


The University of Virginia Library seeks a strong technical leader for  
the position of Director of our “online library environment,” a  
comprehensive suite of tools and services to provide access to the  
Library’s physical and digital collections.  We seek candidates who  
can successfully architect and implement solutions providing faculty  
and students a cohesive, innovative environment for accessing  
information used in research, teaching, and learning.


Environment:  The University of Virginia Library (http://www.lib.virginia.edu 
) is a leader in innovative customer service, an international leader  
in digital library research and digital scholarship, and is recognized  
for the strength and variety of its collections.  The Library system  
consists of twelve libraries, with independent libraries for health  
sciences, law, and business. The libraries support 12,000  
undergraduates, 6,000 graduate students and 1,600 teaching faculty.  
The University and the Library have a strong commitment to achieving  
diversity among faculty and staff. The Neoclassical buildings of  
founder Thomas Jefferson's Academical Village still serve as the  
center of the University's Grounds (http://www.virginia.edu/uvatours/slideshow/ 
) and as a unique backdrop for teaching, learning, and research.


Responsibilities:  The Director of the online library environment is  
responsible for leading the development and implementation of emerging  
information technologies as well as managing daily operations for the  
Library’s access and delivery applications. The Director will head a  
newly formed department of software developers and librarians in  
carrying out this activity. She or he will have oversight of all  
aspects of the Library’s Integrated System (ILS – Sirsi/Dynix Unicorn)  
and will lead development of an information architecture that provides  
cohesive access and delivery. She or he will assess, architect, and  
implement new ways to provide content and workflow services  
traditionally provided by an ILS and develop gateways to other  
information resources such as the Library’s electronic resources and  
institutional repositories. The Director will:
provide leadership and vision that ensures easy, reliable online  
access to a wide array of collections, information, and services in  
support of research, teaching and learning;
provide technical leadership in the design and implementation of all  
aspects of the software and infrastructure for ongoing development  
projects.  Provide technical guidance to developers and systems  
administrators on project requirements as needed.
manage the daily operations environment for the Library’s access and  
delivery applications;  design and implement technical enhancements to  
the Library’s ILS infrastructure to meet current and future needs.

supervise the daily work of both faculty and classified staff positions;
collaborate with and provide technical guidance to partners within the  
Library and among entities that require access to Library content;
and engage professionally in activities related to librarianship and  
digital scholarship.


Qualifications:   Master’s degree in Library Science or master’s  
degree or PhD in Computer Science, Information Sciences or related  
area. Successful candidates should have demonstrated significant and  
progressively responsible experience managing positions with a range  
of technology-specific and administrative responsibilities.   
Experience in libraries or information organizations is preferred.   
Preferred candidates will also have:
demonstrated understanding of digital library concepts and standards  
(e.g., metadata standards, media-specific standards);

experience in systems design and systems architecture;
demonstrated experience in the implementation of Open Source software  
and tools; these tools include, but are not limited to:

Enterprise Java development
Ruby scripting language or equivalent
Ruby on Rails
Solr
Tomcat
Unix, Linux, AIX preferred
an understanding of and commitment to library technologies;
the ability to communicate clearly, both verbally and in writing;
demonstrated ability to manage and lead information technology staff  
and projects as well as departmental priorities;
demonstrated knowledge of emerging technologies and related research;   
these include but are not limited to:

Blacklight
Fedora/DuraSpace
Evergreen
Koha
other Open Source and proprietary systems related to online library  
environments

strong interpersonal skills;
and a customer-service orientation.

Salary and Benefits:  Competitive depending on qualifications. This  
position has general faculty status with excellent benefits, including  
22 days of vacation and TIAA/CREF and other retirement plans. Review  

[CODE4LIB] evening CrossFit excursion

2010-02-22 Thread Erik Hatcher

I posted to the blog and update:

   
http://wiki.code4lib.org/index.php/C4L2010_social_activities#CrossFit_Asheville

If you're one of the few, the proud, the insane, meet me in the lobby  
at 5:45pm.  I'll depart at 6pm.  Gym is really close.


Erik


[CODE4LIB] exercising at code4libcon next week

2010-02-17 Thread Erik Hatcher

code4libcon is about here, yay!

I'm kinda in a fitness craze right now, and will be doing some  
training in Asheville.


Monday night, 6:30pm, I'm going to the CrossFit Asheville gym - 
http://www.crossfitasheville.com/
I contacted them and they said that was a good time to come.  I'll  
likely go back on Wednesday night at the same time.  (dunno if they'll  
charge some fee, though).  There were a couple of folks that mentioned  
interest, and I can carpool up to 3 others.  If you've never done it  
before, now's not the time to start and I'm sure they'll only let  
experienced folks partake, but I imagine those curious about the  
insanity are welcome to spectate.


Jogging - what say folks up for runs meet in the hotel lobby at 6:30am  
any day next week.  I'm game for a relatively short run (2-3 miles)  
both Monday and Wednesday.  I fleshed out a daily signup on the wiki.   
If it's too cold or treacherous out, I'll just hit the treadmill or  
rowing machine if they have it.


http://wiki.code4lib.org/index.php/C4L2010_social_activities#Working_Out

I'm still debating how many pushups folks must do at the Black Belt  
preconference, and which kata to teach ;)


Erik


Re: [CODE4LIB] preconference proposals - solr

2009-11-13 Thread Erik Hatcher
+1, Bess!  I'm especially psyched for the kata demonstrations and  
sparring matches we'll have at the end of the session :)


I'll tinker with the advanced session description a bit when I can,  
but let's run with that for the time being.  I'm happy to have Noami  
join me however she likes.


Erik


On Nov 13, 2009, at 11:25 AM, Bess Sadler wrote:

Hey, how about this? I've been discussing this off list with Erik  
and Naomi and this is what we came up with (I also added it to the  
wiki):


This is a proposal for several pre-conference sessions that would  
fit together nicely for people interested in implementing a next-gen  
catalog system.


1. Morning session - solr white belt
Instructor: Bess Sadler (anyone else want to join me?)
The journey of solr mastery begins with installation. We will then  
proceed to data types, indexing, querying, and inner harmony. You  
will leave this session with enough information to start running a  
solr service with your own data.


2. Morning session - solr black belt
Instructors: Erik Hatcher (and Naomi Dushay? she has offered to  
help, if that's of interest)
Amaze your friends with your ability to combine boolean and weighted  
searching. Confound your enemies with your mastery of the secrets of  
dismax. Leave slow queries in the dust as you performance tune solr  
within an inch of its life. [We should probably add more specific  
advanced topics here... suggestions welcome]


3. Afternoon session - Blacklight
Instructors: Naomi Dushay, Jessie Keck, and Bess Sadler
Apply your solr skills to running Blacklight as a front end for your  
library catalog, institutional repository, or anything you can index  
into solr. We'll cover installation, source control with git, local  
modifications, test driving development, and writing object-specific  
behaviors. You'll leave this workshop ready to revolutionize  
discovery at your library. Solr white belts or black belts are  
welcome.


And then anyone else who had a topic that built on solr (e.g.,  
vufind?) could add it in the afternoon. Obviously I'm biased, but I  
really do think the topic of implementing a next gen catalog is  
meaty enough for a half day and I know people are asking me about it  
and eager to attend such a thing.


What do you think, folks?

Bess

On 12-Nov-09, at 4:10 PM, Gabriel Farrell wrote:


On Tue, Nov 10, 2009 at 02:47:42PM +, Jodi Schneider wrote:
If you'd be up for it Erik, I'd envision a basic session in the  
morning.

Some of us (like me) have never gotten Solr up and running.

Then the afternoon could break off for an advanced session.

Though I like Bess's idea, too! Would that be suitable for a  
conference

breakout? Not sure I'd want to pit it against Solr advanced session!


The preconfs should be as inclusive as possible, but I'm wondering if
the Solr session might be more beneficial if we dive into the
particulars right off the bat in the morning.  There are only a few
steps to get Solr up and running -- it's in the configuration for our
custom needs that the advice of a certain Mr. Hatcher can really be
helpful.

You're right, though, that the NGC thing sounds more like a BOF  
session.

I'd support that in order to attend a full preconf day of Solr.


Gabriel


Elizabeth (Bess) Sadler
Chief Architect for the Online Library Environment
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

b...@virginia.edu
(434) 243-2305



Re: [CODE4LIB] preconference proposals - solr

2009-11-13 Thread Erik Hatcher

On Nov 13, 2009, at 11:42 AM, Walter Lewis wrote:


On 13 Nov 09, at 11:25 AM, Bess Sadler wrote:


1. Morning session - solr white belt
[delightful descriptions snipped]
2. Morning session - solr black belt
3. Afternoon session - Blacklight


Is there any chance that the black belt session needs to be/should  
be a two parter and run through the afternoon as well?  ... or  
repeat for those who have just acquired their white belts but are  
headed in different directions?


I'd hate to miss the Blacklight session myself though!

How about a compromise?  In that I'll do the morning advanced Solr  
session as proposed and then gladly make myself available for the  
remainder of the conference for any folks that have specific questions/ 
issues with Solr.


Erik


Re: [CODE4LIB] preconference proposals

2009-11-12 Thread Erik Hatcher

On Nov 11, 2009, at 6:46 PM, Naomi Dushay wrote:

What do you think about the Solr part having some specific goodies  
like:


+1 to it all!


lots on dismax magic

how to do fielded searching (author/title/subject) with dismax

how to do browsing (termsComponent query, then fielded query to get  
matching docs)


how to do boolean  (use lucene QP, or fake it with dismax)


Or, use the new Lucid contributed extended dismax parser ;)

  https://issues.apache.org/jira/browse/SOLR-1553

Erik


Re: [CODE4LIB] solr | StopFilterFactory - stopwords.txt

2009-11-12 Thread Erik Hatcher
I often recommend against stop word removal altogether.  Is there any  
reason you need to remove them?


The primary reason stop words get removed is to increase performance  
of queries with very common terms.  If you are encountering that,  
using Solr's CommonGramsFilter(Factory) is a good solution to keep  
your stop words and alleviate the performance degradation potential.   
The HathiTrust folks have had success with the common grams capability.


Erik


On Nov 11, 2009, at 3:41 PM, Eric James wrote:

Has anyone already given some thought into refining the solr  
stopwords.txt for library collections, particularly finding aids?  
The words included in the out of the box stopwords.txt are of very  
questionable unimportance:


an and are as at be but by for if in into is it not of on or s such  
t that the their then there these they this to was will with




We were indexing a field id with no. as one of its tokens (for  
number), but wanted a query with no (where the person did not add  
the period) to find the doc, but in actuality the no would get  
stripped by the StopFilterFactory. And thus we stumbled upon this  
list, and was a bit suprised by some of the inclusions (ex:will),  
and exclusions( ex:a).




Thanks,

Eric James

Yale University Libraries



Re: [CODE4LIB] preconference proposals

2009-11-10 Thread Erik Hatcher
I'm interested presenting something Solr+library related at c4l10.   
I'm soliciting ideas from the community on what angle makes the most  
sense.  At first I was thinking a regular conference talk proposal,  
but perhaps a preconference session would be better.  I could be game  
for a half day session.  It could be either an introductory Solr  
class, get up and running with Solr (+ Blacklight, of course).  Or  
maybe a more advanced session on topics like leveraging dismax, Solr  
performance and scalability tuning, and so on, or maybe a freer form  
Solr hackathon session where I'd be there to help with hurdles or  
answer questions.


Thoughts?  Suggestions?   Anything I can do to help the library world  
with Solr is fair game - let me know.


Thanks,
Erik

On Nov 9, 2009, at 9:55 PM, Kevin S. Clarke wrote:


Hi all,

It's time again to collect proposals for Code4Lib 2010 preconference
sessions.  We have space for six full day sessions (or 12 half day
sessions (or some combination of the two)).  If we get more than we
can accommodate, we'll vote... but I don't think we will (take that as
a challenge to propose lots of interesting preconference sessions).
Like last year, attendees will pay $12.50 for a half day or $25 for
the whole day.  The preconference space will be in the hotel so we'll
have wireless available.  If you have a preconference idea, send it to
this list, to me, or to the code4libcon planning list.  We'll put them
up on the wiki once we start receiving them.  Some possible ideas?  A
Drupal in libraries session? LOD part two?  An OCLC webservices
hackathon?  Send the proposals along...

Thanks,
Kevin


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Erik Hatcher
The Lucene Highlighter doesn't require that the text you want  
highlighted be stored.  In fact, you can pass in any arbitrary text to  
the Highlighter.


See the various getBestFragments from the Highlighter class:
  http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/highlight/Highlighter.html 



Erik


On Sep 29, 2009, at 7:01 AM, Yitzchak Schaffer wrote:


Hello,

Sorry for any cross-posting annoyance.  I have a request for a  
Greenstone collection I'm working on, to add context snippets to  
search results; for example a search for yak culture might return  
this in the list of results:


... addressing the fine points of strongyak culture/strong, the  
zoosociologists took into account ...


Sounds like a pretty basic feature, say our sponsors, and I agree.   
(Ah, it's also an old Trac ticket at http://trac.greenstone.org/ticket/444)


I see that GS out-of-the-box is set *not* to store the fulltext in  
the index, which seems to be a prerequisite for this kind of thing,  
as in http://bit.ly/ljNkL .  Has anyone modified the Lucene indexing  
wrapper locally to do this?


Given that we don't have any Java coders on staff, I've started  
porting the Lucene wrapper to PHP for use with a custombuilder.pl  
and Zend_Search_Lucene.  I already have a PHP frontend, so adjusting  
that to display the results shouldn't be a problem; OTOH because the  
frontend is PHP, I'm restricted to using buildtype lucene, or  
something else with good PHP support.


Many thanks,

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Erik Hatcher

On Sep 29, 2009, at 7:33 AM, Yitzchak Schaffer wrote:


Erik Hatcher wrote:
The Lucene Highlighter doesn't require that the text you want  
highlighted be stored.  In fact, you can pass in any arbitrary text  
to the Highlighter.


Thanks Erik,

What I'm looking for is to return the context of the search result,  
not just the ID of the containing document - e.g. when all I input  
is yak culture, I get back the context from the document as a  
search result, without having to retrieve the doc itself:


... addressing the fine points of strongyak culture/strong, the  
zoosociologists took into account ...


GS out of the box does not appear to support this, as it does not  
store the fulltext in the index.  So yes, I can highlight stuff, but  
as it stands, I don't have the text to work with.  IANA Lucene guru,  
so correct me if I misunderstand.


I'm a bit confused then.  You mentioned that somehow Zend Lucene was  
going to help, but if you don't have the text to highlight anywhere  
then the Highlighter isn't going to be of any use.  Again, you don't  
need the full text in the Lucene index, but you do need it get it from  
somewhere in order to be able to highlight it.


Erik


Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Erik Hatcher

Here's a post on how easy it is to send PDF documents to Solr from Java:

  http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/ 



Not only can you post PDF (and other rich content) files to Solr for  
indexing, you can also as shown in that blog entry extract the text  
from such files and have it returned to the client.  This Solr  
capability makes the tool chain a bit simpler.


Erik


On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote:


Hi all,

I would like to suggest an API for extracting text (including  
highlighted or

annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we  
worked

with extraordinary PDF files.

Solr uses Tika (http://lucene.apache.org/tika) for extracting text  
from

documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF  
files,
but it has (at least had) some features, which I didn't satisfied  
with:


- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).

Our PDF files were double layered (original hi-res image + OCR-ed  
text),
several thousands pages length documents (Hungarian scientific  
journals,
the diary of the Houses of Parliament from the 19th century etc.).  
We indexed

the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the  
web UI,

so the user can browse pages according to the full file's TOC.

This project happened two years ago, so it is possible, that lots of  
things

were changed since that time.

Király Péter
http://eXtensibleCatalog.org

- Original Message - From: Mark A. Matienzo m...@matienzo.org 


To: CODE4LIB@LISTSERV.ND.EDU
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files



Eric,


5. Use pdttotext to extract the OCRed text
  from the PDF and index it along with
  the MyLibrary metadata using Solr. [3, 4]



Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library


Re: [CODE4LIB] Usability evaluation of library online catalogues

2008-02-05 Thread Erik Hatcher

On Feb 4, 2008, at 4:12 PM, David Fiander wrote:

Actually, the idea of using AJAX to create a way to add and remove
limits diagonally is exactly what U Virginia's blacklight interface
does, although with a slightly different interface:

http://blacklight.betech.virginia.edu/


David - that is not accurate.  Blacklight doesn't use Ajax anywhere
currently (that I know of, and not for the basic search/browse/facet
functionality at least).   The only place I had Ajaxed it was with as-
you-type suggest, but we took that out fairly early on in the Solr +
Flare + MARC project, before it was even Blacklight, to avoid that as
a performance issue as we were toying with the faceted UI.

Blacklight morphed over time to deal with facets differently, with my
first incarnation using server-side session scope to keep track of
the users query/browse/invert trail.  After I left, the next
developer to tinker with it moved that session state to the URL, to
make it all bookmarkable:

http://tinyurl.com/39abwa  (and that is very horrible URL, IMO, and
deserves a refactoring to be readable and hackable)

But, no Ajax in the mix currently, not even for as-you-type
suggestion.   You can see the suggest feature built into Solr Flare
in action here, if you know a Japanese character or two: http://
www.rondhuit-demo.com/yademo/

   Erik

p.s. I enjoyed the OLA tech trends session on Saturday.  How's
Blacklight look on your cell phone?  :)


Re: [CODE4LIB] arg! classpaths!

2008-01-26 Thread Erik Hatcher

Sadly the Lucene demo is not all that great.

I recommend you start with Solr rather than Lucene directly.

   Erik

On Jan 26, 2008, at 9:30 AM, Eric Lease Morgan wrote:


(Arg! Classpaths!)

Please tell me why Java throws the NoClassDefFoundError error when I
think I have set up my classpath correctly:

$ pwd
/home/eric/lucene
$ ls -lh
total 720K
-rw-r--r-- 1 eric eric 650K 2008-01-26 08:30 lucene-core-2.3.0.jar
-rw-r--r-- 1 eric eric  52K 2008-01-26 08:30 lucene-demos-2.3.0.jar
$ export CLASSPATH=$PWD:.
$ echo $CLASSPATH
/home/eric/lucene:.
$ java org.apache.lucene.demo.IndexFiles
Exception in thread main java.lang.NoClassDefFoundError: org/apache/
lucene/demo/IndexFiles


Put another way, I have downloaded Lucene and I'm trying to do the
demo application. [1] You are suppose to put the .jar files in your
classpath and then run java org.apache.lucene.demo.IndexFiles. What
am I doing wrong?

[1] http://lucene.apache.org/java/2_3_0/demo.html

--
Eric Lease Morgan


Re: [CODE4LIB] Getting started with SOLR

2007-11-23 Thread Erik Hatcher

On Nov 22, 2007, at 3:41 PM, Kent Fitch wrote:

On Nov 23, 2007 4:11 AM, Binkley, Peter [EMAIL PROTECTED]
wrote:


...

If you use boost on the date field the way you suggest, remember
you'll
have to reindex from scratch every year to adjust the boost as items
age.



Or maybe just use a method such that 2007 dates boost the document
by 3.0,
2008 dates by 3.1, 2009 by 3.2 ...  Whether this is feasible
depends on how
else you are expecting other scoring boosts to interact with document
intrinsic boosts.


Don't do date boosting at index time, but rather tune things on the
query end, using FunctionQuery and such.   That'll give you maximum
flexibility.

   Erik


[CODE4LIB] Portland library geeks?

2007-05-17 Thread Erik Hatcher

I'll be in Portland later today through Monday for RailsConf.  The
schedule is really tight, but if there are some library geeks in the
area that want to get together around a pool table let me know.

   Erik


Re: [CODE4LIB] pspell aspell: make your own word lists/dictionaries

2007-04-03 Thread Erik Hatcher

Martin has created a Google Group for his spell checker, and
discussions have been ongoing since c4lcon about how to contribute it
to Lucene.   You can learn more about it here:

   http://groups.google.com/group/spelt

Martin has packaged the code with tests for folks to try it out easily.

   Erik


On Apr 3, 2007, at 2:01 PM, Jonathan Rochkind wrote:


I haven't had time to look at it yet, but someone at Code4Lib
conference
proposed a more sophisticated approach to spell checking that sounded
really interesting to me, and said he was going to share the code. I
hope to have time to investigate at some point.

Let's see if I can find it on the conference page yeah, it was
Martin Haye. You can watch his presentation here:
http://video.google.com/videoplay?docid=4028600349627496246hl=en

Looks like he's *martin*.*haye*[at]gmail.com.  During the lightning
talk, he said he didn't want to distribute the code seperately but
wanted to include it in Lucene if possible---but later in the
conference, he said he had been convinced by the interest in it to
distriburte the code as it's own standalone thing, and planned to do
that presently.

If anyone does or has explored using martin's code, please let us know
about your experience.

Jonathan

Kevin Kierans wrote:

Has anyone created their own dictionaries
for aspell?  We've created blank delimited
lists of words from our opac.  One for title,
one for subjects, and one for authors.  (We're thinking
of a series one as well)

We would like to use
one of these word lists to offer suggestions
depending on which search the patron is making.
We're assuming we can make better suggestions
if the words come from our actual opac.

We've got it working with the dictionary that
comes with aspell, but having problems (we can't do it!)
substituting our own  dictionaries.

Does anyone have any experience/knowledge/hints/pointers
they can share with us?

We are using linux, php 5,  aspell 0.50.5, and
php - pspell functions.

Thanks,
Kevin
TNRD Library System, Kamloops, British Columbia, Canada




--
Jonathan Rochkind
Sr. Programmer/Analyst
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] Video encoding done - Mashup idea request

2007-03-16 Thread Erik Hatcher

Slides schmides.  :)

Just having slides synched to a speaker works for some cases, but for
those of us that love doing live demos, coding on the fly, and just
flat out winging it, the slides are often just barely related to
what's being said.  Having the actual screen being presented is all
that would have made sense for my presentation, for example.

   Erik


On Mar 16, 2007, at 6:36 PM, Gabriel Farrell wrote:


On Fri, Mar 16, 2007 at 09:20:47PM -, Richard Wallis wrote:

If a mix of video and screen capture could be achieved it would be
great, but certainty of a successful result AND a fallback if it
doesn't
work should be part of the plan.

RJW



Agreed.  Video plus screen capture would be nice, but video of the
presenter and a copy of their slides to follow along with works pretty
damn well for me.

As far as trying to combine the two and sleek up the experience,
Columbia's SOA has a nice setup for this on their site (see
http://www.columbia.edu/itc/soa/dmc/cory_arcangel/) for example).
There are some glitches in the timing, though, and I'm just not sure
it's worth the extra effort.

Gabe

PS In a similar vein, when I don't have separately available slides to
follow along with, I find many of the Google TechTalks annoying.  It's
a lot of watching some not-terribly-handsome person refer to things I
can't see.  Having both video and slides is key.


Re: [CODE4LIB] Flamenco

2007-03-07 Thread Erik Hatcher

On Mar 7, 2007, at 6:55 AM, K.G. Schneider wrote:


A mention of the Flamenco project (open source faceted navigation) on
Catalogablog made me wonder if anyone on c4l had looked at this:

http://flamenco.berkeley.edu/


Of course!  Many of us have been all over Flamenco since we first saw
it.  There was a lightening talk about it at the conf. last week in
fact.  It's very nice that it's open sourced now.  I've yet to give
it a try (besides using the online demos)... but it has some nice
features that I'm going to borrow into the Solr flare work such as
hierarchical facets.

   Erik


Re: [CODE4LIB] Preconference

2007-02-13 Thread Erik Hatcher

On Feb 13, 2007, at 9:47 AM, Susan E Teague Rector/FS/VCU wrote:

Are we supposed to be using a predefined set of data for the
preconference
or can we use our own data?


Susan - I'm going to package up a lot of stuff (Solr, sample
datasets, Luke, etc) to help everyone get started, but bringing your
own data is encouraged as long as you also bring along the necessary
tools and know-how to process that data into something usable by Solr
(either XSLT to .xml files, or via code that speaks to Solr directly).

So by all means bring your data.

   Erik


Re: [CODE4LIB] Preconference

2007-02-13 Thread Erik Hatcher

On Feb 13, 2007, at 10:58 AM, Jonathan Rochkind wrote:


If we bring MARCXML and/or MODS, can we assume that there will be
people
who can help us process that data into something useable by Solr?
That
would be a nice, at any rate.


Yes, yes you can make such an assumption.  However, I want this to be
clear - we'll have plenty of sample data to play with on your own
environments that you can rest assured that you'll leave the pre-
conference knowing that you're in good hands with Solr.  Fiddling
with your specific data set, if it doesn't fit into any of the pre-
fab scripts at the time of the pre-conf, could take some time, and
it'll be much better to get Solr know-how than MARCXML know how.

   Erik


Re: [CODE4LIB] Preconference

2007-02-13 Thread Erik Hatcher

Or here:

   http://wiki.apache.org/solr/Solr4Lib

I like the Solr wiki for highly Solr specific stuff... though the
code4lib site can handle file attachments whereas the Apache wiki
does not (I don't think).  So, maybe upload files to code4lib, but
consider adding a blurb on the Solr4Lib wiki page pointing to it.

   Erik



On Feb 13, 2007, at 12:48 PM, Bess Sadler wrote:

Could we post them on the code4lib.org page about the pre-conference?
http://code4lib.org/node/139

Bess

On Feb 13, 2007, at 12:03 PM, Binkley, Peter wrote:


That would be great. I've got a MODS-to-Solr xsl to share as well.
Where
would be a good place to post these, along with relevant Solr
schemas?

Peter

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On
Behalf Of
Andrew Nagy
Sent: Tuesday, February 13, 2007 9:18 AM
To: CODE4LIB@listserv.nd.edu
Subject: Re: [CODE4LIB] Preconference

I have an XSLT doc for transforming MARCXML to SOLR XML that I can
share
around.

Andrew

Jonathan Rochkind wrote:

If we bring MARCXML and/or MODS, can we assume that there will be
people who can help us process that data into something useable by
Solr?  That would be a nice, at any rate.

Jonathan

Erik Hatcher wrote:

On Feb 13, 2007, at 9:47 AM, Susan E Teague Rector/FS/VCU wrote:

Are we supposed to be using a predefined set of data for the
preconference or can we use our own data?


Susan - I'm going to package up a lot of stuff (Solr, sample
datasets, Luke, etc) to help everyone get started, but bringing
your
own data is encouraged as long as you also bring along the
necessary
tools and know-how to process that data into something usable by
Solr



(either XSLT to .xml files, or via code that speaks to Solr

directly).


So by all means bring your data.

   Erik



--
Jonathan Rochkind
Sr. Programmer/Analyst
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Elizabeth (Bess) Sadler
Head, Technical and Metadata Services
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


Re: [CODE4LIB] Getting data from Voyager into XML?

2007-01-19 Thread Erik Hatcher

On Jan 17, 2007, at 3:26 PM, Andrew Nagy wrote:

One thing I am hoping that can come out of the preconference is a
standard XSLT doc.  I sat down with my metadata librarian to
develop our
XSLT doc -- determining what fields are to be searchable what fields
should be left out to help speed up results, etc.

It's pretty easy, I think you will be amazed how fast you can have a
functioning system with very little effort.


You're quite right with that last statement.

I am, however, skeptical of a purely MARC - XSLT - Solr solution.
The MARC data I've seen requires some basic cleanup (removing dots at
the end of subjects, normalizing dates, etc) in order to be useful as
facets.  While XSLT is powerful, this type of data manipulation is
better (IMO) done with scripting languages that allow for easy
tweaking in a succinct way.  I'm sure XSLT could do everything that
you'd want done; you can also drive screws in with a hammer :)

That being said - if you've got XSLT chops, and can easily go from
MARC XML to Solr's XML [1], you'll be in great shape at the pre-
conference for quickly getting your data into Solr and seeing what
needs to be cleaned up.  Seeing the raw data in a faceted way is
actually very helpful for knowing where to go next with cleanup, and
showing catalogers where inconsistencies live in the data.

   Erik

[1] http://wiki.apache.org/solr/UpdateXmlMessages


Re: [CODE4LIB] Getting data from Voyager into XML?

2007-01-19 Thread Erik Hatcher

Tod,

Great information.  I apologize for being a late comer to the game
and bringing up FAQs.

What about date normalization?

One thing that must be considered when doing faceted browsing is that
it works best with some pre-processed data, such as years rather than
full dates.  The question becomes where does the logic for stripping
out the years belong?  Solr could do it if configured with a custom
analyzer for certain fields, or the client could do it.  Is there
XSLT to do this sort of thing with dates available?

   Erik


On Jan 19, 2007, at 5:58 AM, Tod Olson wrote:


On Jan 19, 2007, at 4:07 AM, Erik Hatcher wrote:


On Jan 17, 2007, at 3:26 PM, Andrew Nagy wrote:

One thing I am hoping that can come out of the preconference is a
standard XSLT doc.  I sat down with my metadata librarian to
develop our
XSLT doc -- determining what fields are to be searchable what fields
should be left out to help speed up results, etc.

It's pretty easy, I think you will be amazed how fast you can have a
functioning system with very little effort.


You're quite right with that last statement.

I am, however, skeptical of a purely MARC - XSLT - Solr solution.
The MARC data I've seen requires some basic cleanup (removing dots at
the end of subjects, normalizing dates, etc) in order to be useful as
facets.  While XSLT is powerful, this type of data manipulation is
better (IMO) done with scripting languages that allow for easy
tweaking in a succinct way.  I'm sure XSLT could do everything that
you'd want done; you can also drive screws in with a hammer :)


So the punctuation stripping has already been done in XSLT.

LoC has a MARCXML - MODS XSLT stylesheet [1] which strips out the
evil
ISBD punctuation. I've generally found mapping from MODS to be more
convenient than mapping from MARC, so while it's an extra step, it
does
save a little programmer time since some of the hidden hierarchy in
the
MARC data is made explicit in the MODS structure.

If hopping through MODS is unacceptable, the LoC has the punctuation-
stripping nicely tucked away into a MARC Conversion Utility Stylesheet
that you could use directly in a MARC XML - Solr transformation. [2]

[1] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS.xsl
[2] http://www.loc.gov/marcxml/xslt/MARC21slimUtils.xsl


Tod Olson [EMAIL PROTECTED]
Programmer/Analyst
University of Chicago Library


Re: [CODE4LIB] Lucene Newbie Question

2007-01-11 Thread Erik Hatcher

Andrew,




On Jan 11, 2007, at 10:47 AM, Andrew Darby wrote:

Hello, all.  I'm trying to get started with Lucene for the Code4Lib
preconference


Excellent!!!


and was wondering if someone could help.


Of course


I'm trying to
do the first example from the Lucene site
(http://lucene.apache.org/java/docs/demo.html) on my Windows XP
machine but when I try to build the test index from the command line
like so:

C:\lucene-2.0.0java org.apache.lucene.demo.IndexFiles C:
\lucene-2.0.0/src

I get the following error:

Exception in thread main java.lang.NoClassDefFoundError:
org/apache/lucene/demo/IndexFiles

My CLASSPATH looks like this:

.;C:\Program Files\QuickTime\QTSystem\QTJava.zip;C:\lucene-2.0.0
\build\lucene-core-2.0.1-dev.jar;C:\lucene-2.0.0\build\lucene-
demos-2.0.1-dev.jar;


2.0.1?  Where'd you get that version?

I pulled down the latest stable release, 2.0.0, just now to run
through this myself.

Rather than setting CLASSPATH (an evil thing in the Java world, it
can really bite you at inopportune times), I ran it this way
successfully:

java -cp lucene-core-2.0.0.jar:lucene-demos-2.0.0.jar
org.apache.lucene.demo.IndexFiles src/


I assume this is a basic error, and something to do with the
classpath, but as best I can tell everything is correct, the
IndexFiles.class file is where it should be, etc.  I'm not familiar
with Java, if you haven't guessed.  Any suggestions?


Sadly the demo that ships with Lucene is pretty weak.  For more
examples, grab the Lucene in Action (LIA) codebase from http://
www.lucenebook.com and fire it up simply by typing ant and
following the instructions in the README too.  That code is for
Lucene 1.4.3 - 1.9.x.  Lucene 2.0 removed deprecated methods, and
there are a few tidbits of trivia to adjust LIA code to Lucene 2.0
available here:

   http://www.nabble.com/Lucene-in-Action-examples-complie-problem-
tf2418478.html#a6743189

The demo that ships with Lucene is barely usable for anything other
than yeah, it can search text, but boy is it a hassle to run.  Keep
in mind that Lucene is a low-level library, so for there to be much
of use out of it, you have to build something around it.  The Indexer
and Searcher command-line apps in the LIA code base provide a better
working demo out of the box, but still quite crude.

   Erik


Re: [CODE4LIB] Lucene Newbie Question

2007-01-11 Thread Erik Hatcher

On Jan 11, 2007, at 12:10 PM, Andrew Darby wrote:


Thanks Erik and Bess.  Erik:  Lamentably, your

java -cp lucene-core-2.0.0.jar:lucene-demos-2.0.0.jar
org.apache.lucene.demo.IndexFiles src/

threw the same error.


That is probably due to your environment CLASSPATH (I told you it was
trouble! :).  Remove that environment variable altogether, or set it
to blank, or at least remove all the Lucene JARs from it, and all
should be well.  B


  I'm going to take a look at the LuceneInAction
codebase and see if I can get it working that way.  Thanks for taking
the time to install 2.0.0.  I don't know how I ended up with 2.0.1
jars--they appeared when I ran ant . . . .


Oh, you built Lucene from source... that makes sense.  The version
number in the build file of a 2.0.0 release distribution is set one
version higher in the build environment - this allows us Lucene
developers to at least have a clue that the user built it themselves
(and possibly tinkered with the source code) instead of using a
binary release (helps in supporting users to know what version folks
are running).

   Erik


Re: [CODE4LIB] Lucene Newbie Question

2007-01-11 Thread Erik Hatcher

On Jan 11, 2007, at 2:54 PM, Erik Hatcher wrote:

On Jan 11, 2007, at 12:10 PM, Andrew Darby wrote:


Thanks Erik and Bess.  Erik:  Lamentably, your

java -cp lucene-core-2.0.0.jar:lucene-demos-2.0.0.jar
org.apache.lucene.demo.IndexFiles src/

threw the same error.


That is probably due to your environment CLASSPATH (I told you it was
trouble! :).  Remove that environment variable altogether, or set it
to blank, or at least remove all the Lucene JARs from it, and all
should be well.


Also, in case you copied my exact example, (I noticed you're on
Windows) you'll need to adjust the path separator from : to ; between
the two JARs in the -cp switch.  That might be the issue rather than
CLASSPATH, come to think of it.

   Erik


[CODE4LIB] solrb DSL collaboration

2007-01-09 Thread Erik Hatcher

A little edgy in #code4lib today about where we* are going with solrb
(the Ruby/Solr domain-specific language API), so we're going to add a
bit of process by fleshing it out via the solrb section of the Solr
wiki.  Below is the first draft, though I've revised it some since
then slightly.  Click the link to add your ideas.

Notice you can get e-mail notifications of Apache wiki changes,
though you'll have to sign up for the solr-commits e-mail list -
email a blank message to [EMAIL PROTECTED] -
and you'll also see the gory details of svn commit messages (not only
in solrb/flare, but also to Solr itself).  Reviewing code commit
messages is a practice I highly recommend, for what it's worth.

   Erik

* we is us.  I'm really working hard on making this a collaborative
community-oriented effort.  Our pre-conference will hopefully consist
of a few folks that have gone along for the solrb/flare ride over the
next several weeks and will be literally experts in the domain by then.


Begin forwarded message:



From: Apache Wiki [EMAIL PROTECTED]
Date: January 9, 2007 5:00:01 PM EST
To: solr-commits@lucene.apache.org
Subject: [Solr Wiki] Update of solrb/BrainStorming by ErikHatcher
Reply-To: solr-dev@lucene.apache.org

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Solr Wiki
for change notification.

The following page has been changed by ErikHatcher:
http://wiki.apache.org/solr/solrb/BrainStorming

New page:
This is a straw-man pie-in-the-sky thinking on a Ruby Solr DSL:

{{{
class Book
  attr_reader :isbn
  attr_reader :title
  attr_reader :subjects
  attr_reader :authors
  attr_reader :dates


  include Solrable
  solr_facet :subject
  solr_facet :year do
self.dates.collect { |date| date.year }
  end
end

# Search for full-text matches in the author* field
Book.search_by_author(sekida)

# Search for full-text matches in the title* field
Book.search_by_title(zen)

# Search for full-text matches as a boolean AND in author and title
Book.search_by_author_and_title(sekida, zen)
}}}

Other ideas?



[CODE4LIB] Solr Flare

2007-01-02 Thread Erik Hatcher

code4libers,

I've kicked off a sub-project of Solr called Flare.  Its has several
goals, including to be a Solr Ruby DSL and to achieve a general
purpose user interface framework to include faceted browsing, suggest
interfaces, as well as the folksonomy angle of tagging/annotating
results.  At the moment there isn't too much there, but I will be
devoting ever waking spare moment to this over the next couple of
months.  To get involved, first check out the Flare wiki:

   http://wiki.apache.org/solr/Flare

Then sign up for the solr-user e-mail list, submit patches, and so
on.  From #code4lib, I've seen that this community has some very
sharp Ruby talent and I'm eager to have collaborators on this.  I'm
building this out as a distillation of Collex to the faceted browsing
piece, and evolving it up to the folksonomy features, as a way to
start with a clean slate in a test centric fashion.  The initial goal
is for this to become the demo I do for the UVa library folks on
their 3M+ MARC records, and to also use this for the code4lib pre-
conference.

I'm looking forward to your suggestions, critique, patches, and
collaborations of any level.

Thanks,
   Erik


[CODE4LIB] Fwd: Solr 1.1 released

2006-12-23 Thread Erik Hatcher

The state of Solr's official release was asked about on #code4lib the
other day.  Here ya go, hot off the press


Begin forwarded message:


From: Yonik Seeley [EMAIL PROTECTED]
Date: December 22, 2006 5:07:49 PM EST
To: solr-user@lucene.apache.org, solr-dev@lucene.apache.org
Subject: Solr 1.1 released
Reply-To: solr-dev@lucene.apache.org

Solr 1.1 is now available for download!  This is the first official
release since Solr entered the Incubator.

The release is available at
 http://people.apache.org/dist/incubator/solr/1.1/
and the detailed changelog is at
 http://people.apache.org/dist/incubator/solr/1.1/CHANGES.txt

Thanks to everyone that helped make this happen!

-Yonik


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-29 Thread Erik Hatcher

On Nov 29, 2006, at 10:27 AM, Art Rhyno wrote:

I am so behind in e-mail that I might be treading on ground that is
worn
out on this, but I would add to Eric's list that I don't care about
the
indexer if:


Here's how Lucene/Solr fares on these points:


* the indexer has an open and configurable relevancy weighting
algorithm


Adjusting relevancy with Lucene is configurable in a number of ways,
boosts and tweaking Similarity.


* the indexer allows control of how the data is normalized


Is this the indexers job?  I say not.   Sure, we'd all love to have
everything including the kitchen sink hidden behind some drag-and-
drop interface, but it really isn't Solr's job to clean up data.  I'm
not quite sure what you mean by normalized though, so maybe I'm off
base?


* the indexer uses pluggable parsers


Solr doesn't know MARC from Adam.

Again, it isn't Solr's job to parse MARCXML, I'd argue.  It's a full-
text search engine, and overloading it to be more than that is asking
for trouble later when you do want to swap things out.

Maybe you mean tokenization rather than parsing though, in which case
Solr and Lucene certainly have great configurability.


* the indexer supports very fast retrieval


:)   But of course!


then, on the preferred side:

* the indexer allows the index process to effectively leverage
commodity
hardware


The beefier the better.


* the indexer creates an index that can be combined with others


Solr may eventually federate with other Solr instances - that is on
the TODO list.  And there was recently a message from someone adding
an SRU/SRW interface to it.


One of our most common comments when we do
surveys of our user community is don't show me what you can't deliver
NOW. A world class indexer opens the door for scoping at the
collection
level, there doesn't have to be one solution for IR and it would be
a very
unhealthy ecosystem without variance, but I suspect it would be
easier to
convince a company like Elsevier that I want a lucene index for
licensed
content than almost any other technology offering. So a definite
yes to
SRU, OpenURL,  Z39.50, and the rest, but I wonder if sustaining a
lucene
index is a good idea regardless of what the main building blocks for a
library's preferred IR layer turn out to be. Library standards
don't tend
to delve into the architecture of indexing anyway, but this is really
where a lot of what can be delivered gets defined.


We discussed this in Windsor, but for everyone else's benefit I
personally don't think sharing a Lucene index is the right
granularity to work with.  The specifics of the index format evolve
with each new version of Lucene (with backwards compatibility in
mind, for sure).  The better granularity to consider is the interface
to the index, like SRU or Solr's custom interface, etc.  And the
library world already has these standards in place that could easily
be put on top of Lucene or Solr.

   Erik


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Erik Hatcher

On Nov 28, 2006, at 5:44 PM, Kevin S. Clarke wrote:

Is there a standard for specifying how textual analysis works as
well, so that tokenization can be standardized across these XQuery
engines as well?


Not that I know.  What I've seen so far is that tokenization is
implementation specific.  Perhaps this is something that is
configurable so that implementations can be set up and then queried
consistently.  Any indexing engine worth its salt should be
configurable I'd think.  There is nothing I'm aware of in the fulltext
work though that defines how things are indexed.


If you leave out all the configurability in tokenization for indexing
and querying from the XQuery standard, then there will surely be
extensions needed for concrete implementations to allow this stuff to
be specified.  Interesting issue.

For all you Java savvy folks out there, how about standards like
J2EE that make it easy to move an application from one vendors app.
server to another.  Works for the simplest of applications, but all
vendors have their own specific custom deployment descriptors too.

   Erik


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Erik Hatcher

On Nov 28, 2006, at 3:28 PM, Andrew Nagy wrote:

The major problem
with it all is the ugly mess that is marcxml


This brings up an interesting point about just dropping our source
XML data into an XML-savvy database and using XQuery on it.

Maybe y'all have much cleaner data that I've seen, but my experience
with Rossetti Archive has had many XML data hurdles.  When I came on
board, Tamino was being used for the search engine, with XPath
queries all over the place.  The raw data is not consistent, and a
single word query expanded into an enormous XPath query to look at
many elements and attributes, not to mention it was SLOW.  Analyzing
the user interface and the real-world searching needs, I wrote Java
code that normalized the data for searching purposes into a much
courser grained set of fields, indexing it into Lucene, and voila:
http://www.rossettiarchive.org/rose

The point is that even with super fast full-text searching with
XQuery, most of our archives are probably going to require hideous
expressions to query them using their raw structure, especially if
have to account for data cleanup too (such as date formatting issues,
which we also have in RA raw data).

I realize I'm sounding anti-XQuery, which is sorta true, but only
because in the real-world in which I work it works better to have
some custom digesting of the raw data than to just toss it in and
work with standards.  Indexing is lossy - it's about keying things
the way they need to be looked up.  If your data is clean, you're in
better shape than me.  And if XQuery on your raw data does what you
need, by all means I recommend it.

   Erik


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-27 Thread Erik Hatcher

On Nov 27, 2006, at 5:46 PM, Binkley, Peter wrote:

You've got enough flexibility in the way you
set up your Lucene index, and Lucene search results give you access to
the term weights for each hit,


It does?


so you can tell which fields actually
matched.


You can?

I'm curious how you're doing that!  Especially with Solr in the picture.


There would probably be a lot of optimizations you could do within
Solr
to help with this kind of thing. Art and I talked a little about
this at
the ILS symposium: why not nestle the XML db inside Solr alongside
Lucene? Solr could then manage the indexing of the contents of the db,
and augment your search results with data from the db: you could get
full records as part of your search results without having to store
them
in the Lucene index.


There has been discussions in the Solr community about having hooks
added to allow Solr plugins to pull data from external sources to
return with search results.  I don't think Solr itself is the entry
point to these external systems, as that seems to couple things a bit
too much for my tastes, so I think you'd still want to manage the
external data source separately from indexing into Solr, but having
hooks for Solr to return hybrid results could be just the ticket here.

   Erik


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-27 Thread Erik Hatcher

On Nov 27, 2006, at 6:12 PM, Binkley, Peter wrote:


Fair point, and that's how my current solr-based project works. I'm
thinking I would like the other advantages of an XML db: the
ability to
run xqueries, batch updates, etc., alongside the Lucene searching.
And I
want them integrated under the hood in Solr so that people smarter
than
me will maintain and optimize the connections. But I agree that this
approach will have to prove that the extra overhead is worth it.


The main concern I have is where the hood line is drawn here.  I
don't think Solr should become a central repository interface, but it
should be easily integrated and perhaps be able to tie in external
resources to returned results.

My ideal solution is WebDAV interface with commit hooks into Solr.
With Subversion behind the scenes of the WebDAV interface, or an XML
database, or file system or whatever. :)

   Erik


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-27 Thread Erik Hatcher

On Nov 27, 2006, at 5:04 PM, Jonathan Rochkind wrote:


Bess Sadler wrote:

application. That way you can use solr / lucene for search, faceted
browse, etc, and your XML database only for known item retrieval,
which it is generally able to do without performance issues. I'm
hopping up and down waiting for someone to take this approach with an
ILS, so please come and show us what you've got!


Would this approach complicate hilighting of hits-in-context?  One of
the biggest things missing from most current OPACs in my opinion is
google-style excerpting of WHAT part of the record matched the
query--on
the results page. Many mainstream OPACs do currently provide some form
of hilighting on the detail/full-bib page, but it's not generally
truly
identifying _which_ parts of the record _actually_ matched your search
(a search just on title will still hilight the word found in a non-
title
field), which I find annoying.

Do these kind of hybrid approaches complicate the task of providing
proper result hilighting in context, or am I off on the wrong
direction?


Highlighting is tricky business all the way around.  XTF seems to be
the best solution I've seen for very detailed context highlighting.
But I suspect ILS systems don't need that level of complexity but
rather could leverage Solr's highlighting capabilities by ensuring
that the specific fields that need highlighting are stored.

Solr's highlighter (which is Lucene's contrib Highlighter under the
covers) does do field-specific highlighting, but it still is not
perfect.  For example, if you searched for title:Blessed Damozel,
it would highlight blessed and damozel anywhere in the title
field even though the query is a phrase query where proximity matters.

For a proprietary contract job, I have written code that converts a
general Lucene Query into a SpanQuery and a highlighter that does
precise highlighting.  Beyond this code being proprietary, such that
I cannot share it, it is also not general purpose and does full field
highlighting, not scoring fragments like the Lucene highlighter
does.  The approach to converting to a SpanQuery is a good way to go
though, and has been discussed a bit in the Lucene e-mail list.

In short, I think the hybrid approach is still a good one, separating
the search engine from the actual data repository, but highlighting
requirements need to be considered up front.  Basic and decent field-
specific highlighting can be achieved with Solr, but its got caveats.

   Erik


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-27 Thread Erik Hatcher

On Nov 27, 2006, at 5:49 PM, Andrew Nagy wrote:

My only concern about lucene is the lack of a standard query language.
I went down the native XML database path because of XQuery and XSL,
does
something like lucene and solr offer a strong query language?  Is it a
standard?  What if someone developed a kick ass text indexer in 2
years
that totally blows lucene out of the water, would you easily be
able to
switch systems?


What if games are mostly just guessing games in the high tech
world.  Agility is the trait our projects need.  Software is just
that... soft.  And malleable.  Sure, we can code ourselves into a
corner, but generally we can code ourselves right back out of it
too.  If software is built with decent separation of concerns, we can
adapt to changes readily.

Specifically to your concern about a standard query language, I
prefer to think of things from a users perspective.  A user does not
type in XQuery syntax, so at some point the system has to translate a
user entered expression into something the underlying search engine
understands.  Lucene, and thus Solr, support the syntax already
mentioned here (Google-like syntax), and there is also a contrib
module to Lucene that XMLifies the Lucene query syntax (and then
some), called the xml-query-parser:

   http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/xml-query-
parser/

Eventually this will integrated into Solr.  So, as long as you can
parse a user entered query into some type of data structure, you can
convert it to either of the Lucene supported syntaxes (or directly to
a Query object if you're coding in Java), or to whatever pie in the
sky system that comes along in the future.

I'm looking forward to more discussion on this topic, as it is one
that I hear most often as a negative to Lucene around my neck of the
woods, and the standard what-if scenario is used to choose inferior
search engine technologies sadly.

   Erik