Re: Lucene in the Humanties

2005-02-18 Thread Luke Shannon
Nice work Eric. I would like to spend more time playing with it, but I saw a
few things I really liked. When a specific query turns up no results you
prompt the client to preform a free form search. Less sauvy search users
will benefit from this strategy. I also like the display of information when
you select a result. Everything is at your finger tips without clutter.

I did get this error when a name search failed to turn up results and I
clicked 'help' in the free form search row (the second row).

Here is my browser info:

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.5) Gecko/20041107
Firefox/1.0

Below are the details from the error:

 Page 'help-freeform.html' not found in application namespace.

 Stack Trace:
  a..
org.apache.tapestry.resolver.PageSpecificationResolver.resolve(PageSpecifica
tionResolver.java:120)
  b.. org.apache.tapestry.pageload.PageSource.getPage(PageSource.java:144)
  c.. org.apache.tapestry.engine.RequestCycle.getPage(RequestCycle.java:195)
  d.. org.apache.tapestry.engine.PageService.service(PageService.java:73)
  e..
org.apache.tapestry.engine.AbstractEngine.service(AbstractEngine.java:872)
  f..
org.apache.tapestry.ApplicationServlet.doService(ApplicationServlet.java:197
)
  g..
org.apache.tapestry.ApplicationServlet.doGet(ApplicationServlet.java:158)
  h.. javax.servlet.http.HttpServlet.service(HttpServlet.java:740)
  i.. javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
  j..
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:247)
  k..
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:193)
  l..
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:256)
  m..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
  n..
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
  o.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
  p..
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:191)
  q..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
  r..
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
  s.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
  t..
org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2422)
  u..
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180
)
  v..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
  w..
org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.
java:171)
  x..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:641)
  y..
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:163
)
  z..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:641)
  aa..
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
  ab.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
  ac..
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:174)
  ad..
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
  ae..
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
  af.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
  ag..
org.apache.ajp.tomcat4.Ajp13Processor.process(Ajp13Processor.java:457)
  ah.. org.apache.ajp.tomcat4.Ajp13Processor.run(Ajp13Processor.java:576)
  ai.. java.lang.Thread.run(Thread.java:534)

Luke

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene User" 
Sent: Friday, February 18, 2005 2:46 PM
Subject: Lucene in the Humanties


> It's about time I actually did something real with Lucene  :)
>
> I have been working with the Applied Research in Patacriticism group at
> the University of Virginia for a few months and finally ready to
> present what I've been doing.  The primary focus of my group is working
> with the Rossetti Archive - poems, artwork, interpretations,
> collections, and so on of Dante Gabriel Rossetti.  I was initially
> brought on to build a collection and exhibit system, though I got
> detoured a bit as I got involved in applying Lucene to the archive to
> replace their existing search system.  The existing system used an old
> version of Tamino with XPath queries.  Tamino is not at fault here, at
> least not entirely, because our data is in a very complicated set of
> XML files with a lot of non-normalized and legacy metadata - getting at
> things via XPath is challenging and practically impossible in many
> case

Re: Lucene in the Humanties

2005-02-18 Thread Praveen Peddi
Good work Eric (even though UI could be made pretty). We use lucene so I 
have some knowledge of it. I could see the features you are using with 
lucene (like paging, highlighting, different kinds of pharases). Over all, 
good stuff.

Praveen
- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene User" 
Sent: Friday, February 18, 2005 2:46 PM
Subject: Lucene in the Humanties


It's about time I actually did something real with Lucene  :)
I have been working with the Applied Research in Patacriticism group at 
the University of Virginia for a few months and finally ready to present 
what I've been doing.  The primary focus of my group is working with the 
Rossetti Archive - poems, artwork, interpretations, collections, and so on 
of Dante Gabriel Rossetti.  I was initially brought on to build a 
collection and exhibit system, though I got detoured a bit as I got 
involved in applying Lucene to the archive to replace their existing 
search system.  The existing system used an old version of Tamino with 
XPath queries.  Tamino is not at fault here, at least not entirely, 
because our data is in a very complicated set of XML files with a lot of 
non-normalized and legacy metadata - getting at things via XPath is 
challenging and practically impossible in many cases.

My work is now presentable at
http://www.rossettiarchive.org/rose
(rose is for ROsetti SEarch)
This system is implicitly designed for academics who are delving into 
Rossetti's work, so it may not be all that interesting for most of you. 
Have fun and send me any interesting things you discover, especially any 
issues you may encounter.

Here are some numbers to give you a sense of what is going on 
underneath... There are currently 4,983 XML files, totally about 110MB. 
Without getting into a lot of details of the confusing domain, there are 
basically 3 types of XML files (works, pictures, and transcripts).  It is 
important that  there be case-sensitive and case-insensitive searches.  To 
accomplish that, a custom analyzer is used in two different modes, one 
applying a LowerCaseFilter, and one not with the same documents written to 
two different indexes.  There is one particular type of XML file that gets 
indexed as two different types of documents (a specialized summary/header 
type).  In this first set of indexes, it is basically a one-to-one mapping 
of XML file to Lucene Document (with one type being indexed twice in 
different ways) - all said there are 5539 documents in each of the two 
main indexes.  The transcript type gets sliced into another set of 
original case and lowercased indexes with each document in that index 
representing a document division (a  element in the XML).  There are 
12326 documents in each of these -level indexes.   All said, the 4 
indexes built total about 3GB in size - I'm storing several fields in 
order to hit-highlight.  Only one of these indexes is being hit at a 
time - it depends on what parameters you use when querying for which index 
is used.

Lucene brought the search times into a usable, and impressive to the 
scholars, state.  The previous search solution often timed the browser 
out!  Search results now are in the milliseconds range.

The amount of data is tiny compared to most usages of Lucene, but things 
are getting interesting in other ways.   There has been little tuning in 
terms of ranking quality so far, but this is the next area of work.  There 
is one document type that is more important than the others, and it is 
being boosted during indexing.  There is now a growing interest in 
tinkering with all the new knobs and dials that are now possible.  Putting 
in similar and more-like-this features are desired and will be relatively 
straightforward to implement.  I'm currently using 
catch-all-aggregate-field technique for a default field for QueryParser 
searching.  Using a multi-field expansion is an area that is desirable 
instead though.  So, I've got my homework to do and catch up on all the 
goodness that has been mentioned in this list recently regarding all of 
these techniques.

An area where I'd like to solicit more help from the community relates to 
something akin to personalization.  The scholars would like to be able to 
tune results based on the role (such as "art historian") that is searching 
the site.  This would involve some type of training or continual learning 
process so that someone searching feeds back preferences implicitly for 
their queries by visiting the actual documents that are of interest.  Now 
that the scholars have seen what is possible (I showed them the cool 
SearchMorph comparison page searching Wikipedia for "rossetti"), they want 
more and more!

So - here's where I'm soliciting feedback - who's doing these types of 
things in the realm of Humanties?  Where should we go from here in terms 
of researching and applyin

Lucene in the Humanties

2005-02-18 Thread Erik Hatcher
It's about time I actually did something real with Lucene  :)
I have been working with the Applied Research in Patacriticism group at 
the University of Virginia for a few months and finally ready to 
present what I've been doing.  The primary focus of my group is working 
with the Rossetti Archive - poems, artwork, interpretations, 
collections, and so on of Dante Gabriel Rossetti.  I was initially 
brought on to build a collection and exhibit system, though I got 
detoured a bit as I got involved in applying Lucene to the archive to 
replace their existing search system.  The existing system used an old 
version of Tamino with XPath queries.  Tamino is not at fault here, at 
least not entirely, because our data is in a very complicated set of 
XML files with a lot of non-normalized and legacy metadata - getting at 
things via XPath is challenging and practically impossible in many 
cases.

My work is now presentable at
http://www.rossettiarchive.org/rose
(rose is for ROsetti SEarch)
This system is implicitly designed for academics who are delving into 
Rossetti's work, so it may not be all that interesting for most of you. 
 Have fun and send me any interesting things you discover, especially 
any issues you may encounter.

Here are some numbers to give you a sense of what is going on 
underneath... There are currently 4,983 XML files, totally about 110MB. 
 Without getting into a lot of details of the confusing domain, there 
are basically 3 types of XML files (works, pictures, and transcripts).  
It is important that  there be case-sensitive and case-insensitive 
searches.  To accomplish that, a custom analyzer is used in two 
different modes, one applying a LowerCaseFilter, and one not with the 
same documents written to two different indexes.  There is one 
particular type of XML file that gets indexed as two different types of 
documents (a specialized summary/header type).  In this first set of 
indexes, it is basically a one-to-one mapping of XML file to Lucene 
Document (with one type being indexed twice in different ways) - all 
said there are 5539 documents in each of the two main indexes.  The 
transcript type gets sliced into another set of original case and 
lowercased indexes with each document in that index representing a 
document division (a  element in the XML).  There are 12326 
documents in each of these -level indexes.   All said, the 4 
indexes built total about 3GB in size - I'm storing several fields in 
order to hit-highlight.  Only one of these indexes is being hit at a 
time - it depends on what parameters you use when querying for which 
index is used.

Lucene brought the search times into a usable, and impressive to the 
scholars, state.  The previous search solution often timed the browser 
out!  Search results now are in the milliseconds range.

The amount of data is tiny compared to most usages of Lucene, but 
things are getting interesting in other ways.   There has been little 
tuning in terms of ranking quality so far, but this is the next area of 
work.  There is one document type that is more important than the 
others, and it is being boosted during indexing.  There is now a 
growing interest in tinkering with all the new knobs and dials that are 
now possible.  Putting in similar and more-like-this features are 
desired and will be relatively straightforward to implement.  I'm 
currently using catch-all-aggregate-field technique for a default field 
for QueryParser searching.  Using a multi-field expansion is an area 
that is desirable instead though.  So, I've got my homework to do and 
catch up on all the goodness that has been mentioned in this list 
recently regarding all of these techniques.

An area where I'd like to solicit more help from the community relates 
to something akin to personalization.  The scholars would like to be 
able to tune results based on the role (such as "art historian") that 
is searching the site.  This would involve some type of training or 
continual learning process so that someone searching feeds back 
preferences implicitly for their queries by visiting the actual 
documents that are of interest.  Now that the scholars have seen what 
is possible (I showed them the cool SearchMorph comparison page 
searching Wikipedia for "rossetti"), they want more and more!

So - here's where I'm soliciting feedback - who's doing these types of 
things in the realm of Humanties?  Where should we go from here in 
terms of researching and applying the types of features dreamed about 
here?How would you recommend implementing these types of features?

I'd be happy to share more about what I've done under the covers.  As 
you may be able to tell, the web UI is Tapestry for the search and 
results pages (though you won't be able to tell from the URL's you'll 
see :).  The UI was designed primarily by one of our very graphical/CSS 
savvy post doc research associates, and was designed with the research 
scholar in mind.  I continue to