Re: Lucene in the Humanties
Nice work Eric. I would like to spend more time playing with it, but I saw a few things I really liked. When a specific query turns up no results you prompt the client to preform a free form search. Less sauvy search users will benefit from this strategy. I also like the display of information when you select a result. Everything is at your finger tips without clutter. I did get this error when a name search failed to turn up results and I clicked 'help' in the free form search row (the second row). Here is my browser info: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Below are the details from the error: Page 'help-freeform.html' not found in application namespace. Stack Trace: a.. org.apache.tapestry.resolver.PageSpecificationResolver.resolve(PageSpecifica tionResolver.java:120) b.. org.apache.tapestry.pageload.PageSource.getPage(PageSource.java:144) c.. org.apache.tapestry.engine.RequestCycle.getPage(RequestCycle.java:195) d.. org.apache.tapestry.engine.PageService.service(PageService.java:73) e.. org.apache.tapestry.engine.AbstractEngine.service(AbstractEngine.java:872) f.. org.apache.tapestry.ApplicationServlet.doService(ApplicationServlet.java:197 ) g.. org.apache.tapestry.ApplicationServlet.doGet(ApplicationServlet.java:158) h.. javax.servlet.http.HttpServlet.service(HttpServlet.java:740) i.. javax.servlet.http.HttpServlet.service(HttpServlet.java:853) j.. org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application FilterChain.java:247) k.. org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh ain.java:193) l.. org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja va:256) m.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) n.. org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) o.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) p.. org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja va:191) q.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) r.. org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) s.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) t.. org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2422) u.. org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180 ) v.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) w.. org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve. java:171) x.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:641) y.. org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:163 ) z.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:641) aa.. org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) ab.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) ac.. org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java :174) ad.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) ae.. org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) af.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) ag.. org.apache.ajp.tomcat4.Ajp13Processor.process(Ajp13Processor.java:457) ah.. org.apache.ajp.tomcat4.Ajp13Processor.run(Ajp13Processor.java:576) ai.. java.lang.Thread.run(Thread.java:534) Luke - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene User" Sent: Friday, February 18, 2005 2:46 PM Subject: Lucene in the Humanties > It's about time I actually did something real with Lucene :) > > I have been working with the Applied Research in Patacriticism group at > the University of Virginia for a few months and finally ready to > present what I've been doing. The primary focus of my group is working > with the Rossetti Archive - poems, artwork, interpretations, > collections, and so on of Dante Gabriel Rossetti. I was initially > brought on to build a collection and exhibit system, though I got > detoured a bit as I got involved in applying Lucene to the archive to > replace their existing search system. The existing system used an old > version of Tamino with XPath queries. Tamino is not at fault here, at > least not entirely, because our data is in a very complicated set of > XML files with a lot of non-normalized and legacy metadata - getting at > things via XPath is challenging and practically impossible in many > case
Re: Lucene in the Humanties
Good work Eric (even though UI could be made pretty). We use lucene so I have some knowledge of it. I could see the features you are using with lucene (like paging, highlighting, different kinds of pharases). Over all, good stuff. Praveen - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene User" Sent: Friday, February 18, 2005 2:46 PM Subject: Lucene in the Humanties It's about time I actually did something real with Lucene :) I have been working with the Applied Research in Patacriticism group at the University of Virginia for a few months and finally ready to present what I've been doing. The primary focus of my group is working with the Rossetti Archive - poems, artwork, interpretations, collections, and so on of Dante Gabriel Rossetti. I was initially brought on to build a collection and exhibit system, though I got detoured a bit as I got involved in applying Lucene to the archive to replace their existing search system. The existing system used an old version of Tamino with XPath queries. Tamino is not at fault here, at least not entirely, because our data is in a very complicated set of XML files with a lot of non-normalized and legacy metadata - getting at things via XPath is challenging and practically impossible in many cases. My work is now presentable at http://www.rossettiarchive.org/rose (rose is for ROsetti SEarch) This system is implicitly designed for academics who are delving into Rossetti's work, so it may not be all that interesting for most of you. Have fun and send me any interesting things you discover, especially any issues you may encounter. Here are some numbers to give you a sense of what is going on underneath... There are currently 4,983 XML files, totally about 110MB. Without getting into a lot of details of the confusing domain, there are basically 3 types of XML files (works, pictures, and transcripts). It is important that there be case-sensitive and case-insensitive searches. To accomplish that, a custom analyzer is used in two different modes, one applying a LowerCaseFilter, and one not with the same documents written to two different indexes. There is one particular type of XML file that gets indexed as two different types of documents (a specialized summary/header type). In this first set of indexes, it is basically a one-to-one mapping of XML file to Lucene Document (with one type being indexed twice in different ways) - all said there are 5539 documents in each of the two main indexes. The transcript type gets sliced into another set of original case and lowercased indexes with each document in that index representing a document division (a element in the XML). There are 12326 documents in each of these -level indexes. All said, the 4 indexes built total about 3GB in size - I'm storing several fields in order to hit-highlight. Only one of these indexes is being hit at a time - it depends on what parameters you use when querying for which index is used. Lucene brought the search times into a usable, and impressive to the scholars, state. The previous search solution often timed the browser out! Search results now are in the milliseconds range. The amount of data is tiny compared to most usages of Lucene, but things are getting interesting in other ways. There has been little tuning in terms of ranking quality so far, but this is the next area of work. There is one document type that is more important than the others, and it is being boosted during indexing. There is now a growing interest in tinkering with all the new knobs and dials that are now possible. Putting in similar and more-like-this features are desired and will be relatively straightforward to implement. I'm currently using catch-all-aggregate-field technique for a default field for QueryParser searching. Using a multi-field expansion is an area that is desirable instead though. So, I've got my homework to do and catch up on all the goodness that has been mentioned in this list recently regarding all of these techniques. An area where I'd like to solicit more help from the community relates to something akin to personalization. The scholars would like to be able to tune results based on the role (such as "art historian") that is searching the site. This would involve some type of training or continual learning process so that someone searching feeds back preferences implicitly for their queries by visiting the actual documents that are of interest. Now that the scholars have seen what is possible (I showed them the cool SearchMorph comparison page searching Wikipedia for "rossetti"), they want more and more! So - here's where I'm soliciting feedback - who's doing these types of things in the realm of Humanties? Where should we go from here in terms of researching and applyin
Lucene in the Humanties
It's about time I actually did something real with Lucene :) I have been working with the Applied Research in Patacriticism group at the University of Virginia for a few months and finally ready to present what I've been doing. The primary focus of my group is working with the Rossetti Archive - poems, artwork, interpretations, collections, and so on of Dante Gabriel Rossetti. I was initially brought on to build a collection and exhibit system, though I got detoured a bit as I got involved in applying Lucene to the archive to replace their existing search system. The existing system used an old version of Tamino with XPath queries. Tamino is not at fault here, at least not entirely, because our data is in a very complicated set of XML files with a lot of non-normalized and legacy metadata - getting at things via XPath is challenging and practically impossible in many cases. My work is now presentable at http://www.rossettiarchive.org/rose (rose is for ROsetti SEarch) This system is implicitly designed for academics who are delving into Rossetti's work, so it may not be all that interesting for most of you. Have fun and send me any interesting things you discover, especially any issues you may encounter. Here are some numbers to give you a sense of what is going on underneath... There are currently 4,983 XML files, totally about 110MB. Without getting into a lot of details of the confusing domain, there are basically 3 types of XML files (works, pictures, and transcripts). It is important that there be case-sensitive and case-insensitive searches. To accomplish that, a custom analyzer is used in two different modes, one applying a LowerCaseFilter, and one not with the same documents written to two different indexes. There is one particular type of XML file that gets indexed as two different types of documents (a specialized summary/header type). In this first set of indexes, it is basically a one-to-one mapping of XML file to Lucene Document (with one type being indexed twice in different ways) - all said there are 5539 documents in each of the two main indexes. The transcript type gets sliced into another set of original case and lowercased indexes with each document in that index representing a document division (a element in the XML). There are 12326 documents in each of these -level indexes. All said, the 4 indexes built total about 3GB in size - I'm storing several fields in order to hit-highlight. Only one of these indexes is being hit at a time - it depends on what parameters you use when querying for which index is used. Lucene brought the search times into a usable, and impressive to the scholars, state. The previous search solution often timed the browser out! Search results now are in the milliseconds range. The amount of data is tiny compared to most usages of Lucene, but things are getting interesting in other ways. There has been little tuning in terms of ranking quality so far, but this is the next area of work. There is one document type that is more important than the others, and it is being boosted during indexing. There is now a growing interest in tinkering with all the new knobs and dials that are now possible. Putting in similar and more-like-this features are desired and will be relatively straightforward to implement. I'm currently using catch-all-aggregate-field technique for a default field for QueryParser searching. Using a multi-field expansion is an area that is desirable instead though. So, I've got my homework to do and catch up on all the goodness that has been mentioned in this list recently regarding all of these techniques. An area where I'd like to solicit more help from the community relates to something akin to personalization. The scholars would like to be able to tune results based on the role (such as "art historian") that is searching the site. This would involve some type of training or continual learning process so that someone searching feeds back preferences implicitly for their queries by visiting the actual documents that are of interest. Now that the scholars have seen what is possible (I showed them the cool SearchMorph comparison page searching Wikipedia for "rossetti"), they want more and more! So - here's where I'm soliciting feedback - who's doing these types of things in the realm of Humanties? Where should we go from here in terms of researching and applying the types of features dreamed about here?How would you recommend implementing these types of features? I'd be happy to share more about what I've done under the covers. As you may be able to tell, the web UI is Tapestry for the search and results pages (though you won't be able to tell from the URL's you'll see :). The UI was designed primarily by one of our very graphical/CSS savvy post doc research associates, and was designed with the research scholar in mind. I continue to