Re: ValueListHandler pattern with Lucene
On Saturday 10 April 2004 20:40, Erik Hatcher wrote: Thats the beauty it is up to you to load the doc iff you want it. As I want all of them I don't see why this should be faster at all... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Apr 11, 2004, at 5:25 AM, [EMAIL PROTECTED] wrote: On Saturday 10 April 2004 20:40, Erik Hatcher wrote: Thats the beauty it is up to you to load the doc iff you want it. As I want all of them I don't see why this should be faster at all... Then have a look at the Hits class. It is doing more work for caching and keeping a most recently used collection of documents around. By using a HitCollector you are bypassing those mechanisms. Whether it is measurably faster would depend on several other factors. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Sunday 11 April 2004 13:40, Erik Hatcher wrote: using a HitCollector you are bypassing those mechanisms. Whether it is measurably faster would depend on several other factors. Well, it is hardly faster, so this is no real solution :-\ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Saturday 10 April 2004 20:40, Erik Hatcher wrote: Thats the beauty it is up to you to load the doc iff you want it. Well, there's another problem with HitCollector: the list I build is not sorted by score :-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Apr 11, 2004, at 9:32 AM, [EMAIL PROTECTED] wrote: On Saturday 10 April 2004 20:40, Erik Hatcher wrote: Thats the beauty it is up to you to load the doc iff you want it. Well, there's another problem with HitCollector: the list I build is not sorted by score :-( HitCollector was just an option - and apparently not the right one for your use. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Apr 11, 2004, at 10:00 AM, [EMAIL PROTECTED] wrote: On Sunday 11 April 2004 15:56, Erik Hatcher wrote: HitCollector was just an option - and apparently not the right one for your use. So, any other option? :-) Well, yes the one we already discussed. Let your presentation tier talk directly to Hits, so you are as efficient as possible with access to documents, and only fetch what you need. Again, don't let patterns get in your way. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Sunday 11 April 2004 17:16, Erik Hatcher wrote: Well, yes the one we already discussed. Let your presentation tier talk directly to Hits, so you are as efficient as possible with access to documents, and only fetch what you need. Again, don't let patterns get in your way. Well, the sense of tiers and (BTW: language-independant) patterns is to modularize software and make things exchangable. This way neither the presentation tier nor the search engine is exchangable. The problem actually is that VLH is designed to have a static list of VOs. VLH needs to evolve to support something like a data provider that dynamically may add data. The problems here so far is that an Iterator must throw an ConcurrentModificationException if the backing data is modified but as data in a VLH is actually never removed but only added this should be something possible to implement. Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Apr 11, 2004, at 11:28 AM, [EMAIL PROTECTED] wrote: On Sunday 11 April 2004 17:16, Erik Hatcher wrote: Well, yes the one we already discussed. Let your presentation tier talk directly to Hits, so you are as efficient as possible with access to documents, and only fetch what you need. Again, don't let patterns get in your way. Well, the sense of tiers and (BTW: language-independant) patterns is to modularize software and make things exchangable. This way neither the presentation tier nor the search engine is exchangable. The problem actually is that VLH is designed to have a static list of VOs. VLH needs to evolve to support something like a data provider that dynamically may add data. The problems here so far is that an Iterator must throw an ConcurrentModificationException if the backing data is modified but as data in a VLH is actually never removed but only added this should be something possible to implement. In other words, you need to invent your own pattern here?! :) The benefit of agility is to know that any decision you make now is not something that prohibits you from change later. Do you really think you're going to plug-and-play with search engines? Or will you be sticking with Lucene for the foreseeable future? Are you trying to plan for a future without Lucene when there is no use-case for doing so? If you code with coupling to Lucene, do you see that as making life harder in the future, or are you smart enough and flexible enough to change your software as times change? Throw your patterns away when they don't solve the problem. Be pragmatic _and_ agile. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stemming options
Has anyone on the list implemented a dictionary-based English stemmer with Lucene? Perhaps based on the freely-available ispell dictionaries or something like that? The Porter and Snowball stemmers have not worked that well for our application, but it is a bit daunting to start from scratch in developing an alternate stemmer. Alternatively, is there an algorithmic stemmer that anyone has used which is a little less aggressive than the Porter algorithm? We've been having problems with searches for conversion returning converse and conversational; and animal returning animate. Yes, these are morphologically related, but in our particular application it would be better to stick with removing simple inflections. Thanks for any pointers -- Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Carrot 2 (was: Re: clustering results)
Carrot (2): http://www.cs.put.poznan.pl/dweiss/carrot/xml/index.xml?lang=en Otis --- [EMAIL PROTECTED] wrote: I got all excited reading the subject line clustering results but this isn't really clustering is it? This is more sorting. Does anyone know of any work within Lucene (or another indexer) to do actual subject clustering (i.e. like Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? It would be pretty awesome if Lucene had such ability, I know there aren't a whole lot of clustering options, and the commercial products are very expensive. Anyhow, just curious. A brief definition of clustering: automatically organizing search or database query results into meaningful hierarchical folders ... transforming long lists of search results into categorized information without any clumsy pre- processing of the source documents. I'm not sure how it would be done...? Based off of top Term Frequencies for a document? -K Quoting Michael A. Schoen [EMAIL PROTECTED]: So as Venu pointed out, sorting doesn't seem to help the problem. If we have to walk the result set, access docs and dedupe using brute force, we're better off w/ the standard order by relevance. If you've got an example of this type of clustering done in a more efficient way, that'd be great. Any other ideas? - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, April 10, 2004 12:35 AM Subject: Re: clustering results On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: clustering results
Hi (danger: shameless advertising below), our partner, Brox It Solutions, is using our - XtraMind Technologies GmbH - clustering for implementing meta-search clustering of search results ala Vivisimo. Check out: http://www.anyfinder.de/ The clustering is done on the snipplets coming from search engines, but the original version that we still use in our own products is based on modified Lucene indexes as these can efficiently handle lots of information on texts and terms. Our clustering engine does not only cluster search results, but also performs trend recognition for competitive intelligence and similar tasks, but not too many people require such specialized features. Brox' price models for this engine may be interesting for those who find other products too expensive; it also works with all existing search engines, not only Lucene. -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Sonntag, 11. April 2004 19:03 An: Lucene Users List Betreff: Re: clustering results I got all excited reading the subject line clustering results but this isn't really clustering is it? This is more sorting. Does anyone know of any work within Lucene (or another indexer) to do actual subject clustering (i.e. like Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? It would be pretty awesome if Lucene had such ability, I know there aren't a whole lot of clustering options, and the commercial products are very expensive. Anyhow, just curious. A brief definition of clustering: automatically organizing search or database query results into meaningful hierarchical folders ... transforming long lists of search results into categorized information without any clumsy pre- processing of the source documents. I'm not sure how it would be done...? Based off of top Term Frequencies for a document? -K Quoting Michael A. Schoen [EMAIL PROTECTED]: So as Venu pointed out, sorting doesn't seem to help the problem. If we have to walk the result set, access docs and dedupe using brute force, we're better off w/ the standard order by relevance. If you've got an example of this type of clustering done in a more efficient way, that'd be great. Any other ideas? - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, April 10, 2004 12:35 AM Subject: Re: clustering results On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]