RE: clustering results
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: April 11, 2004 1:03 PM To: Lucene Users List Subject: Re: clustering results I got all excited reading the subject line clustering results but this isn't really clustering is it? This is more sorting. Does anyone know of any work within Lucene (or another indexer) to do actual subject clustering (i.e. like Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? It would be pretty awesome if Lucene had such ability, I know there aren't a whole lot of clustering options, and the commercial products are very expensive. Anyhow, just curious. The one I know about is Carrot - http://www.cs.put.poznan.pl/dweiss/carrot/ Regards, Bruce Ritchie http://www.jivesoftware.com/ smime.p7s Description: S/MIME cryptographic signature
Carrot 2 (was: Re: clustering results)
Carrot (2): http://www.cs.put.poznan.pl/dweiss/carrot/xml/index.xml?lang=en Otis --- [EMAIL PROTECTED] wrote: I got all excited reading the subject line clustering results but this isn't really clustering is it? This is more sorting. Does anyone know of any work within Lucene (or another indexer) to do actual subject clustering (i.e. like Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? It would be pretty awesome if Lucene had such ability, I know there aren't a whole lot of clustering options, and the commercial products are very expensive. Anyhow, just curious. A brief definition of clustering: automatically organizing search or database query results into meaningful hierarchical folders ... transforming long lists of search results into categorized information without any clumsy pre- processing of the source documents. I'm not sure how it would be done...? Based off of top Term Frequencies for a document? -K Quoting Michael A. Schoen [EMAIL PROTECTED]: So as Venu pointed out, sorting doesn't seem to help the problem. If we have to walk the result set, access docs and dedupe using brute force, we're better off w/ the standard order by relevance. If you've got an example of this type of clustering done in a more efficient way, that'd be great. Any other ideas? - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, April 10, 2004 12:35 AM Subject: Re: clustering results On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: clustering results
On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: clustering results
Erik, Thanks for the poiner. I am not sure how sort can filter out results. sort will just sort the results right ? lets say if i had below results http://www.b.com/1.html http://www.a.com/1.html http://www.b.com/2.html http://www.a.com/2.html if you sort by domain name, results might be http://www.a.com/1.htmlhttp://www.a.com/2.html http://www.b.com/1.html http://www.b.com/2.html If i want to have one result per domain. no sorting, just filtering out some results. http://www.b.com/1.html http://www.a.com/1.html can this still be achieved using sort? If not any other ways of doing this ? Thanks. Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: clustering results
On Apr 10, 2004, at 9:47 AM, Venu Durgam wrote: I am not sure how sort can filter out results. sort will just sort the results right ? Right no filtering using Sort. lets say if i had below results http://www.b.com/1.html http://www.a.com/1.html http://www.b.com/2.html http://www.a.com/2.html if you sort by domain name, results might be http://www.a.com/1.htmlhttp://www.a.com/2.html http://www.b.com/1.html http://www.b.com/2.html If i want to have one result per domain. no sorting, just filtering out some results. http://www.b.com/1.html http://www.a.com/1.html can this still be achieved using sort? If not any other ways of doing this ? Not that I know of, directly. The brute force way of sorting and then walking the results yourself to collect things in the way you want is the only method I can think of at the moment. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: clustering results
So as Venu pointed out, sorting doesn't seem to help the problem. If we have to walk the result set, access docs and dedupe using brute force, we're better off w/ the standard order by relevance. If you've got an example of this type of clustering done in a more efficient way, that'd be great. Any other ideas? - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, April 10, 2004 12:35 AM Subject: Re: clustering results On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]