RE: clustering results

2004-04-12 Thread Bruce Ritchie
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: April 11, 2004 1:03 PM
 To: Lucene Users List
 Subject: Re: clustering results
 
 I got all excited reading the subject line clustering 
 results but this isn't really clustering is it?  This is 
 more sorting.  Does anyone know of any work within Lucene (or 
 another indexer) to do actual subject clustering (i.e. like 
 Vivisimo @ http://vivisimo.com/ or Kartoo @ 
 http://www.kartoo.com/)?  It would be pretty awesome if 
 Lucene had such ability, I know there aren't a whole lot of 
 clustering options, and the commercial products are very expensive.  
 Anyhow, just curious.

The one I know about is Carrot - http://www.cs.put.poznan.pl/dweiss/carrot/


Regards,

Bruce Ritchie
http://www.jivesoftware.com/


smime.p7s
Description: S/MIME cryptographic signature


Carrot 2 (was: Re: clustering results)

2004-04-11 Thread Otis Gospodnetic
Carrot (2):

  http://www.cs.put.poznan.pl/dweiss/carrot/xml/index.xml?lang=en

Otis

--- [EMAIL PROTECTED] wrote:
 I got all excited reading the subject line clustering results but
 this isn't 
 really clustering is it?  This is more sorting.  Does anyone know of
 any work 
 within Lucene (or another indexer) to do actual subject clustering
 (i.e. like 
 Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? 
 It would 
 be pretty awesome if Lucene had such ability, I know there aren't a
 whole lot 
 of clustering options, and the commercial products are very
 expensive.  
 Anyhow, just curious.
 
 A brief definition of clustering: automatically organizing search or
 database 
 query results into meaningful hierarchical folders ... transforming
 long lists 
 of search results into categorized information without any clumsy
 pre-
 processing of the source documents.
 
 I'm not sure how it would be done...?  Based off of top Term
 Frequencies for a 
 document?
 
 -K
 
 Quoting Michael A. Schoen [EMAIL PROTECTED]:
 
  So as Venu pointed out, sorting doesn't seem to help the problem.
 If we have
  to walk the result set, access docs and dedupe using brute force,
 we're
  better off w/ the standard order by relevance.
  
  If you've got an example of this type of clustering done in a more
 efficient
  way, that'd be great.
  
  Any other ideas?
  
  
  - Original Message - 
  From: Erik Hatcher [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Saturday, April 10, 2004 12:35 AM
  Subject: Re: clustering results
  
  
   On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
I have an index of urls, and need to display the top 10 results
 for a
given query, but want to display only 1 result per domain. It
 seems
that using either Hits or a HitCollector, I'll need to access
 the doc,
grab the domain field (I'll have it parse ahead of time) and
 only
take/display documents that are unique.
   
A significant percentage of the time I expect I may have to
 access
thousands of results before I find 10 in unique domains. Is
 there a
faster approach that won't require accessing thousands of
 documents?
  
   I have examples of this that I can post when I have more time,
 but a
   quick pointer... check out the overloaded IndexSearcher.search()
   methods which accept a Sort.  You can do really really
 interesting
   slicing and dicing, I think, using it.  Try this one on for size:
  
example.displayHits(allBooks,
new Sort(new SortField[]{
  new SortField(category),
  SortField.FIELD_SCORE,
  new SortField(pubmonth, SortField.INT, true)
}));
  
   Be clever indexing the piece you want to group on - I think you
 may
   find this the solution you're looking for.
  
   Erik
  
  
  
 -
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: clustering results

2004-04-10 Thread Erik Hatcher
On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
I have an index of urls, and need to display the top 10 results for a 
given query, but want to display only 1 result per domain. It seems 
that using either Hits or a HitCollector, I'll need to access the doc, 
grab the domain field (I'll have it parse ahead of time) and only 
take/display documents that are unique.

A significant percentage of the time I expect I may have to access 
thousands of results before I find 10 in unique domains. Is there a 
faster approach that won't require accessing thousands of documents?
I have examples of this that I can post when I have more time, but a 
quick pointer... check out the overloaded IndexSearcher.search() 
methods which accept a Sort.  You can do really really interesting 
slicing and dicing, I think, using it.  Try this one on for size:

example.displayHits(allBooks,
new Sort(new SortField[]{
  new SortField(category),
  SortField.FIELD_SCORE,
  new SortField(pubmonth, SortField.INT, true)
}));
Be clever indexing the piece you want to group on - I think you may 
find this the solution you're looking for.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: clustering results

2004-04-10 Thread Venu Durgam
Erik,

Thanks for the poiner.
I am not sure how sort can filter out results.
sort will just sort the results right ?

lets say if i had below results 
http://www.b.com/1.html

http://www.a.com/1.html
http://www.b.com/2.html
http://www.a.com/2.html

if you sort by domain name, results might be
http://www.a.com/1.htmlhttp://www.a.com/2.html
http://www.b.com/1.html
http://www.b.com/2.html
 
If i want to have one result per domain. no sorting, just filtering out some results.
http://www.b.com/1.html

http://www.a.com/1.html

can this still be achieved using sort? If not any other ways of doing this ?

Thanks.

Erik Hatcher [EMAIL PROTECTED] wrote:
On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
 I have an index of urls, and need to display the top 10 results for a 
 given query, but want to display only 1 result per domain. It seems 
 that using either Hits or a HitCollector, I'll need to access the doc, 
 grab the domain field (I'll have it parse ahead of time) and only 
 take/display documents that are unique.

 A significant percentage of the time I expect I may have to access 
 thousands of results before I find 10 in unique domains. Is there a 
 faster approach that won't require accessing thousands of documents?

I have examples of this that I can post when I have more time, but a 
quick pointer... check out the overloaded IndexSearcher.search() 
methods which accept a Sort. You can do really really interesting 
slicing and dicing, I think, using it. Try this one on for size:

example.displayHits(allBooks,
new Sort(new SortField[]{
new SortField(category),
SortField.FIELD_SCORE,
new SortField(pubmonth, SortField.INT, true)
}));

Be clever indexing the piece you want to group on - I think you may 
find this the solution you're looking for.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: clustering results

2004-04-10 Thread Erik Hatcher
On Apr 10, 2004, at 9:47 AM, Venu Durgam wrote:
I am not sure how sort can filter out results.
sort will just sort the results right ?
Right no filtering using Sort.

lets say if i had below results
http://www.b.com/1.html
http://www.a.com/1.html
http://www.b.com/2.html
http://www.a.com/2.html
if you sort by domain name, results might be
http://www.a.com/1.htmlhttp://www.a.com/2.html
http://www.b.com/1.html
http://www.b.com/2.html
If i want to have one result per domain. no sorting, just filtering 
out some results.
http://www.b.com/1.html

http://www.a.com/1.html

can this still be achieved using sort? If not any other ways of doing 
this ?
Not that I know of, directly.  The brute force way of sorting and then 
walking the results yourself to collect things in the way you want is 
the only method I can think of at the moment.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: clustering results

2004-04-10 Thread Michael A. Schoen
So as Venu pointed out, sorting doesn't seem to help the problem. If we have
to walk the result set, access docs and dedupe using brute force, we're
better off w/ the standard order by relevance.

If you've got an example of this type of clustering done in a more efficient
way, that'd be great.

Any other ideas?


- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Saturday, April 10, 2004 12:35 AM
Subject: Re: clustering results


 On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
  I have an index of urls, and need to display the top 10 results for a
  given query, but want to display only 1 result per domain. It seems
  that using either Hits or a HitCollector, I'll need to access the doc,
  grab the domain field (I'll have it parse ahead of time) and only
  take/display documents that are unique.
 
  A significant percentage of the time I expect I may have to access
  thousands of results before I find 10 in unique domains. Is there a
  faster approach that won't require accessing thousands of documents?

 I have examples of this that I can post when I have more time, but a
 quick pointer... check out the overloaded IndexSearcher.search()
 methods which accept a Sort.  You can do really really interesting
 slicing and dicing, I think, using it.  Try this one on for size:

  example.displayHits(allBooks,
  new Sort(new SortField[]{
new SortField(category),
SortField.FIELD_SCORE,
new SortField(pubmonth, SortField.INT, true)
  }));

 Be clever indexing the piece you want to group on - I think you may
 find this the solution you're looking for.

 Erik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]