RE: clustering results

2004-04-12 Thread Bruce Ritchie
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: April 11, 2004 1:03 PM
 To: Lucene Users List
 Subject: Re: clustering results
 
 I got all excited reading the subject line clustering 
 results but this isn't really clustering is it?  This is 
 more sorting.  Does anyone know of any work within Lucene (or 
 another indexer) to do actual subject clustering (i.e. like 
 Vivisimo @ http://vivisimo.com/ or Kartoo @ 
 http://www.kartoo.com/)?  It would be pretty awesome if 
 Lucene had such ability, I know there aren't a whole lot of 
 clustering options, and the commercial products are very expensive.  
 Anyhow, just curious.

The one I know about is Carrot - http://www.cs.put.poznan.pl/dweiss/carrot/


Regards,

Bruce Ritchie
http://www.jivesoftware.com/


smime.p7s
Description: S/MIME cryptographic signature


Carrot 2 (was: Re: clustering results)

2004-04-11 Thread Otis Gospodnetic
Carrot (2):

  http://www.cs.put.poznan.pl/dweiss/carrot/xml/index.xml?lang=en

Otis

--- [EMAIL PROTECTED] wrote:
 I got all excited reading the subject line clustering results but
 this isn't 
 really clustering is it?  This is more sorting.  Does anyone know of
 any work 
 within Lucene (or another indexer) to do actual subject clustering
 (i.e. like 
 Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? 
 It would 
 be pretty awesome if Lucene had such ability, I know there aren't a
 whole lot 
 of clustering options, and the commercial products are very
 expensive.  
 Anyhow, just curious.
 
 A brief definition of clustering: automatically organizing search or
 database 
 query results into meaningful hierarchical folders ... transforming
 long lists 
 of search results into categorized information without any clumsy
 pre-
 processing of the source documents.
 
 I'm not sure how it would be done...?  Based off of top Term
 Frequencies for a 
 document?
 
 -K
 
 Quoting Michael A. Schoen [EMAIL PROTECTED]:
 
  So as Venu pointed out, sorting doesn't seem to help the problem.
 If we have
  to walk the result set, access docs and dedupe using brute force,
 we're
  better off w/ the standard order by relevance.
  
  If you've got an example of this type of clustering done in a more
 efficient
  way, that'd be great.
  
  Any other ideas?
  
  
  - Original Message - 
  From: Erik Hatcher [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Saturday, April 10, 2004 12:35 AM
  Subject: Re: clustering results
  
  
   On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
I have an index of urls, and need to display the top 10 results
 for a
given query, but want to display only 1 result per domain. It
 seems
that using either Hits or a HitCollector, I'll need to access
 the doc,
grab the domain field (I'll have it parse ahead of time) and
 only
take/display documents that are unique.
   
A significant percentage of the time I expect I may have to
 access
thousands of results before I find 10 in unique domains. Is
 there a
faster approach that won't require accessing thousands of
 documents?
  
   I have examples of this that I can post when I have more time,
 but a
   quick pointer... check out the overloaded IndexSearcher.search()
   methods which accept a Sort.  You can do really really
 interesting
   slicing and dicing, I think, using it.  Try this one on for size:
  
example.displayHits(allBooks,
new Sort(new SortField[]{
  new SortField(category),
  SortField.FIELD_SCORE,
  new SortField(pubmonth, SortField.INT, true)
}));
  
   Be clever indexing the piece you want to group on - I think you
 may
   find this the solution you're looking for.
  
   Erik
  
  
  
 -
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: clustering results

2004-04-11 Thread Karsten Konrad

Hi (danger: shameless advertising below),

our partner, Brox It Solutions, is using our - XtraMind Technologies GmbH - clustering 
for implementing meta-search clustering of search results ala Vivisimo. Check out:

http://www.anyfinder.de/

The clustering is done on the snipplets coming from search engines, but the original 
version that we still use in our own products is based on modified Lucene indexes as 
these can efficiently handle lots of information on texts and terms. Our clustering 
engine does not only cluster search results, but also performs trend recognition for 
competitive intelligence and similar tasks, but not too many people require such 
specialized features.

Brox' price models for this engine may be interesting for those who find other 
products too expensive; it also works with all existing search engines, not only 
Lucene. 

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com







-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Gesendet: Sonntag, 11. April 2004 19:03
An: Lucene Users List
Betreff: Re: clustering results


I got all excited reading the subject line clustering results but this isn't 
really clustering is it?  This is more sorting.  Does anyone know of any work 
within Lucene (or another indexer) to do actual subject clustering (i.e. like 
Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)?  It would 
be pretty awesome if Lucene had such ability, I know there aren't a whole lot 
of clustering options, and the commercial products are very expensive.  
Anyhow, just curious.

A brief definition of clustering: automatically organizing search or database 
query results into meaningful hierarchical folders ... transforming long lists 
of search results into categorized information without any clumsy pre- processing of 
the source documents.

I'm not sure how it would be done...?  Based off of top Term Frequencies for a 
document?

-K

Quoting Michael A. Schoen [EMAIL PROTECTED]:

 So as Venu pointed out, sorting doesn't seem to help the problem. If 
 we have to walk the result set, access docs and dedupe using brute 
 force, we're better off w/ the standard order by relevance.
 
 If you've got an example of this type of clustering done in a more 
 efficient way, that'd be great.
 
 Any other ideas?
 
 
 - Original Message -
 From: Erik Hatcher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Saturday, April 10, 2004 12:35 AM
 Subject: Re: clustering results
 
 
  On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
   I have an index of urls, and need to display the top 10 results 
   for a given query, but want to display only 1 result per domain. 
   It seems that using either Hits or a HitCollector, I'll need to 
   access the doc, grab the domain field (I'll have it parse ahead of 
   time) and only take/display documents that are unique.
  
   A significant percentage of the time I expect I may have to access 
   thousands of results before I find 10 in unique domains. Is there 
   a faster approach that won't require accessing thousands of 
   documents?
 
  I have examples of this that I can post when I have more time, but a 
  quick pointer... check out the overloaded IndexSearcher.search() 
  methods which accept a Sort.  You can do really really interesting 
  slicing and dicing, I think, using it.  Try this one on for size:
 
   example.displayHits(allBooks,
   new Sort(new SortField[]{
 new SortField(category),
 SortField.FIELD_SCORE,
 new SortField(pubmonth, SortField.INT, true)
   }));
 
  Be clever indexing the piece you want to group on - I think you may 
  find this the solution you're looking for.
 
  Erik
 
 
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: clustering results

2004-04-10 Thread Erik Hatcher
On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
I have an index of urls, and need to display the top 10 results for a 
given query, but want to display only 1 result per domain. It seems 
that using either Hits or a HitCollector, I'll need to access the doc, 
grab the domain field (I'll have it parse ahead of time) and only 
take/display documents that are unique.

A significant percentage of the time I expect I may have to access 
thousands of results before I find 10 in unique domains. Is there a 
faster approach that won't require accessing thousands of documents?
I have examples of this that I can post when I have more time, but a 
quick pointer... check out the overloaded IndexSearcher.search() 
methods which accept a Sort.  You can do really really interesting 
slicing and dicing, I think, using it.  Try this one on for size:

example.displayHits(allBooks,
new Sort(new SortField[]{
  new SortField(category),
  SortField.FIELD_SCORE,
  new SortField(pubmonth, SortField.INT, true)
}));
Be clever indexing the piece you want to group on - I think you may 
find this the solution you're looking for.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: clustering results

2004-04-10 Thread Venu Durgam
Erik,

Thanks for the poiner.
I am not sure how sort can filter out results.
sort will just sort the results right ?

lets say if i had below results 
http://www.b.com/1.html

http://www.a.com/1.html
http://www.b.com/2.html
http://www.a.com/2.html

if you sort by domain name, results might be
http://www.a.com/1.htmlhttp://www.a.com/2.html
http://www.b.com/1.html
http://www.b.com/2.html
 
If i want to have one result per domain. no sorting, just filtering out some results.
http://www.b.com/1.html

http://www.a.com/1.html

can this still be achieved using sort? If not any other ways of doing this ?

Thanks.

Erik Hatcher [EMAIL PROTECTED] wrote:
On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
 I have an index of urls, and need to display the top 10 results for a 
 given query, but want to display only 1 result per domain. It seems 
 that using either Hits or a HitCollector, I'll need to access the doc, 
 grab the domain field (I'll have it parse ahead of time) and only 
 take/display documents that are unique.

 A significant percentage of the time I expect I may have to access 
 thousands of results before I find 10 in unique domains. Is there a 
 faster approach that won't require accessing thousands of documents?

I have examples of this that I can post when I have more time, but a 
quick pointer... check out the overloaded IndexSearcher.search() 
methods which accept a Sort. You can do really really interesting 
slicing and dicing, I think, using it. Try this one on for size:

example.displayHits(allBooks,
new Sort(new SortField[]{
new SortField(category),
SortField.FIELD_SCORE,
new SortField(pubmonth, SortField.INT, true)
}));

Be clever indexing the piece you want to group on - I think you may 
find this the solution you're looking for.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: clustering results

2004-04-10 Thread Erik Hatcher
On Apr 10, 2004, at 9:47 AM, Venu Durgam wrote:
I am not sure how sort can filter out results.
sort will just sort the results right ?
Right no filtering using Sort.

lets say if i had below results
http://www.b.com/1.html
http://www.a.com/1.html
http://www.b.com/2.html
http://www.a.com/2.html
if you sort by domain name, results might be
http://www.a.com/1.htmlhttp://www.a.com/2.html
http://www.b.com/1.html
http://www.b.com/2.html
If i want to have one result per domain. no sorting, just filtering 
out some results.
http://www.b.com/1.html

http://www.a.com/1.html

can this still be achieved using sort? If not any other ways of doing 
this ?
Not that I know of, directly.  The brute force way of sorting and then 
walking the results yourself to collect things in the way you want is 
the only method I can think of at the moment.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: clustering results

2004-04-10 Thread Michael A. Schoen
So as Venu pointed out, sorting doesn't seem to help the problem. If we have
to walk the result set, access docs and dedupe using brute force, we're
better off w/ the standard order by relevance.

If you've got an example of this type of clustering done in a more efficient
way, that'd be great.

Any other ideas?


- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Saturday, April 10, 2004 12:35 AM
Subject: Re: clustering results


 On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
  I have an index of urls, and need to display the top 10 results for a
  given query, but want to display only 1 result per domain. It seems
  that using either Hits or a HitCollector, I'll need to access the doc,
  grab the domain field (I'll have it parse ahead of time) and only
  take/display documents that are unique.
 
  A significant percentage of the time I expect I may have to access
  thousands of results before I find 10 in unique domains. Is there a
  faster approach that won't require accessing thousands of documents?

 I have examples of this that I can post when I have more time, but a
 quick pointer... check out the overloaded IndexSearcher.search()
 methods which accept a Sort.  You can do really really interesting
 slicing and dicing, I think, using it.  Try this one on for size:

  example.displayHits(allBooks,
  new Sort(new SortField[]{
new SortField(category),
SortField.FIELD_SCORE,
new SortField(pubmonth, SortField.INT, true)
  }));

 Be clever indexing the piece you want to group on - I think you may
 find this the solution you're looking for.

 Erik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



clustering results

2004-04-09 Thread Michael A. Schoen
I have an index of urls, and need to display the top 10 results for a given query, but 
want to display only 1 result per domain. It seems that using either Hits or a 
HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse 
ahead of time) and only take/display documents that are unique.

A significant percentage of the time I expect I may have to access thousands of 
results before I find 10 in unique domains. Is there a faster approach that won't 
require accessing thousands of documents?