Re: clustering results

2004-04-10 Thread Erik Hatcher
On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
I have an index of urls, and need to display the top 10 results for a 
given query, but want to display only 1 result per domain. It seems 
that using either Hits or a HitCollector, I'll need to access the doc, 
grab the domain field (I'll have it parse ahead of time) and only 
take/display documents that are unique.

A significant percentage of the time I expect I may have to access 
thousands of results before I find 10 in unique domains. Is there a 
faster approach that won't require accessing thousands of documents?
I have examples of this that I can post when I have more time, but a 
quick pointer... check out the overloaded IndexSearcher.search() 
methods which accept a Sort.  You can do really really interesting 
slicing and dicing, I think, using it.  Try this one on for size:

example.displayHits(allBooks,
new Sort(new SortField[]{
  new SortField(category),
  SortField.FIELD_SCORE,
  new SortField(pubmonth, SortField.INT, true)
}));
Be clever indexing the piece you want to group on - I think you may 
find this the solution you're looking for.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ValueListHandler pattern with Lucene

2004-04-10 Thread lucene
On Friday 09 April 2004 23:59, Ype Kingma wrote:
 When you need 3000 hits and their stored fields, you might
 consider using the lower level search API with your own HitCollector.

I apologize for the stupid question but ... where's the actualy result in 
HitCollector? :-) 

  collect(int doc, float score) 

Where doc is the index and score is its score - and where's the Document?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter package v2 RC1

2004-04-10 Thread markharw00d
Can I customize the way it does highlight terms? Right now it does so by arounding 
with b.
That's the job of a formatter class. You can pass one in the constructor eg:

Formatter myFormatter=new SimpleHTMLFormatter(i,/i);
Highlighter h=new Highlighter(myFormatter, new QueryScorer(query)));

If you look at the formatter interface you can see it is now passed scores for each 
token.
You could provide a Formatter implementation which coloured words with different 
colour intensity 
based on these scores if that was a useful effect.



Second, I miss the ability to let it highlight all fields or a selection or several 
fields 

Just concatenate them:

String textToBeHighlighted=field1Text+field2Text..;

and then send that to the highlighter, no?


What am I going to do if I want the entire text to be highlighted - and not 
fragmented?

Use one big fragment...

highlighter.setTextFragmenter(new SimpleFragmenter(100));

Cheers
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: clustering results

2004-04-10 Thread Venu Durgam
Erik,

Thanks for the poiner.
I am not sure how sort can filter out results.
sort will just sort the results right ?

lets say if i had below results 
http://www.b.com/1.html

http://www.a.com/1.html
http://www.b.com/2.html
http://www.a.com/2.html

if you sort by domain name, results might be
http://www.a.com/1.htmlhttp://www.a.com/2.html
http://www.b.com/1.html
http://www.b.com/2.html
 
If i want to have one result per domain. no sorting, just filtering out some results.
http://www.b.com/1.html

http://www.a.com/1.html

can this still be achieved using sort? If not any other ways of doing this ?

Thanks.

Erik Hatcher [EMAIL PROTECTED] wrote:
On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
 I have an index of urls, and need to display the top 10 results for a 
 given query, but want to display only 1 result per domain. It seems 
 that using either Hits or a HitCollector, I'll need to access the doc, 
 grab the domain field (I'll have it parse ahead of time) and only 
 take/display documents that are unique.

 A significant percentage of the time I expect I may have to access 
 thousands of results before I find 10 in unique domains. Is there a 
 faster approach that won't require accessing thousands of documents?

I have examples of this that I can post when I have more time, but a 
quick pointer... check out the overloaded IndexSearcher.search() 
methods which accept a Sort. You can do really really interesting 
slicing and dicing, I think, using it. Try this one on for size:

example.displayHits(allBooks,
new Sort(new SortField[]{
new SortField(category),
SortField.FIELD_SCORE,
new SortField(pubmonth, SortField.INT, true)
}));

Be clever indexing the piece you want to group on - I think you may 
find this the solution you're looking for.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ValueListHandler pattern with Lucene

2004-04-10 Thread Erik Hatcher
On Apr 10, 2004, at 5:08 AM, [EMAIL PROTECTED] wrote:
On Friday 09 April 2004 23:59, Ype Kingma wrote:
When you need 3000 hits and their stored fields, you might
consider using the lower level search API with your own HitCollector.
I apologize for the stupid question but ... where's the actualy result 
in
HitCollector? :-)

  collect(int doc, float score)

Where doc is the index and score is its score - and where's the 
Document?
Thats the beauty it is up to you to load the doc iff you want it.  
In many situations, loading the doc would slow things down 
dramatically.  For example, QueryFilter uses a HitCollector internally, 
but could care less about the actual document object, just its id 
(which you get from the int doc).  To get the doc:

	 Document document = searcher.doc(doc);

(I'd use 'id' for the int, personally).

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: clustering results

2004-04-10 Thread Erik Hatcher
On Apr 10, 2004, at 9:47 AM, Venu Durgam wrote:
I am not sure how sort can filter out results.
sort will just sort the results right ?
Right no filtering using Sort.

lets say if i had below results
http://www.b.com/1.html
http://www.a.com/1.html
http://www.b.com/2.html
http://www.a.com/2.html
if you sort by domain name, results might be
http://www.a.com/1.htmlhttp://www.a.com/2.html
http://www.b.com/1.html
http://www.b.com/2.html
If i want to have one result per domain. no sorting, just filtering 
out some results.
http://www.b.com/1.html

http://www.a.com/1.html

can this still be achieved using sort? If not any other ways of doing 
this ?
Not that I know of, directly.  The brute force way of sorting and then 
walking the results yourself to collect things in the way you want is 
the only method I can think of at the moment.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: clustering results

2004-04-10 Thread Michael A. Schoen
So as Venu pointed out, sorting doesn't seem to help the problem. If we have
to walk the result set, access docs and dedupe using brute force, we're
better off w/ the standard order by relevance.

If you've got an example of this type of clustering done in a more efficient
way, that'd be great.

Any other ideas?


- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Saturday, April 10, 2004 12:35 AM
Subject: Re: clustering results


 On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
  I have an index of urls, and need to display the top 10 results for a
  given query, but want to display only 1 result per domain. It seems
  that using either Hits or a HitCollector, I'll need to access the doc,
  grab the domain field (I'll have it parse ahead of time) and only
  take/display documents that are unique.
 
  A significant percentage of the time I expect I may have to access
  thousands of results before I find 10 in unique domains. Is there a
  faster approach that won't require accessing thousands of documents?

 I have examples of this that I can post when I have more time, but a
 quick pointer... check out the overloaded IndexSearcher.search()
 methods which accept a Sort.  You can do really really interesting
 slicing and dicing, I think, using it.  Try this one on for size:

  example.displayHits(allBooks,
  new Sort(new SortField[]{
new SortField(category),
SortField.FIELD_SCORE,
new SortField(pubmonth, SortField.INT, true)
  }));

 Be clever indexing the piece you want to group on - I think you may
 find this the solution you're looking for.

 Erik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]