Re: clustering results
On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Friday 09 April 2004 23:59, Ype Kingma wrote: When you need 3000 hits and their stored fields, you might consider using the lower level search API with your own HitCollector. I apologize for the stupid question but ... where's the actualy result in HitCollector? :-) collect(int doc, float score) Where doc is the index and score is its score - and where's the Document? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighter package v2 RC1
Can I customize the way it does highlight terms? Right now it does so by arounding with b. That's the job of a formatter class. You can pass one in the constructor eg: Formatter myFormatter=new SimpleHTMLFormatter(i,/i); Highlighter h=new Highlighter(myFormatter, new QueryScorer(query))); If you look at the formatter interface you can see it is now passed scores for each token. You could provide a Formatter implementation which coloured words with different colour intensity based on these scores if that was a useful effect. Second, I miss the ability to let it highlight all fields or a selection or several fields Just concatenate them: String textToBeHighlighted=field1Text+field2Text..; and then send that to the highlighter, no? What am I going to do if I want the entire text to be highlighted - and not fragmented? Use one big fragment... highlighter.setTextFragmenter(new SimpleFragmenter(100)); Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: clustering results
Erik, Thanks for the poiner. I am not sure how sort can filter out results. sort will just sort the results right ? lets say if i had below results http://www.b.com/1.html http://www.a.com/1.html http://www.b.com/2.html http://www.a.com/2.html if you sort by domain name, results might be http://www.a.com/1.htmlhttp://www.a.com/2.html http://www.b.com/1.html http://www.b.com/2.html If i want to have one result per domain. no sorting, just filtering out some results. http://www.b.com/1.html http://www.a.com/1.html can this still be achieved using sort? If not any other ways of doing this ? Thanks. Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ValueListHandler pattern with Lucene
On Apr 10, 2004, at 5:08 AM, [EMAIL PROTECTED] wrote: On Friday 09 April 2004 23:59, Ype Kingma wrote: When you need 3000 hits and their stored fields, you might consider using the lower level search API with your own HitCollector. I apologize for the stupid question but ... where's the actualy result in HitCollector? :-) collect(int doc, float score) Where doc is the index and score is its score - and where's the Document? Thats the beauty it is up to you to load the doc iff you want it. In many situations, loading the doc would slow things down dramatically. For example, QueryFilter uses a HitCollector internally, but could care less about the actual document object, just its id (which you get from the int doc). To get the doc: Document document = searcher.doc(doc); (I'd use 'id' for the int, personally). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: clustering results
On Apr 10, 2004, at 9:47 AM, Venu Durgam wrote: I am not sure how sort can filter out results. sort will just sort the results right ? Right no filtering using Sort. lets say if i had below results http://www.b.com/1.html http://www.a.com/1.html http://www.b.com/2.html http://www.a.com/2.html if you sort by domain name, results might be http://www.a.com/1.htmlhttp://www.a.com/2.html http://www.b.com/1.html http://www.b.com/2.html If i want to have one result per domain. no sorting, just filtering out some results. http://www.b.com/1.html http://www.a.com/1.html can this still be achieved using sort? If not any other ways of doing this ? Not that I know of, directly. The brute force way of sorting and then walking the results yourself to collect things in the way you want is the only method I can think of at the moment. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: clustering results
So as Venu pointed out, sorting doesn't seem to help the problem. If we have to walk the result set, access docs and dedupe using brute force, we're better off w/ the standard order by relevance. If you've got an example of this type of clustering done in a more efficient way, that'd be great. Any other ideas? - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, April 10, 2004 12:35 AM Subject: Re: clustering results On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]