Re: Detecting duplicates

2011-03-05 Thread Li Li
it's indeed very slow. because it do collapsing in all matched documents. we tacked this problem by doing collapsing in top 100 documents. 2011/3/6 Mark > I'm familiar with Deduplication however I do not wish to remove my > duplicates and my needs are slightly different. I would like to mark the

Looking for a Lucene Contractor

2011-03-05 Thread Drew Kutcharian
Hi Everyone, We are looking for someone to help us build a similarity engine. Basically we want to be able to show similar posts when a user posts a new block of text. A good example of this is StackOverflow. When a user tries to ask a new question, the system displays similar questions. Can L

Re: Detecting duplicates

2011-03-05 Thread Mark
I'm familiar with Deduplication however I do not wish to remove my duplicates and my needs are slightly different. I would like to mark the first document with signature 'xyz' as unique but the next one as a duplicate. This way I can filter out "duplicates" during searching using a filter query

Combining analyzers in Lucene

2011-03-05 Thread Martin O'Shea
Hello I have a situation where I'm using two methods in a Java class to implement a StandardAnalyzer in Lucene to index text strings and return their word frequencies as follows: public void indexText(String suffix, boolean includeStopWords) { StandardAnalyzer analyzer = null;

Re: Detecting duplicates

2011-03-05 Thread Devon H. O'Dell
There is a DuplicateFilter class in contrib that works pretty well. 2011/3/5 Grant Ingersoll : > See http://wiki.apache.org/solr/Deduplication.  Should be fairly easy to pull > out if you are doing just Lucene. > > On Mar 5, 2011, at 1:49 AM, Mark wrote: > >> Is there a way one could detect dupli

Re: Detecting duplicates

2011-03-05 Thread Grant Ingersoll
See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull out if you are doing just Lucene. On Mar 5, 2011, at 1:49 AM, Mark wrote: > Is there a way one could detect duplicates (say by using some unique hash of > certain fields) and marking a document as a duplicate but not

Re: Lucene nightly build: similarity score per field

2011-03-05 Thread Patrick Diviacco
Nevermind, I've finally solved. I just now need to figure out how to retrieve the scores per fields in my results. I need to know how much similar each field is. I know I can use explain() but it slows down computations... thanks On 4 March 2011 21:21, Patrick Diviacco wrote: > ok thanks, one