it's indeed very slow. because it do collapsing in all matched documents.
we tacked this problem by doing collapsing in top 100 documents.
2011/3/6 Mark
> I'm familiar with Deduplication however I do not wish to remove my
> duplicates and my needs are slightly different. I would like to mark the
Hi Everyone,
We are looking for someone to help us build a similarity engine. Basically we
want to be able to show similar posts when a user posts a new block of text. A
good example of this is StackOverflow. When a user tries to ask a new question,
the system displays similar questions.
Can L
I'm familiar with Deduplication however I do not wish to remove my
duplicates and my needs are slightly different. I would like to mark the
first document with signature 'xyz' as unique but the next one as a
duplicate. This way I can filter out "duplicates" during searching using
a filter query
Hello
I have a situation where I'm using two methods in a Java class to implement
a StandardAnalyzer in Lucene to index text strings and return their word
frequencies as follows:
public void indexText(String suffix, boolean includeStopWords) {
StandardAnalyzer analyzer = null;
There is a DuplicateFilter class in contrib that works pretty well.
2011/3/5 Grant Ingersoll :
> See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull
> out if you are doing just Lucene.
>
> On Mar 5, 2011, at 1:49 AM, Mark wrote:
>
>> Is there a way one could detect dupli
See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull
out if you are doing just Lucene.
On Mar 5, 2011, at 1:49 AM, Mark wrote:
> Is there a way one could detect duplicates (say by using some unique hash of
> certain fields) and marking a document as a duplicate but not
Nevermind, I've finally solved.
I just now need to figure out how to retrieve the scores per fields in my
results.
I need to know how much similar each field is. I know I can use explain()
but it slows down computations...
thanks
On 4 March 2011 21:21, Patrick Diviacco wrote:
> ok thanks, one