Duplicate documents

2013-06-27 Thread Moshe Immerman
Hi We have recently upgraded from Lucene 3.6 to 4.3.1 and have encountered a sometimes intermittent issue of IndexSearcher.search returning duplicate documents (based on lucene doc no, not a custom field) i.e. TopDocs docs = IndexSearcher.search(query, filter, 10, sort) assert

Re: Do duplicate documents affect term scoring?

2011-11-28 Thread Ian Lea
Lucene won't be aware that you've got duplicate documents, but scoring does take account of the number of documents in which search terms appear. See http://lucene.apache.org/java/3_5_0/scoring.html and the javadocs for oal.search.Similarity. Only you can say whether or not you nee

Do duplicate documents affect term scoring?

2011-11-27 Thread Stephen Thomas
does not change very much between each version, sometimes not at all. I end up with duplicate documents, the only different is the timestamp. Does this impact the term weighting used by Lucene? My intuition is that if a term only occurs in one document, but that document occurs 36 times, then the

Duplicate documents in a corpus

2011-07-28 Thread Rich Heimann
All, I am curious if Lucene and/or Mahout can identify duplicate documents? I am having trouble with many redundant docs in my corpus, which is causing inflated values and an expense on users to process and reprocess much of the material. Can the redundancy be removed or managed in some sense my

Duplicate documents

2009-10-04 Thread Felipe Lobo
Hi, i'm having some problem on my search process, because my search results are being duplicated(but is not duplicated on the index, i checked with luke). I checked the id of the results and one exist on the index and the other is out of the range(like my index has 300 documents, one result has id

FW: Eliminating duplicate documents when indexing

2007-10-03 Thread Rod Giles
Duplicate Documents In An Index The updateDocument method of Index Writer indicates that a delete term occurs before the update document takes place (i.e. the document is replaced in the index, but not duplicated).Has anyone been able to get this process to work? The term that I am using

Re: is there a way to find duplicate documents in the index?

2006-03-13 Thread Yonik Seeley
On 3/13/06, emerson cargnin <[EMAIL PROTECTED]> wrote: > I notice some duplicated entries in my index, my just looking at it, > and I suspect there might be more than those I found out. Is there a > way to detect duplicate documents in an index? > > Emerson Cargnin If th

is there a way to find duplicate documents in the index?

2006-03-13 Thread emerson cargnin
I notice some duplicated entries in my index, my just looking at it, and I suspect there might be more than those I found out. Is there a way to detect duplicate documents in an index? Emerson Cargnin - To unsubscribe, e-mail

Re: deleting duplicate documents from my index

2006-01-30 Thread gekkokid
hi, thats exactly what i did :) works perfectly thanks _gk - Original Message - From: "Chris Hostetter" <[EMAIL PROTECTED]> To: Sent: Monday, January 30, 2006 5:56 AM Subject: Re: deleting duplicate documents from my index : Hi, im trying to delete duplicate d

Re: deleting duplicate documents from my index

2006-01-29 Thread Chris Hostetter
: Hi, im trying to delete duplicate documents from my index, the unique : indentifier is the documents url (aka field "url"). : : my initial thought of how to acomplish this is to open the index via a : reader and sort them by the documents url and then iterate through them : looking f

Re: deleting duplicate documents from my index

2006-01-29 Thread Jeff Rodenburg
issue that needs to be addressed, it's worth it. Hope this helps. -- j On 1/28/06, gekkokid <[EMAIL PROTECTED]> wrote: > > Hi, im trying to delete duplicate documents from my index, the unique > indentifier is the documents url (aka field "url"). > > my init

deleting duplicate documents from my index

2006-01-28 Thread gekkokid
Hi, im trying to delete duplicate documents from my index, the unique indentifier is the documents url (aka field "url"). my initial thought of how to acomplish this is to open the index via a reader and sort them by the documents url and then iterate through them looking for a matc

Optimizing insertion of duplicate documents

2005-09-06 Thread Robichaud, Jean-Philippe
Hi Everyone, I have a special scenario where I frequently want to insert duplicates documents in the index. For example, I know that I want 400 copies of the same document. (I use the docboost of something else so I can't just add one document and set the docboost to 400). I would like to hac

Re: Ideas Needed - Finding Duplicate Documents

2005-06-13 Thread Paul Libbrecht
o poll the community's opinion on good strategies for identifying duplicate documents in a lucene index. You see, I have an index containing roughly 25 million lucene documents. My task requires me to work at sentence level so each lucene document actually contains exactly one sentence. T

Re: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Chris Hostetter
: Yes, when I say "duplicate" sentences, they are exact copies of the same : string. you still haven't explained how you indexed these sentences, what do you mean by "each lucene document actually contains exactly one sentence." ? Did you tokenize the sentence into one field? do you a field for

Re: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Chris Lamprecht
I'd have to see your indexing code to see if there are any obvious performance gotchas there. If you can run your indexer under a profiler (OptimizeIt, JProbe, or just the free one with java using -Xprof), it will tell you in which methods most of your CPU time is spent. If you're using StandardA

AW: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Karsten Konrad
Hi David, >> I would like to poll the community's opinion on good strategies for identifying duplicate documents in a lucene index. >> Do you mean 100% duplicates or some kind of similarity? >> Obviously the brute force method of pairwise compares would take forever.

Re: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Dave Kor
r example), providing a fast way to find duplicates at > search time. > > If you can give more details on your requirements, people in this list > can probably come up with some pretty good solutions. > > -chris > > On 6/12/05, Dave Kor <[EMAIL PROTECTED]> wrote: > > Hi

Re: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Chris Lamprecht
MAIL PROTECTED]> wrote: > Hi, > > I would like to poll the community's opinion on good strategies for > identifying > duplicate documents in a lucene index. > > You see, I have an index containing roughly 25 million lucene documents. My > task > requires me to work

Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Dave Kor
Hi, I would like to poll the community's opinion on good strategies for identifying duplicate documents in a lucene index. You see, I have an index containing roughly 25 million lucene documents. My task requires me to work at sentence level so each lucene document actually contains exactl

Re: finding potential duplicate documents

2005-05-29 Thread Marco Dissel
Any tips on this issue? Thanks Marco - Original Message - From: Marco Dissel To: java-user@lucene.apache.org Sent: Friday, May 13, 2005 9:05 AM Subject: finding potential duplicate documents Hello I've got many documents that are potentially duplicate (merging se

finding potential duplicate documents

2005-05-13 Thread Marco Dissel
ys comparing one document with the index. Is there a way to give back all the potential duplicate documents in the index without interating every document in the index and compare it with the other documents in the index. Thanks Marco ---

finding potential duplicate documents

2005-05-13 Thread Marco Dissel
ys comparing one document with the index. Is there a way to give back all the potential duplicate documents in the index without interating every document in the index and compare it with the other documents in the index. Thanks Marco ---