Hi
We have recently upgraded from Lucene 3.6 to 4.3.1 and have encountered a
sometimes intermittent issue of IndexSearcher.search returning duplicate
documents (based on lucene doc no, not a custom field)
i.e.
TopDocs docs = IndexSearcher.search(query, filter, 10, sort)
assert
Lucene won't be aware that you've got duplicate documents, but scoring
does take account of the number of documents in which search terms
appear. See http://lucene.apache.org/java/3_5_0/scoring.html and the
javadocs for oal.search.Similarity.
Only you can say whether or not you nee
does not change very much between each version,
sometimes not at all. I end up with duplicate documents, the only
different is the timestamp. Does this impact the term weighting used
by Lucene?
My intuition is that if a term only occurs in one document, but that
document occurs 36 times, then the
All,
I am curious if Lucene and/or Mahout can identify duplicate documents? I am
having trouble with many redundant docs in my corpus, which is causing
inflated values and an expense on users to process and reprocess much of the
material. Can the redundancy be removed or managed in some sense my
Hi, i'm having some problem on my search process, because my search results
are being duplicated(but is not duplicated on the index, i checked with
luke).
I checked the id of the results and one exist on the index and the other is
out of the range(like my index has 300 documents, one result has id
Duplicate Documents In An Index
The updateDocument method of Index Writer indicates that a delete term
occurs before the update
document takes place (i.e. the document is replaced in the index, but
not duplicated).Has anyone
been able to get this process to work? The term that I am using
On 3/13/06, emerson cargnin <[EMAIL PROTECTED]> wrote:
> I notice some duplicated entries in my index, my just looking at it,
> and I suspect there might be more than those I found out. Is there a
> way to detect duplicate documents in an index?
>
> Emerson Cargnin
If th
I notice some duplicated entries in my index, my just looking at it,
and I suspect there might be more than those I found out. Is there a
way to detect duplicate documents in an index?
Emerson Cargnin
-
To unsubscribe, e-mail
hi, thats exactly what i did :) works perfectly
thanks
_gk
- Original Message -
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To:
Sent: Monday, January 30, 2006 5:56 AM
Subject: Re: deleting duplicate documents from my index
: Hi, im trying to delete duplicate d
: Hi, im trying to delete duplicate documents from my index, the unique
: indentifier is the documents url (aka field "url").
:
: my initial thought of how to acomplish this is to open the index via a
: reader and sort them by the documents url and then iterate through them
: looking f
issue that
needs to be addressed, it's worth it.
Hope this helps.
-- j
On 1/28/06, gekkokid <[EMAIL PROTECTED]> wrote:
>
> Hi, im trying to delete duplicate documents from my index, the unique
> indentifier is the documents url (aka field "url").
>
> my init
Hi, im trying to delete duplicate documents from my index, the unique
indentifier is the documents url (aka field "url").
my initial thought of how to acomplish this is to open the index via a reader
and sort them by the documents url and then iterate through them looking for a
matc
Hi Everyone,
I have a special scenario where I frequently want to insert duplicates
documents in the index. For example, I know that I want 400 copies of the
same document. (I use the docboost of something else so I can't just add one
document and set the docboost to 400).
I would like to hac
o poll the community's opinion on good strategies for
identifying
duplicate documents in a lucene index.
You see, I have an index containing roughly 25 million lucene
documents. My task
requires me to work at sentence level so each lucene document actually
contains
exactly one sentence. T
: Yes, when I say "duplicate" sentences, they are exact copies of the same
: string.
you still haven't explained how you indexed these sentences, what do you
mean by "each lucene document actually contains exactly one sentence." ?
Did you tokenize the sentence into one field? do you a field for
I'd have to see your indexing code to see if there are any obvious
performance gotchas there. If you can run your indexer under a
profiler (OptimizeIt, JProbe, or just the free one with java using
-Xprof), it will tell you in which methods most of your CPU time is
spent. If you're using StandardA
Hi David,
>>
I would like to poll the community's opinion on good strategies for identifying
duplicate documents in a lucene index.
>>
Do you mean 100% duplicates or some kind of similarity?
>>
Obviously the brute force method of pairwise compares would take forever.
r example), providing a fast way to find duplicates at
> search time.
>
> If you can give more details on your requirements, people in this list
> can probably come up with some pretty good solutions.
>
> -chris
>
> On 6/12/05, Dave Kor <[EMAIL PROTECTED]> wrote:
> > Hi
MAIL PROTECTED]> wrote:
> Hi,
>
> I would like to poll the community's opinion on good strategies for
> identifying
> duplicate documents in a lucene index.
>
> You see, I have an index containing roughly 25 million lucene documents. My
> task
> requires me to work
Hi,
I would like to poll the community's opinion on good strategies for identifying
duplicate documents in a lucene index.
You see, I have an index containing roughly 25 million lucene documents. My task
requires me to work at sentence level so each lucene document actually contains
exactl
Any tips on this issue?
Thanks
Marco
- Original Message -
From: Marco Dissel
To: java-user@lucene.apache.org
Sent: Friday, May 13, 2005 9:05 AM
Subject: finding potential duplicate documents
Hello
I've got many documents that are potentially duplicate (merging se
ys
comparing one document with the index. Is there a way to give back all the
potential duplicate documents in the index without interating every document in
the index and compare it with the other documents in the index.
Thanks
Marco
---
ys
comparing one document with the index. Is there a way to give back all the
potential duplicate documents in the index without interating every document in
the index and compare it with the other documents in the index.
Thanks
Marco
---
23 matches
Mail list logo