Re: dealing with duplicates

Joe Calderon Mon, 10 Aug 2009 12:59:34 -0700

so in the case someone can help me with the query syntax, the
relational query i would use for this would be something like:


SELECT * FROM videos
WHERE
title LIKE 'family guy'
AND desc LIKE 'stewie%'
AND (
  ( is_dup = 0 )
  OR
  ( is_dup = 1 AND id NOT IN
    (
    SELECT id FROM videos
    WHERE
    title LIKE 'family guy'
    AND desc LIKE 'stewie%'
    AND is_dup = 0
    )
  )
)
ORDER BY views
LIMIT 10

can a similar query be written in lucene or do i need to structure my
index differently to be able to do such a query?

thx much

--joe


On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderon<calderon....@gmail.com> wrote:
> hello, thanks for the response, i did take a look at that document but
> in my application i actually want the duplicates, as i mentioned, the
> matching text could be very different among cluster members, what
> joins them together is a similar set of numeric features.
>
> currently i do a query with fq=duplicate:0 and show a link to
> optionally show the "dupes" via by querying for all dupes of the
> master id, however im currently missing any documents that matched the
> query but are duplicates of other masters not included in that result
> set.
>
> in a relational database (fulltext indexing aside) i would use a
> subquery, i imagine a similar approach could be used with lucene, i
> just dont know the syntax
>
> best,
>
> --joe
>
> On Fri, Jul 31, 2009 at 11:32 PM, Otis
> Gospodnetic<otis_gospodne...@yahoo.com> wrote:
>> Joe,
>>
>> Maybe we can take a step back first.  Would it be better if your index was 
>> cleaner and didn't have flagged duplicates in the first place?  If so, have 
>> you tried using http://wiki.apache.org/solr/Deduplication ?
>>
>>  Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> ----- Original Message ----
>>> From: Joe Calderon <calderon....@gmail.com>
>>> To: solr-user@lucene.apache.org
>>> Sent: Friday, July 31, 2009 5:06:48 PM
>>> Subject: dealing with duplicates
>>>
>>> hello all, i have a collection of a few million documents; i have many
>>> duplicates in this collection. they have been clustered with a simple
>>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
>>> fields called 'description, tags, meta', documents are clustered on
>>> different criteria and the text i search against could be very
>>> different among members of a cluster.
>>>
>>> im currently using a dismax handler to search across the text fields
>>> with different boosts, and a filter query to restrict to masters
>>> (duplicate: 0)
>>>
>>> my question is then, how do i best query for documents which are
>>> masters OR match text but are not included in the matched set of
>>> masters?
>>>
>>> does this make sense?
>>
>>
>

Re: dealing with duplicates

Reply via email to