so in the case someone can help me with the query syntax, the relational query i would use for this would be something like:
SELECT * FROM videos WHERE title LIKE 'family guy' AND desc LIKE 'stewie%' AND ( ( is_dup = 0 ) OR ( is_dup = 1 AND id NOT IN ( SELECT id FROM videos WHERE title LIKE 'family guy' AND desc LIKE 'stewie%' AND is_dup = 0 ) ) ) ORDER BY views LIMIT 10 can a similar query be written in lucene or do i need to structure my index differently to be able to do such a query? thx much --joe On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderon<calderon....@gmail.com> wrote: > hello, thanks for the response, i did take a look at that document but > in my application i actually want the duplicates, as i mentioned, the > matching text could be very different among cluster members, what > joins them together is a similar set of numeric features. > > currently i do a query with fq=duplicate:0 and show a link to > optionally show the "dupes" via by querying for all dupes of the > master id, however im currently missing any documents that matched the > query but are duplicates of other masters not included in that result > set. > > in a relational database (fulltext indexing aside) i would use a > subquery, i imagine a similar approach could be used with lucene, i > just dont know the syntax > > best, > > --joe > > On Fri, Jul 31, 2009 at 11:32 PM, Otis > Gospodnetic<otis_gospodne...@yahoo.com> wrote: >> Joe, >> >> Maybe we can take a step back first. Would it be better if your index was >> cleaner and didn't have flagged duplicates in the first place? If so, have >> you tried using http://wiki.apache.org/solr/Deduplication ? >> >> Otis >> -- >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR >> >> >> >> ----- Original Message ---- >>> From: Joe Calderon <calderon....@gmail.com> >>> To: solr-user@lucene.apache.org >>> Sent: Friday, July 31, 2009 5:06:48 PM >>> Subject: dealing with duplicates >>> >>> hello all, i have a collection of a few million documents; i have many >>> duplicates in this collection. they have been clustered with a simple >>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a >>> fields called 'description, tags, meta', documents are clustered on >>> different criteria and the text i search against could be very >>> different among members of a cluster. >>> >>> im currently using a dismax handler to search across the text fields >>> with different boosts, and a filter query to restrict to masters >>> (duplicate: 0) >>> >>> my question is then, how do i best query for documents which are >>> masters OR match text but are not included in the matched set of >>> masters? >>> >>> does this make sense? >> >> >