hmm. would this not fall under the general problem of identifying duplicates?
Would something like this meet your needs? (untested) select -- outer query finds the ids for the duplicates key from ( -- inner query lists duplicate values select count(*) as cnt, value from foo group by value having count(*) > 1 ) z join foo a on (a.value = z.value) ; table foo is your table elements key is your id, value is your element On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey <storey.j...@gmail.com> wrote: > I have a table with 10 million rows and 2 columns - id (int) and element > (string). I am trying to do a self join that finds any ids where the > element values are the same, and my query looks like: > > select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN > elements e2 on e1.element = e2.element WHERE e1.id < e2.id; > > I tested this at a smaller scale and it works well. The problem is that > with 10 million rows, this becomes a bit large and I've let it run for 90 > minutes and it was up to 80GB of disk space and still going. The original > input data was only 500MB. > > Is this something I can optimize in hive? Or should I be considering a > different approach to the problem instead? > > Any guidance here would be helpful. Thank you. >