hmm. would this not fall under the general problem of identifying
duplicates?

Would something like this meet your needs? (untested)

select  -- outer query finds the ids for the duplicates
    key

from (  -- inner query lists duplicate values
     select
       count(*) as cnt,
       value
     from
        foo
     group by
        value
     having
       count(*) > 1
     ) z
     join foo a on (a.value = z.value)
;

table foo is your table elements
key is your id,
value is your element


On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey <storey.j...@gmail.com> wrote:

> I have a table with 10 million rows and 2 columns - id (int) and element
> (string). I am trying to do a self join that finds any ids where the
> element values are the same, and my query looks like:
>
> select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN
> elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
>
> I tested this at a smaller scale and it works well. The problem is that
> with 10 million rows, this becomes a bit large and I've let it run for 90
> minutes and it was up to 80GB of disk space and still going. The original
> input data was only 500MB.
>
> Is this something I can optimize in hive? Or should I be considering a
> different approach to the problem instead?
>
> Any guidance here would be helpful. Thank you.
>

Reply via email to