Thank you all for your suggestions. This group is the best.

I am working with the different options you guys suggested.

One big question I have is -

I am good at writing Oracle SQL queries. But the syntax with Hive is different. 
Especially - wiritng multiple SELECT statements in a single Hive Query has 
become a challenge. Can the group suggest any good tutorial that explains the 
basics of "Syntax to develop complex queries in Hive".

Regards,
Rajendra





On Thursday, September 11, 2014 2:48 AM, vivek thakre <vivek.tha...@gmail.com> 
wrote:
 


Considering that the records only differ by one column i.e if the first two 
columns are are unique (distinct), then you simply use group by with max as 
aggregation function to eliminate duplicates i,e 

select cno, sqno, max (date) 
from table 
group by cno, sqno

If the above assumption is not true i.e if cno and sqno are not unique and for 
a particular cno, you want to get sqno with latest date, then you can do inner 
join with max select query something like

select a.cno, a.sqno, a.date
from table a 
join (select cno, max(date)  as max_date from table group by cno) b
on a.cno=b.cno
and a.date = b.max_date



On Wed, Sep 10, 2014 at 3:39 PM, Nishant Kelkar <nishant....@gmail.com> wrote:

Try something like this then:
>
>
>SELECT A.cno, A.sqno, A.sorted_dates[A.size-1] AS latest_date
>FROM 
>(
>SELECT cno, sqno,
>SORT_ARRAY(COLLECT_SET(date)) AS sorted_dates, SIZE(COLLECT_SET(date)) AS size 
>FROM table GROUP BY cno, sqno
>) A;
>
>
>
>There are better ways of doing this, but this one's quick and dirty :)
>
>
>Best Regards,
>Nishant Kelkar
>
>
>On Wed, Sep 10, 2014 at 12:48 PM, Raj Hadoop <hadoop...@yahoo.com> wrote:
>
>sort_array returns in ascending order. so the first element cannot be the 
>largest date. the last element is the largest date.
>>
>>
>>
>>
>>On Wednesday, September 10, 2014 3:38 PM, Nishant Kelkar 
>><nishant....@gmail.com> wrote:
>> 
>>
>>
>>Hi Raj,
>>
>>
>>You'll have to change the format of your date to something like YYYY-MM-DD. 
>>For example, for "2-oct-2013" it will be 2013-10-02.
>>
>>
>>Best Regards,
>>Nishant Kelkar
>>
>>
>>
>>
>>
>>On Wed, Sep 10, 2014 at 11:48 AM, Raj Hadoop <hadoop...@yahoo.com> wrote:
>>
>>The
>>>
>>>SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>>>
>>>is returning the lowest date. I need the largest date.
>>>
>>>
>>>
>>>--------------------------------------------
>>>On Wed, 9/10/14, Raj Hadoop <hadoop...@yahoo.com> wrote:
>>>
>>> Subject: Re: Remove duplicate records in Hive
>>> To: user@hive.apache.org
>>> Date: Wednesday, September 10, 2014, 2:41 PM
>>>
>>>
>>> Thanks. I will try it.
>>> --------------------------------------------
>>> On Wed, 9/10/14, Nishant Kelkar <nishant....@gmail.com>
>>> wrote:
>>>
>>>  Subject: Re: Remove
>>> duplicate records in Hive
>>>  To: user@hive.apache.org,
>>> hadoop...@yahoo.com
>>>  Date: Wednesday, September 10, 2014, 1:59
>>> PM
>>>
>>>  Hi
>>>
>>> Raj, 
>>>  You can do something
>>>  along these lines: 
>>>
>>>  SELECT
>>>  cno, sqno,
>>> SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
>>>  FROM table GROUP BY cno, sqno;
>>>  However, you have to make sure your
>>>  date format is such that sorting it gives you
>>> the most
>>>  recent date. The best way to do
>>> that is to have it in
>>>  format:
>>> YYYY-MM-DD.
>>>  Hope this helps.
>>>  Best Regards,Nishant
>>>
>>> Kelkar
>>>  On Wed, Sep 10, 2014 at
>>>  10:04 AM, Raj Hadoop <hadoop...@yahoo.com>
>>>  wrote:
>>>
>>>
>>>  Hi,
>>>
>>>
>>>
>>>  I have a requirement in Hive
>>> to remove duplicate records (
>>>  they differ
>>> only by one column i.e a date column) and keep
>>>  the latest date record.
>>>
>>>
>>>
>>>  Sample
>>> :
>>>
>>>  Hive Table :
>>>
>>>   d2 is a higher
>>>
>>>  cno,sqno,date
>>>
>>>
>>>
>>>  100 1 1-oct-2013
>>>
>>>  101 2 1-oct-2013
>>>
>>>  100 1 2-oct-2013
>>>
>>>  102 2 2-oct-2013
>>>
>>>
>>>
>>>
>>>
>>>  Output needed:
>>>
>>>
>>>
>>>  100 1 2-oct-2013
>>>
>>>  101 2 1-oct-2013
>>>
>>>  102 2 2-oct-2013
>>>
>>>
>>>
>>>  I am using
>>> Hive 0.11
>>>
>>>
>>>
>>>  Any suggestions please ?
>>>
>>>
>>>
>>>  Regards,
>>>
>>>
>>> Raj
>>>
>>>
>>>
>>>
>>
>>
>>
>

Reply via email to