I think you need to separate out the logic that does your group by
aggregations from the logic of then retrieving all of the other columns for
a single row from that set.
Something like:
select
tbl.myKeyColumn1,
tbl.myKeyColumn2,
tbl.otherValueColumn1,
tbl.otherValueColumn2,
agg.rowCount
from myTable tbl
join (
select --only the actual group by columns go in the subquery
max(last_updated) as maxUpdateTime,
count(*) as rowCount,
myKeyColumn1,
myKeyColumn2
from myTable
group by myKeyColumn1, myKeyColumn2
) agg
on
agg.maxUpdateTime = tbl.last_updated
and agg.myKeyColumn1 = tbl.myKeyColumn1
and agg.myKeyColumn2 = tbl.myKeyColumn2
hth,
a
On Fri, Apr 11, 2014 at 1:22 AM, Mohit Durgapal <[email protected]>wrote:
> Hi Nitin,
>
> The ddl is as follows:
>
> CREATE EXTERNAL TABLE user_logs(
> users_iduuid string,
> siteid int,
> site_catid int,
> stext string,
> catg int, // CATEGORY
> scatg int, // SUBCATEGORY
> catgname string,
> scatgname string,
> brand string, // PRODUCT BRAND NAME
> prrange string,
> curr int,
> pname string, // product name
> pid int, // product ID
> price string, //Product Price
> prodnbr int,
> mrp string, //MRP
> prURL string, //Product url
> prIMGURL string, //Product Image URL
> opr string,
> oid string,
> txsucc string,
> last_updated string //timestamp
> )
>
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
>
>
> I am looking for an output where I have top 10 products from each
> subcategory(on the basis of count) with all their information like product
> name, price, url, imgurl. Any there will be multiple entries for the same
> products (pids) within the same subcategory, In that case I have to pick
> the product info that is latest(by last_updated field).
>
>
> I have written a query but it is considering a multiple entries of
> product as different products If price or any other info changes for that
> product.
>
>
>
> select siteid,site_catid,catg,scatg,COLLECT_SET(PRODDESC) from
> (
> select
> PRODDESC,displays,siteid,site_catid,catg,scatg,rank(siteid,site_catid,catg,scatg)
> as row_number from
> (
> select count(*) as
> displays,siteid,site_catid,catg,scatg,CONCAT('{','pname:',pname,',price:',price,',','mrp:',mrp,',curr:',curr,',pid:',pid,'}')
> as PRODDESC from
> user_logs group by siteid,site_catid,catg,scatg,pid,pname,price,mrp,curr
> order by siteid,site_catid,catg,scatg,displays desc
> ) A
> ) B
> WHERE row_number < 10
> group by siteid,site_catid,catg,scatg
> order by siteid,site_catid,catg,scatg desc;
>
> The rank() method simply helps in fetching top 10 within a subcategory.
> Every time it encounters the same combination of
> siteid,site_catid,catg,scatg it increments row_number goes till 10.
>
> The problem above is that I am forced to put product info such as
> "pname,price,mrp,...." in the group by clause otherwise I will not be able
> to get that information in "select". Therefore, even if someone changes
> just the price a product(this happens very frequently) it is considered a
> different product by the above query. And that is something I don't want.
>
> I hope I have made it a little more clear? Thanks for your reply :)
>
>
>
> On Fri, Apr 11, 2014 at 12:45 PM, Nitin Pawar <[email protected]>wrote:
>
>> may be you can share your table ddl, your query and what output r u
>> looking for
>>
>>
>> On Fri, Apr 11, 2014 at 12:26 PM, Mohit Durgapal <[email protected]
>> > wrote:
>>
>>> I have a hive table partitioned by dates. It contains ecomm data in the
>>> format siteid,sitecatid,catid,subcatgid,pid,pname,pprice,pmrp,pdesc....
>>>
>>>
>>>
>>> What I need to do is to run a query on table above in hive for top 10
>>> products(count wise) in each sub category. What adds a bit more complexity
>>> is that I need all the information of the product. Now when I do group by
>>> with only subcatg,pid, I can only select the same fields. But I want all
>>> the data for that product coming in the same row as subcatg & prodid like
>>> prodname, proddesc,price, mrp,imageurl. And since some information like
>>> price & proddesc of a product keep on changing I want to pick the latest
>>> column values(according to a date field) for a pid if we are able to do a
>>> group by on subcatg,pid.
>>>
>>>
>>> I am not able to find a solution to my problem in hive. Any help would
>>> be much appreciated.
>>>
>>>
>>> Regards
>>> Mohit
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>