Re: hive query to select top 10 product of each subcategory and select most recent product info

Adrian Hains Fri, 11 Apr 2014 06:20:48 -0700

I think you need to separate out the logic that does your group by
aggregations from the logic of then retrieving all of the other columns for
a single row from that set.


Something like:
   select
         tbl.myKeyColumn1,
         tbl.myKeyColumn2,
         tbl.otherValueColumn1,
         tbl.otherValueColumn2,
         agg.rowCount
   from myTable tbl
   join (
      select --only the actual group by columns go in the subquery
         max(last_updated) as maxUpdateTime,
         count(*) as rowCount,
         myKeyColumn1,
         myKeyColumn2
      from myTable
      group by myKeyColumn1, myKeyColumn2
   ) agg
   on
      agg.maxUpdateTime = tbl.last_updated
      and agg.myKeyColumn1 = tbl.myKeyColumn1
      and agg.myKeyColumn2 = tbl.myKeyColumn2

hth,
a




On Fri, Apr 11, 2014 at 1:22 AM, Mohit Durgapal <[email protected]>wrote:

> Hi Nitin,
>
> The ddl is as follows:
>
> CREATE EXTERNAL TABLE user_logs(
> users_iduuid    string,
> siteid  int,
> site_catid      int,
> stext   string,
> catg    int,   // CATEGORY
> scatg   int, // SUBCATEGORY
> catgname        string,
> scatgname       string,
> brand   string,    // PRODUCT BRAND NAME
> prrange string,
> curr    int,
> pname   string, // product name
> pid     int,  // product ID
> price   string,  //Product Price
> prodnbr int,
> mrp     string,  //MRP
> prURL string, //Product url
> prIMGURL string, //Product Image URL
> opr     string,
> oid     string,
> txsucc  string,
> last_updated    string //timestamp
> )
>
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
>
>
> I am looking for an output where I have top 10 products from each
> subcategory(on the basis of count) with all their information like product
> name, price, url, imgurl. Any there will be multiple entries for the same
> products (pids) within the same subcategory, In that case I have to pick
> the product info that is latest(by last_updated field).
>
>
> I have written a query but it is considering a multiple entries of
> product as  different products If price or any other info changes for that
> product.
>
>
>
> select siteid,site_catid,catg,scatg,COLLECT_SET(PRODDESC) from
> (
>   select
> PRODDESC,displays,siteid,site_catid,catg,scatg,rank(siteid,site_catid,catg,scatg)
> as row_number from
>   (
>   select count(*) as
> displays,siteid,site_catid,catg,scatg,CONCAT('{','pname:',pname,',price:',price,',','mrp:',mrp,',curr:',curr,',pid:',pid,'}')
> as PRODDESC from
>   user_logs group by siteid,site_catid,catg,scatg,pid,pname,price,mrp,curr
> order by siteid,site_catid,catg,scatg,displays desc
>   ) A
>   ) B
> WHERE row_number < 10
> group by siteid,site_catid,catg,scatg
> order by siteid,site_catid,catg,scatg desc;
>
> The rank() method simply helps in fetching top 10 within a subcategory.
> Every time it encounters the same combination of
> siteid,site_catid,catg,scatg it increments row_number goes till 10.
>
> The problem above is that I am forced to put product info such as
> "pname,price,mrp,...." in the group by clause otherwise I will not be able
> to get  that information  in "select". Therefore, even if someone changes
> just the price a product(this happens very frequently) it is considered a
> different product by the above query. And that is something I don't want.
>
> I hope I have made it a little more clear?  Thanks for your reply :)
>
>
>
> On Fri, Apr 11, 2014 at 12:45 PM, Nitin Pawar <[email protected]>wrote:
>
>> may be you can share your table ddl, your query and what output r u
>> looking for
>>
>>
>> On Fri, Apr 11, 2014 at 12:26 PM, Mohit Durgapal <[email protected]
>> > wrote:
>>
>>> I have a hive table partitioned by dates. It contains ecomm data in the
>>> format siteid,sitecatid,catid,subcatgid,pid,pname,pprice,pmrp,pdesc....
>>>
>>>
>>>
>>> What I need to do is to run a query on table above in hive for top 10
>>> products(count wise) in each sub category. What adds a bit more complexity
>>> is that I need all the information of the product. Now when I do group by
>>> with only subcatg,pid, I can only select the same fields. But I want all
>>> the data for that product coming in the same row as subcatg & prodid like
>>> prodname, proddesc,price, mrp,imageurl. And since some information like
>>> price & proddesc of a product keep on changing I want to pick the latest
>>> column values(according to a date field) for a pid if we are able to do a
>>> group by on subcatg,pid.
>>>
>>>
>>> I am not able to find a solution to my problem in hive. Any help would
>>> be much appreciated.
>>>
>>>
>>> Regards
>>> Mohit
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: hive query to select top 10 product of each subcategory and select most recent product info

Reply via email to