Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin
Hi all,

I have an event table with (user_id, timestamp, event)
and I'm trying to write a query to get the first 10 events for each user.


My query goes like this :

SELECT user_id, event
FROM
(
SELECT
user_id,
event,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
FROM eventTable
) T
WHERE rownum = 10

However, the table may contain millions of events for the same user and I'm
getting
an OutOfMemory Error in the reduce phase, inside the following method:

org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:80)



It seems that the windowing functions were designed to store a Buffer
containing all

results for each PARTITION, and writes everything once all rows of
that partition

have been read.


This make windowing with Hive not very scalable...


My questions are:


a) Is there a reason why it was implemented this way rather than in a
streaming fashion?


b) Do you know how I could rewrite the query to avoid the problem (if
possible without having to write my own UDF)?



Thanks,


Furcy


Re: Help needed: Out of memory with windowing functions

2014-08-20 Thread Ashutosh Chauhan
Hi Furcy,

Sorry that you run into this, but this is a non-trivial problem. But, good
news is Harish has done some good work in this area. See following links.
All this will be available in upcoming release, date for which is under
discussion on dev list.

1. https://issues.apache.org/jira/browse/HIVE-6999
2. https://issues.apache.org/jira/browse/HIVE-7062
3. https://issues.apache.org/jira/browse/HIVE-7063
4. https://issues.apache.org/jira/browse/HIVE-7143
5. https://issues.apache.org/jira/browse/HIVE-7344

Hope it helps,
Ashutosh


On Wed, Aug 20, 2014 at 5:34 AM, Furcy Pin furcy@flaminem.com wrote:

 Hi all,

 I have an event table with (user_id, timestamp, event)
 and I'm trying to write a query to get the first 10 events for each user.


 My query goes like this :

 SELECT user_id, event
 FROM
 (
 SELECT
 user_id,
 event,
 ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
 FROM eventTable
 ) T
 WHERE rownum = 10

 However, the table may contain millions of events for the same user and
 I'm getting
 an OutOfMemory Error in the reduce phase, inside the following method:


 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:80)




 It seems that the windowing functions were designed to store a Buffer 
 containing all


 results for each PARTITION, and writes everything once all rows of that 
 partition

 have been read.


 This make windowing with Hive not very scalable...


 My questions are:


 a) Is there a reason why it was implemented this way rather than in a 
 streaming fashion?


 b) Do you know how I could rewrite the query to avoid the problem (if 
 possible without having to write my own UDF)?





 Thanks,



 Furcy







Re: Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin
Thank you very much for your answer Ashutosh,

it seems non-trivial indeed !


2014-08-20 17:51 GMT+02:00 Ashutosh Chauhan hashut...@apache.org:

 Hi Furcy,

 Sorry that you run into this, but this is a non-trivial problem. But, good
 news is Harish has done some good work in this area. See following links.
 All this will be available in upcoming release, date for which is under
 discussion on dev list.

 1. https://issues.apache.org/jira/browse/HIVE-6999
 2. https://issues.apache.org/jira/browse/HIVE-7062
 3. https://issues.apache.org/jira/browse/HIVE-7063
 4. https://issues.apache.org/jira/browse/HIVE-7143
 5. https://issues.apache.org/jira/browse/HIVE-7344

 Hope it helps,
 Ashutosh


 On Wed, Aug 20, 2014 at 5:34 AM, Furcy Pin furcy@flaminem.com wrote:

 Hi all,

 I have an event table with (user_id, timestamp, event)
 and I'm trying to write a query to get the first 10 events for each user.


  My query goes like this :

 SELECT user_id, event
 FROM
 (
 SELECT
 user_id,
 event,
 ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
 FROM eventTable
 ) T
 WHERE rownum = 10

 However, the table may contain millions of events for the same user and
 I'm getting
 an OutOfMemory Error in the reduce phase, inside the following method:


 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:80)





 It seems that the windowing functions were designed to store a Buffer 
 containing all



 results for each PARTITION, and writes everything once all rows of that 
 partition

 have been read.


 This make windowing with Hive not very scalable...


 My questions are:


 a) Is there a reason why it was implemented this way rather than in a 
 streaming fashion?


 b) Do you know how I could rewrite the query to avoid the problem (if 
 possible without having to write my own UDF)?







 Thanks,




 Furcy