subject:"Help needed\: Out of memory with windowing functions"

Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin

Hi all,

I have an event table with (user_id, timestamp, event)
and I'm trying to write a query to get the first 10 events for each user.


My query goes like this :

SELECT user_id, event
FROM
(
SELECT
user_id,
event,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
FROM eventTable
) T
WHERE rownum = 10

However, the table may contain millions of events for the same user and I'm
getting
an OutOfMemory Error in the reduce phase, inside the following method:

org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:80)



It seems that the windowing functions were designed to store a Buffer
containing all

results for each PARTITION, and writes everything once all rows of
that partition

have been read.


This make windowing with Hive not very scalable...


My questions are:


a) Is there a reason why it was implemented this way rather than in a
streaming fashion?


b) Do you know how I could rewrite the query to avoid the problem (if
possible without having to write my own UDF)?



Thanks,


Furcy

Re: Help needed: Out of memory with windowing functions

2014-08-20 Thread Ashutosh Chauhan

Hi Furcy,

Sorry that you run into this, but this is a non-trivial problem. But, good
news is Harish has done some good work in this area. See following links.
All this will be available in upcoming release, date for which is under
discussion on dev list.

1. https://issues.apache.org/jira/browse/HIVE-6999
2. https://issues.apache.org/jira/browse/HIVE-7062
3. https://issues.apache.org/jira/browse/HIVE-7063
4. https://issues.apache.org/jira/browse/HIVE-7143
5. https://issues.apache.org/jira/browse/HIVE-7344

Hope it helps,
Ashutosh

On Wed, Aug 20, 2014 at 5:34 AM, Furcy Pin furcy@flaminem.com wrote:

Hi all,

I have an event table with (user_id, timestamp, event)
and I'm trying to write a query to get the first 10 events for each user.

My query goes like this :

SELECT user_id, event
FROM
(
SELECT
user_id,
event,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
FROM eventTable
) T
WHERE rownum = 10

However, the table may contain millions of events for the same user and
I'm getting
an OutOfMemory Error in the reduce phase, inside the following method:

org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:80)

It seems that the windowing functions were designed to store a Buffer
containing all

results for each PARTITION, and writes everything once all rows of that
partition

have been read.

This make windowing with Hive not very scalable...

My questions are:

a) Is there a reason why it was implemented this way rather than in a
streaming fashion?

b) Do you know how I could rewrite the query to avoid the problem (if
possible without having to write my own UDF)?

Thanks,

Furcy

Re: Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin

Thank you very much for your answer Ashutosh,

it seems non-trivial indeed !

2014-08-20 17:51 GMT+02:00 Ashutosh Chauhan hashut...@apache.org:

Hi Furcy,

Hope it helps,
Ashutosh

On Wed, Aug 20, 2014 at 5:34 AM, Furcy Pin furcy@flaminem.com wrote:

Hi all,

I have an event table with (user_id, timestamp, event)
and I'm trying to write a query to get the first 10 events for each user.

My query goes like this :

SELECT user_id, event
FROM
(
SELECT
user_id,
event,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
FROM eventTable
) T
WHERE rownum = 10

However, the table may contain millions of events for the same user and
I'm getting
an OutOfMemory Error in the reduce phase, inside the following method:

org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:80)

It seems that the windowing functions were designed to store a Buffer
containing all

results for each PARTITION, and writes everything once all rows of that
partition

have been read.

This make windowing with Hive not very scalable...

My questions are:

a) Is there a reason why it was implemented this way rather than in a
streaming fashion?

b) Do you know how I could rewrite the query to avoid the problem (if
possible without having to write my own UDF)?

Thanks,

Furcy

Help needed: Out of memory with windowing functions

Re: Help needed: Out of memory with windowing functions

Re: Help needed: Out of memory with windowing functions

3 matches

Site Navigation

Mail list logo

Footer information