Re: Issues with LAST_VALUE aggregation UDF

Abhishek Gupta Tue, 11 Sep 2018 22:11:44 -0700

Thank you so much for the response Josh.

Will check on 4.14 and get back to you with the results.


Thanks

On Tue, Sep 11, 2018 at 8:45 PM Josh Elser <[email protected]> wrote:

> The versions you provided in the description make it sound like you're
> actually using HDP's distribution of Apache Phoenix, not an official
> Apache Phoenix release.
>
> Please test against an Apache Phoenix release or contact Hortonworks for
> support. It would not be unheard of that this issue has already been fixed.
>
> On 9/11/18 10:01 AM, Abhishek Gupta wrote:
> > Hi,
> >
> > We are trying to use the LAST_VALUE aggregation UDF in our Phoenix SQL
> > queries that serve REST APIs over analytics data. However we are seeing
> > a number of issues/limitations in this function.
> > First, there seems to be a bug where it fails for sparse columns that
> > have NULLs. Have created a JIRA ticket for that containing the details
> > of the exception and steps to reproduce
> > https://issues.apache.org/jira/browse/PHOENIX-4898
> >
> > Second, the function works in a way that it ignores all the group by
> > columns except the column in the order by clause in finding the last
> > value. For example
> > Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns
> > and m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a
> > last value on m1
> >
> >     pk1 .   | pk2                       | pk3 .    | pk4            |
> >     pk5 .   | m1         | m2 | m3
> >
>  
> ------------------------------------------------------------------------------------------------------
> >     | test    | 201808010000    | app1    | plat1          | lang1  | 1
> >                | 10  | 100  |
> >     | test    | 201808010000    | app1    | plat1          | lang2  | 2
> >                | 10  | 100  |
> >     | test    | 201808010000    | app1    | plat2          | lang1  | 3
> >                | 10  | 100  |
> >     | test    | 201808010000    | app2    | plat1          | lang1  | 4
> >                | 10  | 100  |
> >     | test    | 201808010030    | app1    | plat1          | lang1  |
> >     10           | 10  | 100  |
> >     | test    | 201808010030    | app1    | plat1          | lang2  |
> >     20           | 10  | 100  |
> >     | test    | 201808010030    | app1    | plat2          | lang1  |
> >     30           | 10  | 100  |
> >     | test    | 201808010030    | app2    | plat1          | lang1  |
> >     40           | 10  | 100  |
> >     | test    | 201808010100    | app1    | plat1          | lang1  |
> >     100          | 10  | 100  |
> >     | test    | 201808010100    | app1    | plat1          | lang2  |
> >     200          | 10  | 100  |
> >     | test    | 201808010100    | app1    | plat2          | lang1  |
> >     300          | 10  | 100  |
> >     | test    | 201808010100    | app2    | plat1          | lang1  |
> >     400          | 10  | 100  |
> >     | test    | 201808010130    | app1    | plat1          | lang1  |
> >     1000         | 10  | 100  |
> >     | test    | 201808010130    | app1    | plat1          | lang2  |
> >     2000         | 10  | 100  |
> >     | test    | 201808010130    | app1    | plat2          | lang1  |
> >     3000         | 10  | 100  |
> >     | test    | 201808010130    | app2    | plat1          | lang1  |
> >     4000         | 10  | 100  |
> >     | test    | 201808010200    | app1    | plat1          | lang1  |
> >     10000        | 10  | 100  |
> >     | test    | 201808010200    | app1    | plat1          | lang2  |
> >     20000        | 10  | 100  |
> >     | test    | 201808010200    | app1    | plat2          | lang1  |
> >     30000        | 10  | 100  |
> >     | test    | 201808010200    | app2    | plat1          | lang1  |
> >     40000        | 10  | 100  |
> >
> >
> > If I run the following query (using all primary key columns in group by)
> >
> >     select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as
> >     t,pk3,pk4,pk5, last_value(m1) within group (order by pk2 asc) as oi,
> >     sum(m2), sum(m3) from test_table group by pk1,t,pk3,pk4,pk5;
> >
> > I get the correct result for last value as for each pk2 value which is
> > used in order by there is only 1 row
> >
> >
>  
> +-------------+---------+----------------+--------+--------+----------+----------+
> >     |      T      | pk3  | pk4  |  pk5  |   OI   | SUM(M2)  | SUM(M3)  |
> >
>  
> +-------------+---------+----------------+--------+--------+----------+----------+
> >     | 2018080100  | app1    | plat1          | lang1  | 10     | 20
> >       | 200      |
> >     | 2018080100  | app1    | plat1          | lang2  | 20     | 20
> >       | 200      |
> >     | 2018080100  | app1    | plat2          | lang1  | 30     | 20
> >       | 200      |
> >     | 2018080100  | app2    | plat1          | lang1  | 40     | 20
> >       | 200      |
> >     | 2018080101  | app1    | plat1          | lang1  | 1000   | 20
> >       | 200      |
> >     | 2018080101  | app1    | plat1          | lang2  | 2000   | 20
> >       | 200      |
> >     | 2018080101  | app1    | plat2          | lang1  | 3000   | 20
> >       | 200      |
> >     | 2018080101  | app2    | plat1          | lang1  | 4000   | 20
> >       | 200      |
> >     | 2018080102  | app1    | plat1          | lang1  | 10000  | 20
> >       | 100      |
> >     | 2018080102  | app1    | plat1          | lang2  | 20000  | 10
> >       | 100      |
> >     | 2018080102  | app1    | plat2          | lang1  | 30000  | 10
> >       | 100      |
> >     | 2018080102  | app2    | plat1          | lang1  | 40000  | 10
> >       | 100      |
> >
>  
> +-------------+---------+----------------+--------+--------+----------+----------+
> >
> >
> > However if I do I group by on less than all the primary columns the
> > LAST_VALUE function ignores the rest of the group by columns in sorting
> > and returns incorrect last_value
> >
> >     select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t, pk3,
> >     last_value(m1) within group (order by pk2 asc) as oi, sum(m2),
> >     sum(m3) from test_table group by pk1,t,pk3;
> >
> >
> >     +-------------+---------+--------+----------+----------+
> >     |      T      | pk3  |   OI   | SUM(M2)  | SUM(M3)  |
> >     +-------------+---------+--------+----------+----------+
> >     | 2018080100  | app1    | 10     | 60       | 600      |
> >     | 2018080100  | app2    | 40     | 20       | 200      |
> >     | 2018080101  | app1    | 1000   | 60       | 600      |
> >     | 2018080101  | app2    | 4000   | 20       | 200      |
> >     | 2018080102  | app1    | 10000  | 40       | 300      |
> >     | 2018080102  | app2    | 40000  | 10       | 100      |
> >     +-------------+---------+--------+----------+----------+
> >
> > So instead of taking the last value of the group formed by  2018080100
> > and app1 i.e 30 it is picking the first i.e 10.
> >   Thanks,
> > Abhishek
> >
>

Re: Issues with LAST_VALUE aggregation UDF

Reply via email to