Re: Issues with LAST_VALUE aggregation UDF
Thank you so much for the response Josh. Will check on 4.14 and get back to you with the results. Thanks On Tue, Sep 11, 2018 at 8:45 PM Josh Elser wrote: > The versions you provided in the description make it sound like you're > actually using HDP's distribution of Apache Phoenix, not an official > Apache Phoenix release. > > Please test against an Apache Phoenix release or contact Hortonworks for > support. It would not be unheard of that this issue has already been fixed. > > On 9/11/18 10:01 AM, Abhishek Gupta wrote: > > Hi, > > > > We are trying to use the LAST_VALUE aggregation UDF in our Phoenix SQL > > queries that serve REST APIs over analytics data. However we are seeing > > a number of issues/limitations in this function. > > First, there seems to be a bug where it fails for sparse columns that > > have NULLs. Have created a JIRA ticket for that containing the details > > of the exception and steps to reproduce > > https://issues.apache.org/jira/browse/PHOENIX-4898 > > > > Second, the function works in a way that it ignores all the group by > > columns except the column in the order by clause in finding the last > > value. For example > > Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns > > and m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a > > last value on m1 > > > > pk1 . | pk2 | pk3 .| pk4| > > pk5 . | m1 | m2 | m3 > > > > -- > > | test| 20180801| app1| plat1 | lang1 | 1 > >| 10 | 100 | > > | test| 20180801| app1| plat1 | lang2 | 2 > >| 10 | 100 | > > | test| 20180801| app1| plat2 | lang1 | 3 > >| 10 | 100 | > > | test| 20180801| app2| plat1 | lang1 | 4 > >| 10 | 100 | > > | test| 201808010030| app1| plat1 | lang1 | > > 10 | 10 | 100 | > > | test| 201808010030| app1| plat1 | lang2 | > > 20 | 10 | 100 | > > | test| 201808010030| app1| plat2 | lang1 | > > 30 | 10 | 100 | > > | test| 201808010030| app2| plat1 | lang1 | > > 40 | 10 | 100 | > > | test| 201808010100| app1| plat1 | lang1 | > > 100 | 10 | 100 | > > | test| 201808010100| app1| plat1 | lang2 | > > 200 | 10 | 100 | > > | test| 201808010100| app1| plat2 | lang1 | > > 300 | 10 | 100 | > > | test| 201808010100| app2| plat1 | lang1 | > > 400 | 10 | 100 | > > | test| 201808010130| app1| plat1 | lang1 | > > 1000 | 10 | 100 | > > | test| 201808010130| app1| plat1 | lang2 | > > 2000 | 10 | 100 | > > | test| 201808010130| app1| plat2 | lang1 | > > 3000 | 10 | 100 | > > | test| 201808010130| app2| plat1 | lang1 | > > 4000 | 10 | 100 | > > | test| 201808010200| app1| plat1 | lang1 | > > 1| 10 | 100 | > > | test| 201808010200| app1| plat1 | lang2 | > > 2| 10 | 100 | > > | test| 201808010200| app1| plat2 | lang1 | > > 3| 10 | 100 | > > | test| 201808010200| app2| plat1 | lang1 | > > 4| 10 | 100 | > > > > > > If I run the following query (using all primary key columns in group by) > > > > select TO_CHAR(TO_DATE(pk2,'MMddHHmm'),'MMddHH') as > > t,pk3,pk4,pk5, last_value(m1) within group (order by pk2 asc) as oi, > > sum(m2), sum(m3) from test_table group by pk1,t,pk3,pk4,pk5; > > > > I get the correct result for last value as for each pk2 value which is > > used in order by there is only 1 row > > > > > > +-+-++++--+--+ > > | T | pk3 | pk4 | pk5 | OI | SUM(M2) | SUM(M3) | > > > > +-+-++++--+--+ > > | 2018080100 | app1| plat1 | lang1 | 10 | 20 > > | 200 | > > | 2018080100 | app1| plat1 | lang2 | 20 | 20 > > | 200 | > > | 2018080100 | app1| plat2 | lang1 | 30 | 20 > > | 200 | > > | 2018080100 | app2| plat1 | lang1 | 40 | 20 > > | 200 | > > | 2018080101 | app1| plat1 | lang1 | 1000 | 20 > > | 200 | > > | 2018080101 | app1| plat1 | lang2 |
Re: Issues with LAST_VALUE aggregation UDF
The versions you provided in the description make it sound like you're actually using HDP's distribution of Apache Phoenix, not an official Apache Phoenix release. Please test against an Apache Phoenix release or contact Hortonworks for support. It would not be unheard of that this issue has already been fixed. On 9/11/18 10:01 AM, Abhishek Gupta wrote: Hi, We are trying to use the LAST_VALUE aggregation UDF in our Phoenix SQL queries that serve REST APIs over analytics data. However we are seeing a number of issues/limitations in this function. First, there seems to be a bug where it fails for sparse columns that have NULLs. Have created a JIRA ticket for that containing the details of the exception and steps to reproduce https://issues.apache.org/jira/browse/PHOENIX-4898 Second, the function works in a way that it ignores all the group by columns except the column in the order by clause in finding the last value. For example Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns and m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a last value on m1 pk1 . | pk2 | pk3 . | pk4 | pk5 . | m1 | m2 | m3 -- | test | 20180801 | app1 | plat1 | lang1 | 1 | 10 | 100 | | test | 20180801 | app1 | plat1 | lang2 | 2 | 10 | 100 | | test | 20180801 | app1 | plat2 | lang1 | 3 | 10 | 100 | | test | 20180801 | app2 | plat1 | lang1 | 4 | 10 | 100 | | test | 201808010030 | app1 | plat1 | lang1 | 10 | 10 | 100 | | test | 201808010030 | app1 | plat1 | lang2 | 20 | 10 | 100 | | test | 201808010030 | app1 | plat2 | lang1 | 30 | 10 | 100 | | test | 201808010030 | app2 | plat1 | lang1 | 40 | 10 | 100 | | test | 201808010100 | app1 | plat1 | lang1 | 100 | 10 | 100 | | test | 201808010100 | app1 | plat1 | lang2 | 200 | 10 | 100 | | test | 201808010100 | app1 | plat2 | lang1 | 300 | 10 | 100 | | test | 201808010100 | app2 | plat1 | lang1 | 400 | 10 | 100 | | test | 201808010130 | app1 | plat1 | lang1 | 1000 | 10 | 100 | | test | 201808010130 | app1 | plat1 | lang2 | 2000 | 10 | 100 | | test | 201808010130 | app1 | plat2 | lang1 | 3000 | 10 | 100 | | test | 201808010130 | app2 | plat1 | lang1 | 4000 | 10 | 100 | | test | 201808010200 | app1 | plat1 | lang1 | 1 | 10 | 100 | | test | 201808010200 | app1 | plat1 | lang2 | 2 | 10 | 100 | | test | 201808010200 | app1 | plat2 | lang1 | 3 | 10 | 100 | | test | 201808010200 | app2 | plat1 | lang1 | 4 | 10 | 100 | If I run the following query (using all primary key columns in group by) select TO_CHAR(TO_DATE(pk2,'MMddHHmm'),'MMddHH') as t,pk3,pk4,pk5, last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from test_table group by pk1,t,pk3,pk4,pk5; I get the correct result for last value as for each pk2 value which is used in order by there is only 1 row +-+-++++--+--+ | T | pk3 | pk4 | pk5 | OI | SUM(M2) | SUM(M3) | +-+-++++--+--+ | 2018080100 | app1 | plat1 | lang1 | 10 | 20 | 200 | | 2018080100 | app1 | plat1 | lang2 | 20 | 20 | 200 | | 2018080100 | app1 | plat2 | lang1 | 30 | 20 | 200 | | 2018080100 | app2 | plat1 | lang1 | 40 | 20 | 200 | | 2018080101 | app1 | plat1 | lang1 | 1000 | 20 | 200 | | 2018080101 | app1 | plat1 | lang2 | 2000 | 20 | 200 | | 2018080101 | app1 | plat2 | lang1 | 3000 | 20 | 200 | | 2018080101 | app2 | plat1 | lang1 | 4000 | 20 | 200 | | 2018080102 | app1 | plat1 | lang1 | 1 | 20 | 100 | | 2018080102 | app1 | plat1 | lang2 | 2 | 10 | 100 | | 2018080102 | app1 | plat2 | lang1 |
Issues with LAST_VALUE aggregation UDF
Hi, We are trying to use the LAST_VALUE aggregation UDF in our Phoenix SQL queries that serve REST APIs over analytics data. However we are seeing a number of issues/limitations in this function. First, there seems to be a bug where it fails for sparse columns that have NULLs. Have created a JIRA ticket for that containing the details of the exception and steps to reproduce https://issues.apache.org/jira/browse/PHOENIX-4898 Second, the function works in a way that it ignores all the group by columns except the column in the order by clause in finding the last value. For example Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns and m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a last value on m1 pk1 . | pk2 | pk3 .| pk4| pk5 . | > m1 | m2 | m3 > > -- > | test| 20180801| app1| plat1 | lang1 | 1 > | 10 | 100 | > | test| 20180801| app1| plat1 | lang2 | 2 > | 10 | 100 | > | test| 20180801| app1| plat2 | lang1 | 3 > | 10 | 100 | > | test| 20180801| app2| plat1 | lang1 | 4 > | 10 | 100 | > | test| 201808010030| app1| plat1 | lang1 | 10 > | 10 | 100 | > | test| 201808010030| app1| plat1 | lang2 | 20 > | 10 | 100 | > | test| 201808010030| app1| plat2 | lang1 | 30 > | 10 | 100 | > | test| 201808010030| app2| plat1 | lang1 | 40 > | 10 | 100 | > | test| 201808010100| app1| plat1 | lang1 | 100 > | 10 | 100 | > | test| 201808010100| app1| plat1 | lang2 | 200 > | 10 | 100 | > | test| 201808010100| app1| plat2 | lang1 | 300 > | 10 | 100 | > | test| 201808010100| app2| plat1 | lang1 | 400 > | 10 | 100 | > | test| 201808010130| app1| plat1 | lang1 | 1000 > | 10 | 100 | > | test| 201808010130| app1| plat1 | lang2 | 2000 > | 10 | 100 | > | test| 201808010130| app1| plat2 | lang1 | 3000 > | 10 | 100 | > | test| 201808010130| app2| plat1 | lang1 | 4000 > | 10 | 100 | > | test| 201808010200| app1| plat1 | lang1 | 1 > | 10 | 100 | > | test| 201808010200| app1| plat1 | lang2 | 2 > | 10 | 100 | > | test| 201808010200| app1| plat2 | lang1 | 3 > | 10 | 100 | > | test| 201808010200| app2| plat1 | lang1 | 4 > | 10 | 100 | If I run the following query (using all primary key columns in group by) select TO_CHAR(TO_DATE(pk2,'MMddHHmm'),'MMddHH') as t,pk3,pk4,pk5, > last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from > test_table group by pk1,t,pk3,pk4,pk5; I get the correct result for last value as for each pk2 value which is used in order by there is only 1 row > > +-+-++++--+--+ > | T | pk3 | pk4 | pk5 | OI | SUM(M2) | SUM(M3) | > > +-+-++++--+--+ > | 2018080100 | app1| plat1 | lang1 | 10 | 20 | > 200 | > | 2018080100 | app1| plat1 | lang2 | 20 | 20 | > 200 | > | 2018080100 | app1| plat2 | lang1 | 30 | 20 | > 200 | > | 2018080100 | app2| plat1 | lang1 | 40 | 20 | > 200 | > | 2018080101 | app1| plat1 | lang1 | 1000 | 20 | > 200 | > | 2018080101 | app1| plat1 | lang2 | 2000 | 20 | > 200 | > | 2018080101 | app1| plat2 | lang1 | 3000 | 20 | > 200 | > | 2018080101 | app2| plat1 | lang1 | 4000 | 20 | > 200 | > | 2018080102 | app1| plat1 | lang1 | 1 | 20 | > 100 | > | 2018080102 | app1| plat1 | lang2 | 2 | 10 | > 100 | > | 2018080102 | app1| plat2 | lang1 | 3 | 10 | > 100 | > | 2018080102 | app2| plat1 | lang1 | 4 | 10 | > 100 | > > +-+-++++--+--+ However if I do I group by on less than all the primary columns the LAST_VALUE function ignores the rest of the group by columns in sorting and returns incorrect last_value select TO_CHAR(TO_DATE(pk2,'MMddHHmm'),'MMddHH') as t, pk3, > last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from > test_table group by pk1,t,pk3; +-+-++--+--+ > |