Ah I think Thomas is right. I read the expected results and the query too quickly, so my comment about the asc/desc is probably wrong. Clearly my point about analytic functions being tricky holds true :)
On Wed, Jun 21, 2017 at 12:12 PM, Thomas Tauber-Marshall <[email protected]> wrote: > > > On Wed, Jun 21, 2017 at 1:52 PM Ravi Kanth <[email protected]> wrote: >> >> Thomas, >> >> The version of Impala we are using is 2.5.0 with CDH 5.7.0 and I see >> ignore nulls has been added in Impala 2.7.0. And, does adding ignore nulls >> would make a big difference in the expected result? > > > That's too bad. I think that 'ignore nulls' would give you what you want - > the problem with the query that you posted is that it eliminates rows that > don't match the where clause, so for example the row with "Zero" in it is > eliminated because it is filtered out by the "where a is not null", whereas > "ignore nulls" only affects the values that could be returned by the > specific analytic function that the ignore is applied to. > >> >> >> Ravi >> >> On 21 June 2017 at 11:20, Thomas Tauber-Marshall <[email protected]> >> wrote: >>> >>> Ravi, >>> >>> Instead of using the "where ... is not null", have you tried >>> 'last_value(... ignore nulls)' >>> >>> On Wed, Jun 21, 2017 at 1:08 PM Ravi Kanth <[email protected]> >>> wrote: >>>> >>>> Antoni, >>>> >>>> The problem in using last_value function() as far as I observed is, if I >>>> use it on multiple columns in a single query, its not retrieving results as >>>> expected. >>>> >>>> Input: >>>> >>>> ID (Int)Date_Time (timestamp)A (Int)B (String)C (String) >>>> 101NULLNULL >>>> 112HiNULL >>>> 134HelloHi >>>> 125NULLNULL >>>> 14NULLNULLZero >>>> >>>> Expected Output: >>>> >>>> >>>> >>>> ID (Int)A (Int)B (String)C (String) >>>> 14HelloZero >>>> >>>> >>>> Query executed: >>>> >>>> select id, last_value(a) over(partition by id order by date_time desc) >>>> as a, last_value(b) over(partition by id order by date_time desc) as b, >>>> last_value(c) over(partition by id order by date_time desc) as c from >>>> udf_test where a is not null and b is not null and c is not null; >>>> >>>> >>>> >>>> Output I am getting: >>>> >>>> +----+---+-------+----+ >>>> >>>> | id | a | b | c | >>>> >>>> +----+---+-------+----+ >>>> >>>> | 1 | 4 | Hello | Hi || >>>> >>>> +----+---+-------+----+ >>>> >>>> >>>> Hopefully, I am clear with the problem above. >>>> >>>> Thanks, >>>> Ravi >>>> >>>> On 20 June 2017 at 22:05, Ravi Kanth <[email protected]> wrote: >>>>> >>>>> Antoni, >>>>> >>>>> Thanks for the suggestion. Let me have a look at it and hopefully we >>>>> can use it in our use case. >>>>> >>>>> Thanks, >>>>> Ravi >>>>> >>>>> On Tue, Jun 20, 2017 at 21:53 Antoni Ivanov <[email protected]> wrote: >>>>>> >>>>>> Hi Ravi, >>>>>> >>>>>> I am curious why you are not using already existing last_value >>>>>> function in Impala to get "latest non null value for the column” >>>>>> >>>>>> e.g >>>>>> last_value(column_a ignore nulls) over(partition by ID order by >>>>>> Date_Time) >>>>>> >>>>>> Thanks, >>>>>> Antoni >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Jun 21, 2017, at 1:17 AM, Tim Armstrong <[email protected]> >>>>>> wrote: >>>>>> >>>>>> This was double-posted to >>>>>> http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Creating-Impala-UDA/m-p/56201/highlight/false#M3073 >>>>>> also. I'll continue the discussion here. >>>>>> >>>>>> > Can we have the flexibility of declaring the variable globally in >>>>>> > UDF? Globally, I mean outside the function? >>>>>> >>>>>> > And, the reason I am declaring a static variable is to restore the >>>>>> > value of timestamp for every record so that I can perform a comparison >>>>>> > of >>>>>> > the timestamps. Is there an alternative approach for this? >>>>>> >>>>>> Updating a global or static variable in a UDAF is guaranteed not to do >>>>>> what you expect - the function can be invoked concurrently by multiple >>>>>> threads. >>>>>> >>>>>> It seems like you probably want to store some additional state in the >>>>>> intermediate value. There are some sample UDAs here (see Avg()) where >>>>>> additional intermediate state is stored in a StringVal: >>>>>> https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc#L61 >>>>>> >>>>>> >>>>>> On Tue, Jun 20, 2017 at 2:40 PM, Ravi Kanth <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Thanks Bharath. Can you check if the logic I am implementing is >>>>>>> correct or needed any modification in it as well? I am very new to >>>>>>> Impala >>>>>>> UDF & C++ and having some hard time figuring out the problems. >>>>>>> >>>>>>> On 20 June 2017 at 14:27, Bharath Vissapragada >>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> You need to allocate memory for tsTemp, else it can segfault. That >>>>>>>> could be the issue here. >>>>>>>> >>>>>>>> static TimestampVal* tsTemp; >>>>>>>> tsTemp->date = 0; >>>>>>>> tsTemp->time_of_day = 0; >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 20, 2017 at 2:15 PM, Ravi Kanth >>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> We are using Impala to do various processings in our systems. We >>>>>>>>> have a requirement recently, wherein we have to handle the updates on >>>>>>>>> the >>>>>>>>> events i.e, we have an 'e_update' table which has the partial updates >>>>>>>>> received for various events. The fields that are not updated are being >>>>>>>>> stored as NULL values. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Ex: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ID (Int) Date_Time (timestamp) A (Int) B (String) C (String) >>>>>>>>> 1 0 1 NULL NULL >>>>>>>>> 1 1 2 Hi NULL >>>>>>>>> 1 3 4 Hello Hi >>>>>>>>> 1 2 5 NULL NULL >>>>>>>>> 1 4 NULL NULL Zero >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> P.S: Please consider Date_time as valid timestamp type values. For >>>>>>>>> easy understanding, mentioned them as 0,1,2,3,4,5 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> As seen in the above table, the events have a unique id and as we >>>>>>>>> get an update to a particular event, we are storing the date_time at >>>>>>>>> which >>>>>>>>> update has happened and also storing the partial updated values. >>>>>>>>> Apart from >>>>>>>>> the updated values, the rest are stored as NULL values. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> We are planning to mimic inplace updates on the table, so that it >>>>>>>>> would retrieve the resulting table as follows using the query below: >>>>>>>>> We >>>>>>>>> don't delete the data. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> > SELECT id, current_val(A,date_time) as A, >>>>>>>>> > current_val(B,date_time) as B, current_val(C,date_time) as C from >>>>>>>>> > e_update >>>>>>>>> > GROUP BY ID; >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> where, current_val is a custom impala UDA we are planning to >>>>>>>>> implement. i.e. get latest non null value for the column. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ID (Int) A (Int) B (String) C (String) >>>>>>>>> 1 4 Hello Zero >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Implemented current_val UDA: >>>>>>>>> The below code is only for int type inputs: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> uda-currentval.h >>>>>>>>> >>>>>>>>> //This is a sample for retrieving the current value of e_update >>>>>>>>> table >>>>>>>>> // >>>>>>>>> void CurrentValueInit(FunctionContext* context, IntVal* val); >>>>>>>>> void CurrentValueUpdate(FunctionContext* context, const IntVal& >>>>>>>>> input, const TimestampVal& ts, IntVal* val); >>>>>>>>> void CurrentValueMerge(FunctionContext* context, const IntVal& src, >>>>>>>>> IntVal* dst); >>>>>>>>> IntVal CurrentValueFinalize(FunctionContext* context, const IntVal& >>>>>>>>> val); >>>>>>>>> >>>>>>>>> uda-currentval.cc >>>>>>>>> >>>>>>>>> // >>>>>>>>> ----------------------------------------------------------------------------------------------- >>>>>>>>> // This is a sample for retrieving the current value of e_update >>>>>>>>> table >>>>>>>>> >>>>>>>>> //----------------------------------------------------------------------------------------------- >>>>>>>>> void CurrentValueInit(FunctionContext* context, IntVal* val) { >>>>>>>>> val->is_null = false; >>>>>>>>> val->val = 0; >>>>>>>>> } >>>>>>>>> >>>>>>>>> void CurrentValueUpdate(FunctionContext* context, const IntVal& >>>>>>>>> input, const TimestampVal& ts, IntVal* val) { >>>>>>>>> static TimestampVal* tsTemp; >>>>>>>>> tsTemp->date = 0; >>>>>>>>> tsTemp->time_of_day = 0; >>>>>>>>> if(tsTemp->date==0 && tsTemp->time_of_day==0){ >>>>>>>>> tsTemp->date = ts.date; >>>>>>>>> tsTemp->time_of_day = ts.time_of_day; >>>>>>>>> val->val = input.val; >>>>>>>>> return; >>>>>>>>> } >>>>>>>>> if(ts.date > tsTemp->date && ts.time_of_day > >>>>>>>>> tsTemp->time_of_day){ >>>>>>>>> tsTemp->date = ts.date; >>>>>>>>> tsTemp->time_of_day = ts.time_of_day; >>>>>>>>> val->val = input.val; >>>>>>>>> return; >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> void CurrentValueMerge(FunctionContext* context, const IntVal& src, >>>>>>>>> IntVal* dst) { >>>>>>>>> dst->val += src.val; >>>>>>>>> } >>>>>>>>> >>>>>>>>> IntVal CurrentValueFinalize(FunctionContext* context, const IntVal& >>>>>>>>> val) { >>>>>>>>> return val; >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> We are able to build and create an aggregate function in impala, >>>>>>>>> but when trying to run the select query similar to the one above, it >>>>>>>>> is >>>>>>>>> bringing down couple of impala deamons and throwing the error below >>>>>>>>> and >>>>>>>>> getting terminated. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> WARNINGS: Cancelled due to unreachable impalad(s): >>>>>>>>> hadoop102.**.**.**.com:22000 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> We have impalad running on 14 instances. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Can someone help resolve us this problem and a better way to >>>>>>>>> achieve a solution for the scenario explained. >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >> >
