Re: [Spark SQL] Null when trying to use corr() with a Window

Sean Owen Mon, 28 Feb 2022 07:33:57 -0800

The results make sense then. You want a correlation per group right?
because it's over the sums by ID within the group. Then currentRow is
wrong; needs to be unbounded preceding and following.



On Mon, Feb 28, 2022 at 9:22 AM Edgar H <kaotix...@gmail.com> wrote:

> The window is defined as you said yes, unboundedPreceding and currentRow
> ordering by orderCountSum.
>
> val initialSetWindow = Window
>   .partitionBy("group")
>   .orderBy("orderCountSum")
>   .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>
> I'm trying to obtain the correlation for each of the members of the group
> yes (or the accumulative per element, don't really know how to phrase
> that), and the correlation is affected by the counter used for the column,
> right? Top to bottom?
>
> Ps. Thank you so much for replying so fast!
>
> El lun, 28 feb 2022 a las 15:56, Sean Owen (<sro...@gmail.com>) escribió:
>
>> How are you defining the window? It looks like it's something like "rows
>> unbounded proceeding, current" or the reverse, as the correlation varies
>> across the elements of the group as if it's computing them on 1, then 2,
>> then 3 elements. Don't you want the correlation across the group? otherwise
>> this answer is 'right' for what you're doing it seems.
>>
>> On Mon, Feb 28, 2022 at 7:49 AM Edgar H <kaotix...@gmail.com> wrote:
>>
>>> My bad completely, missed the example by a mile sorry for that, let me
>>> change a couple of things.
>>>
>>> - Got to add "id" to the initial grouping and also add more elements to
>>> the initial set;
>>>
>>> val sampleSet = Seq(
>>>   ("group1", "id1", 1, 1, 6),
>>>   ("group1", "id1", 4, 4, 6),
>>>   ("group1", "id2", 2, 2, 5),
>>>   ("group1", "id3", 3, 3, 4),
>>>   ("group2", "id1", 4, 4, 3),
>>>   ("group2", "id2", 5, 5, 2),
>>>   ("group2", "id3", 6, 6, 1),
>>>   ("group2", "id3", 15, 6, 1)
>>> )
>>>
>>> val groupedSet = initialSet
>>>   .groupBy(
>>>     "group", "id"
>>>   ).agg(
>>>     sum("count1").as("count1Sum"),
>>>     sum("count2").as("count2Sum"),
>>>     sum("orderCount").as("orderCountSum")
>>> )
>>>   .withColumn("cf", corr("count1Sum",
>>> "count2Sum").over(initialSetWindow))
>>>
>>> Now, with this in place, in case the correlation is applied, the
>>> following is shown:
>>>
>>> +------+---+---------+---------+-------------+------------------+
>>> | group| id|count1Sum|count2Sum|orderCountSum|                cf|
>>> +------+---+---------+---------+-------------+------------------+
>>> |group1|id3|        3|        3|            4|              null|
>>> |group1|id2|        2|        2|            5|               1.0|
>>> |group1|id1|        5|        5|           12|               1.0|
>>> |group2|id3|       21|       12|            2|              null|
>>> |group2|id2|        5|        5|            2|               1.0|
>>> |group2|id1|        4|        4|            3|0.9980460957560549|
>>> +------+---+---------+---------+-------------+------------------+
>>>
>>> Taking into account what you just mentioned... Even if the Window is
>>> only partitioned by "group", would it still be impossible to obtain a
>>> correlation? I'm trying to do like...
>>>
>>> group1 = id1, id2, id3 (and their respective counts) - apply the
>>> correlation over the set of ids within the group (without taking into
>>> account they are a sum)
>>> group2 = id1, id2, id3 (and their respective counts) - same as before
>>>
>>> However, the highest element is still null. When changing the
>>> rowsBetween call to .rowsBetween(Window.unboundedPreceding,
>>> Window.unboundedFollowing) it will just calculate the whole subset
>>> correlation. Shouldn't the first element of the correlation calculate
>>> itself?
>>>
>>

Re: [Spark SQL] Null when trying to use corr() with a Window

Reply via email to