My bad completely, missed the example by a mile sorry for that, let me
change a couple of things.

- Got to add "id" to the initial grouping and also add more elements to the
initial set;

val sampleSet = Seq(
  ("group1", "id1", 1, 1, 6),
  ("group1", "id1", 4, 4, 6),
  ("group1", "id2", 2, 2, 5),
  ("group1", "id3", 3, 3, 4),
  ("group2", "id1", 4, 4, 3),
  ("group2", "id2", 5, 5, 2),
  ("group2", "id3", 6, 6, 1),
  ("group2", "id3", 15, 6, 1)
)

val groupedSet = initialSet
  .groupBy(
    "group", "id"
  ).agg(
    sum("count1").as("count1Sum"),
    sum("count2").as("count2Sum"),
    sum("orderCount").as("orderCountSum")
)
  .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))

Now, with this in place, in case the correlation is applied, the following
is shown:

+------+---+---------+---------+-------------+------------------+
| group| id|count1Sum|count2Sum|orderCountSum|                cf|
+------+---+---------+---------+-------------+------------------+
|group1|id3|        3|        3|            4|              null|
|group1|id2|        2|        2|            5|               1.0|
|group1|id1|        5|        5|           12|               1.0|
|group2|id3|       21|       12|            2|              null|
|group2|id2|        5|        5|            2|               1.0|
|group2|id1|        4|        4|            3|0.9980460957560549|
+------+---+---------+---------+-------------+------------------+

Taking into account what you just mentioned... Even if the Window is only
partitioned by "group", would it still be impossible to obtain a
correlation? I'm trying to do like...

group1 = id1, id2, id3 (and their respective counts) - apply the
correlation over the set of ids within the group (without taking into
account they are a sum)
group2 = id1, id2, id3 (and their respective counts) - same as before

However, the highest element is still null. When changing the rowsBetween
call to .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
it will just calculate the whole subset correlation. Shouldn't the first
element of the correlation calculate itself?

El lun, 28 feb 2022 a las 14:12, Sean Owen (<sro...@gmail.com>) escribió:

> You're computing correlations of two series of values, but each series has
> one value, a sum. Correlation is not defined in this case (both variances
> are undefined). This is sample correlation, note.
>
> On Mon, Feb 28, 2022 at 7:06 AM Edgar H <kaotix...@gmail.com> wrote:
>
>> Morning all, been struggling with this for a while and can't really seem
>> to understand what I'm doing wrong...
>>
>> Having the following DataFrame I want to apply the corr function over
>> the following DF;
>>
>>     val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
>>
>>     val sampleSet = Seq(
>>       ("group1", "id1", 1, 1, 6),
>>       ("group1", "id2", 2, 2, 5),
>>       ("group1", "id3", 3, 3, 4),
>>       ("group2", "id4", 4, 4, 3),
>>       ("group2", "id5", 5, 5, 2),
>>       ("group2", "id6", 6, 6, 1)
>>     )
>>
>>     val initialSet = sparkSession
>>       .createDataFrame(sampleSet)
>>       .toDF(sampleColumns: _*)
>>
>> ----- .show()
>>
>> +------+---+------+------+----------+
>> | group| id|count1|count2|orderCount|
>> +------+---+------+------+----------+
>> |group1|id1|     1|     1|         6|
>> |group1|id2|     2|     2|         5|
>> |group1|id3|     3|     3|         4|
>> |group2|id4|     4|     4|         3|
>> |group2|id5|     5|     5|         2|
>> |group2|id6|     6|     6|         1|
>> +------+---+------+------+----------+
>>
>>     val initialSetWindow = Window
>>       .partitionBy("group")
>>       .orderBy("orderCountSum")
>>       .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>>
>>     val groupedSet = initialSet
>>       .groupBy(
>>         "group"
>>       ).agg(
>>         sum("count1").as("count1Sum"),
>>         sum("count2").as("count2Sum"),
>>         sum("orderCount").as("orderCountSum")
>>     )
>>       .withColumn("cf", corr("count1Sum", 
>> "count2Sum").over(initialSetWindow))
>>
>> ----- .show()
>>
>> +------+---------+---------+-------------+----+
>> | group|count1Sum|count2Sum|orderCountSum|  cf|
>> +------+---------+---------+-------------+----+
>> |group1|        6|        6|           15|null|
>> |group2|       15|       15|            6|null|
>> +------+---------+---------+-------------+----+
>>
>> When trying to apply the corr function, some of the resulting values in
>> cf are null for some reason:
>>
>> The question is, *how can I apply corr to each of the rows within their
>> subgroup (Window)?* Would like to obtain the corr value per Row and
>> subgroup (group1 and group2), and even if more nested IDs were added (group
>> + id) it should still work.
>>
>

Reply via email to