Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
Oh I see now, using currentRow will give the correlation per ID within the
group based on its ordering and using unbounded both will result in the
overall correlation value for the whole group?

El lun, 28 feb 2022 a las 16:33, Sean Owen () escribió:

> The results make sense then. You want a correlation per group right?
> because it's over the sums by ID within the group. Then currentRow is
> wrong; needs to be unbounded preceding and following.
>
>
> On Mon, Feb 28, 2022 at 9:22 AM Edgar H  wrote:
>
>> The window is defined as you said yes, unboundedPreceding and currentRow
>> ordering by orderCountSum.
>>
>> val initialSetWindow = Window
>>   .partitionBy("group")
>>   .orderBy("orderCountSum")
>>   .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>>
>> I'm trying to obtain the correlation for each of the members of the group
>> yes (or the accumulative per element, don't really know how to phrase
>> that), and the correlation is affected by the counter used for the column,
>> right? Top to bottom?
>>
>> Ps. Thank you so much for replying so fast!
>>
>> El lun, 28 feb 2022 a las 15:56, Sean Owen () escribió:
>>
>>> How are you defining the window? It looks like it's something like "rows
>>> unbounded proceeding, current" or the reverse, as the correlation varies
>>> across the elements of the group as if it's computing them on 1, then 2,
>>> then 3 elements. Don't you want the correlation across the group? otherwise
>>> this answer is 'right' for what you're doing it seems.
>>>
>>> On Mon, Feb 28, 2022 at 7:49 AM Edgar H  wrote:
>>>
 My bad completely, missed the example by a mile sorry for that, let me
 change a couple of things.

 - Got to add "id" to the initial grouping and also add more elements to
 the initial set;

 val sampleSet = Seq(
   ("group1", "id1", 1, 1, 6),
   ("group1", "id1", 4, 4, 6),
   ("group1", "id2", 2, 2, 5),
   ("group1", "id3", 3, 3, 4),
   ("group2", "id1", 4, 4, 3),
   ("group2", "id2", 5, 5, 2),
   ("group2", "id3", 6, 6, 1),
   ("group2", "id3", 15, 6, 1)
 )

 val groupedSet = initialSet
   .groupBy(
 "group", "id"
   ).agg(
 sum("count1").as("count1Sum"),
 sum("count2").as("count2Sum"),
 sum("orderCount").as("orderCountSum")
 )
   .withColumn("cf", corr("count1Sum",
 "count2Sum").over(initialSetWindow))

 Now, with this in place, in case the correlation is applied, the
 following is shown:

 +--+---+-+-+-+--+
 | group| id|count1Sum|count2Sum|orderCountSum|cf|
 +--+---+-+-+-+--+
 |group1|id3|3|3|4|  null|
 |group1|id2|2|2|5|   1.0|
 |group1|id1|5|5|   12|   1.0|
 |group2|id3|   21|   12|2|  null|
 |group2|id2|5|5|2|   1.0|
 |group2|id1|4|4|3|0.9980460957560549|
 +--+---+-+-+-+--+

 Taking into account what you just mentioned... Even if the Window is
 only partitioned by "group", would it still be impossible to obtain a
 correlation? I'm trying to do like...

 group1 = id1, id2, id3 (and their respective counts) - apply the
 correlation over the set of ids within the group (without taking into
 account they are a sum)
 group2 = id1, id2, id3 (and their respective counts) - same as before

 However, the highest element is still null. When changing the
 rowsBetween call to .rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing) it will just calculate the whole subset
 correlation. Shouldn't the first element of the correlation calculate
 itself?

>>>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
The results make sense then. You want a correlation per group right?
because it's over the sums by ID within the group. Then currentRow is
wrong; needs to be unbounded preceding and following.


On Mon, Feb 28, 2022 at 9:22 AM Edgar H  wrote:

> The window is defined as you said yes, unboundedPreceding and currentRow
> ordering by orderCountSum.
>
> val initialSetWindow = Window
>   .partitionBy("group")
>   .orderBy("orderCountSum")
>   .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>
> I'm trying to obtain the correlation for each of the members of the group
> yes (or the accumulative per element, don't really know how to phrase
> that), and the correlation is affected by the counter used for the column,
> right? Top to bottom?
>
> Ps. Thank you so much for replying so fast!
>
> El lun, 28 feb 2022 a las 15:56, Sean Owen () escribió:
>
>> How are you defining the window? It looks like it's something like "rows
>> unbounded proceeding, current" or the reverse, as the correlation varies
>> across the elements of the group as if it's computing them on 1, then 2,
>> then 3 elements. Don't you want the correlation across the group? otherwise
>> this answer is 'right' for what you're doing it seems.
>>
>> On Mon, Feb 28, 2022 at 7:49 AM Edgar H  wrote:
>>
>>> My bad completely, missed the example by a mile sorry for that, let me
>>> change a couple of things.
>>>
>>> - Got to add "id" to the initial grouping and also add more elements to
>>> the initial set;
>>>
>>> val sampleSet = Seq(
>>>   ("group1", "id1", 1, 1, 6),
>>>   ("group1", "id1", 4, 4, 6),
>>>   ("group1", "id2", 2, 2, 5),
>>>   ("group1", "id3", 3, 3, 4),
>>>   ("group2", "id1", 4, 4, 3),
>>>   ("group2", "id2", 5, 5, 2),
>>>   ("group2", "id3", 6, 6, 1),
>>>   ("group2", "id3", 15, 6, 1)
>>> )
>>>
>>> val groupedSet = initialSet
>>>   .groupBy(
>>> "group", "id"
>>>   ).agg(
>>> sum("count1").as("count1Sum"),
>>> sum("count2").as("count2Sum"),
>>> sum("orderCount").as("orderCountSum")
>>> )
>>>   .withColumn("cf", corr("count1Sum",
>>> "count2Sum").over(initialSetWindow))
>>>
>>> Now, with this in place, in case the correlation is applied, the
>>> following is shown:
>>>
>>> +--+---+-+-+-+--+
>>> | group| id|count1Sum|count2Sum|orderCountSum|cf|
>>> +--+---+-+-+-+--+
>>> |group1|id3|3|3|4|  null|
>>> |group1|id2|2|2|5|   1.0|
>>> |group1|id1|5|5|   12|   1.0|
>>> |group2|id3|   21|   12|2|  null|
>>> |group2|id2|5|5|2|   1.0|
>>> |group2|id1|4|4|3|0.9980460957560549|
>>> +--+---+-+-+-+--+
>>>
>>> Taking into account what you just mentioned... Even if the Window is
>>> only partitioned by "group", would it still be impossible to obtain a
>>> correlation? I'm trying to do like...
>>>
>>> group1 = id1, id2, id3 (and their respective counts) - apply the
>>> correlation over the set of ids within the group (without taking into
>>> account they are a sum)
>>> group2 = id1, id2, id3 (and their respective counts) - same as before
>>>
>>> However, the highest element is still null. When changing the
>>> rowsBetween call to .rowsBetween(Window.unboundedPreceding,
>>> Window.unboundedFollowing) it will just calculate the whole subset
>>> correlation. Shouldn't the first element of the correlation calculate
>>> itself?
>>>
>>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
The window is defined as you said yes, unboundedPreceding and currentRow
ordering by orderCountSum.

val initialSetWindow = Window
  .partitionBy("group")
  .orderBy("orderCountSum")
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

I'm trying to obtain the correlation for each of the members of the group
yes (or the accumulative per element, don't really know how to phrase
that), and the correlation is affected by the counter used for the column,
right? Top to bottom?

Ps. Thank you so much for replying so fast!

El lun, 28 feb 2022 a las 15:56, Sean Owen () escribió:

> How are you defining the window? It looks like it's something like "rows
> unbounded proceeding, current" or the reverse, as the correlation varies
> across the elements of the group as if it's computing them on 1, then 2,
> then 3 elements. Don't you want the correlation across the group? otherwise
> this answer is 'right' for what you're doing it seems.
>
> On Mon, Feb 28, 2022 at 7:49 AM Edgar H  wrote:
>
>> My bad completely, missed the example by a mile sorry for that, let me
>> change a couple of things.
>>
>> - Got to add "id" to the initial grouping and also add more elements to
>> the initial set;
>>
>> val sampleSet = Seq(
>>   ("group1", "id1", 1, 1, 6),
>>   ("group1", "id1", 4, 4, 6),
>>   ("group1", "id2", 2, 2, 5),
>>   ("group1", "id3", 3, 3, 4),
>>   ("group2", "id1", 4, 4, 3),
>>   ("group2", "id2", 5, 5, 2),
>>   ("group2", "id3", 6, 6, 1),
>>   ("group2", "id3", 15, 6, 1)
>> )
>>
>> val groupedSet = initialSet
>>   .groupBy(
>> "group", "id"
>>   ).agg(
>> sum("count1").as("count1Sum"),
>> sum("count2").as("count2Sum"),
>> sum("orderCount").as("orderCountSum")
>> )
>>   .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
>>
>> Now, with this in place, in case the correlation is applied, the
>> following is shown:
>>
>> +--+---+-+-+-+--+
>> | group| id|count1Sum|count2Sum|orderCountSum|cf|
>> +--+---+-+-+-+--+
>> |group1|id3|3|3|4|  null|
>> |group1|id2|2|2|5|   1.0|
>> |group1|id1|5|5|   12|   1.0|
>> |group2|id3|   21|   12|2|  null|
>> |group2|id2|5|5|2|   1.0|
>> |group2|id1|4|4|3|0.9980460957560549|
>> +--+---+-+-+-+--+
>>
>> Taking into account what you just mentioned... Even if the Window is only
>> partitioned by "group", would it still be impossible to obtain a
>> correlation? I'm trying to do like...
>>
>> group1 = id1, id2, id3 (and their respective counts) - apply the
>> correlation over the set of ids within the group (without taking into
>> account they are a sum)
>> group2 = id1, id2, id3 (and their respective counts) - same as before
>>
>> However, the highest element is still null. When changing the rowsBetween
>> call to .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
>> it will just calculate the whole subset correlation. Shouldn't the first
>> element of the correlation calculate itself?
>>
>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
How are you defining the window? It looks like it's something like "rows
unbounded proceeding, current" or the reverse, as the correlation varies
across the elements of the group as if it's computing them on 1, then 2,
then 3 elements. Don't you want the correlation across the group? otherwise
this answer is 'right' for what you're doing it seems.

On Mon, Feb 28, 2022 at 7:49 AM Edgar H  wrote:

> My bad completely, missed the example by a mile sorry for that, let me
> change a couple of things.
>
> - Got to add "id" to the initial grouping and also add more elements to
> the initial set;
>
> val sampleSet = Seq(
>   ("group1", "id1", 1, 1, 6),
>   ("group1", "id1", 4, 4, 6),
>   ("group1", "id2", 2, 2, 5),
>   ("group1", "id3", 3, 3, 4),
>   ("group2", "id1", 4, 4, 3),
>   ("group2", "id2", 5, 5, 2),
>   ("group2", "id3", 6, 6, 1),
>   ("group2", "id3", 15, 6, 1)
> )
>
> val groupedSet = initialSet
>   .groupBy(
> "group", "id"
>   ).agg(
> sum("count1").as("count1Sum"),
> sum("count2").as("count2Sum"),
> sum("orderCount").as("orderCountSum")
> )
>   .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
>
> Now, with this in place, in case the correlation is applied, the following
> is shown:
>
> +--+---+-+-+-+--+
> | group| id|count1Sum|count2Sum|orderCountSum|cf|
> +--+---+-+-+-+--+
> |group1|id3|3|3|4|  null|
> |group1|id2|2|2|5|   1.0|
> |group1|id1|5|5|   12|   1.0|
> |group2|id3|   21|   12|2|  null|
> |group2|id2|5|5|2|   1.0|
> |group2|id1|4|4|3|0.9980460957560549|
> +--+---+-+-+-+--+
>
> Taking into account what you just mentioned... Even if the Window is only
> partitioned by "group", would it still be impossible to obtain a
> correlation? I'm trying to do like...
>
> group1 = id1, id2, id3 (and their respective counts) - apply the
> correlation over the set of ids within the group (without taking into
> account they are a sum)
> group2 = id1, id2, id3 (and their respective counts) - same as before
>
> However, the highest element is still null. When changing the rowsBetween
> call to .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> it will just calculate the whole subset correlation. Shouldn't the first
> element of the correlation calculate itself?
>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
My bad completely, missed the example by a mile sorry for that, let me
change a couple of things.

- Got to add "id" to the initial grouping and also add more elements to the
initial set;

val sampleSet = Seq(
  ("group1", "id1", 1, 1, 6),
  ("group1", "id1", 4, 4, 6),
  ("group1", "id2", 2, 2, 5),
  ("group1", "id3", 3, 3, 4),
  ("group2", "id1", 4, 4, 3),
  ("group2", "id2", 5, 5, 2),
  ("group2", "id3", 6, 6, 1),
  ("group2", "id3", 15, 6, 1)
)

val groupedSet = initialSet
  .groupBy(
"group", "id"
  ).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
  .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))

Now, with this in place, in case the correlation is applied, the following
is shown:

+--+---+-+-+-+--+
| group| id|count1Sum|count2Sum|orderCountSum|cf|
+--+---+-+-+-+--+
|group1|id3|3|3|4|  null|
|group1|id2|2|2|5|   1.0|
|group1|id1|5|5|   12|   1.0|
|group2|id3|   21|   12|2|  null|
|group2|id2|5|5|2|   1.0|
|group2|id1|4|4|3|0.9980460957560549|
+--+---+-+-+-+--+

Taking into account what you just mentioned... Even if the Window is only
partitioned by "group", would it still be impossible to obtain a
correlation? I'm trying to do like...

group1 = id1, id2, id3 (and their respective counts) - apply the
correlation over the set of ids within the group (without taking into
account they are a sum)
group2 = id1, id2, id3 (and their respective counts) - same as before

However, the highest element is still null. When changing the rowsBetween
call to .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
it will just calculate the whole subset correlation. Shouldn't the first
element of the correlation calculate itself?

El lun, 28 feb 2022 a las 14:12, Sean Owen () escribió:

> You're computing correlations of two series of values, but each series has
> one value, a sum. Correlation is not defined in this case (both variances
> are undefined). This is sample correlation, note.
>
> On Mon, Feb 28, 2022 at 7:06 AM Edgar H  wrote:
>
>> Morning all, been struggling with this for a while and can't really seem
>> to understand what I'm doing wrong...
>>
>> Having the following DataFrame I want to apply the corr function over
>> the following DF;
>>
>> val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
>>
>> val sampleSet = Seq(
>>   ("group1", "id1", 1, 1, 6),
>>   ("group1", "id2", 2, 2, 5),
>>   ("group1", "id3", 3, 3, 4),
>>   ("group2", "id4", 4, 4, 3),
>>   ("group2", "id5", 5, 5, 2),
>>   ("group2", "id6", 6, 6, 1)
>> )
>>
>> val initialSet = sparkSession
>>   .createDataFrame(sampleSet)
>>   .toDF(sampleColumns: _*)
>>
>> - .show()
>>
>> +--+---+--+--+--+
>> | group| id|count1|count2|orderCount|
>> +--+---+--+--+--+
>> |group1|id1| 1| 1| 6|
>> |group1|id2| 2| 2| 5|
>> |group1|id3| 3| 3| 4|
>> |group2|id4| 4| 4| 3|
>> |group2|id5| 5| 5| 2|
>> |group2|id6| 6| 6| 1|
>> +--+---+--+--+--+
>>
>> val initialSetWindow = Window
>>   .partitionBy("group")
>>   .orderBy("orderCountSum")
>>   .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>>
>> val groupedSet = initialSet
>>   .groupBy(
>> "group"
>>   ).agg(
>> sum("count1").as("count1Sum"),
>> sum("count2").as("count2Sum"),
>> sum("orderCount").as("orderCountSum")
>> )
>>   .withColumn("cf", corr("count1Sum", 
>> "count2Sum").over(initialSetWindow))
>>
>> - .show()
>>
>> +--+-+-+-++
>> | group|count1Sum|count2Sum|orderCountSum|  cf|
>> +--+-+-+-++
>> |group1|6|6|   15|null|
>> |group2|   15|   15|6|null|
>> +--+-+-+-++
>>
>> When trying to apply the corr function, some of the resulting values in
>> cf are null for some reason:
>>
>> The question is, *how can I apply corr to each of the rows within their
>> subgroup (Window)?* Would like to obtain the corr value per Row and
>> subgroup (group1 and group2), and even if more nested IDs were added (group
>> + id) it should still work.
>>
>


[Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
Morning all, been struggling with this for a while and can't really seem to
understand what I'm doing wrong...

Having the following DataFrame I want to apply the corr function over the
following DF;

val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")

val sampleSet = Seq(
  ("group1", "id1", 1, 1, 6),
  ("group1", "id2", 2, 2, 5),
  ("group1", "id3", 3, 3, 4),
  ("group2", "id4", 4, 4, 3),
  ("group2", "id5", 5, 5, 2),
  ("group2", "id6", 6, 6, 1)
)

val initialSet = sparkSession
  .createDataFrame(sampleSet)
  .toDF(sampleColumns: _*)

- .show()

+--+---+--+--+--+
| group| id|count1|count2|orderCount|
+--+---+--+--+--+
|group1|id1| 1| 1| 6|
|group1|id2| 2| 2| 5|
|group1|id3| 3| 3| 4|
|group2|id4| 4| 4| 3|
|group2|id5| 5| 5| 2|
|group2|id6| 6| 6| 1|
+--+---+--+--+--+

val initialSetWindow = Window
  .partitionBy("group")
  .orderBy("orderCountSum")
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

val groupedSet = initialSet
  .groupBy(
"group"
  ).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
  .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))

- .show()

+--+-+-+-++
| group|count1Sum|count2Sum|orderCountSum|  cf|
+--+-+-+-++
|group1|6|6|   15|null|
|group2|   15|   15|6|null|
+--+-+-+-++

When trying to apply the corr function, some of the resulting values in cf
are null for some reason:

The question is, *how can I apply corr to each of the rows within their
subgroup (Window)?* Would like to obtain the corr value per Row and
subgroup (group1 and group2), and even if more nested IDs were added (group
+ id) it should still work.


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
You're computing correlations of two series of values, but each series has
one value, a sum. Correlation is not defined in this case (both variances
are undefined). This is sample correlation, note.

On Mon, Feb 28, 2022 at 7:06 AM Edgar H  wrote:

> Morning all, been struggling with this for a while and can't really seem
> to understand what I'm doing wrong...
>
> Having the following DataFrame I want to apply the corr function over the
> following DF;
>
> val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
>
> val sampleSet = Seq(
>   ("group1", "id1", 1, 1, 6),
>   ("group1", "id2", 2, 2, 5),
>   ("group1", "id3", 3, 3, 4),
>   ("group2", "id4", 4, 4, 3),
>   ("group2", "id5", 5, 5, 2),
>   ("group2", "id6", 6, 6, 1)
> )
>
> val initialSet = sparkSession
>   .createDataFrame(sampleSet)
>   .toDF(sampleColumns: _*)
>
> - .show()
>
> +--+---+--+--+--+
> | group| id|count1|count2|orderCount|
> +--+---+--+--+--+
> |group1|id1| 1| 1| 6|
> |group1|id2| 2| 2| 5|
> |group1|id3| 3| 3| 4|
> |group2|id4| 4| 4| 3|
> |group2|id5| 5| 5| 2|
> |group2|id6| 6| 6| 1|
> +--+---+--+--+--+
>
> val initialSetWindow = Window
>   .partitionBy("group")
>   .orderBy("orderCountSum")
>   .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>
> val groupedSet = initialSet
>   .groupBy(
> "group"
>   ).agg(
> sum("count1").as("count1Sum"),
> sum("count2").as("count2Sum"),
> sum("orderCount").as("orderCountSum")
> )
>   .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
>
> - .show()
>
> +--+-+-+-++
> | group|count1Sum|count2Sum|orderCountSum|  cf|
> +--+-+-+-++
> |group1|6|6|   15|null|
> |group2|   15|   15|6|null|
> +--+-+-+-++
>
> When trying to apply the corr function, some of the resulting values in cf
> are null for some reason:
>
> The question is, *how can I apply corr to each of the rows within their
> subgroup (Window)?* Would like to obtain the corr value per Row and
> subgroup (group1 and group2), and even if more nested IDs were added (group
> + id) it should still work.
>


[Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
Morning all, been struggling with this for a while and can't really seem to
understand what I'm doing wrong...

Having the following DataFrame I want to apply the corr function over the
following DF;

val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")

val sampleSet = Seq(
  ("group1", "id1", 1, 1, 6),
  ("group1", "id2", 2, 2, 5),
  ("group1", "id3", 3, 3, 4),
  ("group2", "id4", 4, 4, 3),
  ("group2", "id5", 5, 5, 2),
  ("group2", "id6", 6, 6, 1)
)

val initialSet = sparkSession
  .createDataFrame(sampleSet)
  .toDF(sampleColumns: _*)

- .show()

+--+---+--+--+--+
| group| id|count1|count2|orderCount|
+--+---+--+--+--+
|group1|id1| 1| 1| 6|
|group1|id2| 2| 2| 5|
|group1|id3| 3| 3| 4|
|group2|id4| 4| 4| 3|
|group2|id5| 5| 5| 2|
|group2|id6| 6| 6| 1|
+--+---+--+--+--+

val initialSetWindow = Window
  .partitionBy("group")
  .orderBy("orderCountSum")
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

val groupedSet = initialSet
  .groupBy(
"group"
  ).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
  .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))

- .show()

+--+-+-+-++
| group|count1Sum|count2Sum|orderCountSum|  cf|
+--+-+-+-++
|group1|6|6|   15|null|
|group2|   15|   15|6|null|
+--+-+-+-++

When trying to apply the corr function, some of the resulting values in cf
are null for some reason:

The question is, *how can I apply corr to each of the rows within their
subgroup (Window)?* Would like to obtain the corr value per Row and
subgroup (group1 and group2), and even if more nested IDs were added (group
+ id) it should still work.