The results make sense then. You want a correlation per group right? because it's over the sums by ID within the group. Then currentRow is wrong; needs to be unbounded preceding and following.
On Mon, Feb 28, 2022 at 9:22 AM Edgar H <kaotix...@gmail.com> wrote: > The window is defined as you said yes, unboundedPreceding and currentRow > ordering by orderCountSum. > > val initialSetWindow = Window > .partitionBy("group") > .orderBy("orderCountSum") > .rowsBetween(Window.unboundedPreceding, Window.currentRow) > > I'm trying to obtain the correlation for each of the members of the group > yes (or the accumulative per element, don't really know how to phrase > that), and the correlation is affected by the counter used for the column, > right? Top to bottom? > > Ps. Thank you so much for replying so fast! > > El lun, 28 feb 2022 a las 15:56, Sean Owen (<sro...@gmail.com>) escribió: > >> How are you defining the window? It looks like it's something like "rows >> unbounded proceeding, current" or the reverse, as the correlation varies >> across the elements of the group as if it's computing them on 1, then 2, >> then 3 elements. Don't you want the correlation across the group? otherwise >> this answer is 'right' for what you're doing it seems. >> >> On Mon, Feb 28, 2022 at 7:49 AM Edgar H <kaotix...@gmail.com> wrote: >> >>> My bad completely, missed the example by a mile sorry for that, let me >>> change a couple of things. >>> >>> - Got to add "id" to the initial grouping and also add more elements to >>> the initial set; >>> >>> val sampleSet = Seq( >>> ("group1", "id1", 1, 1, 6), >>> ("group1", "id1", 4, 4, 6), >>> ("group1", "id2", 2, 2, 5), >>> ("group1", "id3", 3, 3, 4), >>> ("group2", "id1", 4, 4, 3), >>> ("group2", "id2", 5, 5, 2), >>> ("group2", "id3", 6, 6, 1), >>> ("group2", "id3", 15, 6, 1) >>> ) >>> >>> val groupedSet = initialSet >>> .groupBy( >>> "group", "id" >>> ).agg( >>> sum("count1").as("count1Sum"), >>> sum("count2").as("count2Sum"), >>> sum("orderCount").as("orderCountSum") >>> ) >>> .withColumn("cf", corr("count1Sum", >>> "count2Sum").over(initialSetWindow)) >>> >>> Now, with this in place, in case the correlation is applied, the >>> following is shown: >>> >>> +------+---+---------+---------+-------------+------------------+ >>> | group| id|count1Sum|count2Sum|orderCountSum| cf| >>> +------+---+---------+---------+-------------+------------------+ >>> |group1|id3| 3| 3| 4| null| >>> |group1|id2| 2| 2| 5| 1.0| >>> |group1|id1| 5| 5| 12| 1.0| >>> |group2|id3| 21| 12| 2| null| >>> |group2|id2| 5| 5| 2| 1.0| >>> |group2|id1| 4| 4| 3|0.9980460957560549| >>> +------+---+---------+---------+-------------+------------------+ >>> >>> Taking into account what you just mentioned... Even if the Window is >>> only partitioned by "group", would it still be impossible to obtain a >>> correlation? I'm trying to do like... >>> >>> group1 = id1, id2, id3 (and their respective counts) - apply the >>> correlation over the set of ids within the group (without taking into >>> account they are a sum) >>> group2 = id1, id2, id3 (and their respective counts) - same as before >>> >>> However, the highest element is still null. When changing the >>> rowsBetween call to .rowsBetween(Window.unboundedPreceding, >>> Window.unboundedFollowing) it will just calculate the whole subset >>> correlation. Shouldn't the first element of the correlation calculate >>> itself? >>> >>