I would suggest use window function with partitioning. select group1,group2,name,value, avg(value) over (partition group1,group2 order by name) m from t
On Thu, Mar 23, 2017 at 9:58 AM, Craig Ching <craigch...@gmail.com> wrote: > Are the elements count big per group? If not, you can group them and use > the code to calculate the median and diff. > > > They're not big, no. Any pointers on how I might do that? The part I'm > having trouble with is the grouping, I can't seem to see how to do the > median per group. For mean, we have the agg feature, but not for median > (and I understand the reasons for that). > > Yong > > ------------------------------ > *From:* Craig Ching <craigch...@gmail.com> > *Sent:* Wednesday, March 22, 2017 3:17 PM > *To:* user@spark.apache.org > *Subject:* calculate diff of value and median in a group > > Hi, > > When using pyspark, I'd like to be able to calculate the difference > between grouped values and their median for the group. Is this possible? > Here is some code I hacked up that does what I want except that it > calculates the grouped diff from mean. Also, please feel free to comment > on how I could make this better if you feel like being helpful :) > > from pyspark import SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import ( > StringType, > LongType, > DoubleType, > StructField, > StructType > ) > from pyspark.sql import functions as F > > > sc = SparkContext(appName='myapp') > spark = SparkSession(sc) > > file_name = 'data.csv' > > fields = [ > StructField( > 'group2', > LongType(), > True), > StructField( > 'name', > StringType(), > True), > StructField( > 'value', > DoubleType(), > True), > StructField( > 'group1', > LongType(), > True) > ] > schema = StructType(fields) > > df = spark.read.csv( > file_name, header=False, mode="DROPMALFORMED", schema=schema > ) > df.show() > means = df.select([ > 'group1', > 'group2', > 'name', > 'value']).groupBy([ > 'group1', > 'group2' > ]).agg( > F.mean('value').alias('mean_value') > ).orderBy('group1', 'group2') > > cond = [df.group1 == means.group1, df.group2 == means.group2] > > means.show() > df = df.select([ > 'group1', > 'group2', > 'name', > 'value']).join( > means, > cond > ).drop( > df.group1 > ).drop( > df.group2 > ).select('group1', > 'group2', > 'name', > 'value', > 'mean_value') > > final = df.withColumn( > 'diff', > F.abs(df.value - df.mean_value)) > final.show() > > sc.stop() > > And here is an example dataset I'm playing with: > > 100,name1,0.43,0 > 100,name2,0.33,0 > 100,name3,0.73,0 > 101,name1,0.29,0 > 101,name2,0.96,0 > 101,name3,0.42,0 > 102,name1,0.01,0 > 102,name2,0.42,0 > 102,name3,0.51,0 > 103,name1,0.55,0 > 103,name2,0.45,0 > 103,name3,0.02,0 > 104,name1,0.93,0 > 104,name2,0.16,0 > 104,name3,0.74,0 > 105,name1,0.41,0 > 105,name2,0.65,0 > 105,name3,0.29,0 > 100,name1,0.51,1 > 100,name2,0.51,1 > 100,name3,0.43,1 > 101,name1,0.59,1 > 101,name2,0.55,1 > 101,name3,0.84,1 > 102,name1,0.01,1 > 102,name2,0.98,1 > 102,name3,0.44,1 > 103,name1,0.47,1 > 103,name2,0.16,1 > 103,name3,0.02,1 > 104,name1,0.83,1 > 104,name2,0.89,1 > 104,name3,0.31,1 > 105,name1,0.59,1 > 105,name2,0.77,1 > 105,name3,0.45,1 > > and here is what I'm trying to produce: > > group1,group2,name,value,median,diff > 0,100,name1,0.43,0.43,0.0 > 0,100,name2,0.33,0.43,0.10 > 0,100,name3,0.73,0.43,0.30 > 0,101,name1,0.29,0.42,0.13 > 0,101,name2,0.96,0.42,0.54 > 0,101,name3,0.42,0.42,0.0 > 0,102,name1,0.01,0.42,0.41 > 0,102,name2,0.42,0.42,0.0 > 0,102,name3,0.51,0.42,0.09 > 0,103,name1,0.55,0.45,0.10 > 0,103,name2,0.45,0.45,0.0 > 0,103,name3,0.02,0.45,0.43 > 0,104,name1,0.93,0.74,0.19 > 0,104,name2,0.16,0.74,0.58 > 0,104,name3,0.74,0.74,0.0 > 0,105,name1,0.41,0.41,0.0 > 0,105,name2,0.65,0.41,0.24 > 0,105,name3,0.29,0.41,0.24 > 1,100,name1,0.51,0.51,0.0 > 1,100,name2,0.51,0.51,0.0 > 1,100,name3,0.43,0.51,0.08 > 1,101,name1,0.59,0.59,0.0 > 1,101,name2,0.55,0.59,0.04 > 1,101,name3,0.84,0.59,0.25 > 1,102,name1,0.01,0.44,0.43 > 1,102,name2,0.98,0.44,0.54 > 1,102,name3,0.44,0.44,0.0 > 1,103,name1,0.47,0.16,0.31 > 1,103,name2,0.16,0.16,0.0 > 1,103,name3,0.02,0.16,0.14 > 1,104,name1,0.83,0.83,0.0 > 1,104,name2,0.89,0.83,0.06 > 1,104,name3,0.31,0.83,0.52 > 1,105,name1,0.59,0.59,0.0 > 1,105,name2,0.77,0.59,0.18 > 1,105,name3,0.45,0.59,0.14 > > Thanks for any help! > > Cheers, > Craig > > -- Best Regards, Ayan Guha