> Are the elements count big per group? If not, you can group them and use the 
> code to calculate the median and diff.
> 
> 
> 
They're not big, no.  Any pointers on how I might do that?  The part I'm having 
trouble with is the grouping, I can't seem to see how to do the median per 
group.  For mean, we have the agg feature, but not for median (and I understand 
the reasons for that).

> Yong
> 
> 
> From: Craig Ching <[email protected]>
> Sent: Wednesday, March 22, 2017 3:17 PM
> To: [email protected]
> Subject: calculate diff of value and median in a group
>  
> Hi,
> 
> When using pyspark, I'd like to be able to calculate the difference between 
> grouped values and their median for the group.  Is this possible?  Here is 
> some code I hacked up that does what I want except that it calculates the 
> grouped diff from mean.  Also, please feel free to comment on how I could 
> make this better if you feel like being helpful :)
> 
> from pyspark import SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import (
>     StringType,
>     LongType,
>     DoubleType,
>     StructField,
>     StructType
> )
> from pyspark.sql import functions as F
> 
> 
> sc = SparkContext(appName='myapp')
> spark = SparkSession(sc)
> 
> file_name = 'data.csv'
> 
> fields = [
>     StructField(
>         'group2',
>         LongType(),
>         True),
>     StructField(
>         'name',
>         StringType(),
>         True),
>     StructField(
>         'value',
>         DoubleType(),
>         True),
>     StructField(
>         'group1',
>         LongType(),
>         True)
> ]
> schema = StructType(fields)
> 
> df = spark.read.csv(
>     file_name, header=False, mode="DROPMALFORMED", schema=schema
> )
> df.show()
> means = df.select([
>     'group1',
>     'group2',
>     'name',
>     'value']).groupBy([
>         'group1',
>         'group2'
>     ]).agg(
>         F.mean('value').alias('mean_value')
>     ).orderBy('group1', 'group2')
> 
> cond = [df.group1 == means.group1, df.group2 == means.group2]
> 
> means.show()
> df = df.select([
>     'group1',
>     'group2',
>     'name',
>     'value']).join(
>         means,
>         cond
>     ).drop(
>         df.group1
>     ).drop(
>         df.group2
>     ).select('group1',
>              'group2',
>              'name',
>              'value',
>              'mean_value')
> 
> final = df.withColumn(
>     'diff',
>     F.abs(df.value - df.mean_value))
> final.show()
> 
> sc.stop()
> 
> And here is an example dataset I'm playing with:
> 
> 100,name1,0.43,0
> 100,name2,0.33,0
> 100,name3,0.73,0
> 101,name1,0.29,0
> 101,name2,0.96,0
> 101,name3,0.42,0
> 102,name1,0.01,0
> 102,name2,0.42,0
> 102,name3,0.51,0
> 103,name1,0.55,0
> 103,name2,0.45,0
> 103,name3,0.02,0
> 104,name1,0.93,0
> 104,name2,0.16,0
> 104,name3,0.74,0
> 105,name1,0.41,0
> 105,name2,0.65,0
> 105,name3,0.29,0
> 100,name1,0.51,1
> 100,name2,0.51,1
> 100,name3,0.43,1
> 101,name1,0.59,1
> 101,name2,0.55,1
> 101,name3,0.84,1
> 102,name1,0.01,1
> 102,name2,0.98,1
> 102,name3,0.44,1
> 103,name1,0.47,1
> 103,name2,0.16,1
> 103,name3,0.02,1
> 104,name1,0.83,1
> 104,name2,0.89,1
> 104,name3,0.31,1
> 105,name1,0.59,1
> 105,name2,0.77,1
> 105,name3,0.45,1
> 
> and here is what I'm trying to produce:
> 
> group1,group2,name,value,median,diff
> 0,100,name1,0.43,0.43,0.0
> 0,100,name2,0.33,0.43,0.10
> 0,100,name3,0.73,0.43,0.30
> 0,101,name1,0.29,0.42,0.13
> 0,101,name2,0.96,0.42,0.54
> 0,101,name3,0.42,0.42,0.0
> 0,102,name1,0.01,0.42,0.41
> 0,102,name2,0.42,0.42,0.0
> 0,102,name3,0.51,0.42,0.09
> 0,103,name1,0.55,0.45,0.10
> 0,103,name2,0.45,0.45,0.0
> 0,103,name3,0.02,0.45,0.43
> 0,104,name1,0.93,0.74,0.19
> 0,104,name2,0.16,0.74,0.58
> 0,104,name3,0.74,0.74,0.0
> 0,105,name1,0.41,0.41,0.0
> 0,105,name2,0.65,0.41,0.24
> 0,105,name3,0.29,0.41,0.24
> 1,100,name1,0.51,0.51,0.0
> 1,100,name2,0.51,0.51,0.0
> 1,100,name3,0.43,0.51,0.08
> 1,101,name1,0.59,0.59,0.0
> 1,101,name2,0.55,0.59,0.04
> 1,101,name3,0.84,0.59,0.25
> 1,102,name1,0.01,0.44,0.43
> 1,102,name2,0.98,0.44,0.54
> 1,102,name3,0.44,0.44,0.0
> 1,103,name1,0.47,0.16,0.31
> 1,103,name2,0.16,0.16,0.0
> 1,103,name3,0.02,0.16,0.14
> 1,104,name1,0.83,0.83,0.0
> 1,104,name2,0.89,0.83,0.06
> 1,104,name3,0.31,0.83,0.52
> 1,105,name1,0.59,0.59,0.0
> 1,105,name2,0.77,0.59,0.18
> 1,105,name3,0.45,0.59,0.14
> 
> Thanks for any help!
> 
> Cheers,
> Craig

Reply via email to