Hi all, I am a student trying to learn Spark and I had a question regarding converting rows to columns (data pivot/reshape). I have some data in the following format (either RDD or Spark DataFrame):
from pyspark.sql import SQLContext sqlContext = SQLContext(sc) rdd = sc.parallelize([('X01',41,'US',3), ('X01',41,'UK',1), ('X01',41,'CA',2), ('X01',41,'US',4), ('X02',72,'UK',6), ('X02',72,'CA',7), ('X02',72,'XX',8)]) # convert to a Spark DataFrame schema = StructType([StructField('ID', StringType(), True), StructField('Age', IntegerType(), True), StructField('Country', StringType(), True), StructField('Score', IntegerType(), True)]) df = sqlContext.createDataFrame(rdd, schema) What I would like to do is to 'reshape' the data, convert certain rows in Country(specifically US, UK and CA) into columns: ID Age US UK CA 'X01' 41 3 1 2 'X02' 72 4 6 7 Essentially, I need something along the lines of Python's `pivot` workflow: categories = ['US', 'UK', 'CA'] new_df = df[df['Country'].isin(categories)].pivot(index = 'ID', columns = 'Country', values = 'Score') My dataset is rather large so I can't really `collect()` and ingest the data into memory to do the reshaping in Python itself. Is there a way to convert Python's `.pivot()` into an invokable function while mapping either an RDD or a Spark DataFrame? Any help would be appreciated! I had initially posted this question on Stack Overflow here <http://stackoverflow.com/questions/30260015/reshaping-pivoting-data-in-spark-rdd-and-or-spark-dataframes> but the one suggestion solution is verbose and error prone and probably not scalable either. Any help would be greatly appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-reshape-RDD-Spark-DataFrame-tp22909.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org