Hi all,

I am a student trying to learn Spark and I had a question regarding
converting rows to columns (data pivot/reshape). I have some data in the
following format (either RDD or Spark DataFrame):

    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)

     rdd = sc.parallelize([('X01',41,'US',3),
                                  ('X01',41,'UK',1),
                                  ('X01',41,'CA',2),
                                  ('X01',41,'US',4),
                                  ('X02',72,'UK',6),
                                  ('X02',72,'CA',7),
                                  ('X02',72,'XX',8)])
     
    # convert to a Spark DataFrame                    
    schema = StructType([StructField('ID', StringType(), True),
                         StructField('Age', IntegerType(), True),
                         StructField('Country', StringType(), True),
                         StructField('Score', IntegerType(), True)])
                             
    df = sqlContext.createDataFrame(rdd, schema)

What I would like to do is to 'reshape' the data, convert certain rows in
Country(specifically US, UK and CA) into columns:

    ID    Age  US  UK  CA  
    'X01'  41   3    1    2  
    'X02'  72   4    6    7   

Essentially, I need something along the lines of Python's `pivot` workflow:

    categories = ['US', 'UK', 'CA']
    new_df = df[df['Country'].isin(categories)].pivot(index = 'ID', 
                                                      columns = 'Country',
                                                      values = 'Score')

My dataset is rather large so I can't really `collect()` and ingest the data
into memory to do the reshaping in Python itself. Is there a way to convert
Python's `.pivot()` into an invokable function while mapping either an RDD
or a Spark DataFrame? Any help would be appreciated!

I had initially posted this question on Stack Overflow  here
<http://stackoverflow.com/questions/30260015/reshaping-pivoting-data-in-spark-rdd-and-or-spark-dataframes>
  
but the one suggestion solution is verbose and error prone and probably not
scalable either. 

Any help would be greatly appreciated. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-reshape-RDD-Spark-DataFrame-tp22909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to