Hi , I have a set of data, I need to group by specific key and then save as parquet. Refer to the code snippet below. I am querying trade and then grouping by date
val df = sqlContext.sql("SELECT * FROM trade") val dfSchema = df.schema val partitionKeyIndex = dfSchema.fieldNames.seq.indexOf("date") //group by date val groupedByPartitionKey = df.rdd.groupBy { row => row.getString(partitionKeyIndex) } val failure = groupedByPartitionKey.map(row => { val rowDF = sqlContext.createDataFrame(sc.parallelize(row._2.toSeq), dfSchema) val fileName = config.getTempFileName(row._1) try { val dest = new Path(fileName) if(DefaultFileSystem.getFS.exists(dest)) { DefaultFileSystem.getFS.delete(dest, true) } rowDF.saveAsParquetFile(fileName) } catch { case e : Throwable => { logError("Failed to save parquet file") } failure = true } This code doesn't work well because of NestedRDD , what is the best way to solve this problem? Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Group-by-specific-key-and-save-as-parquet-tp24527.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org