Best solution I've found so far (no shuffling and as many threads as input dirs) :
Create an rdd of input dirs, with as many partitions as input dirs Transform it to an rdd of input files (preserving the partitions by dirs) Flat-map it with a custom csv parser Convert rdd to dataframe Write dataframe to parquet table partitioned by dirs It requires to write his own parser. I could not find a solution to preserve the partitioning using sc.textfile or the databricks csv parser. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28120.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org