Hello, I'm trying to convert a bunch of csv files to parquet, with the interesting case that the input csv files are already "partitioned" by directory. All the input files have the same set of columns. The input files structure looks like :
/path/dir1/file1.csv /path/dir1/file2.csv /path/dir2/file3.csv /path/dir3/file4.csv /path/dir3/file5.csv /path/dir3/file6.csv I'd like to read those files and write their data to a parquet table in hdfs, preserving the partitioning (partitioned by input directory), and such as there is a single output file per partition. The output files strucutre should look like : hdfs://path/dir=dir1/part-r-xxx.gz.parquet hdfs://path/dir=dir2/part-r-yyy.gz.parquet hdfs://path/dir=dir3/part-r-zzz.gz.parquet The best solution I have found so far is to loop among the input directories, loading the csv files in a dataframe and to write the dataframe in the target partition. But this not efficient since I want a single output file per partition, the writing to hdfs is a single tasks that blocks the loop. I wonder how to achieve this with a maximum of parallelism (and without shuffling the data in the cluster). Thanks ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org