Best solution I've found so far (no shuffling and as many threads as input
dirs) :
Create an rdd of input dirs, with as many partitions as input dirs
Transform it to an rdd of input files (preserving the partitions by dirs)
Flat-map it with a custom csv parser
Convert rdd to dataframe
Write datafr
This is more or less how I'm doing it now.
Problem is that it creates shuffling in the cluster because the input data
are not collocated according to the partition scheme.
If a reload the output parquet files as a new dataframe, then everything is
fine, but I'd like to avoid shuffling also during
Yes, by parsing the file content, it's possible to recover in which directory
they are.
From: neil90 [via Apache Spark User List]
[mailto:ml-node+s1001560n28083...@n3.nabble.com]
Sent: mercredi 16 novembre 2016 17:41
To: Drooghaag, Benoit (Nokia - BE)
Subject: Re: CSV to parquet preserving part
Hello,
I'm trying to convert a bunch of csv files to parquet, with the interesting
case that the input csv files are already "partitioned" by directory.
All the input files have the same set of columns.
The input files structure looks like :
/path/dir1/file1.csv
/path/dir1/file2.csv
/path/dir2/fi