e
To: user@spark.apache.org
Subject: Repartitioning by partition size, not by number of partitions.
Hi,
I have inpot data that are many of very small files containing one .json.
For performance reasons (I use PySpark) I have to do repartioning, currently I
do:
sc.textFile(files).coalesce(100))
Problem
y partition size, not by number of partitions.
Hi,
I have inpot data that are many of very small files containing one .json.
For performance reasons (I use PySpark) I have to do repartioning, currently I
do:
sc.textFile(files).coalesce(100))
Problem is that I have to guess the number of part
Hi,
I have inpot data that are many of very small files containing one .json.
For performance reasons (I use PySpark) I have to do repartioning, currently I
do:
sc.textFile(files).coalesce(100))
Problem is that I have to guess the number of partitions in a such way that
it's as fast as po