subject:"Repartitioning by partition size, not by number of partitions."

RE: Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread jan.zikes

e To: user@spark.apache.org Subject: Repartitioning by partition size, not by number of partitions. Hi, I have inpot data that are many of very small files containing one .json. For performance reasons (I use PySpark) I have to do repartioning, currently I do: sc.textFile(files).coalesce(100)) Problem

RE: Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread Ganelin, Ilya

y partition size, not by number of partitions. Hi, I have inpot data that are many of very small files containing one .json. For performance reasons (I use PySpark) I have to do repartioning, currently I do: sc.textFile(files).coalesce(100)) Problem is that I have to guess the number of part

Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread jan.zikes

Hi, I have inpot data that are many of very small files containing one .json. For performance reasons (I use PySpark) I have to do repartioning, currently I do: sc.textFile(files).coalesce(100)) Problem is that I have to guess the number of partitions in a such way that it's as fast as po