Try rdd.coalesce(1).saveAsParquetFile(...)
http://spark.apache.org/docs/1.2.0/programming-guide.html#transformations
--- Original Message ---
From: "Manoj Samel"
Sent: January 29, 2015 9:28 AM
To: user@spark.apache.org
Subject: schemaRDD.saveAsParquetFile creates large number of small parquet
You can use coalesce or repartition to control the number of file output by
any Spark operation.
On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel
wrote:
> Spark 1.2 on Hadoop 2.3
>
> Read one big csv file, create a schemaRDD on it and saveAsParquetFile.
>
> It creates a large number of small (~1MB )