My bad, I realized my question was unclear. I did a partitionBy when using saveAsHadoopFile. My question was about doing the same thing for Parquet file. We were using Spark 1.3.x, but now that we've updated to 1.4.1 I totally forgot this makes things possible :-)
Thanks for the answer, then! On 8 September 2015 at 12:58, Cheng Lian <lian.cs....@gmail.com> wrote: > In Spark 1.4 and 1.5, you can do something like this: > > df.write.partitionBy("key").parquet("/datasink/output-parquets") > > BTW, I'm curious about how did you do it without partitionBy using > saveAsHadoopFile? > > Cheng > > > On 9/8/15 2:34 PM, Adrien Mogenet wrote: > > Hi there, > > We've spent several hours to split our input data into several parquet > files (or several folders, i.e. > /datasink/output-parquets/<key>/foobar.parquet), based on a > low-cardinality key. This works very well with a when using > saveAsHadoopFile, but we can't achieve a similar thing with Parquet files. > > The only working solution so far is to persist the RDD and then loop over > it N times to write N files. That does not look acceptable... > > Do you guys have any suggestion to do such an operation? > > -- > > *Adrien Mogenet* > Head of Backend/Infrastructure > <adrien.moge...@contentsquare.com>adrien.moge...@contentsquare.com > (+33)6.59.16.64.22 > <http://www.contentsquare.com/>http://www.contentsquare.com > 50, avenue Montaigne - 75008 Paris > > > -- *Adrien Mogenet* Head of Backend/Infrastructure adrien.moge...@contentsquare.com (+33)6.59.16.64.22 http://www.contentsquare.com 50, avenue Montaigne - 75008 Paris