Fabian, Not sure if we are on the same page. If I do something like below code, it will groupby field 0 and each task will write a separate part file in parallel.
val sink = data1.join(data2) .where(1).equalTo(0) { ((l,r) => ( l._3, r._3) ) } .partitionByHash(0) .writeAsCsv(pathBase + "output/test", rowDelimiter="\n", fieldDelimiter="\t" , WriteMode.OVERWRITE) This will create folder ./output/test/<1,2,3,4...> But what I was looking for is Hive style partitionBy that will output with folder structure ./output/field0=1/file ./output/field0=2/file ./output/field0=3/file ./output/field0=4/file Assuming field0 is Int and has unique values 1,2,3&4. Srikanth On Mon, Feb 15, 2016 at 6:20 AM, Fabian Hueske <fhue...@gmail.com> wrote: > Hi Srikanth, > > DataSet.partitionBy() will partition the data on the declared partition > fields. > If you append a DataSink with the same parallelism as the partition > operator, the data will be written out with the defined partitioning. > It should be possible to achieve the behavior you described using > DataSet.partitionByHash() or partitionByRange(). > > Best, Fabian > > > 2016-02-12 20:53 GMT+01:00 Srikanth <srikanth...@gmail.com>: > >> Hello, >> >> >> >> Is there a Hive(or Spark dataframe) partitionBy equivalent in Flink? >> >> I'm looking to save output as CSV files partitioned by two columns(date >> and hour). >> >> The partitionBy dataset API is more to partition the data based on a >> column for further processing. >> >> >> >> I'm thinking there is no direct API to do this. But what will be the best >> way of achieving this? >> >> >> >> Srikanth >> >> >> > >