Thanks. I have to stream in the historical data and its out-of-boundedness >> real-time data. I thought there was some elegant way using mapPartition that I wasn't seeing.
On Fri, Feb 9, 2018 at 5:10 AM, Fabian Hueske <fhue...@gmail.com> wrote: > You can also partition by range and sort and write each partition. Once > all partitions have been written to files, you can concatenate the files. > As Till said it is not possible to sort in parallel and write in order to > a single file. > > Best, Fabian > > 2018-02-09 10:35 GMT+01:00 Till Rohrmann <trohrm...@apache.org>: > >> Hi David, >> >> Flink only supports sorting within partitions. Thus, if you want to write >> out a globally sorted dataset you should set the parallelism to 1 which >> effectively results in a single partition. Decreasing the parallelism of >> an operator will cause the individual partitions to lose its sort order >> because the individual partitions are read in a non deterministic order. >> >> Cheers, >> Till >> >> >> On Thu, Feb 8, 2018 at 8:07 PM, david westwood < >> david.d.westw...@gmail.com> wrote: >> >>> Hi: >>> >>> I would like to sort historical data using the dataset api. >>> >>> env.setParallelism(10) >>> >>> val dataset = [(Long, String)] .. >>> .paritionByRange(_._1) >>> .sortPartition(_._1, Order.ASCEDING) >>> .writeAsCsv("mydata.csv").setParallelism(1) >>> >>> the data is out of order (in local order) >>> but >>> .print() >>> prints the data in to correct order. I have run a small toy sample >>> multiple times. >>> >>> Is there a way to sort the entire dataset with parallelism > 1 and write >>> it to a single file in ascending order? >>> >> >> >