Re: dataset sort

david westwood Fri, 09 Feb 2018 07:23:46 -0800

Thanks.

I have to stream in the historical data and its out-of-boundedness >>
real-time data. I thought there was some elegant way using mapPartition
that I wasn't seeing.


On Fri, Feb 9, 2018 at 5:10 AM, Fabian Hueske <fhue...@gmail.com> wrote:

> You can also partition by range and sort and write each partition. Once
> all partitions have been written to files, you can concatenate the files.
> As Till said it is not possible to sort in parallel and write in order to
> a single file.
>
> Best, Fabian
>
> 2018-02-09 10:35 GMT+01:00 Till Rohrmann <trohrm...@apache.org>:
>
>> Hi David,
>>
>> Flink only supports sorting within partitions. Thus, if you want to write
>> out a globally sorted dataset you should set the parallelism to 1 which
>> effectively results in a single partition. Decreasing the parallelism of
>> an operator will cause the individual partitions to lose its sort order
>> because the individual partitions are read in a non deterministic order.
>>
>> Cheers,
>> Till
>>
>>
>> On Thu, Feb 8, 2018 at 8:07 PM, david westwood <
>> david.d.westw...@gmail.com> wrote:
>>
>>> Hi:
>>>
>>> I would like to sort historical data using the dataset api.
>>>
>>> env.setParallelism(10)
>>>
>>> val dataset = [(Long, String)] ..
>>> .paritionByRange(_._1)
>>> .sortPartition(_._1, Order.ASCEDING)
>>> .writeAsCsv("mydata.csv").setParallelism(1)
>>>
>>> the data is out of order (in local order)
>>> but
>>> .print()
>>> prints the data in to correct order. I have run a small toy sample
>>> multiple times.
>>>
>>> Is there a way to sort the entire dataset with parallelism > 1 and write
>>> it to a single file in ascending order?
>>>
>>
>>
>

Re: dataset sort

Reply via email to