Hi,

firstly there is no need to use repartition by range. The repartition, or
coalesce clause can come after the sort and everything will be fine.


Secondly to reduce the number of records per file there is no need to use
repartition, just try to sort  and then write out the files using the
property: spark.sql.files.maxRecordsPerFile unless there is skew in the
data things will work out fine.


Regards,
Gourav Sengupta

On Mon, Mar 8, 2021 at 4:01 PM m li <xiyunanmen...@gmail.com> wrote:

> Hi Ivan,
>
>
>
> If the error you are referring to is that the data is out of order, it may
> be that the data is out of order due to the “repartition”. You can try to
> use the “repartitionByRange”
>
> scala> val df = sc.parallelize (1 to 1000, 10).toDF("v")
>
> scala> df.repartitionByRange(5,column("v")).sortWithinPartitions("v").
> write.parquet(outputPath)
>
>
>
> Best Regards,
>
> m li
> Ivan Petrov wrote
> > Ah... makes sense, thank you. i tried sortWithinPartition before and
> > replaced with sort. It was a mistake.
> >
> > чт, 25 февр. 2021 г. в 15:25, Pietro Gentile <
>
> > pietro.gentile89.developer@
>
> >>:
> >
> >> Hi,
> >>
> >> It is because of *repartition* before the *sort* method invocation. If
> >> you reverse them you'll see 5 output files.
> >>
> >> Regards,
> >> Pietro
> >>
> >> Il giorno mer 24 feb 2021 alle ore 16:43 Ivan Petrov &lt;
>
> > capacytron@
>
> > &gt;
> >> ha scritto:
> >>
> >>> Hi, I'm trying to control the size and/or count of spark output.
> >>>
> >>> Here is my code. I expect to get 5 files  but I get dozens of small
> >>> files.
> >>> Why?
> >>>
> >>> dataset
> >>> .repartition(5)
> >>> .sort("long_repeated_string_in_this_column") // should be better
> >>> compressed with snappy
> >>> .write
> >>> .parquet(outputPath)
> >>>
> >>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to