sorry, now what I can do is like this : var df5 = spark.read.parquet("/user/devuser/testdata/df1").coalesce(1) df5 = df5.union(df5).union(df5).union(df5).union(df5)
2018-12-14 lk_spark 发件人:15313776907 <15313776...@163.com> 发送时间:2018-12-14 16:39 主题:Re: how to generate a larg dataset paralleled 收件人:"lk_sp...@163.com"<lk_sp...@163.com> 抄送:"user@spark.apache.org"<user@spark.apache.org> I also have this problem, hope to be able to solve here, thank you On 12/14/2018 10:38,lk_spark<lk_sp...@163.com> wrote: hi,all: I want't to generate some test data , which contained about one hundred million rows . I create a dataset have ten rows ,and I do df.union operation in 'for' circulation , but this will case the operation only happen on driver node. how can I do it on the whole cluster. 2018-12-14 lk_spark