sorry, now what I can do is like this :

var df5 = spark.read.parquet("/user/devuser/testdata/df1").coalesce(1)
df5 = df5.union(df5).union(df5).union(df5).union(df5)

2018-12-14 

lk_spark 



发件人:15313776907 <[email protected]>
发送时间:2018-12-14 16:39
主题:Re: how to generate a larg dataset paralleled
收件人:"[email protected]"<[email protected]>
抄送:"[email protected]"<[email protected]>



I also have this problem, hope to be able to solve here, thank you 
On 12/14/2018 10:38,lk_spark<[email protected]> wrote: 
hi,all:
    I want't to generate some test data , which contained about one hundred 
million rows .
    I create a dataset have ten rows ,and I do df.union operation in 'for' 
circulation , but this will case the operation only happen on driver node.
    how can I do it on the whole cluster.

2018-12-14


lk_spark 

Reply via email to