Actually, as you use Dataset's union API, unlike RDD's union API, it will
break the nested structure. So that should not be the issue.
The additional time introduced when the number of dataframes grows, is spent
on analysis stage. I can think that as the Union has a long children list,
the
.
From: Maciej Szymkiewicz [via Apache Spark Developers List]
[mailto:ml-node+s1001551n20395...@n3.nabble.com]
Sent: Thursday, December 29, 2016 7:39 PM
To: Mendelson, Assaf
Subject: Re: repeated unioning of dataframes take worse than O(N^2) time
Iterative union like this creates a deeply nested
Don't do that. Union them all at once with SparkContext.union
On Thu, Dec 29, 2016, 17:21 assaf.mendelson wrote:
> Hi,
>
>
>
> I have been playing around with doing union between a large number of
> dataframes and saw that the performance of the actual union (not the
>
Iterative union like this creates a deeply nested recursive structure in
a similar manner to described here http://stackoverflow.com/q/34461804
You can try something like this http://stackoverflow.com/a/37612978 but
there is of course on overhead of conversion between Dataset and RDD.
On