Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-30 Thread Liang-Chi Hsieh
Actually, as you use Dataset's union API, unlike RDD's union API, it will break the nested structure. So that should not be the issue. The additional time introduced when the number of dataframes grows, is spent on analysis stage. I can think that as the Union has a long children list, the

RE: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread assaf.mendelson
. From: Maciej Szymkiewicz [via Apache Spark Developers List] [mailto:ml-node+s1001551n20395...@n3.nabble.com] Sent: Thursday, December 29, 2016 7:39 PM To: Mendelson, Assaf Subject: Re: repeated unioning of dataframes take worse than O(N^2) time Iterative union like this creates a deeply nested

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread Sean Owen
Don't do that. Union them all at once with SparkContext.union On Thu, Dec 29, 2016, 17:21 assaf.mendelson wrote: > Hi, > > > > I have been playing around with doing union between a large number of > dataframes and saw that the performance of the actual union (not the >

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread Maciej Szymkiewicz
Iterative union like this creates a deeply nested recursive structure in a similar manner to described here http://stackoverflow.com/q/34461804 You can try something like this http://stackoverflow.com/a/37612978 but there is of course on overhead of conversion between Dataset and RDD. On