Re: generateTreeString causes huge performance problems on dataframe persistence

Jan-Paul Bultmann Wed, 17 Jun 2015 09:38:10 -0700

> Seems you're hitting the self-join, currently Spark SQL won't cache any 
> result/logical tree for further analyzing or computing for self-join.


Other joins don’t suffer from this problem?

> Since the logical tree is huge, it's reasonable to take long time in 
> generating its tree string recursively.

The internal structure is basically a graph though, right?
Where equal cached subtrees are structurally shared by reference instead of 
copying them by value.

Is the `generateTreeString` result needed for anything other than giving the 
RDD a nice name?
It seems rather wasteful to compute a graphs unfolding into a tree for this.

> And I also doubt the computing can finish within a reasonable time, as there 
> probably be lots of partitions (grows exponentially) of the intermediate 
> result.
> 

Possibly, so far the number of partitions stayed the same though.
But I didn’t run that many iterations due to the problem :).

> As a workaround, you can break the iterations into smaller ones and trigger 
> them manually in sequence.

You mean` write` ing them to disk after each iteration?

Thanks :), Jan

> -----Original Message-----
> From: Jan-Paul Bultmann [mailto:janpaulbultm...@me.com] 
> Sent: Wednesday, June 17, 2015 6:17 PM
> To: User
> Subject: generateTreeString causes huge performance problems on dataframe 
> persistence
> 
> Hey,
> I noticed that my code spends hours with `generateTreeString` even though the 
> actual dag/dataframe execution takes seconds.
> 
> I’m running a query that grows exponential in the number of iterations when 
> evaluated without caching, but should be linear when caching previous results.
> 
> E.g.
> 
>    result_i+1 = distinct(join(result_i, result_i))
> 
> Which evaluates exponentially like this this without caching.
> 
> Iteration | Dataframe Plan Tree
> 0            |        /\
> 1            |     /\    /\
> 2            |    /\/\  /\/\
> n            |    ……….
> 
> But should be linear with caching.
> 
> Iteration | Dataframe Plan Tree
> 0            |     /\
>              |     \/
> 1            |     /\
>              |     \/
> 2            |     /\
>              |     \/
> n            | ……….
> 
> 
> It seems that even though the DAG will have the later form, 
> `generateTreeString` will walk the entire plan naively as if no caching was 
> done.
> 
> The spark webui also shows no active jobs even though my CPU uses one core 
> fully, calculating that string.
> 
> Below is the piece of stacktrace that starts the entire walk.
> 
> ^
> |
> Thousands of calls to  `generateTreeString`.
> |
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(int, 
> StringBuilder) TreeNode.scala:431
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString() TreeNode.scala:400
> org.apache.spark.sql.catalyst.trees.TreeNode.toString() TreeNode.scala:397
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply()
>  InMemoryColumnarTableScan.scala:164
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply()
>  InMemoryColumnarTableScan.scala:164
> scala.Option.getOrElse(Function0) Option.scala:120
> org.apache.spark.sql.columnar.InMemoryRelation.buildBuffers() 
> InMemoryColumnarTableScan.scala:164
> org.apache.spark.sql.columnar.InMemoryRelation.<init>(Seq, boolean, int, 
> StorageLevel, SparkPlan, Option, RDD, Statistics, Accumulable) 
> InMemoryColumnarTableScan.scala:112
> org.apache.spark.sql.columnar.InMemoryRelation$.apply(boolean, int, 
> StorageLevel, SparkPlan, Option) InMemoryColumnarTableScan.scala:45
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply() 
> CacheManager.scala:102
> org.apache.spark.sql.execution.CacheManager.writeLock(Function0) 
> CacheManager.scala:70 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(DataFrame, Option, 
> StorageLevel) CacheManager.scala:94
> org.apache.spark.sql.DataFrame.persist(StorageLevel) DataFrame.scala:1320 ^
> |
> Application logic.
> |
> 
> Could someone confirm my suspicion?
> And does somebody know why it’s called while caching, and why it walks the 
> entire tree including cached results?
> 
> Cheers, Jan-Paul
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
> commands, e-mail: user-h...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: generateTreeString causes huge performance problems on dataframe persistence

Reply via email to