> Seems you're hitting the self-join, currently Spark SQL won't cache any > result/logical tree for further analyzing or computing for self-join.
Other joins don’t suffer from this problem? > Since the logical tree is huge, it's reasonable to take long time in > generating its tree string recursively. The internal structure is basically a graph though, right? Where equal cached subtrees are structurally shared by reference instead of copying them by value. Is the `generateTreeString` result needed for anything other than giving the RDD a nice name? It seems rather wasteful to compute a graphs unfolding into a tree for this. > And I also doubt the computing can finish within a reasonable time, as there > probably be lots of partitions (grows exponentially) of the intermediate > result. > Possibly, so far the number of partitions stayed the same though. But I didn’t run that many iterations due to the problem :). > As a workaround, you can break the iterations into smaller ones and trigger > them manually in sequence. You mean` write` ing them to disk after each iteration? Thanks :), Jan > -----Original Message----- > From: Jan-Paul Bultmann [mailto:janpaulbultm...@me.com] > Sent: Wednesday, June 17, 2015 6:17 PM > To: User > Subject: generateTreeString causes huge performance problems on dataframe > persistence > > Hey, > I noticed that my code spends hours with `generateTreeString` even though the > actual dag/dataframe execution takes seconds. > > I’m running a query that grows exponential in the number of iterations when > evaluated without caching, but should be linear when caching previous results. > > E.g. > > result_i+1 = distinct(join(result_i, result_i)) > > Which evaluates exponentially like this this without caching. > > Iteration | Dataframe Plan Tree > 0 | /\ > 1 | /\ /\ > 2 | /\/\ /\/\ > n | ………. > > But should be linear with caching. > > Iteration | Dataframe Plan Tree > 0 | /\ > | \/ > 1 | /\ > | \/ > 2 | /\ > | \/ > n | ………. > > > It seems that even though the DAG will have the later form, > `generateTreeString` will walk the entire plan naively as if no caching was > done. > > The spark webui also shows no active jobs even though my CPU uses one core > fully, calculating that string. > > Below is the piece of stacktrace that starts the entire walk. > > ^ > | > Thousands of calls to `generateTreeString`. > | > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(int, > StringBuilder) TreeNode.scala:431 > org.apache.spark.sql.catalyst.trees.TreeNode.treeString() TreeNode.scala:400 > org.apache.spark.sql.catalyst.trees.TreeNode.toString() TreeNode.scala:397 > org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply() > InMemoryColumnarTableScan.scala:164 > org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply() > InMemoryColumnarTableScan.scala:164 > scala.Option.getOrElse(Function0) Option.scala:120 > org.apache.spark.sql.columnar.InMemoryRelation.buildBuffers() > InMemoryColumnarTableScan.scala:164 > org.apache.spark.sql.columnar.InMemoryRelation.<init>(Seq, boolean, int, > StorageLevel, SparkPlan, Option, RDD, Statistics, Accumulable) > InMemoryColumnarTableScan.scala:112 > org.apache.spark.sql.columnar.InMemoryRelation$.apply(boolean, int, > StorageLevel, SparkPlan, Option) InMemoryColumnarTableScan.scala:45 > org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply() > CacheManager.scala:102 > org.apache.spark.sql.execution.CacheManager.writeLock(Function0) > CacheManager.scala:70 > org.apache.spark.sql.execution.CacheManager.cacheQuery(DataFrame, Option, > StorageLevel) CacheManager.scala:94 > org.apache.spark.sql.DataFrame.persist(StorageLevel) DataFrame.scala:1320 ^ > | > Application logic. > | > > Could someone confirm my suspicion? > And does somebody know why it’s called while caching, and why it walks the > entire tree including cached results? > > Cheers, Jan-Paul > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org