[ https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149227#comment-15149227 ]
Joseph K. Bradley commented on SPARK-13346: ------------------------------------------- CC: [~andrewor14] [~joshrosen] whom I spoke with about this issue > DataFrame caching is not handled well during planning or execution > ------------------------------------------------------------------ > > Key: SPARK-13346 > URL: https://issues.apache.org/jira/browse/SPARK-13346 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.0.0 > Reporter: Joseph K. Bradley > > I have an iterative algorithm based on DataFrames, and the query plan grows > very quickly with each iteration. Caching the current DataFrame at the end > of an iteration does not fix the problem. However, converting the DataFrame > to an RDD and back at the end of each iteration does fix the problem. > Printing the query plans shows that the plan explodes quickly (10 lines, to > several hundred lines, to several thousand lines, ...) with successive > iterations. > The desired behavior is for the analyzer to recognize that a big chunk of the > query plan does not need to be computed since it is already cached. The > computation on each iteration should be the same. > If useful, I can push (complex) code to reproduce the issue. But it should > be simple to see if you create an iterative algorithm which produces a new > DataFrame from an old one on each iteration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org