[ https://issues.apache.org/jira/browse/SPARK-25377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Iverson Hu updated SPARK-25377: ------------------------------- Description: When I use SQL dataframe in application, I found that dataframe.cache is invalid, the first time to execute Action like count() took me 40 seconds, and the seconds time to execute Action also.So I use dataframe.rdd.cache, second execution time is less than first execution time. And I think it's SQL dataframe's bug. This is my codes and console log, and I have cached the datafame of result before. this is my codes logger.info("start to consuming result count") logger.info(s"consuming ${result.count} output records") //result.show(false) logger.info("starting go to MysqlSink") logger.info(s"consuming ${result.count} output records") logger.info("starting go to MysqlSink") And console log is below 18/09/08 14:15:17 INFO MySQLRiskScenarioRunner: start to consuming result count 18/09/08 14:15:49 INFO MySQLRiskScenarioRunner: consuming 5 output records 18/09/08 14:15:49 INFO MySQLRiskScenarioRunner: starting go to MysqlSink 18/09/08 14:16:22 INFO MySQLRiskScenarioRunner: consuming 5 output records 18/09/08 14:16:22 INFO MySQLRiskScenarioRunner: starting go to MysqlSink was: When I use SQL dataframe in application, I found that dataframe.cache is invalid, the first time to execute Action like count() took me 40 seconds, and the seconds time to execute Action also.So I use dataframe.rdd.cache, second execution time is less than first execution time. And I think it's SQL dataframe's bug. This is my codes and console log, and I have cached the datafame of result before. !image-2018-09-08-14-18-36-780.png! !image-2018-09-08-14-18-07-759.png! > spark sql dataframe cache is invalid > ------------------------------------ > > Key: SPARK-25377 > URL: https://issues.apache.org/jira/browse/SPARK-25377 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.0 > Environment: spark version 2.3.0 > scala version 2.1.8 > Reporter: Iverson Hu > Priority: Major > > When I use SQL dataframe in application, I found that dataframe.cache is > invalid, the first time to execute Action like count() took me 40 seconds, > and the seconds time to execute Action also.So I use dataframe.rdd.cache, > second execution time is less than first execution time. And I think it's SQL > dataframe's bug. > This is my codes and console log, and I have cached the datafame of result > before. > this is my codes > logger.info("start to consuming result count") > logger.info(s"consuming ${result.count} output records") > //result.show(false) > logger.info("starting go to MysqlSink") > logger.info(s"consuming ${result.count} output records") > logger.info("starting go to MysqlSink") > > And console log is below > 18/09/08 14:15:17 INFO MySQLRiskScenarioRunner: start to consuming result > count > 18/09/08 14:15:49 INFO MySQLRiskScenarioRunner: consuming 5 output records > 18/09/08 14:15:49 INFO MySQLRiskScenarioRunner: starting go to MysqlSink > 18/09/08 14:16:22 INFO MySQLRiskScenarioRunner: consuming 5 output records > 18/09/08 14:16:22 INFO MySQLRiskScenarioRunner: starting go to MysqlSink > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org