[jira] [Commented] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table
[ https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717737#comment-14717737 ] Yin Huai commented on SPARK-10287: -- We need to put the following release note JSON data source will not automatically load new files that are created by other applications (i.e. files that are not inserted to the dataset through Spark SQL). [SPARK-10287].. After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table Key: SPARK-10287 URL: https://issues.apache.org/jira/browse/SPARK-10287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Labels: releasenotes Fix For: 1.5.1 I have a partitioned json table with 1824 partitions. {code} val df = sqlContext.read.format(json).load(aPartitionedJsonData) val columnStr = df.schema.map(_.name).mkString(,) println(scolumns: $columnStr) val hash = df .selectExpr(shash($columnStr) as hashValue) .groupBy() .sum(hashValue) .head() .getLong(0) {code} Looks like for JSON, we refresh metadata when we call buildScan. For a partitioned table, we call buildScan for every partition. So, looks like we will refresh this table 1824 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table
[ https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14715570#comment-14715570 ] Apache Spark commented on SPARK-10287: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8469 After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table Key: SPARK-10287 URL: https://issues.apache.org/jira/browse/SPARK-10287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Critical I have a partitioned json table with 1824 partitions. {code} val df = sqlContext.read.format(json).load(aPartitionedJsonData) val columnStr = df.schema.map(_.name).mkString(,) println(scolumns: $columnStr) val hash = df .selectExpr(shash($columnStr) as hashValue) .groupBy() .sum(hashValue) .head() .getLong(0) {code} Looks like for JSON, we refresh metadata when we call buildScan. For a partitioned table, we call buildScan for every partition. So, looks like we will refresh this table 1824 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org