[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358402#comment-16358402 ] Marco Gaido commented on SPARK-23373: - Then I think we can close this, thanks. > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#ff}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ > _at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_ > _at >
[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358392#comment-16358392 ] Yuming Wang commented on SPARK-23373: - I cannot reproduce on current master as your mentioned too. > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#ff}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ > _at >
[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358355#comment-16358355 ] Wang, Gang commented on SPARK-23373: Yes. Seems related to my test environment. While, I tried in a Spark suite, in class _*PruneFileSourcePartitionsSuite*, method_ test("SPARK-20986 Reset table's statistics after PruneFileSourcePartitions rule"). Add _sql("select count(distinct id) from tbl").collect()_ __ got the same exception. Could you please have a try in your side? > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#ff}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at
[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358315#comment-16358315 ] Marco Gaido commented on SPARK-23373: - I cannot reproduce on current master... May you try and check whether the issue still exists? > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#ff}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ > _at >