[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358402#comment-16358402
 ] 

Marco Gaido commented on SPARK-23373:
-

Then I think we can close this, thanks.

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  {color:#ff}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at 
> 

[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358392#comment-16358392
 ] 

Yuming Wang commented on SPARK-23373:
-

I cannot reproduce on current master as your mentioned too.

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  {color:#ff}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at 
> 

[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Wang, Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358355#comment-16358355
 ] 

Wang, Gang commented on SPARK-23373:


Yes. Seems related to my test environment.

While, I tried in a Spark suite, in class _*PruneFileSourcePartitionsSuite*, 
method_ test("SPARK-20986 Reset table's statistics after 
PruneFileSourcePartitions rule").

Add 

_sql("select count(distinct id) from tbl").collect()_

 __ got the same exception. Could you please have a try in your side?

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  {color:#ff}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at 

[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358315#comment-16358315
 ] 

Marco Gaido commented on SPARK-23373:
-

I cannot reproduce on current master... May you try and check whether the issue 
still exists?

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  {color:#ff}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at 
>