Raymond Tang created SPARK-40206: ------------------------------------ Summary: Spark SQL Predict Pushdown for Hive Bucketed Table Key: SPARK-40206 URL: https://issues.apache.org/jira/browse/SPARK-40206 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 3.3.0 Reporter: Raymond Tang
Hi team, I was testing out Hive bucket table features. One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance. However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?* How to produce the issue: Create a Hive 3 table using the following DDL: {code:java} create table test_db.bucket_table(user_id int, key string) comment 'A bucketed table' partitioned by(country string) clustered by(user_id) sorted by (key) into 10 buckets stored as ORC;{code} And then insert into this table using the following PySpark script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() # prepare sample data for inserting into hive table data = [] countries = ['CN', 'AU'] for i in range(0, 1000): data.append([int(i), 'U'+str(i), countries[i % 2]]) df = spark.createDataFrame(data, ['country', 'user_id', 'key']) df.show() # Save df to Hive table test_db.bucket_table df.write.mode('append').insertInto('test_db.bucket_table') {code} Then query the table using the following script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() df = spark.sql("""select * from test_db.bucket_table where country='AU' and user_id=101 """) df.show() df.explain(extended=True) {code} I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU. Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org