Raymond Tang created SPARK-40206:
------------------------------------

             Summary: Spark SQL Predict Pushdown for Hive Bucketed Table
                 Key: SPARK-40206
                 URL: https://issues.apache.org/jira/browse/SPARK-40206
             Project: Spark
          Issue Type: Question
          Components: Spark Core
    Affects Versions: 3.3.0
            Reporter: Raymond Tang


Hi team,

I was testing out Hive bucket table features.  One of the benefits as most 
documentation suggested is that bucketed hive table can be used for query 
filer/predict pushdown to improve query performance.

However through my exploration, that doesn't seem to be true. *Can you please 
help to clarify if Spark SQL supports query optimizations when using Hive 
bucketed table?*

 

How to produce the issue:

Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string) 
comment 'A bucketed table' 
partitioned by(country string) 
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
    data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['country', 'user_id', 'key'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

df = spark.sql("""select * from test_db.bucket_table
where country='AU' and user_id=101
""")
df.show()
df.explain(extended=True) {code}
I am expecting to read from only one bucket file in HDFS but instead Spark 
scanned all bucket files in partition folder country=AU.

Am I doing something wrong? or is it because Spark doesn't support it? Your 
guidance and help will be appreciated. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to