[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

Nick Luo Fri, 24 Nov 2023 00:00:37 -0800

Hi, all

The ANALYZE TABLE command run from Spark on a Hive table.


Question:
Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE
TABLE' Command on Hive client, the wrong Statistic Info show up.

For example
 1. run the analyze table command o hive client

 - create table test_anaylze (id int) partitioned by (dt string);
 - insert into test_anaylze partition (dt = "2023-11-24") values(1321);
 - analyze table  test_anaylze partition(dt = "2023-11-24") COMPUTE
STATISTICS;

 2.  run the analyze table command o spark-sql client

- analyze table  test_anaylze partition(dt = "2023-11-24") COMPUTE
STATISTICS;
- DESC EXTENED test_anaylze PARTITION (dt = "2023-11-24")

I got the correct Info at the first time, but when I inserted another value
by using spark-sql, and ran 'ANALYZE TABLE' Command on spark-sql client, i
still got right information of numRows ,totalSize. But when I inserted
third value into Hive table, and ran 'ANALYZE TABLE' Command on Hive
client, then I ran ran 'ANALYZE TABLE' Command on spark-sql client, I got
wrong Statistic INFO from the PARTITION STATISTICS.It seems that Spark will
check the INFO from hive metastore whether the params of hive (numRows,
TotalSize) is currect, the param of spark (spark.sql.statistics.numRows,
spark.sql.statistics.TotalSize) will not update anymore


Can anyone explain why this suitation occurs? [image:
1516f391eab71a4533593f0cf167c4e.png]

[image: 5b8b8067878e22875b524b49b39fa3c.png]

[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

Reply via email to