[ 
https://issues.apache.org/jira/browse/SPARK-46996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46996:
-----------------------------------
    Labels: pull-request-available  (was: )

> Use AnalyzeTableCommand overwrite statistics information incorrectly
> --------------------------------------------------------------------
>
>                 Key: SPARK-46996
>                 URL: https://issues.apache.org/jira/browse/SPARK-46996
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4, 3.4.1
>            Reporter: Davy Xu
>            Priority: Major
>              Labels: pull-request-available
>
> When the size of the table changes but the total number of rows in the table 
> does not change, I use the sql statement "analyze table student compute 
> statistics" to analyze the external table statistics. The Statistics 
> information returned by the sql statement "desc extended student" only 
> contains the table Size information, excluding rowCounts information.
> Specific operating instructions:
> {code:sql}
> Create external table
> create table student(id int, name string, age int) row format delimited 
> fields terminated by ','
> lines terminated by '\n' location 'hdfs://nameservice/spark/student';
> The contents of the external table file are as follows:
> class1.txt:
> 1,'Jack',25
> 2,'Thompson',28
> class2.txt:
> 3,'Davy',30
> 4,'Thompson',35
> class3.txt:
> 5,'Curry',40
> 6,'Morgan',20
> Import external table data
> hdfs dfs -put 1.txt /spark/student
> hdfs dfs -put 2.txt /spark/student
> Analyze external table statistics
> analyze table student compute statistics;
> desc extended student;
> Return results
> Type EXTERNAL
> Provider hive
> Table Properties [transient_lastDdlTime=1707265554]
> Statistics 56 bytes, 4 rows
> Location hdfs://nameservice/spark/student
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat org.apache.hadoop.mapred.TextInputFormat
> OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties [serialization.format=,, line.delim=
> , field.delim=,]
> Partition Provider Catalog
> Modify external table
> hdfs dfs -rm /spark/student/student2.txt
> hdfs dfs -put student3.txt /spark/student
> Analyze the external table again
> analyze table student compute statistics;
> desc extended student;
> Return results
> Type EXTERNAL
> Provider hive
> Table Properties [transient_lastDdlTime=1707265719]
> Statistics 55 bytes
> Location hdfs://nameservice/spark/student
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat org.apache.hadoop.mapred.TextInputFormat
> OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties [serialization.format=,, line.delim=
> , field.delim=,]
> Partition Provider Catalog
> {code}
> Through the above operation results, I found that when the table size changes 
> but the number of rows does not change, the statistics should include the new 
> table size and the total number of rows. I don’t know if it is correct to 
> only display the statistics of the table size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to