Re: Automatic Update statistics on ORC tables in Hive

Andrew Sears Mon, 28 Mar 2016 14:08:26 -0700

It would be useful to have a script that could be scheduled as part of a low 
priority background job, to update stats at least where none are available, and 
a report in the Hive GUI on stats per table.

Encountered a Tez oo memory issue due to the lack of auto updated stats 
recently.
Cheers, Andrew

On Mon, Mar 28, 2016 at 2:27 PM, Mich Talebzadeh < mich.talebza...@gmail.com 
[mich.talebza...@gmail.com] > wrote:
Hi Alan,
Thanks for the clarification. I gather you are referring to the following notes 
in Jira
"Given the work that's going on in HIVE-11160 
[https://issues.apache.org/jira/browse/HIVE-11160] and HIVE-12763 
[https://issues.apache.org/jira/browse/HIVE-12763] I don't think it makes sense 
to continue down this path. These JIRAs will lay the groundwork for 
auto-gathering stats on data as it is inserted rather than having a background 
process do the work."
I concur that I am not a fan of automatic update statistics although many RDBMS 
vendor were touting about it in earlier days. The whole thing turned up to be a 
hindrance as UPDATE STATISTICS was being fired in the middle of the business 
day thus adding issues to the workload by taking resources away.
Most vendors base the need for update/gathering stats on the number of rows 
being changed by relying on some Function say datachange(). When datachange() 
function indicates changes by 10% so it is time for update stats to run. Again 
in my opinion rather arbitrary and void of any scientific base. For Hive the 
important one is Inserts. For transactional tables one will have Updates and 
Deletes as well. My understanding is that the classical approach is to report 
on how many "row change operations" say Inserts have been performed since the 
last time any kind of analyze statistics was run.

This came to my mind as I was using Spark to load CSV files and create and 
insert in Hive ORC tables. The problem I have is that Analyse statistics 
through Spark fails. This is not a show stopper as the load shell script 
invokes beeline to log in to Hive and Analyze statistics on the newly created 
table. Although some proponents might argue about saving data in Spark as 
Parquet file, when one has millions and millions of rows then stats matter and 
then ORC adds its value.

Cheers

Dr Mich Talebzadeh

LinkedIn 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

[https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw]

http://talebzadehmich.wordpress.com [http://talebzadehmich.wordpress.com/]

On 28 March 2016 at 18:43, Alan Gates < alanfga...@gmail.com 
[alanfga...@gmail.com] > wrote:
I resolved that as Won’t Fix. See the last comment on the JIRA for my rationale.

Alan.

> On Mar 28, 2016, at 03:53, Mich Talebzadeh < mich.talebza...@gmail.com 
> [mich.talebza...@gmail.com] > wrote:
>
> Thanks. This does not seem to be implemented although the Jira says resolved. 
> It also mentions the timestamp of the last update stats. I do not see it yet.
>
> Regards,
>
> Mich
>
> Dr Mich Talebzadeh
>
> LinkedIn 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> [https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw]
>
> http://talebzadehmich.wordpress.com [http://talebzadehmich.wordpress.com]
>
>
> On 28 March 2016 at 06:19, Gopal Vijayaraghavan < gop...@apache.org 
> [gop...@apache.org] > wrote:
>
> > This might be a bit far fetched but is there any plan for background
> >ANALYZE STATISTICS to be performed on ORC tables
>
>
> https://issues.apache.org/jira/browse/HIVE-12669 
> [https://issues.apache.org/jira/browse/HIVE-12669]
>
> Cheers,
> Gopal
>
>
>

Re: Automatic Update statistics on ORC tables in Hive

Reply via email to