Re: Automatic Update statistics on ORC tables in Hive

2016-03-28 Thread Andrew Sears
It would be useful to have a script that could be scheduled as part of a low 
priority background job, to update stats at least where none are available, and 
a report in the Hive GUI on stats per table.


Encountered a Tez oo memory issue due to the lack of auto updated stats 
recently.
Cheers, Andrew

On Mon, Mar 28, 2016 at 2:27 PM, Mich Talebzadeh < mich.talebza...@gmail.com 
[mich.talebza...@gmail.com] > wrote:
Hi Alan,
Thanks for the clarification. I gather you are referring to the following notes 
in Jira
"Given the work that's going on in HIVE-11160 
[https://issues.apache.org/jira/browse/HIVE-11160] and HIVE-12763 
[https://issues.apache.org/jira/browse/HIVE-12763] I don't think it makes sense 
to continue down this path. These JIRAs will lay the groundwork for 
auto-gathering stats on data as it is inserted rather than having a background 
process do the work."
I concur that I am not a fan of automatic update statistics although many RDBMS 
vendor were touting about it in earlier days. The whole thing turned up to be a 
hindrance as UPDATE STATISTICS was being fired in the middle of the business 
day thus adding issues to the workload by taking resources away.
Most vendors base the need for update/gathering stats on the number of rows 
being changed by relying on some Function say datachange(). When datachange() 
function indicates changes by 10% so it is time for update stats to run. Again 
in my opinion rather arbitrary and void of any scientific base. For Hive the 
important one is Inserts. For transactional tables one will have Updates and 
Deletes as well. My understanding is that the classical approach is to report 
on how many "row change operations" say Inserts have been performed since the 
last time any kind of analyze statistics was run.

This came to my mind as I was using Spark to load CSV files and create and 
insert in Hive ORC tables. The problem I have is that Analyse statistics 
through Spark fails. This is not a show stopper as the load shell script 
invokes beeline to log in to Hive and Analyze statistics on the newly created 
table. Although some proponents might argue about saving data in Spark as 
Parquet file, when one has millions and millions of rows then stats matter and 
then ORC adds its value.




Cheers


Dr Mich Talebzadeh



LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 
[https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw]



http://talebzadehmich.wordpress.com [http://talebzadehmich.wordpress.com/]




On 28 March 2016 at 18:43, Alan Gates < alanfga...@gmail.com 
[alanfga...@gmail.com] > wrote:
I resolved that as Won’t Fix. See the last comment on the JIRA for my rationale.

Alan.

> On Mar 28, 2016, at 03:53, Mich Talebzadeh < mich.talebza...@gmail.com 
> [mich.talebza...@gmail.com] > wrote:
>
> Thanks. This does not seem to be implemented although the Jira says resolved. 
> It also mentions the timestamp of the last update stats. I do not see it yet.
>
> Regards,
>
> Mich
>
> Dr Mich Talebzadeh
>
> LinkedIn 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> [https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw]
>
> http://talebzadehmich.wordpress.com [http://talebzadehmich.wordpress.com]
>
>
> On 28 March 2016 at 06:19, Gopal Vijayaraghavan < gop...@apache.org 
> [gop...@apache.org] > wrote:
>
> > This might be a bit far fetched but is there any plan for background
> >ANALYZE STATISTICS to be performed on ORC tables
>
>
> https://issues.apache.org/jira/browse/HIVE-12669 
> [https://issues.apache.org/jira/browse/HIVE-12669]
>
> Cheers,
> Gopal
>
>
>

Re: Automatic Update statistics on ORC tables in Hive

2016-03-28 Thread Mich Talebzadeh
Hi Alan,

Thanks for the clarification. I gather you are referring to the following
notes in Jira

"Given the work that's going on in HIVE-11160
 and HIVE-12763
 I don't think it makes
sense to continue down this path. These JIRAs will lay the groundwork for
auto-gathering stats on data as it is inserted rather than having a
background process do the work."

I concur that I am not a fan of automatic update statistics although many
RDBMS vendor were touting about it in earlier days. The whole thing turned
up to be a hindrance as UPDATE STATISTICS was being fired in the middle of
the business day thus adding issues to the workload by taking resources
away.

Most vendors base the need for update/gathering stats on the number of
 rows being changed by relying on some Function say datachange(). When
datachange()  function indicates changes by 10% so it is time for update
stats to run. Again in my opinion rather arbitrary and void of any
scientific base. For Hive the important one is Inserts. For transactional
tables one will have Updates and Deletes as well. My understanding is that
the classical approach is to report on how many "row change operations" say
Inserts have been performed since the last time any kind of analyze
statistics was run.

This came to my mind as I was using Spark to load CSV files and create and
insert in Hive ORC tables. The problem I have is that Analyse statistics
through Spark fails. This is not a show stopper as the load shell
script invokes beeline to log in to Hive and Analyze statistics on the
newly created table. Although some proponents might argue about saving data
in Spark as Parquet file, when one has millions and millions of rows then
stats matter and then ORC adds its value.


Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 March 2016 at 18:43, Alan Gates  wrote:

> I resolved that as Won’t Fix.  See the last comment on the JIRA for my
> rationale.
>
> Alan.
>
> > On Mar 28, 2016, at 03:53, Mich Talebzadeh 
> wrote:
> >
> > Thanks. This does not seem to be implemented although the Jira says
> resolved. It also mentions the timestamp of the last update stats. I do not
> see it yet.
> >
> > Regards,
> >
> > Mich
> >
> > Dr Mich Talebzadeh
> >
> > LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > On 28 March 2016 at 06:19, Gopal Vijayaraghavan 
> wrote:
> >
> > > This might be a bit far fetched but is there any plan for background
> > >ANALYZE STATISTICS to be performed  on ORC tables
> >
> >
> > https://issues.apache.org/jira/browse/HIVE-12669
> >
> > Cheers,
> > Gopal
> >
> >
> >
>
>


Re: Automatic Update statistics on ORC tables in Hive

2016-03-28 Thread Alan Gates
I resolved that as Won’t Fix.  See the last comment on the JIRA for my 
rationale.

Alan.

> On Mar 28, 2016, at 03:53, Mich Talebzadeh  wrote:
> 
> Thanks. This does not seem to be implemented although the Jira says resolved. 
> It also mentions the timestamp of the last update stats. I do not see it yet.
> 
> Regards,
> 
> Mich
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
> On 28 March 2016 at 06:19, Gopal Vijayaraghavan  wrote:
> 
> > This might be a bit far fetched but is there any plan for background
> >ANALYZE STATISTICS to be performed  on ORC tables
> 
> 
> https://issues.apache.org/jira/browse/HIVE-12669
> 
> Cheers,
> Gopal
> 
> 
> 



Re: Automatic Update statistics on ORC tables in Hive

2016-03-28 Thread Mich Talebzadeh
Thanks. This does not seem to be implemented although the Jira says
resolved. It also mentions the timestamp of the last update stats. I do not
see it yet.

Regards,

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 March 2016 at 06:19, Gopal Vijayaraghavan  wrote:

>
> > This might be a bit far fetched but is there any plan for background
> >ANALYZE STATISTICS to be performed  on ORC tables
>
>
> https://issues.apache.org/jira/browse/HIVE-12669
>
> Cheers,
> Gopal
>
>
>


Re: Automatic Update statistics on ORC tables in Hive

2016-03-27 Thread Gopal Vijayaraghavan

> This might be a bit far fetched but is there any plan for background
>ANALYZE STATISTICS to be performed  on ORC tables


https://issues.apache.org/jira/browse/HIVE-12669

Cheers,
Gopal




Automatic Update statistics on ORC tables in Hive

2016-03-27 Thread Mich Talebzadeh
This might be a bit far fetched but is there any plan for background
ANALYZE STATISTICS to be performed  on ORC tables for example when it does
compaction etc.

Also I notice that "desc formatted  does not show details of
statistics run time. Could that be added in future releases as I think it
will be useful cause a frequent question when the query is runnig slow is
to ask whether the stats are up-to-date on the underkying table(s).

hive> desc formatted nw_10124772;
OK
# col_name  data_type   comment
transactiondate date
transactiontype string
description string
value   double
balance double
accountname string
accountnumber   int
# Detailed Table Information
Database:   accounts
Owner:  hduser
CreateTime: Sun Mar 27 17:29:53 BST 2016
LastAccessTime: UNKNOWN
Retention:  0
Location:
hdfs://rhes564:9000/user/hive/warehouse/accounts.db/nw_10124772
Table Type: MANAGED_TABLE
Table Parameters:

*   COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}*
comment from csv file from excel sheet
numFiles6
numRows 1447
orc.compressZLIB
rawDataSize 0
totalSize   36537
transient_lastDdlTime   1459121295


Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com