Re: Improve performance of Analyze table compute statistics

2018-09-13 Thread Prabhakar Reddy
Thank you Gopal for this Information.Currently I am using EMR to run this
query.As this operation is CPU intensive could you please let me know if
increasing the RAM/cores can speed up this process?

On Tue, Aug 28, 2018 at 8:56 PM Gopal Vijayaraghavan 
wrote:

>
> > Will it be referring to orc metadata or it will be loading the whole
> file and then counting the rows.
>
> Depends on the partial-scan setting or if it is computing full column
> stats (the full column stats does an nDV, which reads all rows).
>
> hive> analyze table compute statistics ... partialscan;
>
> https://issues.apache.org/jira/browse/HIVE-4177
>
> AFAIK, this got removed in Hive 3.x (because we really want autogather
> column stats on insert, not just basic stats from this).
>
> > Is there any place to cache this information so that I don't need to
> scan all the files every time.
>
> https://cwiki.apache.org/confluence/display/Hive/LLAP
>
> Cheers,
> Gopal
>
>
>


Re: Improve performance of Analyze table compute statistics

2018-08-28 Thread Gopal Vijayaraghavan


> Will it be referring to orc metadata or it will be loading the whole file and 
> then counting the rows.

Depends on the partial-scan setting or if it is computing full column stats 
(the full column stats does an nDV, which reads all rows).

hive> analyze table compute statistics ... partialscan;

https://issues.apache.org/jira/browse/HIVE-4177

AFAIK, this got removed in Hive 3.x (because we really want autogather column 
stats on insert, not just basic stats from this).

> Is there any place to cache this information so that I don't need to scan all 
> the files every time.

https://cwiki.apache.org/confluence/display/Hive/LLAP

Cheers,
Gopal




Re: Improve performance of Analyze table compute statistics

2018-08-28 Thread Prabhakar Reddy
Yeah partition level statistics are good.I see hive orc reader is reading
rows from s3 for each file in the hive server log.Will it be referring to
orc metadata or it will be loading the whole file and then counting the
rows.Is there any place to cache this information so that I don't need to
scan all the files every time.

On Sun, Aug 26, 2018, 4:24 PM Jörn Franke  wrote:

> You can partition it and only compute statistics for new partitions...
>
> On 26. Aug 2018, at 12:43, Prabhakar Reddy  wrote:
>
> Hello,
>
> Are there any properties that I can set to improve the performance of
> Analyze table compute statistics statement.My data sits in s3 and I see
> it's taking one second per file to read the schema of each file from s3.
>
> 2018-08-24T03:25:57,525 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC
> rows from s3://file_1
> 2018-08-24T03:25:57,526 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) -
> Reader schema not provided -- using file schema
>
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC
> rows from s3://file_2
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) -
> Reader schema not provided -- using file schema
>
> It takes around 80 seconds for 76 files with total size of 23 GB.
>
>
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: exec.Task (SessionState.java:printInfo()) - Table
> dept_data_services.lc_credit_2018_08_20_temp stats: [numFiles=76,
> numRows=101341845, totalSize=26500166568, rawDataSize=294491898741]
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: ql.Driver (Driver.java:execute(2050)) - Completed executing
> command(queryId=lcapp_20180824032545_aba21e71-ea4b-4214-8793-705a5e0367f0);
> Time taken: 81.169 seconds
> 2018-08-24T03:27:07,674 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: ql.Driver (SessionState.java:printInfo()) - OK
> 2018-08-24T03:27:07,681 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: CliDriver (SessionState.java:printInfo()) - Time taken:
> 81.992 seconds
>
> If I run the same command with few columns then the query runs 60%
> faster.Is there any property that I can modify to reduce the time taken for
> this read?
>
> Regards
> Prabhakar Reddy
>
>
>
>
>


Re: Improve performance of Analyze table compute statistics

2018-08-26 Thread Jörn Franke
You can partition it and only compute statistics for new partitions...

> On 26. Aug 2018, at 12:43, Prabhakar Reddy  wrote:
> 
> Hello,
> 
> Are there any properties that I can set to improve the performance of Analyze 
> table compute statistics statement.My data sits in s3 and I see it's taking 
> one second per file to read the schema of each file from s3.
> 
> 2018-08-24T03:25:57,525 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC 
> rows from s3://file_1
> 2018-08-24T03:25:57,526 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) - Reader 
> schema not provided -- using file schema 
> 
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC 
> rows from s3://file_2
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) - Reader 
> schema not provided -- using file schema
> 
> It takes around 80 seconds for 76 files with total size of 23 GB.
> 
> 
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: exec.Task (SessionState.java:printInfo()) - Table 
> dept_data_services.lc_credit_2018_08_20_temp stats: [numFiles=76, 
> numRows=101341845, totalSize=26500166568, rawDataSize=294491898741]
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: ql.Driver (Driver.java:execute(2050)) - Completed executing 
> command(queryId=lcapp_20180824032545_aba21e71-ea4b-4214-8793-705a5e0367f0); 
> Time taken: 81.169 seconds
> 2018-08-24T03:27:07,674 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: ql.Driver (SessionState.java:printInfo()) - OK
> 2018-08-24T03:27:07,681 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: CliDriver (SessionState.java:printInfo()) - Time taken: 81.992 
> seconds
> 
> If I run the same command with few columns then the query runs 60% faster.Is 
> there any property that I can modify to reduce the time taken for this read?
> 
> Regards
> Prabhakar Reddy
> 
> 
>