Re: Computing stats on big partitioned parquet tables

Thoralf Gutierrez Tue, 23 Jan 2018 09:24:24 -0800

Hello everybody,

Did anything catch your eye in the two profiles attached to my last email?
We're still blocked and can't even COMPUTE STATS once for our tables :-/ I
am really curious why it OOMs instead of spilling to disk?


Thanks,
Thoralf

On Fri, 19 Jan 2018 at 08:24 Thoralf Gutierrez <thoralfgutier...@gmail.com>
wrote:

> Hey Mostafa,
>
> Here are two query profiles on two different tables where COMPUTE STATS
> OOMed at different steps. The first one OOMed on the first stats query
> (counts) and the second one OOMed on the second stats query (NDV, MAX, etc).
>
> Don't be fooled by the corrupt parquet error, you can still see the
> exceeded memory limit further down. I believe I am getting
> https://issues.apache.org/jira/browse/IMPALA-5197
>
> Thanks a lot in advance!
> Thoralf
>
> On Thu, 18 Jan 2018 at 22:30 Alexander Behm <alex.b...@cloudera.com>
> wrote:
>
>> The documentation has good overview of the limitations and caveats:
>>
>> https://impala.apache.org/docs/build/html/topics/impala_perf_stats.html#perf_stats_incremental
>>
>> On Thu, Jan 18, 2018 at 7:29 PM, Fawze Abujaber <fawz...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I didn’t in the documentation of the incremental compute stats any
>>> limitations,
>>>
>>> Is it size limit or memory limit ( 200 MB)?
>>>
>>> Why should compute stats successes and incremental compute stats not?
>>>
>>> I’m upgrading my cluster at Sunday as the incremental compute stats was
>>> one of the incentives :(
>>>
>>> On Fri, 19 Jan 2018 at 4:13 Mostafa Mokhtar <mmokh...@cloudera.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Do you mind sharing the query profile for the query that failed with
>>>> OOM? there should be some clues on to why the OOM is happening.
>>>>
>>>> Thanks
>>>> Mostafa
>>>>
>>>>
>>>> On Thu, Jan 18, 2018 at 5:54 PM, Thoralf Gutierrez <
>>>> thoralfgutier...@gmail.com> wrote:
>>>>
>>>>> Hello everybody!
>>>>>
>>>>> (I am using Impala 2.8.0, out of Cloudera Express 5.11.1)
>>>>>
>>>>> I now understand that we are _highly_ recommended to compute stats for
>>>>> our tables so I have decided to make sure we do.
>>>>>
>>>>> On my quest to do so, I started with a first `COMPUTE INCREMENTAL
>>>>> STATS my_big_partitioned_parquet_table` and ran into :
>>>>>
>>>>> > HiveServer2Error: AnalysisException: Incremental stats size estimate
>>>>> exceeds 200.00MB. Please try COMPUTE STATS instead.
>>>>>
>>>>> I found out that we could increase this limit, so I set
>>>>> inc_stats_size_limit_bytes to 1073741824 (1GB)
>>>>>
>>>>> > HiveServer2Error: AnalysisException: Incremental stats size estimate
>>>>> exceeds 1.00GB. Please try COMPUTE STATS instead.
>>>>>
>>>>> So I ended up trying to COMPUTE STATS for the whole table instead of
>>>>> incrementally, but I still hit memory limits when computing counts with my
>>>>> mem_limit at 34359738368 (32GB)
>>>>>
>>>>> > Process: memory limit exceeded. Limit=32.00 GB Total=48.87 GB
>>>>> Peak=51.97 GB
>>>>>
>>>>> 1. Am I correct to assume that even if I did not have enough memory,
>>>>> the query should spill to disk and just be slower instead of OOMing?
>>>>> 2. Any other recommendation on how else I could go about computing
>>>>> some stats on my big partitioned parquet table?
>>>>>
>>>>> Thanks a lot!
>>>>> Thoralf
>>>>>
>>>>>
>>>>
>>

Re: Computing stats on big partitioned parquet tables

Reply via email to