Re: Computing stats on big partitioned parquet tables

Mostafa Mokhtar Tue, 23 Jan 2018 17:57:07 -0800

At first glance I believe you have too many partitions which will cause you
problems down the line, recommendation is keep this number in the 10K
range, more partitions means more load on HMS and a larger metadata foot
print.


If the goal is to speedup read queries I recommend you check this blog:
https://blog.cloudera.com/blog/2017/12/faster-performance-
for-selective-queries/

Were you able to identify the corrupt file?

    Errors: Corrupt Parquet file 'xxxxxxxxxx': negative row count
-2081643451 in file metadata

On Tue, Jan 23, 2018 at 9:33 AM, Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> It looks like there's a bunch of untracked memory (the profile shows that
> the tracked memory consumption of the query operators is actually quite
> low). Given the number of files, I suspect that there's some sort of
> metadata or control structure that is accumulating throughout query
> execution. I'd be interested to understand what is happening.
>
> Just to confirm, does the memory consumption go away after the query
> completes? You can look at the host:25000/memz debug webpage and see the
> "Process" memory consumption.
>
> On Tue, Jan 23, 2018 at 9:24 AM, Thoralf Gutierrez <
> thoralfgutier...@gmail.com> wrote:
>
>> Hello everybody,
>>
>> Did anything catch your eye in the two profiles attached to my last
>> email? We're still blocked and can't even COMPUTE STATS once for our tables
>> :-/ I am really curious why it OOMs instead of spilling to disk?
>>
>> Thanks,
>> Thoralf
>>
>> On Fri, 19 Jan 2018 at 08:24 Thoralf Gutierrez <
>> thoralfgutier...@gmail.com> wrote:
>>
>>> Hey Mostafa,
>>>
>>> Here are two query profiles on two different tables where COMPUTE STATS
>>> OOMed at different steps. The first one OOMed on the first stats query
>>> (counts) and the second one OOMed on the second stats query (NDV, MAX, etc).
>>>
>>> Don't be fooled by the corrupt parquet error, you can still see the
>>> exceeded memory limit further down. I believe I am getting
>>> https://issues.apache.org/jira/browse/IMPALA-5197
>>>
>>> Thanks a lot in advance!
>>> Thoralf
>>>
>>> On Thu, 18 Jan 2018 at 22:30 Alexander Behm <alex.b...@cloudera.com>
>>> wrote:
>>>
>>>> The documentation has good overview of the limitations and caveats:
>>>> https://impala.apache.org/docs/build/html/topics/impala_perf
>>>> _stats.html#perf_stats_incremental
>>>>
>>>> On Thu, Jan 18, 2018 at 7:29 PM, Fawze Abujaber <fawz...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I didn’t in the documentation of the incremental compute stats any
>>>>> limitations,
>>>>>
>>>>> Is it size limit or memory limit ( 200 MB)?
>>>>>
>>>>> Why should compute stats successes and incremental compute stats not?
>>>>>
>>>>> I’m upgrading my cluster at Sunday as the incremental compute stats
>>>>> was one of the incentives :(
>>>>>
>>>>> On Fri, 19 Jan 2018 at 4:13 Mostafa Mokhtar <mmokh...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Do you mind sharing the query profile for the query that failed with
>>>>>> OOM? there should be some clues on to why the OOM is happening.
>>>>>>
>>>>>> Thanks
>>>>>> Mostafa
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 18, 2018 at 5:54 PM, Thoralf Gutierrez <
>>>>>> thoralfgutier...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello everybody!
>>>>>>>
>>>>>>> (I am using Impala 2.8.0, out of Cloudera Express 5.11.1)
>>>>>>>
>>>>>>> I now understand that we are _highly_ recommended to compute stats
>>>>>>> for our tables so I have decided to make sure we do.
>>>>>>>
>>>>>>> On my quest to do so, I started with a first `COMPUTE INCREMENTAL
>>>>>>> STATS my_big_partitioned_parquet_table` and ran into :
>>>>>>>
>>>>>>> > HiveServer2Error: AnalysisException: Incremental stats size
>>>>>>> estimate exceeds 200.00MB. Please try COMPUTE STATS instead.
>>>>>>>
>>>>>>> I found out that we could increase this limit, so I set
>>>>>>> inc_stats_size_limit_bytes to 1073741824 (1GB)
>>>>>>>
>>>>>>> > HiveServer2Error: AnalysisException: Incremental stats size
>>>>>>> estimate exceeds 1.00GB. Please try COMPUTE STATS instead.
>>>>>>>
>>>>>>> So I ended up trying to COMPUTE STATS for the whole table instead of
>>>>>>> incrementally, but I still hit memory limits when computing counts with 
>>>>>>> my
>>>>>>> mem_limit at 34359738368 (32GB)
>>>>>>>
>>>>>>> > Process: memory limit exceeded. Limit=32.00 GB Total=48.87 GB
>>>>>>> Peak=51.97 GB
>>>>>>>
>>>>>>> 1. Am I correct to assume that even if I did not have enough memory,
>>>>>>> the query should spill to disk and just be slower instead of OOMing?
>>>>>>> 2. Any other recommendation on how else I could go about computing
>>>>>>> some stats on my big partitioned parquet table?
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>> Thoralf
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>

Re: Computing stats on big partitioned parquet tables

Reply via email to