Re: Incremental stats and catalogd data serialization

Miguel Figueiredo Thu, 08 Mar 2018 07:55:21 -0800

Hi Alex,

Thanks for the feedback.


I will the new version and the new way of computing stats when possible.
In the meantime we are thinking of computing the stats manually. If we
compute stats per partition and for the whole table, will we encounter the
same memory limit?
Should we compute stats for the whole table and disregard partitions stats?

Best regards,
Miguel

On Wed, Mar 7, 2018 at 6:14 PM, Alexander Behm <[email protected]>
wrote:

> Using incremental stats in your scenario is extremely dangerous and I
> highly recommend against it. That limitation was put in place to guard
> clusters against service downtime due to serializing huge tables and
> hitting JVM limits like the 2GB max array size.
>
> Even if the catalogd and impalads stay up, having such huge metadata will
> negatively impact the health and performance of your cluster.
>
> There's a blurb about this in the Impala docs, CTRL+F for "For a table
> with a huge number of partitions"
> https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html
>
> Impala caches and disseminates metadata at the table granularity. If
> anything in a table changes, the whole updated table is sent out to all
> coordinating impalads. Each such impalad caches the entire metadata of a
> table including the incremental stats for all columns and all partitions.
>
>
> The below is copied from a different discussion thread discussing
> alternatives to incremental stats:
>
> Btw, you should also know that the following improvements in the upcoming
> 2.12 release might make "compute stats" more palatable on your huge tables.
> We'd love your feedback on COMPUTE STATS with TABLESAMPLE, in particular.
>
> COMPUTE STATS with TABLESAMPLE
> https://issues.apache.org/jira/browse/IMPALA-5310
>
> COMPUTE STATS on a subset of columns
> https://issues.apache.org/jira/browse/IMPALA-3562
>
> The following improvement should allow you to COMPUTE STATS less
> frequently by extrapolating the row count of partitions that were added or
> modified since the last COMPUTE STATS.
> https://issues.apache.org/jira/browse/IMPALA-2373
> https://issues.apache.org/jira/browse/IMPALA-6228
>
> In general, would be great to get your feedback/ideas on how to make
> computing stats more practical for you.
>
>
>
> On Wed, Mar 7, 2018 at 8:48 AM, Miguel Figueiredo <[email protected]>
> wrote:
>
>> Hi,
>>
>> I have bumped into the 200MB limit when calculating incremental stats (
>> https://issues.apache.org/jira/browse/IMPALA-3552).
>>
>> I don't understand which data catalogd sends to the impalad each time the
>> incremental stats are calculated. Does it send only the new information
>> calculated for new partitions or all the statistics data?
>>
>> In my case I have 387 tables with 2550 columns. I am creating a new
>> partition for each table every hour and calculating incremental stats for
>> these new partitions. If catalogd is sending serialized data for the new
>> partitions and columns, it shouldn't amount to 200MB.
>>
>> I would appreciate if someone can help me understand this concept or
>> point me to some documentation.
>>
>> Best regards,
>> Miguel
>>
>> --
>> Miguel Figueiredo
>> Software Developer
>>
>> "I'm a pretty lazy person and am prepared to work quite hard in order to
>> avoid work."
>> -- Martin Fowler
>>
>
>


-- 
Miguel Figueiredo
Software Developer

"I'm a pretty lazy person and am prepared to work quite hard in order to
avoid work."
-- Martin Fowler

Re: Incremental stats and catalogd data serialization

Reply via email to