Re: Incremental stats and catalogd data serialization

Alexander Behm Wed, 07 Mar 2018 10:14:54 -0800

Using incremental stats in your scenario is extremely dangerous and I
highly recommend against it. That limitation was put in place to guard
clusters against service downtime due to serializing huge tables and
hitting JVM limits like the 2GB max array size.

Even if the catalogd and impalads stay up, having such huge metadata will
negatively impact the health and performance of your cluster.

There's a blurb about this in the Impala docs, CTRL+F for "For a table with
a huge number of partitions"
https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html

Impala caches and disseminates metadata at the table granularity. If
anything in a table changes, the whole updated table is sent out to all
coordinating impalads. Each such impalad caches the entire metadata of a
table including the incremental stats for all columns and all partitions.

The below is copied from a different discussion thread discussing
alternatives to incremental stats:

Btw, you should also know that the following improvements in the upcoming
2.12 release might make "compute stats" more palatable on your huge tables.
We'd love your feedback on COMPUTE STATS with TABLESAMPLE, in particular.

COMPUTE STATS with TABLESAMPLE
https://issues.apache.org/jira/browse/IMPALA-5310

COMPUTE STATS on a subset of columns
https://issues.apache.org/jira/browse/IMPALA-3562

The following improvement should allow you to COMPUTE STATS less frequently
by extrapolating the row count of partitions that were added or modified
since the last COMPUTE STATS.
https://issues.apache.org/jira/browse/IMPALA-2373
https://issues.apache.org/jira/browse/IMPALA-6228

In general, would be great to get your feedback/ideas on how to make
computing stats more practical for you.

On Wed, Mar 7, 2018 at 8:48 AM, Miguel Figueiredo <[email protected]>
wrote:

> Hi,
>
> I have bumped into the 200MB limit when calculating incremental stats (
> https://issues.apache.org/jira/browse/IMPALA-3552).
>
> I don't understand which data catalogd sends to the impalad each time the
> incremental stats are calculated. Does it send only the new information
> calculated for new partitions or all the statistics data?
>
> In my case I have 387 tables with 2550 columns. I am creating a new
> partition for each table every hour and calculating incremental stats for
> these new partitions. If catalogd is sending serialized data for the new
> partitions and columns, it shouldn't amount to 200MB.
>
> I would appreciate if someone can help me understand this concept or point
> me to some documentation.
>
> Best regards,
> Miguel
>
> --
> Miguel Figueiredo
> Software Developer
>
> "I'm a pretty lazy person and am prepared to work quite hard in order to
> avoid work."
> -- Martin Fowler
>

Re: Incremental stats and catalogd data serialization

Reply via email to