Using incremental stats in your scenario is extremely dangerous and I highly recommend against it. That limitation was put in place to guard clusters against service downtime due to serializing huge tables and hitting JVM limits like the 2GB max array size.
Even if the catalogd and impalads stay up, having such huge metadata will negatively impact the health and performance of your cluster. There's a blurb about this in the Impala docs, CTRL+F for "For a table with a huge number of partitions" https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html Impala caches and disseminates metadata at the table granularity. If anything in a table changes, the whole updated table is sent out to all coordinating impalads. Each such impalad caches the entire metadata of a table including the incremental stats for all columns and all partitions. The below is copied from a different discussion thread discussing alternatives to incremental stats: Btw, you should also know that the following improvements in the upcoming 2.12 release might make "compute stats" more palatable on your huge tables. We'd love your feedback on COMPUTE STATS with TABLESAMPLE, in particular. COMPUTE STATS with TABLESAMPLE https://issues.apache.org/jira/browse/IMPALA-5310 COMPUTE STATS on a subset of columns https://issues.apache.org/jira/browse/IMPALA-3562 The following improvement should allow you to COMPUTE STATS less frequently by extrapolating the row count of partitions that were added or modified since the last COMPUTE STATS. https://issues.apache.org/jira/browse/IMPALA-2373 https://issues.apache.org/jira/browse/IMPALA-6228 In general, would be great to get your feedback/ideas on how to make computing stats more practical for you. On Wed, Mar 7, 2018 at 8:48 AM, Miguel Figueiredo <ollliega...@gmail.com> wrote: > Hi, > > I have bumped into the 200MB limit when calculating incremental stats ( > https://issues.apache.org/jira/browse/IMPALA-3552). > > I don't understand which data catalogd sends to the impalad each time the > incremental stats are calculated. Does it send only the new information > calculated for new partitions or all the statistics data? > > In my case I have 387 tables with 2550 columns. I am creating a new > partition for each table every hour and calculating incremental stats for > these new partitions. If catalogd is sending serialized data for the new > partitions and columns, it shouldn't amount to 200MB. > > I would appreciate if someone can help me understand this concept or point > me to some documentation. > > Best regards, > Miguel > > -- > Miguel Figueiredo > Software Developer > > "I'm a pretty lazy person and am prepared to work quite hard in order to > avoid work." > -- Martin Fowler >