Re: Impala table partitions and it's impact

Bharath Vissapragada Fri, 19 Oct 2018 13:03:01 -0700

Thanks, some responses inline. HTH.

On Fri, Oct 19, 2018 at 11:31 AM Fawze Abujaber <fawz...@gmail.com> wrote:


> sorry i missed to mention this, i'm using impala 2.10 with Cloudera
> manager 5.13.0
>
> On Fri, Oct 19, 2018 at 9:30 PM Bharath Vissapragada <
> bhara...@cloudera.com> wrote:
>
>> What version of Impala are you on?
>>
>> On Fri, Oct 19, 2018 at 11:14 AM Fawze Abujaber <fawz...@gmail.com>
>> wrote:
>>
>>> Hello Community,
>>>
>>> I Have 400 Impala tables that partitioned by Year,month and day, and the
>>> retention for these tables is 6 months.
>>>
>>> I would like to increase these tables partitions by adding  the first 2
>>> digits of the account, that meaning i will increase the partitions of each
>>> table by X100.
>>>
>>> For sure i will review these tables and make sure i do this for the
>>> large tables only.
>>>
>>> Is there is a limit for the number of partitions for each table,
>>> theorytically No but intersting to know the best practises, I know this
>>> will impact the metastore and catalog server.
>>>
>>
There is no enforced limits or guardrails built into the product but
generally fewer the better. There are some guidelines in this cookbook [1].
Few things might help scale this better.

- Do not use incremental stats. Based on our experience it blows up the
memory usage. We made many stability improvements in this area in the
master branch (Impala 3.1.0, future release)
- Separate coordinators and executors [2]. Look for "Controlling which
Hosts are Coordinators and Executors".  We have noticed that this
stabilizes the cluster and network quite a bit by limiting the number of
nodes to which the Catalog is broadcast.

[1] https://www.slideshare.net/cloudera/the-impala-cookbook-42530186
[2]
https://impala.apache.org/docs/build3x/html/topics/impala_scalability.html


>
>>> What i'm looking for is:
>>>
>>> 1- How i can check the size for the metadata that each impala node store
>>> and the catalog server as a whole?
>>>
>>
You can check the /memz page and look for JVM metrics in the bottom.
Metadata is stored and processed in the embedded JVM memory. So the JVM
heap metrics should give you a good idea,


>
>>> 2- Is there a linear relationship between number of tables/partitions
>>> and the memory needed for the metastore and catalog server?
>>> In other words, for example if i would like to do the mention change,
>>> what is the needed changes i should do in terms of memory for the
>>> metastore, Catalog, and Impala Daemon to minmize the impact.
>>>
>>
Major contributors are "Incremental stats" and "number of files and
blocks". Try to minimize these if possible. For example don't use too many
small files etc.


>
>>> 3- Is there a relationship between the DDL statements that i will do
>>> (mainly DROP partitions) and the memory of the metastore and Catalog, and
>>> impala daemon memory?
>>>
>>
There will be a temporary working memory spike with DDLs. This is
essentially due to thrift serialization and deserialization of the table
after the DDL is run. In 2.10.0, table is the mimumum unit for
serialization and is sent as a response to the DDL. Depending on the size
of the table, the working memory may vary.

We made some improvements in this area in an ongoing Catalog redesign
project. This ships in 3.1.0


>
>>>
>>> 4- Is there any metric in Cloudera Manager that i can use to get about
>>> the partitions and it's impact on the mentioned 3 Roles?
>>>
>>
May be check with Cloudera on this. Like I mentioned above, JVM heap usage
is a good indicator.


>
>>> 5- in a note a side, on 200 of the impala tables i have, i have to run
>>> ALTER Table xxxx recover partitions each 20 minutes, and DROP/CREATE tables
>>> twice a day.
>>> which actions i can take to reduce the running time of these operations.
>>>
>>>  I'm intersting to know the actions that i can terms in terms of:
>>>
>>>  A) Number of impala daemons in the cluster (adding more nodes).
>>>
>>
Don't think it matters. It is more about no. of coordinators vs executors.


>  B) Number of the nodes that can act as coordinator ( I'm using VIP for
>>> the cordinator and i can drop and add nodes to this VIP).
>>>
>>
May be start with a lower value like 5% of the cluster size and see how it
works out and then scale up?


>  C) The impala daemon memory limit.
>>>
>>  D) The catalog role memory and the hive metastore memory.
>>>
>>
No easy way to calculate this. Try it out and see if the current limits
work fine. If not bump them up?


>
>>>
>>>
>>> --
>>> Take Care
>>> Fawze Abujaber
>>>
>>
>
> --
> Take Care
> Fawze Abujaber
>

Re: Impala table partitions and it's impact

Reply via email to