Hi Peter, For starters I would recommend the following overviews: 1. The Apache Impala website has a pretty comprehensive Impala guide, the 2.10.0 version can be found at http://impala.apache.org/docs/build/impala-2.10.pdf. Sizing considerations start on page 20. 2. Putting on my Cloudera hat for a moment: A good summary slide deck is the Impala Cookbook, created by Cloudera's Impala developers and field engineers, available on SlideShare: https://www.slideshare.net/cloudera/the-impala-cookbook-42530186
To answer your specific question: The statestore and the catalog are usually recommended to run on their own dedicated hosts, separate from the worker nodes. The catalog has significant memory requirements, as it has to keep the complete metadata in memory (databases/tables/fields, the file layout for the tables and the HDFS block layout of the files, and optionally all security permissions from Sentry). You can find sizing formulas both for the memory requirements and for storage sizing in the above documents. I'm sure the community would be able to offer more specific help given more details about your setup and workload. Hope this helps, - LaszloG On Tue, Feb 27, 2018 at 1:58 AM, Peter Horvath <[email protected]> wrote: > Dear All, > > I am in the process of setting up a Hadoop cluster including Impala > v2.10.0. > > I would like to configure Impala State Store and Catalog Service > appropriately (maybe even on a dedicated host), however I cannot really > find any documentation on the resource needs of these services or any > other best practices regarding the sizing of the host machine. > > For example I do not know how much memory or disk space should I reserve > for these services: based on my understanding Impala State Store and > Catalog Service should be of relatively small footprint compared to other > big data components, but I am not sure I would be able make a right > estimation on my own. > > Could someone please point me into the right direction? > > Thank you, > Peter > >
