Hi All, Hoping you all are staying safe in these tough times. And I am utilizing this time to learn about Impala. :)
I want to do the following: 1. Setup Impala on 5 nodes (1 master + 4 data) 2. I don't want to use prepackaged Impala from 3rd party vendors, instead, I strictly wanted to do from scratch. This is what I did: 1. Downloaded the latest Release-3.3.0 available at https://impala.apache.org/downloads.html 2. Observed that the downloaded is a source project and not the binary. Which means I need to build the source and generate the binaries. 3. So, digging deeper & reading through the following docs I understand that its not straight forward to bring up an impala cluster instead there is a lot of pre-setup that needs to be done. https://github.com/apache/impala https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites I had the following already set up and working as they were mentioned mandatory for building impala from GitHub (The components needed to build Impala are Apache Hadoop, Hive, HBase, and Sentry) 1. Hadoop 2. Hive 3. Sentry Also, installed and configured but haven't brought up the service for HBase. (I don't understand why this was needed in first place but still installed & configured it to make Impala building happy :)) Questions: 1. Is there a well-written documentation on how to build the source code from scratch for multi-node environments?. I understand https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala deals with building however, it clearly mentions that its for development purpose. Also, the starting line in the document "*This page describes how to build Impala from source and how to configure and run Impala in a single node development environment.*" says its intended for single-node development. Also, the comments on this page don't sound positive which makes me think that if they really work. However, it was last updated in Oct, 2019 which is good. 2. The same build page from the previous question also mentions that the source code is compatible with CentOS 7. However, if you look at bin/bootstrap_build.sh, its all hardcoded to Ubuntu(also mentioned in the comments). So, it seems like I have to do some changes to the scripts to make it compatible with CentOs. Please suggest me if I am wrong and if there is anything readily available. Unfortunately, I couldn't locate any. 3. In the same build page, it was mentioned *Installing and Configuring Impala (Obsolete)* If its Obsolete, where can I find the latest installation & configuration document? 4. https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites mentions about setting up of PostGresSQL to bring up Impala. I am aware that Impala needs Hive Metastore for Metadata mangement which in my case is pointing to MySQL. So, do I still need Postgres? So, to bring up Impala it looks like we need a ton of other databases/technologies. In short, I heard great about Impala for its efficient analytical query processing based on Parquet and I am eagerly waiting to play with it. However, the documentation is creating a lot of pain and yet times disappointing. Sorry about that. Hoping to hear from some brilliant minds. Thanks, Rav