Hi All,

Hoping you all are staying safe in these tough times. And I am utilizing
this time to learn about Impala. :)

I want to do the following:

1. Setup Impala on 5 nodes (1 master + 4 data)
2. I don't want to use prepackaged Impala from 3rd party vendors, instead,
I strictly wanted to do from scratch.

This is what I did:
1. Downloaded the latest Release-3.3.0 available at
https://impala.apache.org/downloads.html
2. Observed that the downloaded is a source project and not the binary.
Which means I need to build the source and generate the binaries.
3. So, digging deeper & reading through the following docs I understand
that its not straight forward to bring up an impala cluster instead there
is a lot of pre-setup that needs to be done.

https://github.com/apache/impala
https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala
https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites

I had the following already set up and working as they were mentioned
mandatory for building impala from GitHub (The components needed to build
Impala are Apache Hadoop, Hive, HBase, and Sentry)
1. Hadoop
2. Hive
3. Sentry
Also, installed and configured but haven't brought up the service for
HBase. (I don't understand why this was needed in first place but still
installed & configured it to make Impala building happy :))

Questions:
1. Is there a well-written documentation on how to build the source code
from scratch for multi-node environments?.

I understand
https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala deals
with building however, it clearly mentions that its for development
purpose. Also, the starting line in the document "*This page describes how
to build Impala from source and how to configure and run Impala in a single
node development environment.*" says its intended for single-node
development.

Also, the comments on this page don't sound positive which makes me think
that if they really work. However, it was last updated in Oct, 2019 which
is good.

2. The same build page from the previous question also mentions that the
source code is compatible with CentOS 7. However, if you look at
bin/bootstrap_build.sh, its all hardcoded to Ubuntu(also mentioned in the
comments). So, it seems like I have to do some changes to the scripts to
make it compatible with CentOs. Please suggest me if I am wrong and if
there is anything readily available. Unfortunately, I couldn't locate any.

3. In the same build page, it was mentioned
*Installing and Configuring Impala (Obsolete)*
If its Obsolete, where can I find the latest installation & configuration
document?

4.
https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites
mentions about setting up of PostGresSQL to bring up Impala. I am aware
that Impala needs Hive Metastore for Metadata mangement which in my case is
pointing to MySQL. So, do I still need Postgres?

So, to bring up Impala it looks like we need a ton of other
databases/technologies.

In short, I heard great about Impala for its efficient analytical query
processing based on Parquet and I am eagerly waiting to play with it.
However, the documentation is creating a lot of pain and yet times
disappointing. Sorry about that.

Hoping to hear from some brilliant minds.

Thanks,
Rav

Reply via email to