On Thu, Apr 2, 2020 at 6:08 PM ravi kanth <ravi....@gmail.com> wrote:

> Hi All,
>
> Hoping you all are staying safe in these tough times. And I am utilizing
> this time to learn about Impala. :)
>
> I want to do the following:
>
> 1. Setup Impala on 5 nodes (1 master + 4 data)
> 2. I don't want to use prepackaged Impala from 3rd party vendors, instead,
> I strictly wanted to do from scratch.
>
> This is what I did:
> 1. Downloaded the latest Release-3.3.0 available at
> https://impala.apache.org/downloads.html
> 2. Observed that the downloaded is a source project and not the binary.
> Which means I need to build the source and generate the binaries.
> 3. So, digging deeper & reading through the following docs I understand
> that its not straight forward to bring up an impala cluster instead there
> is a lot of pre-setup that needs to be done.
>
> https://github.com/apache/impala
> https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala
>
> https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites
>
> I had the following already set up and working as they were mentioned
> mandatory for building impala from GitHub (The components needed to build
> Impala are Apache Hadoop, Hive, HBase, and Sentry)
> 1. Hadoop
> 2. Hive
> 3. Sentry
> Also, installed and configured but haven't brought up the service for
> HBase. (I don't understand why this was needed in first place but still
> installed & configured it to make Impala building happy :))
>
> Questions:
> 1. Is there a well-written documentation on how to build the source code
> from scratch for multi-node environments?.
>
> I understand
> https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala deals
> with building however, it clearly mentions that its for development
> purpose. Also, the starting line in the document "*This page describes
> how to build Impala from source and how to configure and run Impala in a
> single node development environment.*" says its intended for single-node
> development.
>

> Also, the comments on this page don't sound positive which makes me think
> that if they really work. However, it was last updated in Oct, 2019 which
> is good.
>
The comment is just that
https://cwiki.apache.org/confluence/display/IMPALA/Bootstrapping+an+Impala+Development+Environment+From+Scratch
is
the recommended approach, which is less manual. I don't see any comments
saying that it doesn't work. AFAIK the page you linked still works.

I'd suggest starting with the front page of the wiki if you want developer
docs, it's easier finding the most relevant stuff if you start there:
https://cwiki.apache.org/confluence/display/IMPALA/Impala+Home


> 2. The same build page from the previous question also mentions that the
> source code is compatible with CentOS 7. However, if you look at
> bin/bootstrap_build.sh, its all hardcoded to Ubuntu(also mentioned in the
> comments). So, it seems like I have to do some changes to the scripts to
> make it compatible with CentOs. Please suggest me if I am wrong and if
> there is anything readily available. Unfortunately, I couldn't locate any.
>
bootstrap_development.sh supports CentOS. bootstrap_build.sh is not really
used much, only in a Jenkins job AFAIK.

>
> 3. In the same build page, it was mentioned
> *Installing and Configuring Impala (Obsolete)*
> If its Obsolete, where can I find the latest installation & configuration
> document?
>
The wiki is mostly developer documentation, user-facing documentation is
here: https://impala.apache.org/docs/build/html/index.html.

It does have some info about how you might run the different services, but
as of right now the Apache Impala project doesn't provide a multi-node
cluster management solution. Users that I know of tend to either use their
own scripts, use docker containers, or use Cloudera Manager. The hardest
part is wiring it up to other services - you need the various hive/hadoop
configurations so that Impala can connect to the various storage and
metadata services. At the moment we're in a similar position to say, the
core linux kernel project, where Apache Impala as a project has been
focused on the core technology and not so much on packaging, distribution,
orchestration, etc - that's been left to others, similar to the
relationship between the linux kernel and Red Hat, Debian, Ubuntu, etc. I
think we'd all like to make it more accessible, especially for people
wanting to try it out, cause the project website is obviously the first
place people will come and look.


> 4.
> https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites
> mentions about setting up of PostGresSQL to bring up Impala. I am aware
> that Impala needs Hive Metastore for Metadata mangement which in my case is
> pointing to MySQL. So, do I still need Postgres?
>
Those instructions are for setting up a development environment. The
development environment includes its own versions of all dependencies
including HMS and will set them all up pointing at the postgres instance.
If you want to point it at your own installation of HMS, etc, then it
doesn't really apply.


>
> So, to bring up Impala it looks like we need a ton of other
> databases/technologies.
>
Yeah, that's the nature of the big data ecosystem. There's good and bad
about it. Impala is focused on being a great query engine for data stored
in a bunch of different formats - the good is that we can focus on that one
problem, the bad is that it's not self-contained.


> In short, I heard great about Impala for its efficient analytical query
> processing based on Parquet and I am eagerly waiting to play with it.
> However, the documentation is creating a lot of pain and yet times
> disappointing. Sorry about that.
>
If you want to kick the tires on a single node setup, the Apache Kudu team
put together this docker-based quickstart:
https://github.com/apache/kudu/blob/master/examples/quickstart/impala/README.adoc.
It's not suitable for production deployments but it is self-contained. I
would highly recommend this because it sounds like it's addressing the pain
points you are hitting.

The development environment you get from running bootstrap_development.sh
also is good for playing around on a single node, but takes longer and has
more potential to hit snags cause it's building from scratch:
https://cwiki.apache.org/confluence/display/IMPALA/Bootstrapping+an+Impala+Development+Environment+From+Scratch


>
> Hoping to hear from some brilliant minds.
>
> Thanks,
> Rav
>

Reply via email to