Hi Tim,

I configured all the dependencies and tried building buildall.sh with
-release flag. However, maven build got stuck & failed downloading:
https://repository.cloudera.com/content/repositories/third-party/org/apache/maven/plugins/maven-clean-plugin/2.5/maven-clean-plugin-2.5.pom

I looked up this pom and got a File Not Found response.

Thanks,
Rav


On Mon, Apr 13, 2020 at 11:01 AM Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> For those following along, I created a code review to improve the README a
> bit: https://gerrit.cloudera.org/#/c/15719/
>
> Thanks Ravi for asking these questions, it helps us make the project
> better.
>
> On Mon, Apr 6, 2020 at 9:00 PM ravi kanth <ravi....@gmail.com> wrote:
>
>> Hi Tim,
>>
>> Thanks for taking the time and explaining everything in detail. I will
>> invest more time in building this cluster & will reach out to the
>> community if I face any issues.
>>
>> Thanks,
>> Rav
>>
>>
>> On Mon, Apr 6, 2020 at 9:45 AM Tim Armstrong <tarmstr...@cloudera.com>
>> wrote:
>>
>>> > I had the following already set up and working as they were mentioned
>>> mandatory for building impala from GitHub (The components needed to
>>> build Impala are Apache Hadoop, Hive, HBase, and Sentry)
>>> We should probably remove some of that stuff from the README on github,
>>> it's mainly confusing - the real dev docs are on apache wiki and the real
>>> user docs are elsewhere. Those are just some notes about how the
>>> development environment works that are not of general interested.
>>>
>>>
>>> > 1. Is there a well-written documentation on how to build the source
>>> code from scratch for multi-node environments?.
>>> The build scripts are all the same - the impalad, statestored, catalogd
>>> binaries used in the dev environment are deployable in production setups.
>>> For a production deployment you want a release build (pass in the -release
>>> flag to buildall.sh).
>>>
>>> On Mon, Apr 6, 2020 at 9:40 AM Tim Armstrong <tarmstr...@cloudera.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Apr 2, 2020 at 6:08 PM ravi kanth <ravi....@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Hoping you all are staying safe in these tough times. And I am
>>>>> utilizing this time to learn about Impala. :)
>>>>>
>>>>> I want to do the following:
>>>>>
>>>>> 1. Setup Impala on 5 nodes (1 master + 4 data)
>>>>> 2. I don't want to use prepackaged Impala from 3rd party
>>>>> vendors, instead, I strictly wanted to do from scratch.
>>>>>
>>>>> This is what I did:
>>>>> 1. Downloaded the latest Release-3.3.0 available at
>>>>> https://impala.apache.org/downloads.html
>>>>> 2. Observed that the downloaded is a source project and not the
>>>>> binary. Which means I need to build the source and generate the binaries.
>>>>> 3. So, digging deeper & reading through the following docs I
>>>>> understand that its not straight forward to bring up an impala cluster
>>>>> instead there is a lot of pre-setup that needs to be done.
>>>>>
>>>>> https://github.com/apache/impala
>>>>> https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites
>>>>>
>>>>> I had the following already set up and working as they were mentioned
>>>>> mandatory for building impala from GitHub (The components needed to
>>>>> build Impala are Apache Hadoop, Hive, HBase, and Sentry)
>>>>> 1. Hadoop
>>>>> 2. Hive
>>>>> 3. Sentry
>>>>> Also, installed and configured but haven't brought up the service for
>>>>> HBase. (I don't understand why this was needed in first place but still
>>>>> installed & configured it to make Impala building happy :))
>>>>>
>>>>> Questions:
>>>>> 1. Is there a well-written documentation on how to build the source
>>>>> code from scratch for multi-node environments?.
>>>>>
>>>>> I understand
>>>>> https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala
>>>>> deals with building however, it clearly mentions that its for development
>>>>> purpose. Also, the starting line in the document "*This page
>>>>> describes how to build Impala from source and how to configure and run
>>>>> Impala in a single node development environment.*" says its intended
>>>>> for single-node development.
>>>>>
>>>>
>>>>> Also, the comments on this page don't sound positive which makes me
>>>>> think that if they really work. However, it was last updated in Oct, 2019
>>>>> which is good.
>>>>>
>>>> The comment is just that
>>>> https://cwiki.apache.org/confluence/display/IMPALA/Bootstrapping+an+Impala+Development+Environment+From+Scratch
>>>>  is
>>>> the recommended approach, which is less manual. I don't see any comments
>>>> saying that it doesn't work. AFAIK the page you linked still works.
>>>>
>>>> I'd suggest starting with the front page of the wiki if you want
>>>> developer docs, it's easier finding the most relevant stuff if you start
>>>> there: https://cwiki.apache.org/confluence/display/IMPALA/Impala+Home
>>>>
>>>>
>>>>> 2. The same build page from the previous question also mentions that
>>>>> the source code is compatible with CentOS 7. However, if you look at
>>>>> bin/bootstrap_build.sh, its all hardcoded to Ubuntu(also mentioned in the
>>>>> comments). So, it seems like I have to do some changes to the scripts to
>>>>> make it compatible with CentOs. Please suggest me if I am wrong and if
>>>>> there is anything readily available. Unfortunately, I couldn't locate any.
>>>>>
>>>> bootstrap_development.sh supports CentOS. bootstrap_build.sh is not
>>>> really used much, only in a Jenkins job AFAIK.
>>>>
>>>>>
>>>>> 3. In the same build page, it was mentioned
>>>>> *Installing and Configuring Impala (Obsolete)*
>>>>> If its Obsolete, where can I find the latest installation &
>>>>> configuration document?
>>>>>
>>>> The wiki is mostly developer documentation, user-facing documentation
>>>> is here: https://impala.apache.org/docs/build/html/index.html.
>>>>
>>>> It does have some info about how you might run the different services,
>>>> but as of right now the Apache Impala project doesn't provide a multi-node
>>>> cluster management solution. Users that I know of tend to either use their
>>>> own scripts, use docker containers, or use Cloudera Manager. The hardest
>>>> part is wiring it up to other services - you need the various hive/hadoop
>>>> configurations so that Impala can connect to the various storage and
>>>> metadata services. At the moment we're in a similar position to say, the
>>>> core linux kernel project, where Apache Impala as a project has been
>>>> focused on the core technology and not so much on packaging, distribution,
>>>> orchestration, etc - that's been left to others, similar to the
>>>> relationship between the linux kernel and Red Hat, Debian, Ubuntu, etc. I
>>>> think we'd all like to make it more accessible, especially for people
>>>> wanting to try it out, cause the project website is obviously the first
>>>> place people will come and look.
>>>>
>>>>
>>>>> 4.
>>>>> https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites
>>>>> mentions about setting up of PostGresSQL to bring up Impala. I am aware
>>>>> that Impala needs Hive Metastore for Metadata mangement which in my case 
>>>>> is
>>>>> pointing to MySQL. So, do I still need Postgres?
>>>>>
>>>> Those instructions are for setting up a development environment. The
>>>> development environment includes its own versions of all dependencies
>>>> including HMS and will set them all up pointing at the postgres instance.
>>>> If you want to point it at your own installation of HMS, etc, then it
>>>> doesn't really apply.
>>>>
>>>>
>>>>>
>>>>> So, to bring up Impala it looks like we need a ton of other
>>>>> databases/technologies.
>>>>>
>>>> Yeah, that's the nature of the big data ecosystem. There's good and bad
>>>> about it. Impala is focused on being a great query engine for data stored
>>>> in a bunch of different formats - the good is that we can focus on that one
>>>> problem, the bad is that it's not self-contained.
>>>>
>>>>
>>>>> In short, I heard great about Impala for its efficient analytical
>>>>> query processing based on Parquet and I am eagerly waiting to play with 
>>>>> it.
>>>>> However, the documentation is creating a lot of pain and yet times
>>>>> disappointing. Sorry about that.
>>>>>
>>>> If you want to kick the tires on a single node setup, the Apache Kudu
>>>> team put together this docker-based quickstart:
>>>> https://github.com/apache/kudu/blob/master/examples/quickstart/impala/README.adoc.
>>>> It's not suitable for production deployments but it is self-contained. I
>>>> would highly recommend this because it sounds like it's addressing the pain
>>>> points you are hitting.
>>>>
>>>> The development environment you get from running
>>>> bootstrap_development.sh also is good for playing around on a single node,
>>>> but takes longer and has more potential to hit snags cause it's building
>>>> from scratch:
>>>> https://cwiki.apache.org/confluence/display/IMPALA/Bootstrapping+an+Impala+Development+Environment+From+Scratch
>>>>
>>>>
>>>>>
>>>>> Hoping to hear from some brilliant minds.
>>>>>
>>>>> Thanks,
>>>>> Rav
>>>>>
>>>>

Reply via email to