Hi Tim, I configured all the dependencies and tried building buildall.sh with -release flag. However, maven build got stuck & failed downloading: https://repository.cloudera.com/content/repositories/third-party/org/apache/maven/plugins/maven-clean-plugin/2.5/maven-clean-plugin-2.5.pom
I looked up this pom and got a File Not Found response. Thanks, Rav On Mon, Apr 13, 2020 at 11:01 AM Tim Armstrong <tarmstr...@cloudera.com> wrote: > For those following along, I created a code review to improve the README a > bit: https://gerrit.cloudera.org/#/c/15719/ > > Thanks Ravi for asking these questions, it helps us make the project > better. > > On Mon, Apr 6, 2020 at 9:00 PM ravi kanth <ravi....@gmail.com> wrote: > >> Hi Tim, >> >> Thanks for taking the time and explaining everything in detail. I will >> invest more time in building this cluster & will reach out to the >> community if I face any issues. >> >> Thanks, >> Rav >> >> >> On Mon, Apr 6, 2020 at 9:45 AM Tim Armstrong <tarmstr...@cloudera.com> >> wrote: >> >>> > I had the following already set up and working as they were mentioned >>> mandatory for building impala from GitHub (The components needed to >>> build Impala are Apache Hadoop, Hive, HBase, and Sentry) >>> We should probably remove some of that stuff from the README on github, >>> it's mainly confusing - the real dev docs are on apache wiki and the real >>> user docs are elsewhere. Those are just some notes about how the >>> development environment works that are not of general interested. >>> >>> >>> > 1. Is there a well-written documentation on how to build the source >>> code from scratch for multi-node environments?. >>> The build scripts are all the same - the impalad, statestored, catalogd >>> binaries used in the dev environment are deployable in production setups. >>> For a production deployment you want a release build (pass in the -release >>> flag to buildall.sh). >>> >>> On Mon, Apr 6, 2020 at 9:40 AM Tim Armstrong <tarmstr...@cloudera.com> >>> wrote: >>> >>>> >>>> >>>> On Thu, Apr 2, 2020 at 6:08 PM ravi kanth <ravi....@gmail.com> wrote: >>>> >>>>> Hi All, >>>>> >>>>> Hoping you all are staying safe in these tough times. And I am >>>>> utilizing this time to learn about Impala. :) >>>>> >>>>> I want to do the following: >>>>> >>>>> 1. Setup Impala on 5 nodes (1 master + 4 data) >>>>> 2. I don't want to use prepackaged Impala from 3rd party >>>>> vendors, instead, I strictly wanted to do from scratch. >>>>> >>>>> This is what I did: >>>>> 1. Downloaded the latest Release-3.3.0 available at >>>>> https://impala.apache.org/downloads.html >>>>> 2. Observed that the downloaded is a source project and not the >>>>> binary. Which means I need to build the source and generate the binaries. >>>>> 3. So, digging deeper & reading through the following docs I >>>>> understand that its not straight forward to bring up an impala cluster >>>>> instead there is a lot of pre-setup that needs to be done. >>>>> >>>>> https://github.com/apache/impala >>>>> https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala >>>>> >>>>> https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites >>>>> >>>>> I had the following already set up and working as they were mentioned >>>>> mandatory for building impala from GitHub (The components needed to >>>>> build Impala are Apache Hadoop, Hive, HBase, and Sentry) >>>>> 1. Hadoop >>>>> 2. Hive >>>>> 3. Sentry >>>>> Also, installed and configured but haven't brought up the service for >>>>> HBase. (I don't understand why this was needed in first place but still >>>>> installed & configured it to make Impala building happy :)) >>>>> >>>>> Questions: >>>>> 1. Is there a well-written documentation on how to build the source >>>>> code from scratch for multi-node environments?. >>>>> >>>>> I understand >>>>> https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala >>>>> deals with building however, it clearly mentions that its for development >>>>> purpose. Also, the starting line in the document "*This page >>>>> describes how to build Impala from source and how to configure and run >>>>> Impala in a single node development environment.*" says its intended >>>>> for single-node development. >>>>> >>>> >>>>> Also, the comments on this page don't sound positive which makes me >>>>> think that if they really work. However, it was last updated in Oct, 2019 >>>>> which is good. >>>>> >>>> The comment is just that >>>> https://cwiki.apache.org/confluence/display/IMPALA/Bootstrapping+an+Impala+Development+Environment+From+Scratch >>>> is >>>> the recommended approach, which is less manual. I don't see any comments >>>> saying that it doesn't work. AFAIK the page you linked still works. >>>> >>>> I'd suggest starting with the front page of the wiki if you want >>>> developer docs, it's easier finding the most relevant stuff if you start >>>> there: https://cwiki.apache.org/confluence/display/IMPALA/Impala+Home >>>> >>>> >>>>> 2. The same build page from the previous question also mentions that >>>>> the source code is compatible with CentOS 7. However, if you look at >>>>> bin/bootstrap_build.sh, its all hardcoded to Ubuntu(also mentioned in the >>>>> comments). So, it seems like I have to do some changes to the scripts to >>>>> make it compatible with CentOs. Please suggest me if I am wrong and if >>>>> there is anything readily available. Unfortunately, I couldn't locate any. >>>>> >>>> bootstrap_development.sh supports CentOS. bootstrap_build.sh is not >>>> really used much, only in a Jenkins job AFAIK. >>>> >>>>> >>>>> 3. In the same build page, it was mentioned >>>>> *Installing and Configuring Impala (Obsolete)* >>>>> If its Obsolete, where can I find the latest installation & >>>>> configuration document? >>>>> >>>> The wiki is mostly developer documentation, user-facing documentation >>>> is here: https://impala.apache.org/docs/build/html/index.html. >>>> >>>> It does have some info about how you might run the different services, >>>> but as of right now the Apache Impala project doesn't provide a multi-node >>>> cluster management solution. Users that I know of tend to either use their >>>> own scripts, use docker containers, or use Cloudera Manager. The hardest >>>> part is wiring it up to other services - you need the various hive/hadoop >>>> configurations so that Impala can connect to the various storage and >>>> metadata services. At the moment we're in a similar position to say, the >>>> core linux kernel project, where Apache Impala as a project has been >>>> focused on the core technology and not so much on packaging, distribution, >>>> orchestration, etc - that's been left to others, similar to the >>>> relationship between the linux kernel and Red Hat, Debian, Ubuntu, etc. I >>>> think we'd all like to make it more accessible, especially for people >>>> wanting to try it out, cause the project website is obviously the first >>>> place people will come and look. >>>> >>>> >>>>> 4. >>>>> https://cwiki.apache.org/confluence/display/IMPALA/Impala+Build+Prerequisites >>>>> mentions about setting up of PostGresSQL to bring up Impala. I am aware >>>>> that Impala needs Hive Metastore for Metadata mangement which in my case >>>>> is >>>>> pointing to MySQL. So, do I still need Postgres? >>>>> >>>> Those instructions are for setting up a development environment. The >>>> development environment includes its own versions of all dependencies >>>> including HMS and will set them all up pointing at the postgres instance. >>>> If you want to point it at your own installation of HMS, etc, then it >>>> doesn't really apply. >>>> >>>> >>>>> >>>>> So, to bring up Impala it looks like we need a ton of other >>>>> databases/technologies. >>>>> >>>> Yeah, that's the nature of the big data ecosystem. There's good and bad >>>> about it. Impala is focused on being a great query engine for data stored >>>> in a bunch of different formats - the good is that we can focus on that one >>>> problem, the bad is that it's not self-contained. >>>> >>>> >>>>> In short, I heard great about Impala for its efficient analytical >>>>> query processing based on Parquet and I am eagerly waiting to play with >>>>> it. >>>>> However, the documentation is creating a lot of pain and yet times >>>>> disappointing. Sorry about that. >>>>> >>>> If you want to kick the tires on a single node setup, the Apache Kudu >>>> team put together this docker-based quickstart: >>>> https://github.com/apache/kudu/blob/master/examples/quickstart/impala/README.adoc. >>>> It's not suitable for production deployments but it is self-contained. I >>>> would highly recommend this because it sounds like it's addressing the pain >>>> points you are hitting. >>>> >>>> The development environment you get from running >>>> bootstrap_development.sh also is good for playing around on a single node, >>>> but takes longer and has more potential to hit snags cause it's building >>>> from scratch: >>>> https://cwiki.apache.org/confluence/display/IMPALA/Bootstrapping+an+Impala+Development+Environment+From+Scratch >>>> >>>> >>>>> >>>>> Hoping to hear from some brilliant minds. >>>>> >>>>> Thanks, >>>>> Rav >>>>> >>>>