Hadoop cluster and ETL environment - Validating some basic understanding

Gaurav Pandit Mon, 25 Mar 2013 14:58:47 -0700

Hi,

I am new to Oozie and still exploring the capability. I have done some
basic testing and it's working out fine with simple workflows.


Now I want to make sure we are on the right path to use Oozie in our
environment. Any feedback would be a great help.

Our current ETL environment is running on linux and using perl, RDBMS etc
(classic ETL setup). We also have a 12 node cluster dedicated to Hadoop
(using CDH4). Both environments are independent of each other so far.

The way I understand is that - as long as we are able to install Oozie
client on the ETL side machine, and have it point to the cluster with
correct Oozie server URL, namenodes, tasktrackers etc properties, we should
be able to build a process flow that can execute Hadoop jobs from the ETL
machine when required.


One very simple example could be:

1. ETL : Extract, transform and load a file into an RDBMS table
2. From the same machine, execute an Oozie workflow that will do the
following on the Hadoop cluster:
 a. Use Sqoop to load this table (and some other tables) in Hive (or HBase)
 b. Run a Pig script on this data to create, say, summary data
 c. Store the summary into, say, an HBase table
 d. Return success or failure to the calling process
3. Complete the ETL process.


Our goal is to isolate the cluster from the ETL development by using Oozie
such that the only tool the ETL machine needs is an Oozie client (and
nothing else from Hadoop cluster - i.e. no Hadoop/Pig/Hive binaries,
libraries, configuration files).

What I understand is that Oozie is able to provide such isolation while at
the same time providing a way to interact with the cluster. Is this
understanding correct?


Thanks!
- Gaurav

Hadoop cluster and ETL environment - Validating some basic understanding

Reply via email to