Re: Hadoop cluster and ETL environment - Validating some basic understanding

Ryota Egashira Mon, 25 Mar 2013 16:26:47 -0700

Gaurav


Yes, I think oozie is perfect fit for isolation.
As you mentioned, on ETL machine, you only need oozie client, which can
talk with Oozie server through HTTP.
Oozie client can submit job, and check status of it.

One thing that you might want to think of is deployment of oozie
applications(meaning, workflow/coordinator.xml, and all jars) on HDFS.
If it is one-time thing, and you keep using the same workflow/libraries,
then it's ok that you manually copy to HDFS at one time,
But if you change oozie application frequently, and need to automate
deployment process, there could be case where ELT machine might need to
access HDFS. (but using WebHDFS or others, there should be way around that)

Thanks 
Ryota

On 3/25/13 2:58 PM, "Gaurav Pandit" <[email protected]> wrote:

>Hi,
>
>I am new to Oozie and still exploring the capability. I have done some
>basic testing and it's working out fine with simple workflows.
>
>Now I want to make sure we are on the right path to use Oozie in our
>environment. Any feedback would be a great help.
>
>Our current ETL environment is running on linux and using perl, RDBMS etc
>(classic ETL setup). We also have a 12 node cluster dedicated to Hadoop
>(using CDH4). Both environments are independent of each other so far.
>
>The way I understand is that - as long as we are able to install Oozie
>client on the ETL side machine, and have it point to the cluster with
>correct Oozie server URL, namenodes, tasktrackers etc properties, we
>should
>be able to build a process flow that can execute Hadoop jobs from the ETL
>machine when required.
>
>
>One very simple example could be:
>
>1. ETL : Extract, transform and load a file into an RDBMS table
>2. From the same machine, execute an Oozie workflow that will do the
>following on the Hadoop cluster:
> a. Use Sqoop to load this table (and some other tables) in Hive (or
>HBase)
> b. Run a Pig script on this data to create, say, summary data
> c. Store the summary into, say, an HBase table
> d. Return success or failure to the calling process
>3. Complete the ETL process.
>
>
>Our goal is to isolate the cluster from the ETL development by using Oozie
>such that the only tool the ETL machine needs is an Oozie client (and
>nothing else from Hadoop cluster - i.e. no Hadoop/Pig/Hive binaries,
>libraries, configuration files).
>
>What I understand is that Oozie is able to provide such isolation while at
>the same time providing a way to interact with the cluster. Is this
>understanding correct?
>
>
>Thanks!
>- Gaurav

Re: Hadoop cluster and ETL environment - Validating some basic understanding

Reply via email to