Re: Hadoop cluster and ETL environment - Validating some basic understanding

Gaurav Pandit Tue, 26 Mar 2013 06:35:54 -0700

Thanks Ryota. That helps!

And yes, I almost forgot about code deployment. It's a great point. I guess
we will take care of it as part of a process that's independent of the
general ETL workflow. As you said, something like WebHDFS should be handy.



Thanks,
Gaurav


On Mon, Mar 25, 2013 at 7:25 PM, Ryota Egashira <[email protected]>wrote:

> Gaurav
>
>
> Yes, I think oozie is perfect fit for isolation.
> As you mentioned, on ETL machine, you only need oozie client, which can
> talk with Oozie server through HTTP.
> Oozie client can submit job, and check status of it.
>
> One thing that you might want to think of is deployment of oozie
> applications(meaning, workflow/coordinator.xml, and all jars) on HDFS.
> If it is one-time thing, and you keep using the same workflow/libraries,
> then it's ok that you manually copy to HDFS at one time,
> But if you change oozie application frequently, and need to automate
> deployment process, there could be case where ELT machine might need to
> access HDFS. (but using WebHDFS or others, there should be way around that)
>
> Thanks
> Ryota
>
> On 3/25/13 2:58 PM, "Gaurav Pandit" <[email protected]> wrote:
>
> >Hi,
> >
> >I am new to Oozie and still exploring the capability. I have done some
> >basic testing and it's working out fine with simple workflows.
> >
> >Now I want to make sure we are on the right path to use Oozie in our
> >environment. Any feedback would be a great help.
> >
> >Our current ETL environment is running on linux and using perl, RDBMS etc
> >(classic ETL setup). We also have a 12 node cluster dedicated to Hadoop
> >(using CDH4). Both environments are independent of each other so far.
> >
> >The way I understand is that - as long as we are able to install Oozie
> >client on the ETL side machine, and have it point to the cluster with
> >correct Oozie server URL, namenodes, tasktrackers etc properties, we
> >should
> >be able to build a process flow that can execute Hadoop jobs from the ETL
> >machine when required.
> >
> >
> >One very simple example could be:
> >
> >1. ETL : Extract, transform and load a file into an RDBMS table
> >2. From the same machine, execute an Oozie workflow that will do the
> >following on the Hadoop cluster:
> > a. Use Sqoop to load this table (and some other tables) in Hive (or
> >HBase)
> > b. Run a Pig script on this data to create, say, summary data
> > c. Store the summary into, say, an HBase table
> > d. Return success or failure to the calling process
> >3. Complete the ETL process.
> >
> >
> >Our goal is to isolate the cluster from the ETL development by using Oozie
> >such that the only tool the ETL machine needs is an Oozie client (and
> >nothing else from Hadoop cluster - i.e. no Hadoop/Pig/Hive binaries,
> >libraries, configuration files).
> >
> >What I understand is that Oozie is able to provide such isolation while at
> >the same time providing a way to interact with the cluster. Is this
> >understanding correct?
> >
> >
> >Thanks!
> >- Gaurav
>
>

Re: Hadoop cluster and ETL environment - Validating some basic understanding

Reply via email to