Thanks Ryota. That helps! And yes, I almost forgot about code deployment. It's a great point. I guess we will take care of it as part of a process that's independent of the general ETL workflow. As you said, something like WebHDFS should be handy.
Thanks, Gaurav On Mon, Mar 25, 2013 at 7:25 PM, Ryota Egashira <[email protected]>wrote: > Gaurav > > > Yes, I think oozie is perfect fit for isolation. > As you mentioned, on ETL machine, you only need oozie client, which can > talk with Oozie server through HTTP. > Oozie client can submit job, and check status of it. > > One thing that you might want to think of is deployment of oozie > applications(meaning, workflow/coordinator.xml, and all jars) on HDFS. > If it is one-time thing, and you keep using the same workflow/libraries, > then it's ok that you manually copy to HDFS at one time, > But if you change oozie application frequently, and need to automate > deployment process, there could be case where ELT machine might need to > access HDFS. (but using WebHDFS or others, there should be way around that) > > Thanks > Ryota > > On 3/25/13 2:58 PM, "Gaurav Pandit" <[email protected]> wrote: > > >Hi, > > > >I am new to Oozie and still exploring the capability. I have done some > >basic testing and it's working out fine with simple workflows. > > > >Now I want to make sure we are on the right path to use Oozie in our > >environment. Any feedback would be a great help. > > > >Our current ETL environment is running on linux and using perl, RDBMS etc > >(classic ETL setup). We also have a 12 node cluster dedicated to Hadoop > >(using CDH4). Both environments are independent of each other so far. > > > >The way I understand is that - as long as we are able to install Oozie > >client on the ETL side machine, and have it point to the cluster with > >correct Oozie server URL, namenodes, tasktrackers etc properties, we > >should > >be able to build a process flow that can execute Hadoop jobs from the ETL > >machine when required. > > > > > >One very simple example could be: > > > >1. ETL : Extract, transform and load a file into an RDBMS table > >2. From the same machine, execute an Oozie workflow that will do the > >following on the Hadoop cluster: > > a. Use Sqoop to load this table (and some other tables) in Hive (or > >HBase) > > b. Run a Pig script on this data to create, say, summary data > > c. Store the summary into, say, an HBase table > > d. Return success or failure to the calling process > >3. Complete the ETL process. > > > > > >Our goal is to isolate the cluster from the ETL development by using Oozie > >such that the only tool the ETL machine needs is an Oozie client (and > >nothing else from Hadoop cluster - i.e. no Hadoop/Pig/Hive binaries, > >libraries, configuration files). > > > >What I understand is that Oozie is able to provide such isolation while at > >the same time providing a way to interact with the cluster. Is this > >understanding correct? > > > > > >Thanks! > >- Gaurav > >
