Hi, I am new to Oozie and still exploring the capability. I have done some basic testing and it's working out fine with simple workflows.
Now I want to make sure we are on the right path to use Oozie in our environment. Any feedback would be a great help. Our current ETL environment is running on linux and using perl, RDBMS etc (classic ETL setup). We also have a 12 node cluster dedicated to Hadoop (using CDH4). Both environments are independent of each other so far. The way I understand is that - as long as we are able to install Oozie client on the ETL side machine, and have it point to the cluster with correct Oozie server URL, namenodes, tasktrackers etc properties, we should be able to build a process flow that can execute Hadoop jobs from the ETL machine when required. One very simple example could be: 1. ETL : Extract, transform and load a file into an RDBMS table 2. From the same machine, execute an Oozie workflow that will do the following on the Hadoop cluster: a. Use Sqoop to load this table (and some other tables) in Hive (or HBase) b. Run a Pig script on this data to create, say, summary data c. Store the summary into, say, an HBase table d. Return success or failure to the calling process 3. Complete the ETL process. Our goal is to isolate the cluster from the ETL development by using Oozie such that the only tool the ETL machine needs is an Oozie client (and nothing else from Hadoop cluster - i.e. no Hadoop/Pig/Hive binaries, libraries, configuration files). What I understand is that Oozie is able to provide such isolation while at the same time providing a way to interact with the cluster. Is this understanding correct? Thanks! - Gaurav
