1) Run pig job 2) Store results into HDFS using a known format 3) Load into database using Apache Sqoop, or if the summarized data set is small, simply copy to local disk and use your standard bulk data insert tools.
There is an option 4 -- store directly into your DW from Pig -- but not all database vendors provide a Pig storage function (Vertica does), and in many cases those store funcs aren't atomic, so it's hard to recover from failures. I prefer having a copy of the summarized data in HDFS, and load into the DW as a separate step. D On Tue, Jan 10, 2012 at 1:27 PM, Guy Bayes <[email protected]> wrote: > you might want to take a look at zookeeper as a coordination mechanism for > when to process what file > > On Tue, Jan 10, 2012 at 12:42 PM, rakesh sharma <[email protected] >> wrote: > >> >> Hi All, >> I am quite new to hadoop world and trying to work on a project using >> hadoop and pig. The data is continuously being written in hadoop by many >> producers. All producers concurrently write data to the same file for 30 >> minutes duration. After 30 minutes, new file is created and they start >> writing on it. I need to run pig jobs to analyze the data from hadoop >> incrementally and push the resulted data in RDBMS. I am wondering what will >> be the right way to implement it. >> Thanks,RS
