Depends on your RDBMS. You can use sqoop to load directly from HDFS to the DB or you can get the file to a local disk and use some bulk loading tool that comes with you db as Dmitriy has mentioned above. There are two things to consider:
1) sqoop uses INSERT/UPDATE statments which is slower than native bulk load tools (But for MySql, sqoop supports direct load.. Check out http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_literal_sqoop_export_literal). YOu can also look at hiho which is a similar tool ( https://github.com/sonalgoyal/hiho) 2) How much parallelism can your DB take? In sqoop you need to set this parameter, and you can try to optimize it, it might be faster or slower than the local disk+bulk load approach. On Wed, Jan 11, 2012 at 5:06 AM, Dmitriy Ryaboy <[email protected]> wrote: > 1) Run pig job > 2) Store results into HDFS using a known format > 3) Load into database using Apache Sqoop, or if the summarized data > set is small, simply copy to local disk and use your standard bulk > data insert tools. > > There is an option 4 -- store directly into your DW from Pig -- but > not all database vendors provide a Pig storage function (Vertica > does), and in many cases those store funcs aren't atomic, so it's hard > to recover from failures. I prefer having a copy of the summarized > data in HDFS, and load into the DW as a separate step. > > D > > On Tue, Jan 10, 2012 at 1:27 PM, Guy Bayes <[email protected]> wrote: > > you might want to take a look at zookeeper as a coordination mechanism > for > > when to process what file > > > > On Tue, Jan 10, 2012 at 12:42 PM, rakesh sharma < > [email protected] > >> wrote: > > > >> > >> Hi All, > >> I am quite new to hadoop world and trying to work on a project using > >> hadoop and pig. The data is continuously being written in hadoop by many > >> producers. All producers concurrently write data to the same file for 30 > >> minutes duration. After 30 minutes, new file is created and they start > >> writing on it. I need to run pig jobs to analyze the data from hadoop > >> incrementally and push the resulted data in RDBMS. I am wondering what > will > >> be the right way to implement it. > >> Thanks,RS >
