Read the docs in the hiho link.. apparently it has an bulk export facility for Oracle too (which I presume is your DB from you other question)
On Wed, Jan 11, 2012 at 11:33 AM, IGZ Nick <[email protected]> wrote: > Depends on your RDBMS. You can use sqoop to load directly from HDFS to the > DB or you can get the file to a local disk and use some bulk loading tool > that comes with you db as Dmitriy has mentioned above. There are two things > to consider: > > 1) sqoop uses INSERT/UPDATE statments which is slower than native bulk > load tools (But for MySql, sqoop supports direct load.. Check out > http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_literal_sqoop_export_literal). > YOu can also look at hiho which is a similar tool ( > https://github.com/sonalgoyal/hiho) > > 2) How much parallelism can your DB take? In sqoop you need to set this > parameter, and you can try to optimize it, it might be faster or slower > than the local disk+bulk load approach. > > > On Wed, Jan 11, 2012 at 5:06 AM, Dmitriy Ryaboy <[email protected]>wrote: > >> 1) Run pig job >> 2) Store results into HDFS using a known format >> 3) Load into database using Apache Sqoop, or if the summarized data >> set is small, simply copy to local disk and use your standard bulk >> data insert tools. >> >> There is an option 4 -- store directly into your DW from Pig -- but >> not all database vendors provide a Pig storage function (Vertica >> does), and in many cases those store funcs aren't atomic, so it's hard >> to recover from failures. I prefer having a copy of the summarized >> data in HDFS, and load into the DW as a separate step. >> >> D >> >> On Tue, Jan 10, 2012 at 1:27 PM, Guy Bayes <[email protected]> wrote: >> > you might want to take a look at zookeeper as a coordination mechanism >> for >> > when to process what file >> > >> > On Tue, Jan 10, 2012 at 12:42 PM, rakesh sharma < >> [email protected] >> >> wrote: >> > >> >> >> >> Hi All, >> >> I am quite new to hadoop world and trying to work on a project using >> >> hadoop and pig. The data is continuously being written in hadoop by >> many >> >> producers. All producers concurrently write data to the same file for >> 30 >> >> minutes duration. After 30 minutes, new file is created and they start >> >> writing on it. I need to run pig jobs to analyze the data from hadoop >> >> incrementally and push the resulted data in RDBMS. I am wondering what >> will >> >> be the right way to implement it. >> >> Thanks,RS >> > >
