Re: Getting Data into Data Warehouse from Pig

IGZ Nick Tue, 10 Jan 2012 22:16:11 -0800

Read the docs in the hiho link.. apparently it has an bulk export facility
for Oracle too (which I presume is your DB from you other question)


On Wed, Jan 11, 2012 at 11:33 AM, IGZ Nick <[email protected]> wrote:

> Depends on your RDBMS. You can use sqoop to load directly from HDFS to the
> DB or you can get the file to a local disk and use some bulk loading tool
> that comes with you db as Dmitriy has mentioned above. There are two things
> to consider:
>
> 1) sqoop uses INSERT/UPDATE statments which is slower than native bulk
> load tools  (But for MySql, sqoop supports direct load.. Check out
> http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_literal_sqoop_export_literal).
> YOu can also look at hiho which is a similar tool (
> https://github.com/sonalgoyal/hiho)
>
> 2) How much parallelism can your DB take? In sqoop you need to set this
> parameter, and you can try to optimize it, it might be faster or slower
> than the local disk+bulk load approach.
>
>
> On Wed, Jan 11, 2012 at 5:06 AM, Dmitriy Ryaboy <[email protected]>wrote:
>
>> 1) Run pig job
>> 2) Store results into HDFS using a known format
>> 3) Load into database using Apache Sqoop, or if the summarized data
>> set is small, simply copy to local disk and use your standard bulk
>> data insert tools.
>>
>> There is an option 4 -- store directly into your DW from Pig -- but
>> not all database vendors provide a Pig storage function (Vertica
>> does), and in many cases those store funcs aren't atomic, so it's hard
>> to recover from failures. I prefer having a copy of the summarized
>> data in HDFS, and load into the DW as a separate step.
>>
>> D
>>
>> On Tue, Jan 10, 2012 at 1:27 PM, Guy Bayes <[email protected]> wrote:
>> > you might want to take a look at zookeeper as a coordination mechanism
>> for
>> > when to process what file
>> >
>> > On Tue, Jan 10, 2012 at 12:42 PM, rakesh sharma <
>> [email protected]
>> >> wrote:
>> >
>> >>
>> >> Hi All,
>> >> I am quite new to hadoop world and trying to work on a project using
>> >> hadoop and pig. The data is continuously being written in hadoop by
>> many
>> >> producers. All producers concurrently write data to the same file for
>> 30
>> >> minutes duration. After 30 minutes, new file is created and they start
>> >> writing on it. I need to run pig jobs to analyze the data from hadoop
>> >> incrementally and push the resulted data in RDBMS. I am wondering what
>> will
>> >> be the right way to implement it.
>> >> Thanks,RS
>>
>
>

Re: Getting Data into Data Warehouse from Pig

Reply via email to