1) Run pig job
2) Store results into HDFS using a known format
3) Load into database using Apache Sqoop, or if the summarized data
set is small, simply copy to local disk and use your standard bulk
data insert tools.

There is an option 4 -- store directly into your DW from Pig -- but
not all database vendors provide a Pig storage function (Vertica
does), and in many cases those store funcs aren't atomic, so it's hard
to recover from failures. I prefer having a copy of the summarized
data in HDFS, and load into the DW as a separate step.

D

On Tue, Jan 10, 2012 at 1:27 PM, Guy Bayes <[email protected]> wrote:
> you might want to take a look at zookeeper as a coordination mechanism for
> when to process what file
>
> On Tue, Jan 10, 2012 at 12:42 PM, rakesh sharma <[email protected]
>> wrote:
>
>>
>> Hi All,
>> I am quite new to hadoop world and trying to work on a project using
>> hadoop and pig. The data is continuously being written in hadoop by many
>> producers. All producers concurrently write data to the same file for 30
>> minutes duration. After 30 minutes, new file is created and they start
>> writing on it. I need to run pig jobs to analyze the data from hadoop
>> incrementally and push the resulted data in RDBMS. I am wondering what will
>> be the right way to implement it.
>> Thanks,RS

Reply via email to