Thanks Jarek.

Thanks Jarek.



Here is the use case.



1.      My Hive table contains detailed transaction level data and
continuously getting updated throughout the day (say every 15 minutes)

2.      I have to send summary data from Hive/HDFS to other systems like
EDW say twice a day.



This need to be automated and scheduled in production. I need to implement
incremental logic so that I can export only the changes every time. I was
reading about incremental options in Sqoop Import. It has kind of features
I am looking for; but I need them on Sqoop Export. Since export does not
provide that feature, I may have to track it myself.  Some how I need to
keep track of when was the last time export ran successfully and what data
has been added to Hive since then. Then I can do something like:



1.      Execute Hive Query to extract the data I need to send (summary and
only changes):

    Select fld1, fld,2, sum(fld3), …

    From tableA

    Where <HDFS_File_create_timestamp>   > <last_extract_timestamp>

    Group by fld, fld2, …



2.      Use SQOOP Export to export the result file to EDW



I am not sure where / how to get HDFS_File_create_timestamp and
last_extract_timestamp  values so it can be used dynamically inside Hive
query.



Any ideas??? Are there any other options?


Thanks for your help.



Sadu





On Thu, Oct 25, 2012 at 8:04 PM, Jarek Jarcec Cecho <[email protected]>wrote:

> Hi Sadu,
> unfortunately Sqoop export is taking entire input directory (--export-dir)
> and simply exporting it's content to the external database/warehouse
> system. I'm afraid that there isn't more sophisticated way of doing
> "incremental" exports then using different hdfs directories for each
> "incremental" part.
>
> If you could describe your use case, there might be other ways how to
> achieve similar results.
>
> Jarcec
>
> On Thu, Oct 25, 2012 at 07:38:21PM -0500, Sadananda Hegde wrote:
> > Hello,
> >
> > I am exploring sqoop to send data from hadoop to EDW. I don't want to
> send
> > the same data again and again. I need to identify the changes in HDFS and
> > send only the data that has changed since my previous export. What is the
> > best way to implement such incremental export logic?  I see that sqoop
> > import has incremental logic option; but can't see it in export.
> >
> > Any recomendations / suggestions would greatly be appreciated.
> >
> > Thanks,
> > Sadu
>

Reply via email to