Hello Experts,
I need to extract data from RDBMS tables to Hive Tables on a timely 
basis(daily/weekly, etc.).
My tables have a 'Primary Key' but does NOT have a 'last-modified' column.
I plan to go as:FIRST RUN: Will use sqoop-import-all-tables command for 
importing all the tables at one go to Hive Tables.EACH SUBSEQUENT RUN: Use 
Sqoop incremental import mode to retrieve only rows newer than some 
previously-imported set of rows.
My question is how can I get the updated rows which got updated in between the 
FIRST RUN and NEXT SUBSEQUENT RUN.For e.g. Say in FIRST RUN: I fetched all 
tables. A CUSTOMER table that has 100 records with CustomerId as Primary Key(1 
to 100) is imported to Hive CUSTOMER table.
And now meanwhile some rows in the Source CUSTOMER table got updated.With my 
NEXT SUBSEQUENT RUN this will fetch rows > 100, thus skipping the updated rows 
which are  < 100. 
How can I get the updated rows on each subsequent run.
Referring to Sqoop documentation this strategy works only if we have a 
last-modified column(which in my case don't have):An alternate table update 
strategy supported by Sqoop is called lastmodified mode. You should use this 
when rows of the source table may be updated, and each such update will set the 
value of a last-modified column to the current timestamp. Rows where the check 
column holds a timestamp more recent than the timestamp specified with 
--last-value are imported.
Any suggestions on how to get the updated data on SUBSEQUENT RUNS using the 
Sqoop Incremental mode?
Thanks,-RR                                        

Reply via email to