Thanx Jarec , it really works. I tested on some tables by using second technique. But it requires to create staging ( temporary ) tables to be in same database , becomes difficult if you are exporting 100 tables from sqoop , u need 100 staging tables also.
Now for import process , i think only first technique is the rescue to avoid half-written or duplicate records in HDFS.We need to check before running import job whether the file exists and follow some standards for file creation in HDFS. Thanx once again. On Sun, Sep 9, 2012 at 1:02 PM, Jarek Jarcec Cecho <[email protected]>wrote: > Hi Adarsh, > you can achieve similar functionality with sqoop using several ways, based > on the connector that you will use: > > 1) You can always manually (or by script) remove previously imported data > if you know how to easily identify them prior executing sqoop. E.g. you > might create script that will remove previously imported data (if present) > and then execute sqoop. > > 2) You can benefit from staging table using parameters --staging-table and > --clear-staging-table. This way, sqoop will firstly import your data in > parallel to staging table and promote them to destination table only if all > parallel execution threads will succeed. Please note that staging option is > not available in all connectors (typically direct connectors are not > supporting it). > > 3) Lastly, you might use "upsert" functionality. Some connectors (MySQL, > Oracle) are supporting --update-mode allowinsert which will either insert > new row or update the previous one if it's present in the table already. > Please note that this solution have the worst performance from all others. > > Jarcec > > On Sun, Sep 09, 2012 at 12:42:45PM +0530, Adarsh Sharma wrote: > > Hi, > > > > I am using Sqoop-1.4.2 from the past few days in a hadoop cluster of 10 > > nodes. > > As per the documentation of sqoop 9.4 Export & Transactions , the export > > operation is not atomic in database becuase it creates separate > > transactions to insert records. > > > > Fore.g if a map task failed to export transaction while others succeeded > , > > it would lead to partial & incomplete results in database tables. > > > > I created a script in bash to load data from a CSV ( daily csvs ) of 500 > > thousand records into db in which i delete the records of the day csvs > > before loading the csv into db so that if there is issue while loading a > > day CSV , we get correct results by again running the job. > > > > Can we achieve the same functionality in Sqoop , so that if a job in > sqoop > > fails some map tasks, we achive correct & complete ( no duplicates ) > > records in db. > > > > > > Thanks >
