Yogesh, Please try to put $CONDITIONS after your where clause.
Checkout the examples in the below blog. http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1 On Jan 13, 2014 7:04 AM, "yogesh kumar" <[email protected]> wrote: > Hello Jarcec, > > I got the issue hope this is the cause.. I got data loss by doing > incremental pull > > I have crossed checked it and found that > > sqoop import -libjars > --driver com.sybase.jdbc3.jdbc.SybDriver \ > --query "select * from > from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \ > --check-column Unique_value \ > --incremental append \ > --last-value 201401200 \ > --split-by DEPT \ > --fields-terminated-by ',' \ > --target-dir ${TARGET_DIR}/${INC} \ > --username ${SYBASE_USERNAME} \ > --password ${SYBASE_PASSWORD} \ > > > now I have imported newly inserted data into RDBMS to HDFS > > but when I do > > select count(*) , unique_value from EMP group by unique_value (both in > RDBMS and in HIVE) > > I can find huge data loss. > > 1) in RDBMS > > Count(*) Unique_value > 1000 201401201 > 5000 201401202 > 10000 201401203 > > > 2) in HIVE > > Count(*) Unique_value > 189 201401201 > 421 201401202 > 50 201401203 > > > If I do > > select Unique value from emp ; > > Result : > 201401201 > 201401201 > 201401201 > 201401201 > 201401201 > . > . > 201401202 > . > . > and so on... > > > Pls help and suggest why is it so > > > Many thanks in advance > > Yogesh kumar > > On Sun, Jan 12, 2014 at 11:08 PM, Jarek Jarcec Cecho <[email protected]>wrote: > >> Hi Yogesh, >> I would start by verifying imported data. If there are duplicates than >> it's suggesting some miss configuration of Sqoop, otherwise you might have >> some inconsistencies down the pipeline. >> >> Jarcec >> >> On Sat, Jan 11, 2014 at 11:01:22PM +0530, yogesh kumar wrote: >> > Hello All, >> > >> > I am working on a use case where I have to run a process on daily basis >> > which will do these. >> > >> > 1) Pull every day new data inserted into RDBMS tables to HDFS >> > 2) Having external table in hive (pointing to the location of HDFS >> > directry where data is pulled by sqoop) >> > 3) Perform some hive queries (joins) and create a final internal table >> into >> > Hive (say.. Hive_Table_Final). >> > >> > >> > What I am doing.. >> > >> > I am migrating a process from RDBMS to HADOOP ( same process is being >> > executed in RDBMS procedure and stored in final table . {say.. >> > Rdbms_Table_Final} ) >> > >> > Issue I am facing is. >> > >> > Every time I do Incremental import and after processing I find the final >> > table in hive having the value multiplied by every time I do incremental >> > import (If I do incremental import to bring new data into HDFS , the >> data >> > in final table of hive after processing i.e "Hive_Table_Final" showing >> > the values of all columns multiplied by the times of I done incremental >> > pull), if I do perform incremental import for 4 days ( every day once >> > incremental import in a day and did it for 4 days) i got data >> multiplied >> > 4 in the final table of hive (Hive_Table_Final) with respect to final >> > table in RDBMS (Rdbms_final_table). >> > >> > >> > Like.. >> > >> > 1) 1st time I have pulled the data from RDBMS based on the months (like >> > from 2013-12-01 to 2013-01-01) and processed it, got perfect results >> > matching the data in final Hive's table(Hive_Table_Final) and RDBMS >> > processed data into (Rdbms_Table_Final) >> > >> > 2) I have done incremental import to bring new data from RDBMS to HDFS >> by >> > using this command.. >> > >> > >> > sqoop import -libjars >> > --driver com.sybase.jdbc3.jdbc.SybDriver \ >> > --query "select * from >> > from EMP where \$CONDITIONS and SAL > 50000 and SAL <= 80000" \ >> > --check-column Unique_value \ >> > --incremental append \ >> > --last-value 201401200 \ >> > --split-by DEPT \ >> > --fields-terminated-by ',' \ >> > --target-dir ${TARGET_DIR}/${INC} \ >> > --username ${SYBASE_USERNAME} \ >> > --password ${SYBASE_PASSWORD} \ >> > >> > "Note -- The field Unique_value is very unique for every time, its >> > like primary key " >> > >> > >> > >> > As now I have just pulled the new records to my HDFS which were into >> RDBMS >> > tables.. >> > >> > Now I got major data mis-match issue, after the >> > processing..(Hive_Table_final) >> > >> > My Major issue is with sqoop incremental import, as many times I do >> > Incremental import I find the data into my final table gets multiplied >> by >> > the times I have done incremental import.. >> > >> > >> > Please suggest, whats wrong I am doing, Whats I am missing.. >> > pls help me out.. >> > >> > >> > Thanks & Regards >> > Yogesh Kumar >> > >
