Hello Abe, Thanks a zillion for your response,
Yes the unique_vale is SAL over here .. I have an option to test with different directory as its having small amount of data (only 1 month data) Yes the interesting fact is I did sqoop pull on the basis on month, into which I am have not done incremental import, and the data matches very well, but for daily pull I have to do incremental import every day.. and then i find some data loss... one of the sqoop guy has suggested me to use " \$CONDITIONS " with in this query as ...... from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \ like EMP where SAL > 201401200 and SAL <= 201401204 and \$CONDITIONS Plz do suggest me pls help me out as I have to output to my client.. On Mon, Jan 13, 2014 at 11:13 PM, Abraham Elmahrek <[email protected]> wrote: > Yogesh, > > Is unique_value in this case SAL? I'm a bit confused about your query. > > Do you have the option of running this query on a separate database > somewhere to find the issue? I think it would be interesting to see the > initial state and then the state after running an incremental import. That > would tell us how many results are being imported after sqoop has ran and > we can validate each step. Also, please use the --verbose flag to get the > most out of the logs. > > -Abe > > > On Mon, Jan 13, 2014 at 5:15 AM, Sharath Punreddy <[email protected]>wrote: > >> Yogesh, >> >> Please try to put $CONDITIONS after your where clause. >> >> Checkout the examples in the below blog. >> >> >> http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1 >> On Jan 13, 2014 7:04 AM, "yogesh kumar" <[email protected]> wrote: >> >>> Hello Jarcec, >>> >>> I got the issue hope this is the cause.. I got data loss by doing >>> incremental pull >>> >>> I have crossed checked it and found that >>> >>> sqoop import -libjars >>> --driver com.sybase.jdbc3.jdbc.SybDriver \ >>> --query "select * from >>> from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \ >>> --check-column Unique_value \ >>> --incremental append \ >>> --last-value 201401200 \ >>> --split-by DEPT \ >>> --fields-terminated-by ',' \ >>> --target-dir ${TARGET_DIR}/${INC} \ >>> --username ${SYBASE_USERNAME} \ >>> --password ${SYBASE_PASSWORD} \ >>> >>> >>> now I have imported newly inserted data into RDBMS to HDFS >>> >>> but when I do >>> >>> select count(*) , unique_value from EMP group by unique_value (both in >>> RDBMS and in HIVE) >>> >>> I can find huge data loss. >>> >>> 1) in RDBMS >>> >>> Count(*) Unique_value >>> 1000 201401201 >>> 5000 201401202 >>> 10000 201401203 >>> >>> >>> 2) in HIVE >>> >>> Count(*) Unique_value >>> 189 201401201 >>> 421 201401202 >>> 50 201401203 >>> >>> >>> If I do >>> >>> select Unique value from emp ; >>> >>> Result : >>> 201401201 >>> 201401201 >>> 201401201 >>> 201401201 >>> 201401201 >>> . >>> . >>> 201401202 >>> . >>> . >>> and so on... >>> >>> >>> Pls help and suggest why is it so >>> >>> >>> Many thanks in advance >>> >>> Yogesh kumar >>> >>> On Sun, Jan 12, 2014 at 11:08 PM, Jarek Jarcec Cecho >>> <[email protected]>wrote: >>> >>>> Hi Yogesh, >>>> I would start by verifying imported data. If there are duplicates than >>>> it's suggesting some miss configuration of Sqoop, otherwise you might have >>>> some inconsistencies down the pipeline. >>>> >>>> Jarcec >>>> >>>> On Sat, Jan 11, 2014 at 11:01:22PM +0530, yogesh kumar wrote: >>>> > Hello All, >>>> > >>>> > I am working on a use case where I have to run a process on daily >>>> basis >>>> > which will do these. >>>> > >>>> > 1) Pull every day new data inserted into RDBMS tables to HDFS >>>> > 2) Having external table in hive (pointing to the location of HDFS >>>> > directry where data is pulled by sqoop) >>>> > 3) Perform some hive queries (joins) and create a final internal >>>> table into >>>> > Hive (say.. Hive_Table_Final). >>>> > >>>> > >>>> > What I am doing.. >>>> > >>>> > I am migrating a process from RDBMS to HADOOP ( same process is being >>>> > executed in RDBMS procedure and stored in final table . {say.. >>>> > Rdbms_Table_Final} ) >>>> > >>>> > Issue I am facing is. >>>> > >>>> > Every time I do Incremental import and after processing I find the >>>> final >>>> > table in hive having the value multiplied by every time I do >>>> incremental >>>> > import (If I do incremental import to bring new data into HDFS , the >>>> data >>>> > in final table of hive after processing i.e "Hive_Table_Final" >>>> showing >>>> > the values of all columns multiplied by the times of I done >>>> incremental >>>> > pull), if I do perform incremental import for 4 days ( every day once >>>> > incremental import in a day and did it for 4 days) i got data >>>> multiplied >>>> > 4 in the final table of hive (Hive_Table_Final) with respect to final >>>> > table in RDBMS (Rdbms_final_table). >>>> > >>>> > >>>> > Like.. >>>> > >>>> > 1) 1st time I have pulled the data from RDBMS based on the months >>>> (like >>>> > from 2013-12-01 to 2013-01-01) and processed it, got perfect results >>>> > matching the data in final Hive's table(Hive_Table_Final) and RDBMS >>>> > processed data into (Rdbms_Table_Final) >>>> > >>>> > 2) I have done incremental import to bring new data from RDBMS to >>>> HDFS by >>>> > using this command.. >>>> > >>>> > >>>> > sqoop import -libjars >>>> > --driver com.sybase.jdbc3.jdbc.SybDriver \ >>>> > --query "select * from >>>> > from EMP where \$CONDITIONS and SAL > 50000 and SAL <= 80000" \ >>>> > --check-column Unique_value \ >>>> > --incremental append \ >>>> > --last-value 201401200 \ >>>> > --split-by DEPT \ >>>> > --fields-terminated-by ',' \ >>>> > --target-dir ${TARGET_DIR}/${INC} \ >>>> > --username ${SYBASE_USERNAME} \ >>>> > --password ${SYBASE_PASSWORD} \ >>>> > >>>> > "Note -- The field Unique_value is very unique for every time, its >>>> > like primary key " >>>> > >>>> > >>>> > >>>> > As now I have just pulled the new records to my HDFS which were into >>>> RDBMS >>>> > tables.. >>>> > >>>> > Now I got major data mis-match issue, after the >>>> > processing..(Hive_Table_final) >>>> > >>>> > My Major issue is with sqoop incremental import, as many times I do >>>> > Incremental import I find the data into my final table gets >>>> multiplied by >>>> > the times I have done incremental import.. >>>> > >>>> > >>>> > Please suggest, whats wrong I am doing, Whats I am missing.. >>>> > pls help me out.. >>>> > >>>> > >>>> > Thanks & Regards >>>> > Yogesh Kumar >>>> >>> >>> >
