Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)

Sharath Punreddy Mon, 13 Jan 2014 05:18:50 -0800

Yogesh,

Please try to put $CONDITIONS after your where clause.


Checkout the examples in the below blog.

http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1
On Jan 13, 2014 7:04 AM, "yogesh kumar" <[email protected]> wrote:

> Hello Jarcec,
>
> I got the issue hope this is the cause..  I got data loss by doing
> incremental pull
>
> I have crossed checked it and found that
>
> sqoop import -libjars
>  --driver com.sybase.jdbc3.jdbc.SybDriver \
>  --query "select * from
>  from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \
> --check-column Unique_value \
>  --incremental append \
>  --last-value 201401200 \
>  --split-by DEPT \
>  --fields-terminated-by ',' \
>  --target-dir ${TARGET_DIR}/${INC} \
>  --username ${SYBASE_USERNAME} \
>  --password ${SYBASE_PASSWORD} \
>
>
> now I have imported newly inserted data into RDBMS to HDFS
>
> but when I do
>
> select count(*) , unique_value from EMP group by unique_value (both in
> RDBMS and in HIVE)
>
> I can find huge data loss.
>
> 1) in RDBMS
>
>   Count(*)    Unique_value
>   1000          201401201
>    5000         201401202
>   10000         201401203
>
>
> 2) in HIVE
>
>   Count(*)    Unique_value
>   189          201401201
>    421         201401202
>    50           201401203
>
>
> If I do
>
> select Unique value from emp ;
>
> Result :
> 201401201
> 201401201
> 201401201
> 201401201
> 201401201
> .
> .
> 201401202
> .
> .
> and so on...
>
>
> Pls help and suggest why is it so
>
>
> Many thanks in advance
>
> Yogesh kumar
>
> On Sun, Jan 12, 2014 at 11:08 PM, Jarek Jarcec Cecho <[email protected]>wrote:
>
>> Hi Yogesh,
>> I would start by verifying imported data. If there are duplicates than
>> it's suggesting some miss configuration of Sqoop, otherwise you might have
>> some inconsistencies down the pipeline.
>>
>> Jarcec
>>
>> On Sat, Jan 11, 2014 at 11:01:22PM +0530, yogesh kumar wrote:
>> > Hello All,
>> >
>> > I am working on a use case where I have to run a process on daily basis
>> > which will do these.
>> >
>> > 1)  Pull every day new data inserted into RDBMS tables to HDFS
>> > 2)  Having external table in hive (pointing to the location of HDFS
>> > directry where data is pulled by sqoop)
>> > 3) Perform some hive queries (joins) and create a final internal table
>> into
>> > Hive (say.. Hive_Table_Final).
>> >
>> >
>> > What I am doing..
>> >
>> > I am migrating a process from RDBMS to HADOOP ( same process is being
>> > executed in RDBMS procedure and stored in final table . {say..
>> >  Rdbms_Table_Final} )
>> >
>> > Issue I am facing is.
>> >
>> > Every time I do Incremental import and after processing I find the final
>> > table in hive having the value multiplied by every time I do incremental
>> > import (If I do incremental import to bring new data into HDFS , the
>> data
>> > in final table of hive after processing  i.e "Hive_Table_Final"  showing
>> > the values of all columns multiplied by the times of I done incremental
>> > pull), if I do perform incremental import for 4 days ( every day once
>> > incremental import in a day and did it for  4 days) i got  data
>> multiplied
>> > 4 in the final table of hive (Hive_Table_Final)  with respect to final
>> > table in RDBMS (Rdbms_final_table).
>> >
>> >
>> > Like..
>> >
>> > 1) 1st time I have pulled the data from RDBMS based on the months (like
>> > from 2013-12-01 to 2013-01-01) and processed it, got perfect results
>> > matching the data in final Hive's  table(Hive_Table_Final) and RDBMS
>> > processed data into (Rdbms_Table_Final)
>> >
>> > 2) I have done incremental import to bring new data from RDBMS to HDFS
>> by
>> > using this command..
>> >
>> >
>> >  sqoop import -libjars
>> >  --driver com.sybase.jdbc3.jdbc.SybDriver \
>> >  --query "select * from
>> >  from EMP where \$CONDITIONS and SAL > 50000 and SAL <= 80000" \
>> > --check-column Unique_value \
>> >  --incremental append \
>> >  --last-value 201401200 \
>> >  --split-by DEPT \
>> >  --fields-terminated-by ',' \
>> >  --target-dir ${TARGET_DIR}/${INC} \
>> >  --username ${SYBASE_USERNAME} \
>> >  --password ${SYBASE_PASSWORD} \
>> >
>> > "Note -- The field Unique_value is very unique for every time, its
>> > like primary key "
>> >
>> >
>> >
>> > As now I have just pulled the new records to my HDFS which were into
>>  RDBMS
>> > tables..
>> >
>> > Now I got major data mis-match issue,  after the
>> > processing..(Hive_Table_final)
>> >
>> > My Major issue is with sqoop incremental import, as many times I do
>> > Incremental import I find the  data into my final table gets multiplied
>> by
>> > the times I have done incremental import..
>> >
>> >
>> > Please suggest, whats wrong I am doing, Whats I am missing..
>> > pls help me out..
>> >
>> >
>> > Thanks & Regards
>> > Yogesh Kumar
>>
>
>

Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)

Reply via email to