Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)

yogesh kumar Mon, 13 Jan 2014 12:52:54 -0800

Hello Abe,

Thanks a zillion for your response,


Yes the unique_vale is SAL  over here ..

I have an option to test with different directory as its having small
amount of data (only 1 month data)

Yes the interesting fact is I did sqoop pull on the basis on month, into
which I am have not done incremental import, and the data matches very
well, but for daily pull I have to do incremental import every day..

and then i find some data loss...

one of the sqoop guy has suggested me to use  " \$CONDITIONS " with in this
query as  ......  from EMP where \$CONDITIONS and SAL > 201401200 and SAL
<= 201401204 \

like

EMP where SAL > 201401200 and SAL <= 201401204 and   \$CONDITIONS



Plz do suggest me pls help me out as I have to output to my client..








On Mon, Jan 13, 2014 at 11:13 PM, Abraham Elmahrek <[email protected]> wrote:

> Yogesh,
>
> Is unique_value in this case SAL? I'm a bit confused about your query.
>
> Do you have the option of running this query on a separate database
> somewhere to find the issue? I think it would be interesting to see the
> initial state and then the state after running an incremental import. That
> would tell us how many results are being imported after sqoop has ran and
> we can validate each step. Also, please use the --verbose flag to get the
> most out of the logs.
>
> -Abe
>
>
> On Mon, Jan 13, 2014 at 5:15 AM, Sharath Punreddy <[email protected]>wrote:
>
>> Yogesh,
>>
>> Please try to put $CONDITIONS after your where clause.
>>
>> Checkout the examples in the below blog.
>>
>>
>> http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1
>>  On Jan 13, 2014 7:04 AM, "yogesh kumar" <[email protected]> wrote:
>>
>>> Hello Jarcec,
>>>
>>> I got the issue hope this is the cause..  I got data loss by doing
>>> incremental pull
>>>
>>> I have crossed checked it and found that
>>>
>>> sqoop import -libjars
>>>  --driver com.sybase.jdbc3.jdbc.SybDriver \
>>>  --query "select * from
>>>  from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \
>>> --check-column Unique_value \
>>>  --incremental append \
>>>  --last-value 201401200 \
>>>  --split-by DEPT \
>>>  --fields-terminated-by ',' \
>>>  --target-dir ${TARGET_DIR}/${INC} \
>>>  --username ${SYBASE_USERNAME} \
>>>  --password ${SYBASE_PASSWORD} \
>>>
>>>
>>> now I have imported newly inserted data into RDBMS to HDFS
>>>
>>> but when I do
>>>
>>> select count(*) , unique_value from EMP group by unique_value (both in
>>> RDBMS and in HIVE)
>>>
>>> I can find huge data loss.
>>>
>>> 1) in RDBMS
>>>
>>>   Count(*)    Unique_value
>>>   1000          201401201
>>>    5000         201401202
>>>   10000         201401203
>>>
>>>
>>> 2) in HIVE
>>>
>>>   Count(*)    Unique_value
>>>   189          201401201
>>>    421         201401202
>>>    50           201401203
>>>
>>>
>>> If I do
>>>
>>> select Unique value from emp ;
>>>
>>> Result :
>>> 201401201
>>> 201401201
>>> 201401201
>>> 201401201
>>> 201401201
>>> .
>>> .
>>> 201401202
>>> .
>>> .
>>> and so on...
>>>
>>>
>>> Pls help and suggest why is it so
>>>
>>>
>>> Many thanks in advance
>>>
>>> Yogesh kumar
>>>
>>> On Sun, Jan 12, 2014 at 11:08 PM, Jarek Jarcec Cecho 
>>> <[email protected]>wrote:
>>>
>>>> Hi Yogesh,
>>>> I would start by verifying imported data. If there are duplicates than
>>>> it's suggesting some miss configuration of Sqoop, otherwise you might have
>>>> some inconsistencies down the pipeline.
>>>>
>>>> Jarcec
>>>>
>>>> On Sat, Jan 11, 2014 at 11:01:22PM +0530, yogesh kumar wrote:
>>>> > Hello All,
>>>> >
>>>> > I am working on a use case where I have to run a process on daily
>>>> basis
>>>> > which will do these.
>>>> >
>>>> > 1)  Pull every day new data inserted into RDBMS tables to HDFS
>>>> > 2)  Having external table in hive (pointing to the location of HDFS
>>>> > directry where data is pulled by sqoop)
>>>> > 3) Perform some hive queries (joins) and create a final internal
>>>> table into
>>>> > Hive (say.. Hive_Table_Final).
>>>> >
>>>> >
>>>> > What I am doing..
>>>> >
>>>> > I am migrating a process from RDBMS to HADOOP ( same process is being
>>>> > executed in RDBMS procedure and stored in final table . {say..
>>>> >  Rdbms_Table_Final} )
>>>> >
>>>> > Issue I am facing is.
>>>> >
>>>> > Every time I do Incremental import and after processing I find the
>>>> final
>>>> > table in hive having the value multiplied by every time I do
>>>> incremental
>>>> > import (If I do incremental import to bring new data into HDFS , the
>>>> data
>>>> > in final table of hive after processing  i.e "Hive_Table_Final"
>>>>  showing
>>>> > the values of all columns multiplied by the times of I done
>>>> incremental
>>>> > pull), if I do perform incremental import for 4 days ( every day once
>>>> > incremental import in a day and did it for  4 days) i got  data
>>>> multiplied
>>>> > 4 in the final table of hive (Hive_Table_Final)  with respect to final
>>>> > table in RDBMS (Rdbms_final_table).
>>>> >
>>>> >
>>>> > Like..
>>>> >
>>>> > 1) 1st time I have pulled the data from RDBMS based on the months
>>>> (like
>>>> > from 2013-12-01 to 2013-01-01) and processed it, got perfect results
>>>> > matching the data in final Hive's  table(Hive_Table_Final) and RDBMS
>>>> > processed data into (Rdbms_Table_Final)
>>>> >
>>>> > 2) I have done incremental import to bring new data from RDBMS to
>>>> HDFS by
>>>> > using this command..
>>>> >
>>>> >
>>>> >  sqoop import -libjars
>>>> >  --driver com.sybase.jdbc3.jdbc.SybDriver \
>>>> >  --query "select * from
>>>> >  from EMP where \$CONDITIONS and SAL > 50000 and SAL <= 80000" \
>>>> > --check-column Unique_value \
>>>> >  --incremental append \
>>>> >  --last-value 201401200 \
>>>> >  --split-by DEPT \
>>>> >  --fields-terminated-by ',' \
>>>> >  --target-dir ${TARGET_DIR}/${INC} \
>>>> >  --username ${SYBASE_USERNAME} \
>>>> >  --password ${SYBASE_PASSWORD} \
>>>> >
>>>> > "Note -- The field Unique_value is very unique for every time, its
>>>> > like primary key "
>>>> >
>>>> >
>>>> >
>>>> > As now I have just pulled the new records to my HDFS which were into
>>>>  RDBMS
>>>> > tables..
>>>> >
>>>> > Now I got major data mis-match issue,  after the
>>>> > processing..(Hive_Table_final)
>>>> >
>>>> > My Major issue is with sqoop incremental import, as many times I do
>>>> > Incremental import I find the  data into my final table gets
>>>> multiplied by
>>>> > the times I have done incremental import..
>>>> >
>>>> >
>>>> > Please suggest, whats wrong I am doing, Whats I am missing..
>>>> > pls help me out..
>>>> >
>>>> >
>>>> > Thanks & Regards
>>>> > Yogesh Kumar
>>>>
>>>
>>>
>

Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)

Reply via email to