Re: What is the recommended way to import data into Hive incrementally

Jarek Jarcec Cecho Wed, 06 Jun 2012 22:28:58 -0700

Hi Jason,
your procedure is correct use case for the incremental import and I do not see 
any issues. You've also correctly guessed that it's the file ordering that mess 
with row "order".


This behaviour is definitely expected, so you don't have to worry about that. 
Hive do not care about internal order of the files/rows and you are need to use 
"order by" clause if you need to sort the output result set in any way. Other 
RDBMS have the same requirement (including MySQL).

Jarcec

On Thu, Jun 07, 2012 at 12:06:30PM +0800, jason Yang wrote:
> Hi, all
> 
> Recently I'm trying to use sqoop to import data into Hive incrementally.
> However, I have encountered some weird problems.
> 
> Here is my scenario:
> ----
> 1. I have a table t_student in MySQL , which schema is:
>     schema(student) = (sID integer PRIMARY KEY, sName vchar(30) NOT NULL);
> 
> 2. Initially, I have 1000 records ,and I import all of these records into
> hive by using: sqoop --import ... --hive-import. and It works fine.
> 
> 3. Suppose that now I get some new records(sID from 1001 to 2000), I try to
> import those new records into hive incrementally by the following command:
>     sqoop --import ... --check-column sID --last-value 1000 --incremental
> append --hive-import
> 
> 4. This command is executed without any error, but when I try to select the
> data in the Hive table, I have got a weird result:
>      query    : hive> select sID from t_student;
>      Result   : I got all the sID in the table ,but the order of those
> records is like: 0~500, 1001~1500, 501~1000, 1501~2000. I thought it should
> order by the sID ascendantly.
> 
> 5. According to the manual of Hive, The data in Hive is actually stored in
> the warehouse directory as HDFS files. so, I looked up this directory path
> and I found a sub-directory named as t_student. In this
> $HIVE_DIR\t_student, there're some files named like:
>     part-m-00000
>     part-m-00000-copy
>     part-m-00001
>     part-m-00001-copy
>     ....
> It seems that the records are stored in such files, and the order of select
> result is exactly the order of file name. In my case, the student records
> which ID are from 0 to 500 are stored in the part-m-00000, and the student
> records with ID from 1001 to 1500 are stored in the part-m-000000-copy, so
> I got the result described before.
> -----
> 
> I'm not sure whether this kind of result is OK or not, and I was wondering
> what is the recommended way to import data into Hive incrementally?
> 
> Any suggestion would be appreciated.
> 
> -- 
> YANG, Lin

signature.asc
Description: Digital signature

Re: What is the recommended way to import data into Hive incrementally

Reply via email to