Hi, all

Recently I'm trying to use sqoop to import data into Hive incrementally.
However, I have encountered some weird problems.

Here is my scenario:
----
1. I have a table t_student in MySQL , which schema is:
    schema(student) = (sID integer PRIMARY KEY, sName vchar(30) NOT NULL);

2. Initially, I have 1000 records ,and I import all of these records into
hive by using: sqoop --import ... --hive-import. and It works fine.

3. Suppose that now I get some new records(sID from 1001 to 2000), I try to
import those new records into hive incrementally by the following command:
    sqoop --import ... --check-column sID --last-value 1000 --incremental
append --hive-import

4. This command is executed without any error, but when I try to select the
data in the Hive table, I have got a weird result:
     query    : hive> select sID from t_student;
     Result   : I got all the sID in the table ,but the order of those
records is like: 0~500, 1001~1500, 501~1000, 1501~2000. I thought it should
order by the sID ascendantly.

5. According to the manual of Hive, The data in Hive is actually stored in
the warehouse directory as HDFS files. so, I looked up this directory path
and I found a sub-directory named as t_student. In this
$HIVE_DIR\t_student, there're some files named like:
    part-m-00000
    part-m-00000-copy
    part-m-00001
    part-m-00001-copy
    ....
It seems that the records are stored in such files, and the order of select
result is exactly the order of file name. In my case, the student records
which ID are from 0 to 500 are stored in the part-m-00000, and the student
records with ID from 1001 to 1500 are stored in the part-m-000000-copy, so
I got the result described before.
-----

I'm not sure whether this kind of result is OK or not, and I was wondering
what is the recommended way to import data into Hive incrementally?

Any suggestion would be appreciated.

-- 
YANG, Lin

Reply via email to