all right, Thank you very much~ 2012/6/7 Jarek Jarcec Cecho <[email protected]>
> Hi Jason, > your procedure is correct use case for the incremental import and I do not > see any issues. You've also correctly guessed that it's the file ordering > that mess with row "order". > > This behaviour is definitely expected, so you don't have to worry about > that. Hive do not care about internal order of the files/rows and you are > need to use "order by" clause if you need to sort the output result set in > any way. Other RDBMS have the same requirement (including MySQL). > > Jarcec > > On Thu, Jun 07, 2012 at 12:06:30PM +0800, jason Yang wrote: > > Hi, all > > > > Recently I'm trying to use sqoop to import data into Hive incrementally. > > However, I have encountered some weird problems. > > > > Here is my scenario: > > ---- > > 1. I have a table t_student in MySQL , which schema is: > > schema(student) = (sID integer PRIMARY KEY, sName vchar(30) NOT > NULL); > > > > 2. Initially, I have 1000 records ,and I import all of these records into > > hive by using: sqoop --import ... --hive-import. and It works fine. > > > > 3. Suppose that now I get some new records(sID from 1001 to 2000), I try > to > > import those new records into hive incrementally by the following > command: > > sqoop --import ... --check-column sID --last-value 1000 --incremental > > append --hive-import > > > > 4. This command is executed without any error, but when I try to select > the > > data in the Hive table, I have got a weird result: > > query : hive> select sID from t_student; > > Result : I got all the sID in the table ,but the order of those > > records is like: 0~500, 1001~1500, 501~1000, 1501~2000. I thought it > should > > order by the sID ascendantly. > > > > 5. According to the manual of Hive, The data in Hive is actually stored > in > > the warehouse directory as HDFS files. so, I looked up this directory > path > > and I found a sub-directory named as t_student. In this > > $HIVE_DIR\t_student, there're some files named like: > > part-m-00000 > > part-m-00000-copy > > part-m-00001 > > part-m-00001-copy > > .... > > It seems that the records are stored in such files, and the order of > select > > result is exactly the order of file name. In my case, the student records > > which ID are from 0 to 500 are stored in the part-m-00000, and the > student > > records with ID from 1001 to 1500 are stored in the part-m-000000-copy, > so > > I got the result described before. > > ----- > > > > I'm not sure whether this kind of result is OK or not, and I was > wondering > > what is the recommended way to import data into Hive incrementally? > > > > Any suggestion would be appreciated. > > > > -- > > YANG, Lin > -- YANG, Lin
