Re: What is the recommended way to import data into Hive incrementally

jason Yang Thu, 07 Jun 2012 01:53:42 -0700

all right, Thank you very much~

2012/6/7 Jarek Jarcec Cecho <[email protected]>


> Hi Jason,
> your procedure is correct use case for the incremental import and I do not
> see any issues. You've also correctly guessed that it's the file ordering
> that mess with row "order".
>
> This behaviour is definitely expected, so you don't have to worry about
> that. Hive do not care about internal order of the files/rows and you are
> need to use "order by" clause if you need to sort the output result set in
> any way. Other RDBMS have the same requirement (including MySQL).
>
> Jarcec
>
> On Thu, Jun 07, 2012 at 12:06:30PM +0800, jason Yang wrote:
> > Hi, all
> >
> > Recently I'm trying to use sqoop to import data into Hive incrementally.
> > However, I have encountered some weird problems.
> >
> > Here is my scenario:
> > ----
> > 1. I have a table t_student in MySQL , which schema is:
> >     schema(student) = (sID integer PRIMARY KEY, sName vchar(30) NOT
> NULL);
> >
> > 2. Initially, I have 1000 records ,and I import all of these records into
> > hive by using: sqoop --import ... --hive-import. and It works fine.
> >
> > 3. Suppose that now I get some new records(sID from 1001 to 2000), I try
> to
> > import those new records into hive incrementally by the following
> command:
> >     sqoop --import ... --check-column sID --last-value 1000 --incremental
> > append --hive-import
> >
> > 4. This command is executed without any error, but when I try to select
> the
> > data in the Hive table, I have got a weird result:
> >      query    : hive> select sID from t_student;
> >      Result   : I got all the sID in the table ,but the order of those
> > records is like: 0~500, 1001~1500, 501~1000, 1501~2000. I thought it
> should
> > order by the sID ascendantly.
> >
> > 5. According to the manual of Hive, The data in Hive is actually stored
> in
> > the warehouse directory as HDFS files. so, I looked up this directory
> path
> > and I found a sub-directory named as t_student. In this
> > $HIVE_DIR\t_student, there're some files named like:
> >     part-m-00000
> >     part-m-00000-copy
> >     part-m-00001
> >     part-m-00001-copy
> >     ....
> > It seems that the records are stored in such files, and the order of
> select
> > result is exactly the order of file name. In my case, the student records
> > which ID are from 0 to 500 are stored in the part-m-00000, and the
> student
> > records with ID from 1001 to 1500 are stored in the part-m-000000-copy,
> so
> > I got the result described before.
> > -----
> >
> > I'm not sure whether this kind of result is OK or not, and I was
> wondering
> > what is the recommended way to import data into Hive incrementally?
> >
> > Any suggestion would be appreciated.
> >
> > --
> > YANG, Lin
>



-- 
YANG, Lin

Re: What is the recommended way to import data into Hive incrementally

Reply via email to