Thanks Jarcec, probably you've identified immediately the problem. In fact, I checked the date field, and I think problem is that in my data I have some "limit" values like '0000-00-00' (damn who have inserted these). The other data are equally distributed in 2 months (from 2012-04-01 to 2012-06-01): as you said with a parallelism of 3, 2 mappers will take basically no data while the other will do the "true" job, right?
So, now my question becomes: the other field that I could use to split the job is an hash (string). How sqoop divide this type of field? Lexicography order? Alberto On 5 September 2012 09:57, Jarek Jarcec Cecho <[email protected]> wrote: > Hi Alberto, > taking into account that you have 910 millions of records and you're job was > able to get to 75% in matter of 8 minutes and then it slow down > significantly, I do have a feeling that your splits were not equally divided. > Based on your command line it seems that you're diving data by some date > field. Is this date field uniformly distributed? E.g. is there roughly same > number of rows for each date or do you have more rows in more recent days? > > Because Sqoop have no idea how exactly the data are distributed in your > database, it assumes uniform distribution. Let me explain why it matters on > following example. Let's consider table where there is one row on 2012-01-01, > second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's assume that we > will use three mappers (--num-mappers 3). In this case, sqoop will create > three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to 2012-02-28 and > 2012-03-01 up to 2012-03-31. Because the first two mappers do have just one > row to move, they will finish almost instantly and get job to 66% done (2 out > of 3 mappers are done), however the last mapper will be running for some time > as it need to move 1M of rows. For external observer it would appear that the > sqoop has stopped, but what really happened is just having not uniformly > distributed data across all mappers. > > Jarcec > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote: >> Hi all, >> >> I am using Sqoop to import a big MySql table (around 910 milions of >> records) in Hbase. >> The command line that I'm using is something like: >> sqoop import --connect >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username <usr> >> -P --query <query>' --split-by <date-field> --hbase-table >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash" >> >> The strange thing is that it takes a lot to complete the last part of >> the map. This is part of the log: >> >> [...] >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job: job_201209031227_0007 >> 12/09/04 17:16:46 INFO mapred.JobClient: map 0% reduce 0% >> 12/09/04 17:24:20 INFO mapred.JobClient: map 25% reduce 0% >> 12/09/04 17:24:21 INFO mapred.JobClient: map 50% reduce 0% >> 12/09/04 17:24:23 INFO mapred.JobClient: map 75% reduce 0% >> >> As you can see it does not take much time to from start to 75%, but >> the last part hasn't been finished (although it is working by a day >> continuously). >> Is there something wrong? I've tried to take a look to the logs but it >> seems to be ok. >> >> >> Thanks, >> Alberto >> >> >> >> -- >> Alberto Cordioli -- Alberto Cordioli
