Thanks Jarcec,
probably you've identified immediately the problem. In fact, I checked
the date field, and I think problem is that in my data I have some
"limit" values like '0000-00-00' (damn who have inserted these).
The other data are equally distributed in 2 months (from 2012-04-01 to
2012-06-01): as you said with a parallelism of 3, 2 mappers will take
basically no data while the other will do the "true" job, right?

So, now my question becomes: the other field that I could use to split
the job is an hash (string). How sqoop divide this type of field?
Lexicography order?


Alberto



On 5 September 2012 09:57, Jarek Jarcec Cecho <[email protected]> wrote:
> Hi Alberto,
> taking into account that you have 910 millions of records and you're job was 
> able to get to 75% in matter of 8 minutes and then it slow down 
> significantly, I do have a feeling that your splits were not equally divided. 
> Based on your command line it seems that you're diving data by some date 
> field. Is this date field uniformly distributed? E.g. is there roughly same 
> number of rows for each date or do you have more rows in more recent days?
>
> Because Sqoop have no idea how exactly the data are distributed in your 
> database, it assumes uniform distribution. Let me explain why it matters on 
> following example. Let's consider table where there is one row on 2012-01-01, 
> second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's assume that we 
> will use three mappers (--num-mappers 3). In this case, sqoop will create 
> three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to 2012-02-28 and 
> 2012-03-01 up to 2012-03-31. Because the first two mappers do have just one 
> row to move, they will finish almost instantly and get job to 66% done (2 out 
> of 3 mappers are done), however the last mapper will be running for some time 
> as it need to move 1M of rows. For external observer it would appear that the 
> sqoop has stopped, but what really happened is just having not uniformly 
> distributed data across all mappers.
>
> Jarcec
>
> On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
>> Hi all,
>>
>> I am using Sqoop to import a big MySql table (around 910 milions of
>> records) in Hbase.
>> The command line that I'm using is something like:
>> sqoop import --connect
>> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username <usr>
>> -P --query <query>' --split-by <date-field> --hbase-table
>> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
>>
>> The strange thing is that it takes a lot to complete the last part of
>> the map. This is part of the log:
>>
>> [...]
>> 12/09/04 17:16:45 INFO mapred.JobClient: Running job: job_201209031227_0007
>> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
>> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
>> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
>> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
>>
>> As you can see it does not take much time to from start to 75%, but
>> the last part hasn't been finished (although it is working by a day
>> continuously).
>> Is there something wrong? I've tried to take a look to the logs but it
>> seems to be ok.
>>
>>
>> Thanks,
>> Alberto
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Reply via email to