RE: Re: the confusion of --split-by parameter

David Robson Tue, 09 Sep 2014 20:33:58 -0700

In regards to Oracle – with the addition of the direct connector you can split 
by ROWID, or by partition. This is much faster than using min/max boundaries.


I do not know the internals of MySQL – but limit/offset queries would most 
likely need to sort the data to implement this – so would potentially have an 
additional overhead.

What database are you using? I guess the current splitting by the minimum and 
maximum value of the column could be considered the generic way of doing it – 
then each database should implement a custom method. So we wrote the direct 
connector for Oracle to take advantage of Oracles features and make it better. 
So if someone could work out a better way of doing it for say MySQL or 
PostgreSQL then they could enhance the connector for that particular database. 
I know there is connectors for various databases – but I can’t comment on 
whether it could be done more efficiently as I have only focused on the Oracle 
connector. You could try enhancing a connector on a database you are looking at 
and submit it as a patch if you find a more efficient method.

If you are using Oracle – you should try the direct connector in 1.4.5 
(formerly known as OraOop) as this doesn’t require a split by column.

From: Abraham Elmahrek [mailto:[email protected]]
Sent: Wednesday, 10 September 2014 12:00 PM
To: [email protected]
Subject: Re: Re: the confusion of --split-by parameter

Good point. The only thing I can think of is that offsets might be slower 
(since the DB has to scan and keep a count internally) and the expectation that 
certain ranges of data end up in certain files (though I doubt this one). I'll 
defer this one to the broader community as I'm not sure myself.

On Tue, Sep 9, 2014 at 5:31 PM, 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>> wrote:


Hey,brother.
  Glad to hear from you!I think we can use limit/offset(if the database support 
this operation),or we can use sub-selection(if the database does not support 
limint/offset)
For example:
For MySQL:select * from table limiit 0,5;select * from table limit 6,10...
For Oracle we can use rownum
I just can not understand why sqoop override this opreation above.This override 
can lead to data skew.

From: Abraham Elmahrek<mailto:[email protected]>
Date: 2014-09-10 00:38
To: [email protected]<mailto:[email protected]>
Subject: Re: the confusion of --split-by parameter
Hey there,

For databases, there needs to be a way to actually infer boundaries for a 
particular column. Simply performing a "select *" would not be enough because 
we would not know how to query the database.

-Abe

On Mon, Sep 8, 2014 at 8:33 PM, 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>> wrote:
Hi,all.
   In sqoop we can specify the parameter --split-by,which can determine which 
field we will use to split map recored.
But if the split field's data is skew.The workload between maps will be 
imbalance.I want to know why sqoop does not use
select count(*) from table/num-maps to determine each map's workload.As I know 
some other base class of  DataDrivenDBInputFormat's
has the implementation of select count(*) from table/num-maps.Then why sqoop 
override this.

RE: Re: the confusion of --split-by parameter

Reply via email to