Re: [sqoop-user] Support for partitioning during export into HDFS

Ken Krugler Mon, 05 Sep 2011 16:42:17 -0700

Hi David,

On Sep 4, 2011, at 9:01pm, David Robson wrote:


> Hi Ken,
>  
> We thought you wanted each mapper to run against an individual partition – if 
> you want to just pull out a specific partition say the JAN2010 partition - 
> you can achieve that in OraOop currently using a WHERE clause that would only 
> pick that partition.

> OraOop would then split the blocks of the entire table across all the mappers 
> – the problem of course a lot of those blocks would not match the WHERE 
> clause and hence not be selecting any rows. From my testing – Oracle seems to 
> be smart enough that if you request rowid ranges and also specify a WHERE 
> clause on the partition key such that no rows could match – it realises this 
> and does not actually read the blocks. So OraOop would get maximum 
> performance in this case (aside from the slight overhead to work out there 
> were no blocks in that range – which shouldn’t be much in the scheme of a big 
> import job).
>  
> The main problem here is because OraOop is splitting across all the 
> partitions – it could potentially be unbalanced and some mappers would 
> process more rows than others. Are you experiencing problems with mappers 
> being unbalanced? You can try changing the oraoop.block.allocation property 
> to RANDOM (by default it is ROUNDROBIN).

No, I don't believe so - at least not when OraOop is actually getting used.

> Also Oracle should not spawn any parallel queries for OraOop queries – we 
> have specifically disabled it as it would cause overhead on the database.

OK, great - that's good to know.

> So in summary – to get the best possible performance you should run one 
> OraOop job that fetches all the rows you want – and OraOop should balance the 
> data across all the mappers and try to minimise the load on the DB by only 
> reading each block once. The ROWIDs we scan also have the partition ID in 
> them so we only ever scan one partition at a time – then UNION ALL the 
> results together – as you have probably seen if you trace a session.
>  
> If there is a particular import job you are trying to tune you could send us 
> some details about it – some screenshots from OEM and log files etc and we 
> could have a look at it with you.

Thanks for that offer.

We're improving performance (up to 50M rows/min best case) by tuning other 
parameters, so feeling better about the status.

Regards,

-- Ken


> From: Ken Krugler [mailto:kkrugler_li...@transpac.com] 
> Sent: Sunday, 4 September 2011 6:49 AM
> To: sqoop-user@incubator.apache.org
> Subject: Re: [sqoop-user] Support for partitioning during export into HDFS
>  
> Hi Peter,
>  
> On Sep 1, 2011, at 4:52pm, Peter Hall wrote:
> 
> 
> Hi Ken,
> 
> We did initially consider an approach similar to what you suggest, but 
> decided not to go with it due to complexities when the number of mappers is 
> different to the number of partitions.
>  
> OK - but note that what I'm asking for is the ability to restrict a given 
> Sqoop import request to one partition.
>  
> We can run multiple of these in parallel, if that would improve our 
> throughput.
>  
> Given that we've got billions of rows coming from a single DB, we're looking 
> to maximize performance here, thus the interest in this topic.
> 
> 
> Instead we are breaking up the blocks in the table and spreading them across 
> all the mappers and doing ROWID range scans. So all mappers could be reading 
> from all partitions - but they would only be reading part of each.
>  
> Since the ROWID range scans are not partition specific, wouldn't this cause 
> Oracle to spawn 16 parallel queries (one per partition)?
>  
> Also, typically wouldn't a range of ROWIDs be in one partition (or maybe 
> two), if we have num mappers == num partitions?
>  
> So the queries in all the other partitions would match nothing.
>  
> Just wanting to make sure I understand what's happening under the hood, here.
>  
> Thanks,
>  
> -- Ken
> 
> 
> Splitting by PARTITION may provide slightly better performance, but we don't 
> believe it would be a huge difference.
> 
> Regards,
> Peter Hall
> Quest Software
>  
> Hi there,
> 
> For maximum performance when pulling data, it seems like we'd want to run 
> multiple Sqoops in parallel against the available partitions in a table.
> 
> That would require adding 'PARTITION <partition_name> to the select 
> statement, something like:
> 
> select * from <table_name> PARTITION <partition_name> where <condition>;
> 
> 1. Does this make sense, both for general Sqoop and specifically OraOop?
> 
> 2. Is there a way to do this now, or would Sqoop (and OraOop) need to be 
> extended?
> 
> Thanks,
> 
> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 
>  
> --------------------------------------------
> http://about.me/kkrugler
> +1 530-210-6378
>  
>  
> 
>  

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: [sqoop-user] Support for partitioning during export into HDFS

Reply via email to