Hi David, On Sep 4, 2011, at 9:01pm, David Robson wrote:
> Hi Ken, > > We thought you wanted each mapper to run against an individual partition – if > you want to just pull out a specific partition say the JAN2010 partition - > you can achieve that in OraOop currently using a WHERE clause that would only > pick that partition. > OraOop would then split the blocks of the entire table across all the mappers > – the problem of course a lot of those blocks would not match the WHERE > clause and hence not be selecting any rows. From my testing – Oracle seems to > be smart enough that if you request rowid ranges and also specify a WHERE > clause on the partition key such that no rows could match – it realises this > and does not actually read the blocks. So OraOop would get maximum > performance in this case (aside from the slight overhead to work out there > were no blocks in that range – which shouldn’t be much in the scheme of a big > import job). > > The main problem here is because OraOop is splitting across all the > partitions – it could potentially be unbalanced and some mappers would > process more rows than others. Are you experiencing problems with mappers > being unbalanced? You can try changing the oraoop.block.allocation property > to RANDOM (by default it is ROUNDROBIN). No, I don't believe so - at least not when OraOop is actually getting used. > Also Oracle should not spawn any parallel queries for OraOop queries – we > have specifically disabled it as it would cause overhead on the database. OK, great - that's good to know. > So in summary – to get the best possible performance you should run one > OraOop job that fetches all the rows you want – and OraOop should balance the > data across all the mappers and try to minimise the load on the DB by only > reading each block once. The ROWIDs we scan also have the partition ID in > them so we only ever scan one partition at a time – then UNION ALL the > results together – as you have probably seen if you trace a session. > > If there is a particular import job you are trying to tune you could send us > some details about it – some screenshots from OEM and log files etc and we > could have a look at it with you. Thanks for that offer. We're improving performance (up to 50M rows/min best case) by tuning other parameters, so feeling better about the status. Regards, -- Ken > From: Ken Krugler [mailto:kkrugler_li...@transpac.com] > Sent: Sunday, 4 September 2011 6:49 AM > To: sqoop-user@incubator.apache.org > Subject: Re: [sqoop-user] Support for partitioning during export into HDFS > > Hi Peter, > > On Sep 1, 2011, at 4:52pm, Peter Hall wrote: > > > Hi Ken, > > We did initially consider an approach similar to what you suggest, but > decided not to go with it due to complexities when the number of mappers is > different to the number of partitions. > > OK - but note that what I'm asking for is the ability to restrict a given > Sqoop import request to one partition. > > We can run multiple of these in parallel, if that would improve our > throughput. > > Given that we've got billions of rows coming from a single DB, we're looking > to maximize performance here, thus the interest in this topic. > > > Instead we are breaking up the blocks in the table and spreading them across > all the mappers and doing ROWID range scans. So all mappers could be reading > from all partitions - but they would only be reading part of each. > > Since the ROWID range scans are not partition specific, wouldn't this cause > Oracle to spawn 16 parallel queries (one per partition)? > > Also, typically wouldn't a range of ROWIDs be in one partition (or maybe > two), if we have num mappers == num partitions? > > So the queries in all the other partitions would match nothing. > > Just wanting to make sure I understand what's happening under the hood, here. > > Thanks, > > -- Ken > > > Splitting by PARTITION may provide slightly better performance, but we don't > believe it would be a huge difference. > > Regards, > Peter Hall > Quest Software > > Hi there, > > For maximum performance when pulling data, it seems like we'd want to run > multiple Sqoops in parallel against the available partitions in a table. > > That would require adding 'PARTITION <partition_name> to the select > statement, something like: > > select * from <table_name> PARTITION <partition_name> where <condition>; > > 1. Does this make sense, both for general Sqoop and specifically OraOop? > > 2. Is there a way to do this now, or would Sqoop (and OraOop) need to be > extended? > > Thanks, > > -- Ken > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > > -------------------------------------------- > http://about.me/kkrugler > +1 530-210-6378 > > > > -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr