Hi Ken,

We thought you wanted each mapper to run against an individual partition - if 
you want to just pull out a specific partition say the JAN2010 partition - you 
can achieve that in OraOop currently using a WHERE clause that would only pick 
that partition.

OraOop would then split the blocks of the entire table across all the mappers - 
the problem of course a lot of those blocks would not match the WHERE clause 
and hence not be selecting any rows. From my testing - Oracle seems to be smart 
enough that if you request rowid ranges and also specify a WHERE clause on the 
partition key such that no rows could match - it realises this and does not 
actually read the blocks. So OraOop would get maximum performance in this case 
(aside from the slight overhead to work out there were no blocks in that range 
- which shouldn't be much in the scheme of a big import job).

The main problem here is because OraOop is splitting across all the partitions 
- it could potentially be unbalanced and some mappers would process more rows 
than others. Are you experiencing problems with mappers being unbalanced? You 
can try changing the oraoop.block.allocation property to RANDOM (by default it 
is ROUNDROBIN).

Also Oracle should not spawn any parallel queries for OraOop queries - we have 
specifically disabled it as it would cause overhead on the database.

So in summary - to get the best possible performance you should run one OraOop 
job that fetches all the rows you want - and OraOop should balance the data 
across all the mappers and try to minimise the load on the DB by only reading 
each block once. The ROWIDs we scan also have the partition ID in them so we 
only ever scan one partition at a time - then UNION ALL the results together - 
as you have probably seen if you trace a session.

If there is a particular import job you are trying to tune you could send us 
some details about it - some screenshots from OEM and log files etc and we 
could have a look at it with you.

David

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Sunday, 4 September 2011 6:49 AM
To: sqoop-user@incubator.apache.org
Subject: Re: [sqoop-user] Support for partitioning during export into HDFS

Hi Peter,

On Sep 1, 2011, at 4:52pm, Peter Hall wrote:


Hi Ken,

We did initially consider an approach similar to what you suggest, but decided 
not to go with it due to complexities when the number of mappers is different 
to the number of partitions.

OK - but note that what I'm asking for is the ability to restrict a given Sqoop 
import request to one partition.

We can run multiple of these in parallel, if that would improve our throughput.

Given that we've got billions of rows coming from a single DB, we're looking to 
maximize performance here, thus the interest in this topic.


Instead we are breaking up the blocks in the table and spreading them across 
all the mappers and doing ROWID range scans. So all mappers could be reading 
from all partitions - but they would only be reading part of each.

Since the ROWID range scans are not partition specific, wouldn't this cause 
Oracle to spawn 16 parallel queries (one per partition)?

Also, typically wouldn't a range of ROWIDs be in one partition (or maybe two), 
if we have num mappers == num partitions?

So the queries in all the other partitions would match nothing.

Just wanting to make sure I understand what's happening under the hood, here.

Thanks,

-- Ken


Splitting by PARTITION may provide slightly better performance, but we don't 
believe it would be a huge difference.

Regards,
Peter Hall
Quest Software

Hi there,

For maximum performance when pulling data, it seems like we'd want to run 
multiple Sqoops in parallel against the available partitions in a table.

That would require adding 'PARTITION <partition_name> to the select statement, 
something like:

select * from <table_name> PARTITION <partition_name> where <condition>;

1. Does this make sense, both for general Sqoop and specifically OraOop?

2. Is there a way to do this now, or would Sqoop (and OraOop) need to be 
extended?

Thanks,

-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378



Reply via email to