Re: parallel processing with JDBC

Mich Talebzadeh Sun, 14 Aug 2016 13:37:45 -0700

If you have primary keys on these tables then you can parallelise the
process reading data.


You have to be careful not to set the number of partitions too many.
Certainly there is a balance between the number of partitions supplied to
JDBC and the load on the network and the source DB.

Assuming that your underlying table has primary key ID, then this will
create 20 parallel processes to Oracle DB

 val d = HiveContext.read.format("jdbc").options(
 Map("url" -> _ORACLEserver,
 "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)",
 "partitionColumn" -> "ID",
 "lowerBound" -> "1",
 "upperBound" -> "maxID",
 "numPartitions" -> "20",
 "user" -> _username,
 "password" -> _password)).load

assuming your upper bound on ID is maxID


This will open multiple connections to RDBMS, each getting a subset of data
that you want.

You need to test it to ensure that you get the numPartitions optimum and
you don't overload any component.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 21:15, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:

> Hi,
>
> There are 4 tables ranging from 10 million to 100 million rows but they
> all have primary keys.
>
> The network is fine but our Oracle is RAC and we can only connect to a
> designated Oracle node (where we have a DQ account only).
>
> We have a limited time window of few hours to get the required data out.
>
> Thanks
>
>
> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> How big are your tables and is there any issue with the network between
> your Spark nodes and your Oracle DB that adds to issues?
>
> HTH
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
> On 14 August 2016 at 20:50, Ashok Kumar <ashok34...@yahoo.com.invalid>
> wrote:
>
> Hi Gurus,
>
> I have few large tables in rdbms (ours is Oracle). We want to access these
> tables through Spark JDBC
>
> What is the quickest way of getting data into Spark Dataframe say multiple
> connections from Spark
>
> thanking you
>
>
>
>
>
>

Re: parallel processing with JDBC

Reply via email to