Re: parallel processing with JDBC

Mich Talebzadeh Mon, 15 Aug 2016 06:07:01 -0700

Ok Rajesh

This is standalone.


In that case it ought to be at least 4 connections as one executor will use
one worker.

I am hesitant in here as you can see with (at least) as with Standalone
mode you may end up with more executors on each worker.

But try it and see whether numPartitions" -> "4" is good or you can change
this to something higher.


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 August 2016 at 12:19, Madabhattula Rajesh Kumar <mrajaf...@gmail.com>
wrote:

> Hi Mich,
>
> Thank you for detailed explanation. One more question
>
> In my cluster, I have one master and 4 workers. In this case, 4
> connections will be opened to Oracle ?
>
> Regards,
> Rajesh
>
> On Mon, Aug 15, 2016 at 3:59 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> It happens that the number of parallel processes open from Spark to RDBMS
>> is determined by the number of executors.
>>
>> I just tested this.
>>
>> With Yarn client using to executors I see two connections to RDBMS
>>
>>
>> EXECUTIONS USERNAME       SID SERIAL# USERS_EXECUTING SQL_TEXT
>> ---------- ---------- ------- ------- ---------------
>> --------------------------------------------------
>>          1 SCRATCHPAD     443   62565               1 SELECT
>> "RANDOMISED","RANDOM_STRING","PADDING","CLU
>>
>> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t
>>                                                       o_char(ID) AS ID,
>> to_char(CLUSTERED) AS CLUSTERED,
>>
>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>                                                       ED) AS RANDOMISED,
>> RANDOM_STRING, SMALL_VC, PADDIN
>>                                                       G FROM
>> scratchpad.dummy) WHERE ID >= 23000001 AND
>>                                                       ID < 24000001
>>          1 SCRATCHPAD     406   46793               1 SELECT
>> "RANDOMISED","RANDOM_STRING","PADDING","CLU
>>
>> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t
>>                                                       o_char(ID) AS ID,
>> to_char(CLUSTERED) AS CLUSTERED,
>>
>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>                                                       ED) AS RANDOMISED,
>> RANDOM_STRING, SMALL_VC, PADDIN
>>                                                       G FROM
>> scratchpad.dummy) WHERE ID >= 24000001 AND
>>                                                       ID < 25000001
>>
>> So it  sounds like (can someone else independently confirm this) that
>> regardless of what one specifies in "numPartitions" one ends up one
>> connection from one Spark executor to RDBMS.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 August 2016 at 09:12, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Hi.
>>>
>>> This is a very good question
>>>
>>> I did some tests on this.
>>>
>>> If you are joining two tables then you are creating a result set based
>>> on some conditions. In this case what I normally do is to specify an ID
>>> column from either tables and will base my partitioning on that ID column.
>>> This is pretty straight forward. So bring back your ID column and base you
>>> lower and upper limit on that ID value
>>>
>>> "partitionColumn" -> "ID",
>>> "lowerBound" -> "1",
>>> "upperBound" -> "100000000",
>>> "numPartitions" -> "100",
>>>
>>>
>>>
>>> Also I have noticed that regardless of the number of partitions you
>>> specify at the RDBMS site, the number of parallel connections will be
>>> limited and the result set will be partitioned accordingly. For example
>>> with numberPartitions=100, I see only 8 connections in Oracle coming from
>>> Spark connection.
>>>
>>> scala> val s = HiveContext.read.format("jdbc").options(
>>>      | Map("url" -> _ORACLEserver,
>>>      | "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
>>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>>      | "partitionColumn" -> "ID",
>>>      | "lowerBound" -> "1",
>>>      | "upperBound" -> "100000000",
>>>      | "numPartitions" -> "100",
>>>      | "user" -> _username,
>>>      | "password" -> _password)).load
>>> s: org.apache.spark.sql.DataFrame = [ID: string, CLUSTERED: string ... 5
>>> more fields]
>>> scala> s.toJavaRDD.partitions.size()
>>> res1: Int = 100
>>>
>>> This also seems to set the number of partitions. I still think that the
>>> emphasis has to be on getting data from RDBMS as quickly as possible. The
>>> partitioning does work. In below the login scratchpad has multiple
>>> connections to Oracle and does the range selection OK
>>>
>>>          1 SCRATCHPAD      45   43048               1 SELECT
>>> "SMALL_VC","CLUSTERED","PADDING","RANDOM_ST
>>>
>>> RING","ID","SCATTERED","RANDOMISED" FROM (SELECT t
>>>                                                       o_char(ID) AS ID,
>>> to_char(CLUSTERED) AS CLUSTERED,
>>>
>>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>>                                                       ED) AS
>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDIN
>>>                                                       G FROM
>>> scratchpad.dummy)
>>> *WHERE ID >= 16000001
>>> AND                                                      ID < 17000001*
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 15 August 2016 at 08:18, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> I would suggest you to look at sqoop as well. Essentially, you can
>>>> provide a splitBy/partitionBy column using which data will be distributed
>>>> among your stated number of mappers
>>>>
>>>> On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar <
>>>> mrajaf...@gmail.com> wrote:
>>>>
>>>>> Hi Mich,
>>>>>
>>>>> I have a below question.
>>>>>
>>>>> I want to join two tables and return the result based on the input
>>>>> value. In this case, how we need to specify lower bound and upper bound
>>>>> values ?
>>>>>
>>>>> select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where
>>>>> t1.transactionid=*11111* and t1.id = t2.id
>>>>>
>>>>> *11111 => dynamic input value.*
>>>>>
>>>>> Regards,
>>>>> Rajesh
>>>>>
>>>>> On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> If you have your RDBMS table partitioned, then you need to consider
>>>>>> how much data you want to extract in other words the result set returned 
>>>>>> by
>>>>>> the JDBC call.
>>>>>>
>>>>>> If you want all the data, then the number of partitions specified in
>>>>>> the JDBC call should be equal to the number of partitions in your RDBMS
>>>>>> table.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 14 August 2016 at 21:44, Ashok Kumar <ashok34...@yahoo.com> wrote:
>>>>>>
>>>>>>> Thank you very much sir.
>>>>>>>
>>>>>>> I forgot to mention that two of these Oracle tables are range
>>>>>>> partitioned. In that case what would be the optimum number of 
>>>>>>> partitions if
>>>>>>> you can share?
>>>>>>>
>>>>>>> Warmest
>>>>>>>
>>>>>>>
>>>>>>> On Sunday, 14 August 2016, 21:37, Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> If you have primary keys on these tables then you can parallelise
>>>>>>> the process reading data.
>>>>>>>
>>>>>>> You have to be careful not to set the number of partitions too many.
>>>>>>> Certainly there is a balance between the number of partitions supplied 
>>>>>>> to
>>>>>>> JDBC and the load on the network and the source DB.
>>>>>>>
>>>>>>> Assuming that your underlying table has primary key ID, then this
>>>>>>> will create 20 parallel processes to Oracle DB
>>>>>>>
>>>>>>>  val d = HiveContext.read.format("jdbc").options(
>>>>>>>  Map("url" -> _ORACLEserver,
>>>>>>>  "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)",
>>>>>>>  "partitionColumn" -> "ID",
>>>>>>>  "lowerBound" -> "1",
>>>>>>>  "upperBound" -> "maxID",
>>>>>>>  "numPartitions" -> "20",
>>>>>>>  "user" -> _username,
>>>>>>>  "password" -> _password)).load
>>>>>>>
>>>>>>> assuming your upper bound on ID is maxID
>>>>>>>
>>>>>>>
>>>>>>> This will open multiple connections to RDBMS, each getting a subset
>>>>>>> of data that you want.
>>>>>>>
>>>>>>> You need to test it to ensure that you get the numPartitions optimum
>>>>>>> and you don't overload any component.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>> On 14 August 2016 at 21:15, Ashok Kumar <
>>>>>>> ashok34...@yahoo.com.invalid> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> There are 4 tables ranging from 10 million to 100 million rows but
>>>>>>> they all have primary keys.
>>>>>>>
>>>>>>> The network is fine but our Oracle is RAC and we can only connect to
>>>>>>> a designated Oracle node (where we have a DQ account only).
>>>>>>>
>>>>>>> We have a limited time window of few hours to get the required data
>>>>>>> out.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> How big are your tables and is there any issue with the network
>>>>>>> between your Spark nodes and your Oracle DB that adds to issues?
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/ profile/view?id=
>>>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>> http://talebzadehmich. wordpress.com
>>>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>> On 14 August 2016 at 20:50, Ashok Kumar <
>>>>>>> ashok34...@yahoo.com.invalid> wrote:
>>>>>>>
>>>>>>> Hi Gurus,
>>>>>>>
>>>>>>> I have few large tables in rdbms (ours is Oracle). We want to access
>>>>>>> these tables through Spark JDBC
>>>>>>>
>>>>>>> What is the quickest way of getting data into Spark Dataframe say
>>>>>>> multiple connections from Spark
>>>>>>>
>>>>>>> thanking you
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>

Re: parallel processing with JDBC

Reply via email to