Re: spark session jdbc performance

Gourav Sengupta Wed, 25 Oct 2017 13:21:49 -0700

Hi Lucas,

so if I am assuming things, can you please explain why the UI is showing
only one partition to run the query?



Regards,
Gourav Sengupta

On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com <lucas.g...@gmail.com>
wrote:

> Gourav, I'm assuming you misread the code.  It's 30 partitions, which
> isn't a ridiculous value.  Maybe you misread the upperBound for the
> partitions?  (That would be ridiculous)
>
> Why not use the PK as the partition column?  Obviously it depends on the
> downstream queries.  If you're going to be performing joins (which I assume
> is the case) then partitioning on the join column would be advisable, but
> what about the case where the join column would be heavily skewed?
>
> Thanks!
>
> Gary
>
> On 24 October 2017 at 23:41, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi Naveen,
>>
>> I do not think that it is prudent to use the PK as the partitionColumn.
>> That is too many partitions for any system to handle. The numPartitions
>> will be valid in case of JDBC very differently.
>>
>> Please keep me updated on how things go.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <vmadh...@umail.iu.edu>
>> wrote:
>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am trying to fetch data from Oracle DB using a subquery and
>>> experiencing lot of performance issues.
>>>
>>>
>>>
>>> Below is the query I am using,
>>>
>>>
>>>
>>> *Using Spark 2.0.2*
>>>
>>>
>>>
>>> *val *df = spark_session.read.format(*"jdbc"*)
>>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
>>> .option(*"url"*, jdbc_url)
>>>    .option(*"user"*, user)
>>>    .option(*"password"*, pwd)
>>>    .option(*"dbtable"*, *"subquery"*)
>>>    .option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
>>> distributed
>>>    .option(*"lowerBound"*, *"1"*)
>>>    .option(*"upperBound"*, *"500000"*)
>>> .option(*"numPartitions"*, 30)
>>> .load()
>>>
>>>
>>>
>>> The above query is running using the 30 partitions, but when I see the
>>> UI it is only using 1 partiton to run the query.
>>>
>>>
>>>
>>> Can anyone tell if I am missing anything or do I need to anything else
>>> to tune the performance of the query.
>>>
>>>  *Thanks*
>>>
>>
>>
>

Re: spark session jdbc performance

Reply via email to