Hi Lucas, so if I am assuming things, can you please explain why the UI is showing only one partition to run the query?
Regards, Gourav Sengupta On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: > Gourav, I'm assuming you misread the code. It's 30 partitions, which > isn't a ridiculous value. Maybe you misread the upperBound for the > partitions? (That would be ridiculous) > > Why not use the PK as the partition column? Obviously it depends on the > downstream queries. If you're going to be performing joins (which I assume > is the case) then partitioning on the join column would be advisable, but > what about the case where the join column would be heavily skewed? > > Thanks! > > Gary > > On 24 October 2017 at 23:41, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi Naveen, >> >> I do not think that it is prudent to use the PK as the partitionColumn. >> That is too many partitions for any system to handle. The numPartitions >> will be valid in case of JDBC very differently. >> >> Please keep me updated on how things go. >> >> >> Regards, >> Gourav Sengupta >> >> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <vmadh...@umail.iu.edu> >> wrote: >> >>> >>> Hi, >>> >>> >>> >>> I am trying to fetch data from Oracle DB using a subquery and >>> experiencing lot of performance issues. >>> >>> >>> >>> Below is the query I am using, >>> >>> >>> >>> *Using Spark 2.0.2* >>> >>> >>> >>> *val *df = spark_session.read.format(*"jdbc"*) >>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) >>> .option(*"url"*, jdbc_url) >>> .option(*"user"*, user) >>> .option(*"password"*, pwd) >>> .option(*"dbtable"*, *"subquery"*) >>> .option(*"partitionColumn"*, *"id"*) //primary key column uniformly >>> distributed >>> .option(*"lowerBound"*, *"1"*) >>> .option(*"upperBound"*, *"500000"*) >>> .option(*"numPartitions"*, 30) >>> .load() >>> >>> >>> >>> The above query is running using the 30 partitions, but when I see the >>> UI it is only using 1 partiton to run the query. >>> >>> >>> >>> Can anyone tell if I am missing anything or do I need to anything else >>> to tune the performance of the query. >>> >>> *Thanks* >>> >> >> >