Re: JdbcRDD Constructor

Deenar Toraskar Thu, 24 Sep 2015 09:49:51 -0700

On 24 September 2015 at 17:48, Deenar Toraskar <
deenar.toras...@thinkreactive.co.uk> wrote:


> you are interpreting the JDBCRDD API incorrectly. If you want to use
> partitions, then the column used to partition and present in the where
> clause must be numeric and the lower bound and upper bound must be the min
> and max values of the column. Spark will equally distribute the range over
> the number of partitions selected. So in your case OFFSET is the first
> placeholder and LIMIT the second
>
> numPartitions 1 - Your query will be called once with first placeholder 0
> and second placeholder 100, this explains how you get 100 rows
> select *  from schema.Table OFFSET 0 LIMIT 100
>
> numPartitions 2 - Your query will be called twice with first placeholder
> 0 and second placeholder 50, and second time with 51,100. Again this
> explains why you get 150 records
>
> select *  from schema.Table OFFSET 0 LIMIT 50
> select *  from schema.Table OFFSET 51 LIMIT 100
>
> numPartitions 3 - Your query will be called thrice
>
> select *  from schema.Table OFFSET 0 LIMIT 34
> select *  from schema.Table OFFSET 35 LIMIT 67
> select *  from schema.Table OFFSET 68 LIMIT 100
>
> That explains why you get 201 records. You need to amend the query and
> provide correct lower and upper bounds aligned to the column used in the
> where clause.
>
> See
> http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/
>
> Deenar
>
>
>
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
> On 24 September 2015 at 11:55, satish chandra j <jsatishchan...@gmail.com>
> wrote:
>
>> HI Deenar,
>>
>> Please find the SQL query below:
>>
>> var SQL_RDD= new JdbcRDD( sc, ()=>
>> DriverManager.getConnection(url,user,pass),"select col1, col2,
>> col3......col 37 from schema.Table LIMIT ? OFFSET ?",100,0,*1*,(r:
>> ResultSet) => (r.getInt("col1"),r.getInt("col2").......r.getInt("col37")))
>>
>>
>> When I have the above 100,0,*1 * I am getting SQL_RDD.count as 100
>> When set to 100,0,2 I am getting SQL_RDD.count as 151
>> When set to 100,0,3 I am getting SQL RDD.count as 201
>>
>> But where as I expect every execution count should be 100, let me know if
>> I am missing anything here
>>
>> Regards,
>> Satish Chandra
>>
>>
>> On Thu, Sep 24, 2015 at 12:48 AM, Deenar Toraskar <
>> deenar.toras...@thinkreactive.co.uk> wrote:
>>
>>> Satish
>>>
>>> Can you post the SQL query you are using?
>>>
>>> The SQL query must have 2 placeholders and both of them should be an
>>> inclusive range (<= and >=)..
>>>
>>> e.g. select title, author from books where ? <= id and id <= ?
>>>
>>> Are you doing this?
>>>
>>> Deenar
>>>
>>>
>>>
>>>
>>> *Think Reactive Ltd*
>>> deenar.toras...@thinkreactive.co.uk
>>> 07714140812
>>>
>>>
>>>
>>> On 23 September 2015 at 13:47, satish chandra j <
>>> jsatishchan...@gmail.com> wrote:
>>>
>>>> HI,
>>>> Could anybody provide inputs if they have came across similar issue
>>>>
>>>> @Rishitesh
>>>> Could you provide if any sample code to use JdbcRDDSuite
>>>>
>>>>
>>>> Regards,
>>>> Satish Chandra
>>>>
>>>> On Wed, Sep 23, 2015 at 5:14 PM, Rishitesh Mishra <
>>>> rishi80.mis...@gmail.com> wrote:
>>>>
>>>>> I am using Spark 1.5. I always get count = 100, irrespective of num
>>>>> partitions.
>>>>>
>>>>> On Wed, Sep 23, 2015 at 5:00 PM, satish chandra j <
>>>>> jsatishchan...@gmail.com> wrote:
>>>>>
>>>>>> HI,
>>>>>> Currently using Spark 1.2.2, could you please let me know correct
>>>>>> results output count which you got it by using JdbcRDDSuite
>>>>>>
>>>>>> Regards,
>>>>>> Satish Chandra
>>>>>>
>>>>>> On Wed, Sep 23, 2015 at 4:02 PM, Rishitesh Mishra <
>>>>>> rishi80.mis...@gmail.com> wrote:
>>>>>>
>>>>>>> Which version of Spark you are using ??  I can get correct results
>>>>>>> using JdbcRDD. Infact there is a test suite precisely for this (
>>>>>>> JdbcRDDSuite) .
>>>>>>> I changed according to your input and got correct results from this
>>>>>>> test suite.
>>>>>>>
>>>>>>> On Wed, Sep 23, 2015 at 11:00 AM, satish chandra j <
>>>>>>> jsatishchan...@gmail.com> wrote:
>>>>>>>
>>>>>>>> HI All,
>>>>>>>>
>>>>>>>> JdbcRDD constructor has following parameters,
>>>>>>>>
>>>>>>>> *JdbcRDD
>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html#JdbcRDD(org.apache.spark.SparkContext,%20scala.Function0,%20java.lang.String,%20long,%20long,%20int,%20scala.Function1,%20scala.reflect.ClassTag)>*
>>>>>>>> (SparkContext
>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/SparkContext.html>
>>>>>>>>  sc,
>>>>>>>> scala.Function0<java.sql.Connection> getConnection, String sql, *long 
>>>>>>>> lowerBound,
>>>>>>>> long upperBound, int numPartitions*,
>>>>>>>> scala.Function1<java.sql.ResultSet,T
>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html>>
>>>>>>>>  mapRow,
>>>>>>>> scala.reflect.ClassTag<T
>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html>
>>>>>>>> > evidence$1)
>>>>>>>>
>>>>>>>> where the below parameters *lowerBound* refers to Lower boundary
>>>>>>>> of entire data, *upperBound *refers to Upper boundary of entire
>>>>>>>> data and *numPartitions *refer to Number of partitions
>>>>>>>>
>>>>>>>> Source table to which JbdcRDD is fetching data from Oracle DB has
>>>>>>>> more than 500 records but its confusing when I tried several 
>>>>>>>> executions by
>>>>>>>> changing "numPartitions" parameter
>>>>>>>>
>>>>>>>> LowerBound,UpperBound,numPartitions: Output Count
>>>>>>>>
>>>>>>>> 0                 ,100              ,1                   : 100
>>>>>>>>
>>>>>>>> 0                 ,100              ,2                   : 151
>>>>>>>>
>>>>>>>> 0                 ,100              ,3                   : 201
>>>>>>>>
>>>>>>>>
>>>>>>>> Please help me in understanding the why Output count is 151 if
>>>>>>>> numPartitions is 2 and Output count is 201 if numPartitions is 3
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Satish Chandra
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: JdbcRDD Constructor

Reply via email to