On 24 September 2015 at 17:48, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote:
> you are interpreting the JDBCRDD API incorrectly. If you want to use > partitions, then the column used to partition and present in the where > clause must be numeric and the lower bound and upper bound must be the min > and max values of the column. Spark will equally distribute the range over > the number of partitions selected. So in your case OFFSET is the first > placeholder and LIMIT the second > > numPartitions 1 - Your query will be called once with first placeholder 0 > and second placeholder 100, this explains how you get 100 rows > select * from schema.Table OFFSET 0 LIMIT 100 > > numPartitions 2 - Your query will be called twice with first placeholder > 0 and second placeholder 50, and second time with 51,100. Again this > explains why you get 150 records > > select * from schema.Table OFFSET 0 LIMIT 50 > select * from schema.Table OFFSET 51 LIMIT 100 > > numPartitions 3 - Your query will be called thrice > > select * from schema.Table OFFSET 0 LIMIT 34 > select * from schema.Table OFFSET 35 LIMIT 67 > select * from schema.Table OFFSET 68 LIMIT 100 > > That explains why you get 201 records. You need to amend the query and > provide correct lower and upper bounds aligned to the column used in the > where clause. > > See > http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/ > > Deenar > > > > > *Think Reactive Ltd* > deenar.toras...@thinkreactive.co.uk > 07714140812 > > > > On 24 September 2015 at 11:55, satish chandra j <jsatishchan...@gmail.com> > wrote: > >> HI Deenar, >> >> Please find the SQL query below: >> >> var SQL_RDD= new JdbcRDD( sc, ()=> >> DriverManager.getConnection(url,user,pass),"select col1, col2, >> col3......col 37 from schema.Table LIMIT ? OFFSET ?",100,0,*1*,(r: >> ResultSet) => (r.getInt("col1"),r.getInt("col2").......r.getInt("col37"))) >> >> >> When I have the above 100,0,*1 * I am getting SQL_RDD.count as 100 >> When set to 100,0,2 I am getting SQL_RDD.count as 151 >> When set to 100,0,3 I am getting SQL RDD.count as 201 >> >> But where as I expect every execution count should be 100, let me know if >> I am missing anything here >> >> Regards, >> Satish Chandra >> >> >> On Thu, Sep 24, 2015 at 12:48 AM, Deenar Toraskar < >> deenar.toras...@thinkreactive.co.uk> wrote: >> >>> Satish >>> >>> Can you post the SQL query you are using? >>> >>> The SQL query must have 2 placeholders and both of them should be an >>> inclusive range (<= and >=).. >>> >>> e.g. select title, author from books where ? <= id and id <= ? >>> >>> Are you doing this? >>> >>> Deenar >>> >>> >>> >>> >>> *Think Reactive Ltd* >>> deenar.toras...@thinkreactive.co.uk >>> 07714140812 >>> >>> >>> >>> On 23 September 2015 at 13:47, satish chandra j < >>> jsatishchan...@gmail.com> wrote: >>> >>>> HI, >>>> Could anybody provide inputs if they have came across similar issue >>>> >>>> @Rishitesh >>>> Could you provide if any sample code to use JdbcRDDSuite >>>> >>>> >>>> Regards, >>>> Satish Chandra >>>> >>>> On Wed, Sep 23, 2015 at 5:14 PM, Rishitesh Mishra < >>>> rishi80.mis...@gmail.com> wrote: >>>> >>>>> I am using Spark 1.5. I always get count = 100, irrespective of num >>>>> partitions. >>>>> >>>>> On Wed, Sep 23, 2015 at 5:00 PM, satish chandra j < >>>>> jsatishchan...@gmail.com> wrote: >>>>> >>>>>> HI, >>>>>> Currently using Spark 1.2.2, could you please let me know correct >>>>>> results output count which you got it by using JdbcRDDSuite >>>>>> >>>>>> Regards, >>>>>> Satish Chandra >>>>>> >>>>>> On Wed, Sep 23, 2015 at 4:02 PM, Rishitesh Mishra < >>>>>> rishi80.mis...@gmail.com> wrote: >>>>>> >>>>>>> Which version of Spark you are using ?? I can get correct results >>>>>>> using JdbcRDD. Infact there is a test suite precisely for this ( >>>>>>> JdbcRDDSuite) . >>>>>>> I changed according to your input and got correct results from this >>>>>>> test suite. >>>>>>> >>>>>>> On Wed, Sep 23, 2015 at 11:00 AM, satish chandra j < >>>>>>> jsatishchan...@gmail.com> wrote: >>>>>>> >>>>>>>> HI All, >>>>>>>> >>>>>>>> JdbcRDD constructor has following parameters, >>>>>>>> >>>>>>>> *JdbcRDD >>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html#JdbcRDD(org.apache.spark.SparkContext,%20scala.Function0,%20java.lang.String,%20long,%20long,%20int,%20scala.Function1,%20scala.reflect.ClassTag)>* >>>>>>>> (SparkContext >>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/SparkContext.html> >>>>>>>> sc, >>>>>>>> scala.Function0<java.sql.Connection> getConnection, String sql, *long >>>>>>>> lowerBound, >>>>>>>> long upperBound, int numPartitions*, >>>>>>>> scala.Function1<java.sql.ResultSet,T >>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html>> >>>>>>>> mapRow, >>>>>>>> scala.reflect.ClassTag<T >>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html> >>>>>>>> > evidence$1) >>>>>>>> >>>>>>>> where the below parameters *lowerBound* refers to Lower boundary >>>>>>>> of entire data, *upperBound *refers to Upper boundary of entire >>>>>>>> data and *numPartitions *refer to Number of partitions >>>>>>>> >>>>>>>> Source table to which JbdcRDD is fetching data from Oracle DB has >>>>>>>> more than 500 records but its confusing when I tried several >>>>>>>> executions by >>>>>>>> changing "numPartitions" parameter >>>>>>>> >>>>>>>> LowerBound,UpperBound,numPartitions: Output Count >>>>>>>> >>>>>>>> 0 ,100 ,1 : 100 >>>>>>>> >>>>>>>> 0 ,100 ,2 : 151 >>>>>>>> >>>>>>>> 0 ,100 ,3 : 201 >>>>>>>> >>>>>>>> >>>>>>>> Please help me in understanding the why Output count is 151 if >>>>>>>> numPartitions is 2 and Output count is 201 if numPartitions is 3 >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Satish Chandra >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >