[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly
[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555647#comment-14555647 ] Josh Rosen commented on SPARK-7804: --- If possible, we should be hiding the internal JDBCRDD (all-caps) from the Javadoc; I've filed SPARK-7821 so that we remember to follow up on this. Slightly confusingly, Spark also has another class called JdbcRDD (note the different capitalization) which _is_ a public API: https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/rdd/JdbcRDD.html. Perhaps you meant to use that instead? There might be a way to address your use-case while continuing to use the public DataFrame APIs, but I don't know enough about your use-case or Spark SQL APIs to provide a great answer. The Spark Users mailing list would probably be a better place to have that discussion, though. In the meantime, I'm going to resolve this JIRA ticket as Not an Issue. Incorrect results from JDBCRDD -- one record repeatly - Key: SPARK-7804 URL: https://issues.apache.org/jira/browse/SPARK-7804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Paul Wu Getting only one record repeated in the RDD and repeated field value: I have a table like: {code} attuid name email 12 john j...@appp.com 23 tom t...@appp.com 34 tony t...@appp.com {code} My code: {code} JavaSparkContext sc = new JavaSparkContext(sparkConf); String url = ; java.util.Properties prop = new Properties(); ListJDBCPartition partitionList = new ArrayList(); //int i; partitionList.add(new JDBCPartition(1=1, 0)); ListStructField fields = new ArrayListStructField(); fields.add(DataTypes.createStructField(attuid, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(name, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(email, DataTypes.StringType, true)); StructType schema = DataTypes.createStructType(fields); JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), JDBCRDD.getConnector(oracle.jdbc.OracleDriver, url, prop), schema, USERS, new String[]{attuid, name, email}, new Filter[]{ }, partitionList.toArray(new JDBCPartition[0]) ); System.out.println(count before to Java RDD= + jdbcRDD.cache().count()); JavaRDDRow jrdd = jdbcRDD.toJavaRDD(); System.out.println(count= + jrdd.count()); ListRow lr = jrdd.collect(); for (Row r : lr) { for (int ii = 0; ii r.length(); ii++) { System.out.println(r.getString(ii)); } } {code} === result is : {code} 34 tony t...@appp.com 34 tony t...@appp.com 34 tony t...@appp.com {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly
[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555680#comment-14555680 ] Paul Wu commented on SPARK-7804: Unfortunately, JdbcRDD was poorly designed since the lowerbound and upperbound are long types which are too limited. One of my team member implemented a general one based on the idea. Some of my team are worried about the home-made solution. When we saw JDBCRDD, it looks like what we wanted. In fact, I hope JDBCRDD can be public or JdbcRDD can be re-designed to take care general situation just like what JDBCRDD does. Incorrect results from JDBCRDD -- one record repeatly - Key: SPARK-7804 URL: https://issues.apache.org/jira/browse/SPARK-7804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Paul Wu Getting only one record repeated in the RDD and repeated field value: I have a table like: {code} attuid name email 12 john j...@appp.com 23 tom t...@appp.com 34 tony t...@appp.com {code} My code: {code} JavaSparkContext sc = new JavaSparkContext(sparkConf); String url = ; java.util.Properties prop = new Properties(); ListJDBCPartition partitionList = new ArrayList(); //int i; partitionList.add(new JDBCPartition(1=1, 0)); ListStructField fields = new ArrayListStructField(); fields.add(DataTypes.createStructField(attuid, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(name, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(email, DataTypes.StringType, true)); StructType schema = DataTypes.createStructType(fields); JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), JDBCRDD.getConnector(oracle.jdbc.OracleDriver, url, prop), schema, USERS, new String[]{attuid, name, email}, new Filter[]{ }, partitionList.toArray(new JDBCPartition[0]) ); System.out.println(count before to Java RDD= + jdbcRDD.cache().count()); JavaRDDRow jrdd = jdbcRDD.toJavaRDD(); System.out.println(count= + jrdd.count()); ListRow lr = jrdd.collect(); for (Row r : lr) { for (int ii = 0; ii r.length(); ii++) { System.out.println(r.getString(ii)); } } {code} === result is : {code} 34 tony t...@appp.com 34 tony t...@appp.com 34 tony t...@appp.com {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly
[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555154#comment-14555154 ] Josh Rosen commented on SPARK-7804: --- (I'm a bit new to Spark SQL internals, so please forgive me if this is off-base) I think that all-caps {{JDBCRDD}} is an internal SQL class that's not designed to be used by end users (it's marked as {{private\[sql]}} in the code (https://github.com/apache/spark/blob/v1.3.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L208). Spark SQL's internal RDDs have {{compute()}} methods that return iterators which same mutable object on each {{next()}} call. When you're calling {{cache()}} on this RDD, you end up with a cached RDD that contains the same mutable row object repeated many times, leading to the duplicate records that you're seeing here. In a nutshell, I don't think that the example given here is valid because it's using an internal Spark SQL class in a way that it does not support. If you want to load data from JDBC and access it as an RDD, I think the right way to do this is to use SQLContext.load to load the data from JDBC into a dataFrame, then to call {{toRDD}} on the resulting DataFrame. See https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases for more details. Incorrect results from JDBCRDD -- one record repeatly - Key: SPARK-7804 URL: https://issues.apache.org/jira/browse/SPARK-7804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Paul Wu Labels: JDBCRDD, sql Getting only one record repeated in the RDD and repeated field value: I have a table like: {code} attuid name email 12 john j...@appp.com 23 tom t...@appp.com 34 tony t...@appp.com {code} My code: {code} JavaSparkContext sc = new JavaSparkContext(sparkConf); String url = ; java.util.Properties prop = new Properties(); ListJDBCPartition partitionList = new ArrayList(); //int i; partitionList.add(new JDBCPartition(1=1, 0)); ListStructField fields = new ArrayListStructField(); fields.add(DataTypes.createStructField(attuid, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(name, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(email, DataTypes.StringType, true)); StructType schema = DataTypes.createStructType(fields); JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), JDBCRDD.getConnector(oracle.jdbc.OracleDriver, url, prop), schema, USERS, new String[]{attuid, name, email}, new Filter[]{ }, partitionList.toArray(new JDBCPartition[0]) ); System.out.println(count before to Java RDD= + jdbcRDD.cache().count()); JavaRDDRow jrdd = jdbcRDD.toJavaRDD(); System.out.println(count= + jrdd.count()); ListRow lr = jrdd.collect(); for (Row r : lr) { for (int ii = 0; ii r.length(); ii++) { System.out.println(r.getString(ii)); } } {code} === result is : {code} 34 tony t...@appp.com 34 tony t...@appp.com 34 tony t...@appp.com {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly
[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555602#comment-14555602 ] Paul Wu commented on SPARK-7804: Thanks -- you are right. The cache() was a problem and also I cannot use ListRow lr = jrdd.collect();. But jrdd.foreach((Row r) - { System.out.println(r.get(0) + . + r.get(1) + + r.get(2)); }); or foreachParition will work. We really wanted to use DataFrame, however it does not have the partition options that we really need to improve the performance. Using this class, we can take the advantage of sending multiple query to each db partition at the same time. By as you said this is the internal code (from JAVA DOC, I cannot see it), I'm not sure what I can do now. I guess you guys can close this ticket. Thanks again! Incorrect results from JDBCRDD -- one record repeatly - Key: SPARK-7804 URL: https://issues.apache.org/jira/browse/SPARK-7804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Paul Wu Labels: JDBCRDD, sql Getting only one record repeated in the RDD and repeated field value: I have a table like: {code} attuid name email 12 john j...@appp.com 23 tom t...@appp.com 34 tony t...@appp.com {code} My code: {code} JavaSparkContext sc = new JavaSparkContext(sparkConf); String url = ; java.util.Properties prop = new Properties(); ListJDBCPartition partitionList = new ArrayList(); //int i; partitionList.add(new JDBCPartition(1=1, 0)); ListStructField fields = new ArrayListStructField(); fields.add(DataTypes.createStructField(attuid, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(name, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(email, DataTypes.StringType, true)); StructType schema = DataTypes.createStructType(fields); JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), JDBCRDD.getConnector(oracle.jdbc.OracleDriver, url, prop), schema, USERS, new String[]{attuid, name, email}, new Filter[]{ }, partitionList.toArray(new JDBCPartition[0]) ); System.out.println(count before to Java RDD= + jdbcRDD.cache().count()); JavaRDDRow jrdd = jdbcRDD.toJavaRDD(); System.out.println(count= + jrdd.count()); ListRow lr = jrdd.collect(); for (Row r : lr) { for (int ii = 0; ii r.length(); ii++) { System.out.println(r.getString(ii)); } } {code} === result is : {code} 34 tony t...@appp.com 34 tony t...@appp.com 34 tony t...@appp.com {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org