[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly

2015-05-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555647#comment-14555647
 ] 

Josh Rosen commented on SPARK-7804:
---

If possible, we should be hiding the internal JDBCRDD (all-caps) from the 
Javadoc; I've filed SPARK-7821 so that we remember to follow up on this.

Slightly confusingly, Spark also has another class called JdbcRDD (note the 
different capitalization) which _is_ a public API: 
https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/rdd/JdbcRDD.html. 
 Perhaps you meant to use that instead?

There might be a way to address your use-case while continuing to use the 
public DataFrame APIs, but I don't know enough about your use-case or Spark SQL 
APIs to provide a great answer.  The Spark Users mailing list would probably be 
a better place to have that discussion, though.

In the meantime, I'm going to resolve this JIRA ticket as Not an Issue.

 Incorrect results from JDBCRDD -- one record repeatly
 -

 Key: SPARK-7804
 URL: https://issues.apache.org/jira/browse/SPARK-7804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Paul Wu

 Getting only one  record repeated in the RDD and repeated field value:
  
 I have a table like:
 {code}
 attuid  name email
 12  john   j...@appp.com
 23  tom   t...@appp.com
 34  tony  t...@appp.com
 {code}
 My code:
 {code}
  JavaSparkContext sc = new JavaSparkContext(sparkConf);
 String url = ;
 java.util.Properties prop = new Properties();
 ListJDBCPartition partitionList = new ArrayList();
 //int i;
 partitionList.add(new JDBCPartition(1=1, 0));
 
 ListStructField fields = new ArrayListStructField();
 fields.add(DataTypes.createStructField(attuid, 
 DataTypes.StringType, true));
 fields.add(DataTypes.createStructField(name, DataTypes.StringType, 
 true));
 fields.add(DataTypes.createStructField(email, DataTypes.StringType, 
 true));
 StructType schema = DataTypes.createStructType(fields);
 JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(),
 JDBCRDD.getConnector(oracle.jdbc.OracleDriver, url, prop),
  
 schema,
  USERS,
 new String[]{attuid, name, email},
 new Filter[]{ },
 
 partitionList.toArray(new JDBCPartition[0])
   
 );
 
 System.out.println(count before to Java RDD= + 
 jdbcRDD.cache().count());
 JavaRDDRow jrdd = jdbcRDD.toJavaRDD();
 System.out.println(count= + jrdd.count());
 ListRow lr = jrdd.collect();
 for (Row r : lr) {
 for (int ii = 0; ii  r.length(); ii++) {
 System.out.println(r.getString(ii));
 }
 }
 {code}
 ===
 result is :
 {code}
 34
 tony
  t...@appp.com
 34
 tony
  t...@appp.com
 34
 tony 
  t...@appp.com
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly

2015-05-22 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555680#comment-14555680
 ] 

Paul Wu commented on SPARK-7804:


Unfortunately,  JdbcRDD was poorly designed since the lowerbound and upperbound 
are long types which are too limited.  One of my team member implemented a 
general one  based on the idea. Some of my team are worried about the home-made 
solution.  When we saw JDBCRDD, it looks like what we wanted. In fact, I hope 
JDBCRDD can be public or JdbcRDD can be  re-designed to take care general 
situation just like what JDBCRDD does.



 Incorrect results from JDBCRDD -- one record repeatly
 -

 Key: SPARK-7804
 URL: https://issues.apache.org/jira/browse/SPARK-7804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Paul Wu

 Getting only one  record repeated in the RDD and repeated field value:
  
 I have a table like:
 {code}
 attuid  name email
 12  john   j...@appp.com
 23  tom   t...@appp.com
 34  tony  t...@appp.com
 {code}
 My code:
 {code}
  JavaSparkContext sc = new JavaSparkContext(sparkConf);
 String url = ;
 java.util.Properties prop = new Properties();
 ListJDBCPartition partitionList = new ArrayList();
 //int i;
 partitionList.add(new JDBCPartition(1=1, 0));
 
 ListStructField fields = new ArrayListStructField();
 fields.add(DataTypes.createStructField(attuid, 
 DataTypes.StringType, true));
 fields.add(DataTypes.createStructField(name, DataTypes.StringType, 
 true));
 fields.add(DataTypes.createStructField(email, DataTypes.StringType, 
 true));
 StructType schema = DataTypes.createStructType(fields);
 JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(),
 JDBCRDD.getConnector(oracle.jdbc.OracleDriver, url, prop),
  
 schema,
  USERS,
 new String[]{attuid, name, email},
 new Filter[]{ },
 
 partitionList.toArray(new JDBCPartition[0])
   
 );
 
 System.out.println(count before to Java RDD= + 
 jdbcRDD.cache().count());
 JavaRDDRow jrdd = jdbcRDD.toJavaRDD();
 System.out.println(count= + jrdd.count());
 ListRow lr = jrdd.collect();
 for (Row r : lr) {
 for (int ii = 0; ii  r.length(); ii++) {
 System.out.println(r.getString(ii));
 }
 }
 {code}
 ===
 result is :
 {code}
 34
 tony
  t...@appp.com
 34
 tony
  t...@appp.com
 34
 tony 
  t...@appp.com
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly

2015-05-21 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555154#comment-14555154
 ] 

Josh Rosen commented on SPARK-7804:
---

(I'm a bit new to Spark SQL internals, so please forgive me if this is off-base)

I think that all-caps {{JDBCRDD}} is an internal SQL class that's not designed 
to be used by end users (it's marked as {{private\[sql]}} in the code 
(https://github.com/apache/spark/blob/v1.3.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L208).

Spark SQL's internal RDDs have {{compute()}} methods that return iterators 
which same mutable object on each {{next()}} call.  When you're calling 
{{cache()}} on this RDD, you end up with a cached RDD that contains the same 
mutable row object repeated many times, leading to the duplicate records that 
you're seeing here.

In a nutshell, I don't think that the example given here is valid because it's 
using an internal Spark SQL class in a way that it does not support.

If you want to load data from JDBC and access it as an RDD, I think the right 
way to do this is to use SQLContext.load to load the data from JDBC into a 
dataFrame, then to call {{toRDD}} on the resulting DataFrame.  See 
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
 for more details.

 Incorrect results from JDBCRDD -- one record repeatly
 -

 Key: SPARK-7804
 URL: https://issues.apache.org/jira/browse/SPARK-7804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Paul Wu
  Labels: JDBCRDD, sql

 Getting only one  record repeated in the RDD and repeated field value:
  
 I have a table like:
 {code}
 attuid  name email
 12  john   j...@appp.com
 23  tom   t...@appp.com
 34  tony  t...@appp.com
 {code}
 My code:
 {code}
  JavaSparkContext sc = new JavaSparkContext(sparkConf);
 String url = ;
 java.util.Properties prop = new Properties();
 ListJDBCPartition partitionList = new ArrayList();
 //int i;
 partitionList.add(new JDBCPartition(1=1, 0));
 
 ListStructField fields = new ArrayListStructField();
 fields.add(DataTypes.createStructField(attuid, 
 DataTypes.StringType, true));
 fields.add(DataTypes.createStructField(name, DataTypes.StringType, 
 true));
 fields.add(DataTypes.createStructField(email, DataTypes.StringType, 
 true));
 StructType schema = DataTypes.createStructType(fields);
 JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(),
 JDBCRDD.getConnector(oracle.jdbc.OracleDriver, url, prop),
  
 schema,
  USERS,
 new String[]{attuid, name, email},
 new Filter[]{ },
 
 partitionList.toArray(new JDBCPartition[0])
   
 );
 
 System.out.println(count before to Java RDD= + 
 jdbcRDD.cache().count());
 JavaRDDRow jrdd = jdbcRDD.toJavaRDD();
 System.out.println(count= + jrdd.count());
 ListRow lr = jrdd.collect();
 for (Row r : lr) {
 for (int ii = 0; ii  r.length(); ii++) {
 System.out.println(r.getString(ii));
 }
 }
 {code}
 ===
 result is :
 {code}
 34
 tony
  t...@appp.com
 34
 tony
  t...@appp.com
 34
 tony 
  t...@appp.com
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly

2015-05-21 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555602#comment-14555602
 ] 

Paul Wu commented on SPARK-7804:


Thanks -- you are right. The cache() was a problem and also I cannot use 
ListRow lr = jrdd.collect();. But 

jrdd.foreach((Row r) - {
 System.out.println(r.get(0) +  . + r.get(1) +   + r.get(2));
 });
or foreachParition will work. 

We really wanted to use DataFrame, however it does not have the partition 
options that we really need to improve the performance. Using this class, we 
can take the advantage of sending multiple query to each db partition at the 
same time. By as you said this is the internal code (from JAVA DOC, I cannot 
see it), I'm not sure what I can do now.

I guess you guys can close this ticket. Thanks again!  

 Incorrect results from JDBCRDD -- one record repeatly
 -

 Key: SPARK-7804
 URL: https://issues.apache.org/jira/browse/SPARK-7804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Paul Wu
  Labels: JDBCRDD, sql

 Getting only one  record repeated in the RDD and repeated field value:
  
 I have a table like:
 {code}
 attuid  name email
 12  john   j...@appp.com
 23  tom   t...@appp.com
 34  tony  t...@appp.com
 {code}
 My code:
 {code}
  JavaSparkContext sc = new JavaSparkContext(sparkConf);
 String url = ;
 java.util.Properties prop = new Properties();
 ListJDBCPartition partitionList = new ArrayList();
 //int i;
 partitionList.add(new JDBCPartition(1=1, 0));
 
 ListStructField fields = new ArrayListStructField();
 fields.add(DataTypes.createStructField(attuid, 
 DataTypes.StringType, true));
 fields.add(DataTypes.createStructField(name, DataTypes.StringType, 
 true));
 fields.add(DataTypes.createStructField(email, DataTypes.StringType, 
 true));
 StructType schema = DataTypes.createStructType(fields);
 JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(),
 JDBCRDD.getConnector(oracle.jdbc.OracleDriver, url, prop),
  
 schema,
  USERS,
 new String[]{attuid, name, email},
 new Filter[]{ },
 
 partitionList.toArray(new JDBCPartition[0])
   
 );
 
 System.out.println(count before to Java RDD= + 
 jdbcRDD.cache().count());
 JavaRDDRow jrdd = jdbcRDD.toJavaRDD();
 System.out.println(count= + jrdd.count());
 ListRow lr = jrdd.collect();
 for (Row r : lr) {
 for (int ii = 0; ii  r.length(); ii++) {
 System.out.println(r.getString(ii));
 }
 }
 {code}
 ===
 result is :
 {code}
 34
 tony
  t...@appp.com
 34
 tony
  t...@appp.com
 34
 tony 
  t...@appp.com
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org