[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-20 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20679083
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +135,20 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Returns a new RDD of JSON strings, one string per row
+  *
+  * @group schema
+  */
+  def toJSON: RDD[String] = {
+val rowSchema = this.schema
+this.mapPartitions { iter =
+  val jsonFactory = new JsonFactory()
+  iter.map(JsonRDD.rowToJSON(rowSchema, jsonFactory))
+}
+
--- End diff --

Extra line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-20 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20679092
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -35,6 +37,8 @@ import org.apache.spark.sql.catalyst.analysis._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.{Inner, JoinType}
 import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.types.UserDefinedType
--- End diff --

Unused imports.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-20 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20679073
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala ---
@@ -126,6 +126,12 @@ class JavaSchemaRDD(
   // Transformations (return a new RDD)
 
   /**
+   * Return a new RDD that is the schema transformed to JSON
--- End diff --

`Returns an RDD with each row transformed to a JSON string.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-20 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63884435
  
Since we are about to cut a 1.2 preview I'll make the final changes while 
merging.  Thanks for working on this!  I think it'll be a pretty popular 
feature.  I used it last night :)

Merging to master and 1.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-20 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63884730
  
Michael; thanks for being willing to pick up the final changes!

I'm happy to get a chance to contribute again.  Hopefully the next PR won't
require so much of your time.

Cheers,
Dan

On Thu, Nov 20, 2014 at 1:38 PM, Michael Armbrust notificati...@github.com
wrote:

 Since we are about to cut a 1.2 preview I'll make the final changes while
 merging. Thanks for working on this! I think it'll be a pretty popular
 feature. I used it last night :)

 Merging to master and 1.2

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213#issuecomment-63884435.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3213


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63716980
  
Is this good to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20609235
  
--- Diff: python/pyspark/sql.py ---
@@ -1870,6 +1870,10 @@ def limit(self, num):
 rdd = 
self._jschema_rdd.baseSchemaRDD().limit(num).toJavaSchemaRDD()
 return SchemaRDD(rdd, self.sql_ctx)
 
+def toJSON(self, use_unicode=False):
+rdd = self._jschema_rdd.baseSchemaRDD().toJSON()
--- End diff --

Could you put some simple tests (also will be examples in docs) here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread dwmclary
Github user dwmclary commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20610021
  
--- Diff: python/pyspark/sql.py ---
@@ -1870,6 +1870,10 @@ def limit(self, num):
 rdd = 
self._jschema_rdd.baseSchemaRDD().limit(num).toJavaSchemaRDD()
 return SchemaRDD(rdd, self.sql_ctx)
 
+def toJSON(self, use_unicode=False):
+rdd = self._jschema_rdd.baseSchemaRDD().toJSON()
--- End diff --

Sure thing.

On Wed, Nov 19, 2014 at 1:34 PM, Davies Liu notificati...@github.com
wrote:

 In python/pyspark/sql.py:

  @@ -1870,6 +1870,10 @@ def limit(self, num):
   rdd = 
self._jschema_rdd.baseSchemaRDD().limit(num).toJavaSchemaRDD()
   return SchemaRDD(rdd, self.sql_ctx)
 
  +def toJSON(self, use_unicode=False):
  +rdd = self._jschema_rdd.baseSchemaRDD().toJSON()

 Could you put some simple tests (also will be examples in docs) here?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213/files#r20609235.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63727367
  
  [Test build #23637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23637/consoleFull)
 for   PR 3213 at commit 
[`f9471d3`](https://github.com/apache/spark/commit/f9471d345a58ba3b76b298b3ebadf2674b0f79bb).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63738691
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23637/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63738684
  
  [Test build #23637 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23637/consoleFull)
 for   PR 3213 at commit 
[`f9471d3`](https://github.com/apache/spark/commit/f9471d345a58ba3b76b298b3ebadf2674b0f79bb).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63744673
  
This may be an intermittent diff; it's not in the code path modified in
this PR.

On Wed, Nov 19, 2014 at 4:03 PM, UCB AMPLab notificati...@github.com
wrote:

 Test FAILed.
 Refer to this link for build results (access rights to CI server needed):
 https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23637/
 Test FAILed.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213#issuecomment-63738691.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63748417
  
Ugh, yeah, just wasn't paying attention.  Fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63748781
  
  [Test build #23650 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23650/consoleFull)
 for   PR 3213 at commit 
[`cac2879`](https://github.com/apache/spark/commit/cac2879694668f392f9163ee2264874d3376e9ac).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63748886
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23650/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63748882
  
  [Test build #23650 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23650/consoleFull)
 for   PR 3213 at commit 
[`cac2879`](https://github.com/apache/spark/commit/cac2879694668f392f9163ee2264874d3376e9ac).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63755364
  
  [Test build #23656 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23656/consoleFull)
 for   PR 3213 at commit 
[`d714e1d`](https://github.com/apache/spark/commit/d714e1dcfaed40efdd137b5a83f1c334f7ae940f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63761834
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23656/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63761828
  
  [Test build #23656 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23656/consoleFull)
 for   PR 3213 at commit 
[`d714e1d`](https://github.com/apache/spark/commit/d714e1dcfaed40efdd137b5a83f1c334f7ae940f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20462697
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,69 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
+val writer = new StringWriter()
+val gen = jsonFactory.createGenerator(writer)
+
+def valWriter: (DataType, Any) = Unit = {
--- End diff --

Yeah, I think for user defined type it will just be

```scala
case (udt: UserDefinedType[_], v) = valWriter(udt.sqlType, v)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20462712
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,80 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
+val writer = new StringWriter()
+val gen = jsonFactory.createGenerator(writer)
+
+def valWriter: (DataType, Any) = Unit = {
+  case(_, null) = gen.writeNull() //writing null could break some 
parsers
+  case(StringType, v: String) = gen.writeString(v)
+  case(TimestampType, v: java.sql.Timestamp) = 
gen.writeString(v.toString)
+  case(IntegerType, v: Int) = gen.writeNumber(v)
--- End diff --

Space between `case` and `(`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20462726
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,80 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
+val writer = new StringWriter()
+val gen = jsonFactory.createGenerator(writer)
+
+def valWriter: (DataType, Any) = Unit = {
+  case(_, null) = gen.writeNull() //writing null could break some 
parsers
+  case(StringType, v: String) = gen.writeString(v)
+  case(TimestampType, v: java.sql.Timestamp) = 
gen.writeString(v.toString)
+  case(IntegerType, v: Int) = gen.writeNumber(v)
+  case(ShortType, v: Short) = gen.writeNumber(v)
+  case(FloatType, v: Float) = gen.writeNumber(v)
+  case(DoubleType, v: Double) = gen.writeNumber(v)
+  case(LongType, v: Long) = gen.writeNumber(v)
+  case(DecimalType(), v: java.math.BigDecimal) = gen.writeNumber(v)
+  case(ByteType, v: Byte) = gen.writeNumber(v.toInt)
+  case(BinaryType, v: Array[Byte]) = gen.writeBinary(v)
+  case(BooleanType, v: Boolean) = gen.writeBoolean(v)
+  case(DateType, v) = gen.writeString(v.toString)
+
+
+  case(ArrayType(ty, _), v: Seq[_] ) =
+ gen.writeStartArray()
--- End diff --

Indent only two spaces from `case`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20462808
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,80 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
+val writer = new StringWriter()
+val gen = jsonFactory.createGenerator(writer)
+
+def valWriter: (DataType, Any) = Unit = {
+  case(_, null) = gen.writeNull() //writing null could break some 
parsers
+  case(StringType, v: String) = gen.writeString(v)
+  case(TimestampType, v: java.sql.Timestamp) = 
gen.writeString(v.toString)
+  case(IntegerType, v: Int) = gen.writeNumber(v)
+  case(ShortType, v: Short) = gen.writeNumber(v)
+  case(FloatType, v: Float) = gen.writeNumber(v)
+  case(DoubleType, v: Double) = gen.writeNumber(v)
+  case(LongType, v: Long) = gen.writeNumber(v)
+  case(DecimalType(), v: java.math.BigDecimal) = gen.writeNumber(v)
+  case(ByteType, v: Byte) = gen.writeNumber(v.toInt)
+  case(BinaryType, v: Array[Byte]) = gen.writeBinary(v)
+  case(BooleanType, v: Boolean) = gen.writeBoolean(v)
+  case(DateType, v) = gen.writeString(v.toString)
+
+
+  case(ArrayType(ty, _), v: Seq[_] ) =
+ gen.writeStartArray()
+ v.foreach(valWriter(ty,_))
+ gen.writeEndArray()
+
+  case(MapType(kv,vv, _), v: Map[_,_]) =
+gen.writeStartObject
+v.foreach(p = {
+  gen.writeFieldName(p._1.toString)
+  valWriter(vv,p._2)
+  }
+)
--- End diff --

Format multi-line lambdas like this:

```scala
v.foreach { p = 
  gen.writeFieldName(p._1.toString)
  valWriter(vv,p._2)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20462981
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,80 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
--- End diff --

I think it would be better to put this in the `object` in the `json` 
package to avoid more `SQLContext` bloat.  (Admittedly we have not been very 
good about this in the past, but I'd like to avoid it getting worse)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20463233
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,80 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
+val writer = new StringWriter()
+val gen = jsonFactory.createGenerator(writer)
+
+def valWriter: (DataType, Any) = Unit = {
+  case(_, null) = gen.writeNull() //writing null could break some 
parsers
+  case(StringType, v: String) = gen.writeString(v)
+  case(TimestampType, v: java.sql.Timestamp) = 
gen.writeString(v.toString)
+  case(IntegerType, v: Int) = gen.writeNumber(v)
+  case(ShortType, v: Short) = gen.writeNumber(v)
+  case(FloatType, v: Float) = gen.writeNumber(v)
+  case(DoubleType, v: Double) = gen.writeNumber(v)
+  case(LongType, v: Long) = gen.writeNumber(v)
+  case(DecimalType(), v: java.math.BigDecimal) = gen.writeNumber(v)
+  case(ByteType, v: Byte) = gen.writeNumber(v.toInt)
+  case(BinaryType, v: Array[Byte]) = gen.writeBinary(v)
+  case(BooleanType, v: Boolean) = gen.writeBoolean(v)
+  case(DateType, v) = gen.writeString(v.toString)
+
+
+  case(ArrayType(ty, _), v: Seq[_] ) =
+ gen.writeStartArray()
+ v.foreach(valWriter(ty,_))
+ gen.writeEndArray()
+
+  case(MapType(kv,vv, _), v: Map[_,_]) =
+gen.writeStartObject
+v.foreach(p = {
+  gen.writeFieldName(p._1.toString)
+  valWriter(vv,p._2)
+  }
+)
+gen.writeEndObject
+
+  case(StructType(ty), v: Seq[_]) =
+ gen.writeStartObject()
+ ty.zip(v).foreach {
+   case(_, null) =
+   case(field, v) =
+gen.writeFieldName(field.name)
+ valWriter(field.dataType, v)
+ }
+
+gen.writeEndObject()
+
--- End diff --

Indenting and line spacing are kinda messed up here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63372210
  
A few minor style comments.  Also, regarding testing: it seems like a 
really good way to get better coverage with little effort would be to just 
round-trip a bunch of our existing datasets though this code path.  Basically 
just compare the results of json data - `jsonRDD` with json data - 
`jsonRDD` - `toJSON` - `jsonRDD` using the `checkAnswer` function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63376043
  
Thanks -- I'll clean up the style issues straight away.  I'm glad to see
this getting close to finished.

As for additional tests, I'd been thinking along the same lines.  However,
I'm not sure where they should live: SQLQuerySuite or JsonSuite.  I think
if it's JsonRDD compared with rehydrated toJSON output, it should be in
JsonSuite.

Cheers,
Dan

On Mon, Nov 17, 2014 at 12:43 PM, Michael Armbrust notificati...@github.com
 wrote:

 A few minor style comments. Also, regarding testing: it seems like a
 really good way to get better coverage with little effort would be to just
 round-trip a bunch of our existing datasets though this code path.
 Basically just compare the results of json data - jsonRDD with json
 data - jsonRDD - toJSON - jsonRDD using the checkAnswer function.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213#issuecomment-63372210.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20466740
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,80 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
--- End diff --

I'm not sure I understand: would that be placing the rowToJSON method in 
JsonRDD, or as a separate object for import into SchemaRDD?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20467259
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,80 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
--- End diff --

I was just thinking of moving the `rowToJSON` method to JsonRDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63381046
  
I'd prefer either `JSONSuite` or create a new `ToJsonSuite`, 
`SQLQuerySuite` is already way too big.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63386849
  
I'm going to go with JSONSuite.  I don't think it's big enough to warrant a 
whole suite.  I'm putting rowToJSON in JsonRDD right after asRow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63396570
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63396987
  
  [Test build #23514 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23514/consoleFull)
 for   PR 3213 at commit 
[`4387dd5`](https://github.com/apache/spark/commit/4387dd589425310cef7db7a5423aad5aa4a706f3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63397114
  
  [Test build #23514 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23514/consoleFull)
 for   PR 3213 at commit 
[`4387dd5`](https://github.com/apache/spark/commit/4387dd589425310cef7db7a5423aad5aa4a706f3).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63397119
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23514/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63404468
  
  [Test build #23519 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23519/consoleFull)
 for   PR 3213 at commit 
[`4a651f0`](https://github.com/apache/spark/commit/4a651f0fcf359739d392a1a34dc2effef640dda1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20478602
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -591,6 +591,30 @@ class SQLQuerySuite extends QueryTest with 
BeforeAndAfterAll {
 clear()
   }
 
+  test(SPARK-4228 SchemaRDD to JSON)
--- End diff --

Do we still need it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20478747
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -591,6 +591,30 @@ class SQLQuerySuite extends QueryTest with 
BeforeAndAfterAll {
 clear()
   }
 
+  test(SPARK-4228 SchemaRDD to JSON)
--- End diff --

Nope; removed.

On Mon, Nov 17, 2014 at 4:51 PM, Yin Huai notificati...@github.com wrote:

 In sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:

  @@ -591,6 +591,30 @@ class SQLQuerySuite extends QueryTest with 
BeforeAndAfterAll {
   clear()
 }
 
  +  test(SPARK-4228 SchemaRDD to JSON)

 Do we still need it?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213/files#r20478602.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63405701
  
  [Test build #23521 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23521/consoleFull)
 for   PR 3213 at commit 
[`1a5fd30`](https://github.com/apache/spark/commit/1a5fd30b4cf85a84bf8b4db68913091b1246ef95).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20479487
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala 
---
@@ -779,4 +780,52 @@ class JsonSuite extends QueryTest {
   Seq(null, null, null, Seq(Seq(null, Seq(1, 2, 3 :: Nil
 )
   }
+
+  test(SPARK-4228 SchemaRDD to JSON)
+  {
--- End diff --

Move it to the line above.

Also, can you make changes according to @marmbrus's comments on tests 
(round-trip our existing datasets with this code path)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20479917
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala 
---
@@ -779,4 +780,52 @@ class JsonSuite extends QueryTest {
   Seq(null, null, null, Seq(Seq(null, Seq(1, 2, 3 :: Nil
 )
   }
+
+  test(SPARK-4228 SchemaRDD to JSON)
+  {
--- End diff --

I'm already working on it.  Shouldn't take too much longer.

On Mon, Nov 17, 2014 at 5:12 PM, Yin Huai notificati...@github.com wrote:

 In sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala:

  @@ -779,4 +780,52 @@ class JsonSuite extends QueryTest {
 Seq(null, null, null, Seq(Seq(null, Seq(1, 2, 3 :: Nil
   )
 }
  +
  +  test(SPARK-4228 SchemaRDD to JSON)
  +  {

 Move it to the line above.

 Also, can you make changes according to @marmbrus
 https://github.com/marmbrus's comments on tests (round-trip our
 existing datasets with this code path)?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213/files#r20479487.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63414062
  
  [Test build #23519 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23519/consoleFull)
 for   PR 3213 at commit 
[`4a651f0`](https://github.com/apache/spark/commit/4a651f0fcf359739d392a1a34dc2effef640dda1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63414067
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23519/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63414844
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23521/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63414839
  
  [Test build #23521 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23521/consoleFull)
 for   PR 3213 at commit 
[`1a5fd30`](https://github.com/apache/spark/commit/1a5fd30b4cf85a84bf8b4db68913091b1246ef95).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63419169
  
OK, pulled in the bulk of the tests for primitive and complex types from
other parts of JsonSuite.  I think we're pretty heavily exercising the code
at this point.

Cheers,
D

On Mon, Nov 17, 2014 at 6:40 PM, UCB AMPLab notificati...@github.com
wrote:

 Test PASSed.
 Refer to this link for build results (access rights to CI server needed):
 https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23519/
 Test PASSed.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213#issuecomment-63414067.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63419571
  
  [Test build #23535 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23535/consoleFull)
 for   PR 3213 at commit 
[`6598cee`](https://github.com/apache/spark/commit/6598ceeca5b1194a6758e8d51e20028fc9bb9b56).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63425759
  
  [Test build #23535 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23535/consoleFull)
 for   PR 3213 at commit 
[`6598cee`](https://github.com/apache/spark/commit/6598ceeca5b1194a6758e8d51e20028fc9bb9b56).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63425765
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23535/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-16 Thread dwmclary
Github user dwmclary commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20410937
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,69 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
+val writer = new StringWriter()
+val gen = jsonFactory.createGenerator(writer)
+
+def valWriter: (DataType, Any) = Unit = {
+  case(_, null)  = //do nothing
+  case(StringType, v: String) = gen.writeString(v)
+  case(TimestampType, v: java.sql.Timestamp) = 
gen.writeString(v.toString)
--- End diff --

@yhaui
This is pretty pretty close.  The Java APIs in, DateType and MapType are
in.  However, a few things:

   - If next logical action is saveAsTextFile, I'd presume this is mainly
   for readers.  In that case, I think the string output for timestamps is
   preferable.
   - UserDefinedType is proving a little tricky.  I would think that we
   would just add a case for UserDefinedType[_], but that's an unsupported
   pattern.



On Sat, Nov 15, 2014 at 6:03 PM, Yin Huai notificati...@github.com wrote:

 In sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala:

  @@ -131,6 +134,69 @@ class SchemaRDD(
  */
 lazy val schema: StructType = queryExecution.analyzed.schema
 
  +  /** Transforms a single Row to JSON using Jackson
  +*
  +* @param jsonFactory a JsonFactory object to construct a 
JsonGenerator
  +* @param rowSchema the schema object used for conversion
  +* @param row The row to convert
  +*/
  +  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
  +val writer = new StringWriter()
  +val gen = jsonFactory.createGenerator(writer)
  +
  +def valWriter: (DataType, Any) = Unit = {
  +  case(_, null)  = //do nothing
  +  case(StringType, v: String) = gen.writeString(v)
  +  case(TimestampType, v: java.sql.Timestamp) = 
gen.writeString(v.toString)

 If we use string for a timestamp value, the meaning of the time can be
 changed (e.g. the data is generated by a developer in a time zone and then
 it is read by another developer in another time zone). I feel using
 getTime is better (it is not very reader friendly though).

 @marmbrus https://github.com/marmbrus What do you think?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213/files#r20406598.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63194531
  
@davies -- that's much cleaner; thanks!  I think unicode should be default,
but optional for the deserializer so I added that to the method.

@yhuai https://github.com/yhuai, I followed after @NathanHowell
https://github.com/NathanHowell's approach; if we're ever going to have
columns which are Scala collections, it's better than using ObjectMapper on
a Java HashMap.

This should be close to ready.

Also, I'm not writing nulls to the JSON string.  There's too much
variability in JSON parsers around nulls; I believe it's better to avoid
the issue altogether.

Cheers,
Dan

On Fri, Nov 14, 2014 at 8:03 PM, Davies Liu notificati...@github.com
wrote:

 I think it should be

 def toJsonRDD(self):
  rdd = self._jschema_rdd.baseSchemaRDD().toJsonRDD()
  return RDD(rdd.toJavaRDD(), self.ctx, UTF8Deserializer())

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213#issuecomment-63159199.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63199746
  
@dwmclary I will like to have str as default instead, which use less memory 
and better performance (avoid decode and encode in some cases).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63200033
  
@dwmclary Can you reformat the code to follow the [2-space 
convention](http://docs.scala-lang.org/style/indentation.html)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20406459
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,68 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jf a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jf: JsonFactory)(row: Row): 
String = {
--- End diff --

How about to rename `jf` to `jsonFactory`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread dwmclary
Github user dwmclary commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20406513
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,68 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jf a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jf: JsonFactory)(row: Row): 
String = {
--- End diff --

Done and done @davies, @yhaui

On Sat, Nov 15, 2014 at 5:17 PM, Yin Huai notificati...@github.com wrote:

 In sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala:

  @@ -131,6 +134,68 @@ class SchemaRDD(
  */
 lazy val schema: StructType = queryExecution.analyzed.schema
 
  +  /** Transforms a single Row to JSON using Jackson
  +*
  +* @param jf a JsonFactory object to construct a JsonGenerator
  +* @param rowSchema the schema object used for conversion
  +* @param row The row to convert
  +*/
  +  private def rowToJSON(rowSchema: StructType, jf: JsonFactory)(row: 
Row): String = {

 How about to rename jf to jsonFactory?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213/files#r20406459.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20406534
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,69 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
+val writer = new StringWriter()
+val gen = jsonFactory.createGenerator(writer)
+
+def valWriter: (DataType, Any) = Unit = {
+  case(_, null)  = //do nothing
+  case(StringType, v: String) = gen.writeString(v)
+  case(TimestampType, v: java.sql.Timestamp) = 
gen.writeString(v.toString)
+  case(IntegerType, v: Int) = gen.writeNumber(v)
+  case(ShortType, v: Short) = gen.writeNumber(v)
+  case(FloatType, v: Float) = gen.writeNumber(v)
+  case(DoubleType, v: Double) = gen.writeNumber(v)
+  case(LongType, v: Long) = gen.writeNumber(v)
+  case(DecimalType(), v: java.math.BigDecimal) = gen.writeNumber(v)
+  case(ByteType, v: Byte) = gen.writeNumber(v.toInt)
+  case(BinaryType, v: Array[Byte]) = gen.writeBinary(v)
+  case(BooleanType, v: Boolean) = gen.writeBoolean(v)
+
+  case(ArrayType(ty, _), v: Seq[_] ) =
+ gen.writeStartArray()
+ v.foreach(valWriter(ty,_))
--- End diff --

I think we need to keep `null`s in an array. Otherwise, the information 
carried by the array will be changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20406537
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -591,6 +591,30 @@ class SQLQuerySuite extends QueryTest with 
BeforeAndAfterAll {
 clear()
   }
 
+  test(SPARK-4228 SchemaRDD to JSON)
--- End diff --

Can we also add complex data types?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20406544
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -17,9 +17,12 @@
 
 package org.apache.spark.sql
 
-import java.util.{List = JList}
-
+import java.util.{Map = JMap, List = JList, HashMap = JHMap}
+import java.io.StringWriter
 import org.apache.spark.api.python.SerDeUtil
+import org.apache.spark.storage.StorageLevel
+import com.fasterxml.jackson.databind.ObjectMapper
+import com.fasterxml.jackson.core.JsonFactory
--- End diff --

Sort imports (see 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3213#discussion_r20406568
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -131,6 +134,69 @@ class SchemaRDD(
*/
   lazy val schema: StructType = queryExecution.analyzed.schema
 
+  /** Transforms a single Row to JSON using Jackson
+*
+* @param jsonFactory a JsonFactory object to construct a JsonGenerator
+* @param rowSchema the schema object used for conversion
+* @param row The row to convert
+*/
+  private def rowToJSON(rowSchema: StructType, jsonFactory: 
JsonFactory)(row: Row): String = {
+val writer = new StringWriter()
+val gen = jsonFactory.createGenerator(writer)
+
+def valWriter: (DataType, Any) = Unit = {
--- End diff --

Seems we are missing `MapType`, `DateType`, and `UserDefinedType`. 

@marmbrus Should we write the string representation (returned by 
`toString`) for a `UserDefinedType`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63200778
  
Seems we also need the Java API (`JavaSchemaRDD`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63139261
  
Thanks for working on this.  I have two high level comments:
 - I think it would be better to have a single implementation in Scala with 
a wrapper in python.  This way we don't have to serialize / ship the objects to 
python which seems like it might be expensive, especially if the next thing you 
are going to do is something like `saveAsTextFile`
 - It would also be better to use jackson to do the generation of the JSON 
string as there are a lot of tricky edge cases around escaping that we need to 
handle if we do it ourselves.  For example, I think this version will fail if a 
column name contains a quote character.

/cc @yhuai


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63145286
  
Happy to help; these changes should be quick.

   - Sure, the wrapper for pyspark makes more sense; I hadn't considered
   that we'd be shipping the objects back and forth to py4j.
   - Jackson should be a simple change; I'll just go hash-json via
   objectmapper if that makes sense.

Cheers,
Dan

On Fri, Nov 14, 2014 at 2:27 PM, Michael Armbrust notificati...@github.com
wrote:

 Thanks for working on this. I have two high level comments:

- I think it would be better to have a single implementation in Scala
with a wrapper in python. This way we don't have to serialize / ship 
the
objects to python which seems like it might be expensive, especially 
if the
next thing you are going to do is something like saveAsTextFile
- It would also be better to use jackson to do the generation of the
JSON string as there are a lot of tricky edge cases around escaping 
that we
need to handle if we do it ourselves. For example, I think this version
will fail if a column name contains a quote character.

 /cc @yhuai https://github.com/yhuai

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213#issuecomment-63139261.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63151280
  
Another approach is to use a `JsonGenerator` instead of an `ObjectMapper`. 
This is the implementation I've been using for a while: 
https://gist.github.com/NathanHowell/0a15f0bd23cd940becb3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63152791
  
I pushed up a Jackson version, which cuts down the size quite a bit.  At
present we're not handling complex types, correct?

What I'm a bit stuck on is getting the results of the scala method back
into pyspark.  If I call:

newJsonRDD = someSchemaRDD._jschema_rdd.baseSchemaRDD().toJSON()
I'm not sure how do deserialize it back on the python side.

My intuition would be that I'd just do
RDD(newJsonRDD, sc), but that doesn't seem to give me back a valid RDD.

-D

On Fri, Nov 14, 2014 at 4:41 PM, Nathan Howell notificati...@github.com
wrote:

 Another approach is to use a JsonGenerator instead of an ObjectMapper.
 This is the implementation I've been using for a while:
 https://gist.github.com/NathanHowell/0a15f0bd23cd940becb3

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213#issuecomment-63151280.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63156674
  
@dwmclary Thank you for working on it. I think we need to handle complex 
types. The approach @NathanHowell is using looks good since we do not need to 
convert Scala collections to Java collections to generate the JSON string.

Regarding the Python API, `toJSON` will returns a Scala RDD and I think you 
can use `jrdd = self._jvm.JavaRDD.fromRDD` to generate a Java RDD and then 
[create an RDD at the Python 
side](https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L123) 
(`RDD(jrdd, self._sc, self._jrdd_deserializer)`). @davies is it the right way?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63157839
  
Yin,

  Thanks for jumping in.  I'll run some complex types through ObjectMapper
and see how it compares to JsonFactory.  I figure object creation overhead
is probably equivalent.

  I think you're correct about the python approach.  I was hoping to pick
up javaToPython, but since I have a MappedRDD at the end of the
transformation I can't get it.  javaToPython should stay private, so I
probably have to go after the _jvm.

Cheers,
Dan

On Fri, Nov 14, 2014 at 6:31 PM, Yin Huai notificati...@github.com wrote:

 @dwmclary https://github.com/dwmclary Thank you for working on it. I
 think we need to handle complex types. The approach @NathanHowell
 https://github.com/NathanHowell is using looks good since we do not
 need to convert Scala collections to Java collections to generate the JSON
 string.

 Regarding the Python API, toJSON will returns a Scala RDD and I think you
 can use jrdd = self._jvm.JavaRDD.fromRDD to generate a Java RDD and then 
create
 an RDD at the Python side
 https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L123 
(RDD(jrdd,
 self._sc, self._jrdd_deserializer)). @davies https://github.com/davies
 is it the right way?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3213#issuecomment-63156674.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-63159199
  
I think it should be
```
def toJsonRDD(self):
 rdd = self._jschema_rdd.baseSchemaRDD().toJsonRDD()
 return RDD(rdd.toJavaRDD(), self.ctx, UTF8Deserializer())
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-11 Thread dwmclary
GitHub user dwmclary opened a pull request:

https://github.com/apache/spark/pull/3213

 SPARK-4228 SchemaRDD to JSON

Here's a simple fix for SchemaRDD to JSON.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dwmclary/spark SPARK-4228

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3213


commit d6d19e9f3e64b7c5b38994e56e0c16393f480e76
Author: Dan McClary dan.mccl...@gmail.com
Date:   2014-07-22T16:38:05Z

pr example

commit 5d34e371b1af42febb9c8d8a0bb5ff5764577e92
Author: Dan McClary dan.mccl...@gmail.com
Date:   2014-10-10T18:17:30Z

merge resolved

commit f7d166aff68772179fcde2c46f1ef4a935b20628
Author: Dan McClary dan.mccl...@gmail.com
Date:   2014-11-11T07:37:06Z

added toJSON method

commit 626a5b1b589c5d11df05bccea3ac6db9c17960f1
Author: Dan McClary dan.mccl...@gmail.com
Date:   2014-11-11T15:51:00Z

added toJSON to SchemaRDD

commit 424f130419d4a4727d1dc594d8d15e15a1293c55
Author: Dan McClary dan.mccl...@gmail.com
Date:   2014-11-11T20:25:45Z

tests pass, ready for pull and PR

commit 319e3bac4c448119c3ac42343c14e6e52caf8dbe
Author: Dan McClary dan.mccl...@gmail.com
Date:   2014-11-11T22:50:36Z

updated to upstream master with merged SPARK-4228




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3213#issuecomment-62636889
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org