[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20679083 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +135,20 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Returns a new RDD of JSON strings, one string per row + * + * @group schema + */ + def toJSON: RDD[String] = { +val rowSchema = this.schema +this.mapPartitions { iter = + val jsonFactory = new JsonFactory() + iter.map(JsonRDD.rowToJSON(rowSchema, jsonFactory)) +} + --- End diff -- Extra line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20679092 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -35,6 +37,8 @@ import org.apache.spark.sql.catalyst.analysis._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.{Inner, JoinType} import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.types.UserDefinedType --- End diff -- Unused imports. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20679073 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala --- @@ -126,6 +126,12 @@ class JavaSchemaRDD( // Transformations (return a new RDD) /** + * Return a new RDD that is the schema transformed to JSON --- End diff -- `Returns an RDD with each row transformed to a JSON string.` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63884435 Since we are about to cut a 1.2 preview I'll make the final changes while merging. Thanks for working on this! I think it'll be a pretty popular feature. I used it last night :) Merging to master and 1.2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63884730 Michael; thanks for being willing to pick up the final changes! I'm happy to get a chance to contribute again. Hopefully the next PR won't require so much of your time. Cheers, Dan On Thu, Nov 20, 2014 at 1:38 PM, Michael Armbrust notificati...@github.com wrote: Since we are about to cut a 1.2 preview I'll make the final changes while merging. Thanks for working on this! I think it'll be a pretty popular feature. I used it last night :) Merging to master and 1.2 â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213#issuecomment-63884435. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3213 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63716980 Is this good to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20609235 --- Diff: python/pyspark/sql.py --- @@ -1870,6 +1870,10 @@ def limit(self, num): rdd = self._jschema_rdd.baseSchemaRDD().limit(num).toJavaSchemaRDD() return SchemaRDD(rdd, self.sql_ctx) +def toJSON(self, use_unicode=False): +rdd = self._jschema_rdd.baseSchemaRDD().toJSON() --- End diff -- Could you put some simple tests (also will be examples in docs) here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20610021 --- Diff: python/pyspark/sql.py --- @@ -1870,6 +1870,10 @@ def limit(self, num): rdd = self._jschema_rdd.baseSchemaRDD().limit(num).toJavaSchemaRDD() return SchemaRDD(rdd, self.sql_ctx) +def toJSON(self, use_unicode=False): +rdd = self._jschema_rdd.baseSchemaRDD().toJSON() --- End diff -- Sure thing. On Wed, Nov 19, 2014 at 1:34 PM, Davies Liu notificati...@github.com wrote: In python/pyspark/sql.py: @@ -1870,6 +1870,10 @@ def limit(self, num): rdd = self._jschema_rdd.baseSchemaRDD().limit(num).toJavaSchemaRDD() return SchemaRDD(rdd, self.sql_ctx) +def toJSON(self, use_unicode=False): +rdd = self._jschema_rdd.baseSchemaRDD().toJSON() Could you put some simple tests (also will be examples in docs) here? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213/files#r20609235. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63727367 [Test build #23637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23637/consoleFull) for PR 3213 at commit [`f9471d3`](https://github.com/apache/spark/commit/f9471d345a58ba3b76b298b3ebadf2674b0f79bb). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63738691 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23637/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63738684 [Test build #23637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23637/consoleFull) for PR 3213 at commit [`f9471d3`](https://github.com/apache/spark/commit/f9471d345a58ba3b76b298b3ebadf2674b0f79bb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63744673 This may be an intermittent diff; it's not in the code path modified in this PR. On Wed, Nov 19, 2014 at 4:03 PM, UCB AMPLab notificati...@github.com wrote: Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23637/ Test FAILed. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213#issuecomment-63738691. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63748417 Ugh, yeah, just wasn't paying attention. Fixed now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63748781 [Test build #23650 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23650/consoleFull) for PR 3213 at commit [`cac2879`](https://github.com/apache/spark/commit/cac2879694668f392f9163ee2264874d3376e9ac). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63748886 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23650/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63748882 [Test build #23650 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23650/consoleFull) for PR 3213 at commit [`cac2879`](https://github.com/apache/spark/commit/cac2879694668f392f9163ee2264874d3376e9ac). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63755364 [Test build #23656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23656/consoleFull) for PR 3213 at commit [`d714e1d`](https://github.com/apache/spark/commit/d714e1dcfaed40efdd137b5a83f1c334f7ae940f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63761834 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23656/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63761828 [Test build #23656 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23656/consoleFull) for PR 3213 at commit [`d714e1d`](https://github.com/apache/spark/commit/d714e1dcfaed40efdd137b5a83f1c334f7ae940f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20462697 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,69 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { +val writer = new StringWriter() +val gen = jsonFactory.createGenerator(writer) + +def valWriter: (DataType, Any) = Unit = { --- End diff -- Yeah, I think for user defined type it will just be ```scala case (udt: UserDefinedType[_], v) = valWriter(udt.sqlType, v) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20462712 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,80 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { +val writer = new StringWriter() +val gen = jsonFactory.createGenerator(writer) + +def valWriter: (DataType, Any) = Unit = { + case(_, null) = gen.writeNull() //writing null could break some parsers + case(StringType, v: String) = gen.writeString(v) + case(TimestampType, v: java.sql.Timestamp) = gen.writeString(v.toString) + case(IntegerType, v: Int) = gen.writeNumber(v) --- End diff -- Space between `case` and `(`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20462726 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,80 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { +val writer = new StringWriter() +val gen = jsonFactory.createGenerator(writer) + +def valWriter: (DataType, Any) = Unit = { + case(_, null) = gen.writeNull() //writing null could break some parsers + case(StringType, v: String) = gen.writeString(v) + case(TimestampType, v: java.sql.Timestamp) = gen.writeString(v.toString) + case(IntegerType, v: Int) = gen.writeNumber(v) + case(ShortType, v: Short) = gen.writeNumber(v) + case(FloatType, v: Float) = gen.writeNumber(v) + case(DoubleType, v: Double) = gen.writeNumber(v) + case(LongType, v: Long) = gen.writeNumber(v) + case(DecimalType(), v: java.math.BigDecimal) = gen.writeNumber(v) + case(ByteType, v: Byte) = gen.writeNumber(v.toInt) + case(BinaryType, v: Array[Byte]) = gen.writeBinary(v) + case(BooleanType, v: Boolean) = gen.writeBoolean(v) + case(DateType, v) = gen.writeString(v.toString) + + + case(ArrayType(ty, _), v: Seq[_] ) = + gen.writeStartArray() --- End diff -- Indent only two spaces from `case` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20462808 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,80 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { +val writer = new StringWriter() +val gen = jsonFactory.createGenerator(writer) + +def valWriter: (DataType, Any) = Unit = { + case(_, null) = gen.writeNull() //writing null could break some parsers + case(StringType, v: String) = gen.writeString(v) + case(TimestampType, v: java.sql.Timestamp) = gen.writeString(v.toString) + case(IntegerType, v: Int) = gen.writeNumber(v) + case(ShortType, v: Short) = gen.writeNumber(v) + case(FloatType, v: Float) = gen.writeNumber(v) + case(DoubleType, v: Double) = gen.writeNumber(v) + case(LongType, v: Long) = gen.writeNumber(v) + case(DecimalType(), v: java.math.BigDecimal) = gen.writeNumber(v) + case(ByteType, v: Byte) = gen.writeNumber(v.toInt) + case(BinaryType, v: Array[Byte]) = gen.writeBinary(v) + case(BooleanType, v: Boolean) = gen.writeBoolean(v) + case(DateType, v) = gen.writeString(v.toString) + + + case(ArrayType(ty, _), v: Seq[_] ) = + gen.writeStartArray() + v.foreach(valWriter(ty,_)) + gen.writeEndArray() + + case(MapType(kv,vv, _), v: Map[_,_]) = +gen.writeStartObject +v.foreach(p = { + gen.writeFieldName(p._1.toString) + valWriter(vv,p._2) + } +) --- End diff -- Format multi-line lambdas like this: ```scala v.foreach { p = gen.writeFieldName(p._1.toString) valWriter(vv,p._2) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20462981 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,80 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { --- End diff -- I think it would be better to put this in the `object` in the `json` package to avoid more `SQLContext` bloat. (Admittedly we have not been very good about this in the past, but I'd like to avoid it getting worse) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20463233 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,80 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { +val writer = new StringWriter() +val gen = jsonFactory.createGenerator(writer) + +def valWriter: (DataType, Any) = Unit = { + case(_, null) = gen.writeNull() //writing null could break some parsers + case(StringType, v: String) = gen.writeString(v) + case(TimestampType, v: java.sql.Timestamp) = gen.writeString(v.toString) + case(IntegerType, v: Int) = gen.writeNumber(v) + case(ShortType, v: Short) = gen.writeNumber(v) + case(FloatType, v: Float) = gen.writeNumber(v) + case(DoubleType, v: Double) = gen.writeNumber(v) + case(LongType, v: Long) = gen.writeNumber(v) + case(DecimalType(), v: java.math.BigDecimal) = gen.writeNumber(v) + case(ByteType, v: Byte) = gen.writeNumber(v.toInt) + case(BinaryType, v: Array[Byte]) = gen.writeBinary(v) + case(BooleanType, v: Boolean) = gen.writeBoolean(v) + case(DateType, v) = gen.writeString(v.toString) + + + case(ArrayType(ty, _), v: Seq[_] ) = + gen.writeStartArray() + v.foreach(valWriter(ty,_)) + gen.writeEndArray() + + case(MapType(kv,vv, _), v: Map[_,_]) = +gen.writeStartObject +v.foreach(p = { + gen.writeFieldName(p._1.toString) + valWriter(vv,p._2) + } +) +gen.writeEndObject + + case(StructType(ty), v: Seq[_]) = + gen.writeStartObject() + ty.zip(v).foreach { + case(_, null) = + case(field, v) = +gen.writeFieldName(field.name) + valWriter(field.dataType, v) + } + +gen.writeEndObject() + --- End diff -- Indenting and line spacing are kinda messed up here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63372210 A few minor style comments. Also, regarding testing: it seems like a really good way to get better coverage with little effort would be to just round-trip a bunch of our existing datasets though this code path. Basically just compare the results of json data - `jsonRDD` with json data - `jsonRDD` - `toJSON` - `jsonRDD` using the `checkAnswer` function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63376043 Thanks -- I'll clean up the style issues straight away. I'm glad to see this getting close to finished. As for additional tests, I'd been thinking along the same lines. However, I'm not sure where they should live: SQLQuerySuite or JsonSuite. I think if it's JsonRDD compared with rehydrated toJSON output, it should be in JsonSuite. Cheers, Dan On Mon, Nov 17, 2014 at 12:43 PM, Michael Armbrust notificati...@github.com wrote: A few minor style comments. Also, regarding testing: it seems like a really good way to get better coverage with little effort would be to just round-trip a bunch of our existing datasets though this code path. Basically just compare the results of json data - jsonRDD with json data - jsonRDD - toJSON - jsonRDD using the checkAnswer function. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213#issuecomment-63372210. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20466740 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,80 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { --- End diff -- I'm not sure I understand: would that be placing the rowToJSON method in JsonRDD, or as a separate object for import into SchemaRDD? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20467259 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,80 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { --- End diff -- I was just thinking of moving the `rowToJSON` method to JsonRDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63381046 I'd prefer either `JSONSuite` or create a new `ToJsonSuite`, `SQLQuerySuite` is already way too big. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63386849 I'm going to go with JSONSuite. I don't think it's big enough to warrant a whole suite. I'm putting rowToJSON in JsonRDD right after asRow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63396570 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63396987 [Test build #23514 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23514/consoleFull) for PR 3213 at commit [`4387dd5`](https://github.com/apache/spark/commit/4387dd589425310cef7db7a5423aad5aa4a706f3). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63397114 [Test build #23514 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23514/consoleFull) for PR 3213 at commit [`4387dd5`](https://github.com/apache/spark/commit/4387dd589425310cef7db7a5423aad5aa4a706f3). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63397119 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23514/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63404468 [Test build #23519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23519/consoleFull) for PR 3213 at commit [`4a651f0`](https://github.com/apache/spark/commit/4a651f0fcf359739d392a1a34dc2effef640dda1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20478602 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -591,6 +591,30 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll { clear() } + test(SPARK-4228 SchemaRDD to JSON) --- End diff -- Do we still need it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20478747 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -591,6 +591,30 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll { clear() } + test(SPARK-4228 SchemaRDD to JSON) --- End diff -- Nope; removed. On Mon, Nov 17, 2014 at 4:51 PM, Yin Huai notificati...@github.com wrote: In sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala: @@ -591,6 +591,30 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll { clear() } + test(SPARK-4228 SchemaRDD to JSON) Do we still need it? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213/files#r20478602. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63405701 [Test build #23521 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23521/consoleFull) for PR 3213 at commit [`1a5fd30`](https://github.com/apache/spark/commit/1a5fd30b4cf85a84bf8b4db68913091b1246ef95). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20479487 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala --- @@ -779,4 +780,52 @@ class JsonSuite extends QueryTest { Seq(null, null, null, Seq(Seq(null, Seq(1, 2, 3 :: Nil ) } + + test(SPARK-4228 SchemaRDD to JSON) + { --- End diff -- Move it to the line above. Also, can you make changes according to @marmbrus's comments on tests (round-trip our existing datasets with this code path)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20479917 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala --- @@ -779,4 +780,52 @@ class JsonSuite extends QueryTest { Seq(null, null, null, Seq(Seq(null, Seq(1, 2, 3 :: Nil ) } + + test(SPARK-4228 SchemaRDD to JSON) + { --- End diff -- I'm already working on it. Shouldn't take too much longer. On Mon, Nov 17, 2014 at 5:12 PM, Yin Huai notificati...@github.com wrote: In sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala: @@ -779,4 +780,52 @@ class JsonSuite extends QueryTest { Seq(null, null, null, Seq(Seq(null, Seq(1, 2, 3 :: Nil ) } + + test(SPARK-4228 SchemaRDD to JSON) + { Move it to the line above. Also, can you make changes according to @marmbrus https://github.com/marmbrus's comments on tests (round-trip our existing datasets with this code path)? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213/files#r20479487. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63414062 [Test build #23519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23519/consoleFull) for PR 3213 at commit [`4a651f0`](https://github.com/apache/spark/commit/4a651f0fcf359739d392a1a34dc2effef640dda1). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63414067 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23519/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63414844 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23521/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63414839 [Test build #23521 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23521/consoleFull) for PR 3213 at commit [`1a5fd30`](https://github.com/apache/spark/commit/1a5fd30b4cf85a84bf8b4db68913091b1246ef95). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63419169 OK, pulled in the bulk of the tests for primitive and complex types from other parts of JsonSuite. I think we're pretty heavily exercising the code at this point. Cheers, D On Mon, Nov 17, 2014 at 6:40 PM, UCB AMPLab notificati...@github.com wrote: Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23519/ Test PASSed. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213#issuecomment-63414067. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63419571 [Test build #23535 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23535/consoleFull) for PR 3213 at commit [`6598cee`](https://github.com/apache/spark/commit/6598ceeca5b1194a6758e8d51e20028fc9bb9b56). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63425759 [Test build #23535 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23535/consoleFull) for PR 3213 at commit [`6598cee`](https://github.com/apache/spark/commit/6598ceeca5b1194a6758e8d51e20028fc9bb9b56). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63425765 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23535/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20410937 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,69 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { +val writer = new StringWriter() +val gen = jsonFactory.createGenerator(writer) + +def valWriter: (DataType, Any) = Unit = { + case(_, null) = //do nothing + case(StringType, v: String) = gen.writeString(v) + case(TimestampType, v: java.sql.Timestamp) = gen.writeString(v.toString) --- End diff -- @yhaui This is pretty pretty close. The Java APIs in, DateType and MapType are in. However, a few things: - If next logical action is saveAsTextFile, I'd presume this is mainly for readers. In that case, I think the string output for timestamps is preferable. - UserDefinedType is proving a little tricky. I would think that we would just add a case for UserDefinedType[_], but that's an unsupported pattern. On Sat, Nov 15, 2014 at 6:03 PM, Yin Huai notificati...@github.com wrote: In sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala: @@ -131,6 +134,69 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { +val writer = new StringWriter() +val gen = jsonFactory.createGenerator(writer) + +def valWriter: (DataType, Any) = Unit = { + case(_, null) = //do nothing + case(StringType, v: String) = gen.writeString(v) + case(TimestampType, v: java.sql.Timestamp) = gen.writeString(v.toString) If we use string for a timestamp value, the meaning of the time can be changed (e.g. the data is generated by a developer in a time zone and then it is read by another developer in another time zone). I feel using getTime is better (it is not very reader friendly though). @marmbrus https://github.com/marmbrus What do you think? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213/files#r20406598. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63194531 @davies -- that's much cleaner; thanks! I think unicode should be default, but optional for the deserializer so I added that to the method. @yhuai https://github.com/yhuai, I followed after @NathanHowell https://github.com/NathanHowell's approach; if we're ever going to have columns which are Scala collections, it's better than using ObjectMapper on a Java HashMap. This should be close to ready. Also, I'm not writing nulls to the JSON string. There's too much variability in JSON parsers around nulls; I believe it's better to avoid the issue altogether. Cheers, Dan On Fri, Nov 14, 2014 at 8:03 PM, Davies Liu notificati...@github.com wrote: I think it should be def toJsonRDD(self): rdd = self._jschema_rdd.baseSchemaRDD().toJsonRDD() return RDD(rdd.toJavaRDD(), self.ctx, UTF8Deserializer()) â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213#issuecomment-63159199. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user davies commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63199746 @dwmclary I will like to have str as default instead, which use less memory and better performance (avoid decode and encode in some cases). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63200033 @dwmclary Can you reformat the code to follow the [2-space convention](http://docs.scala-lang.org/style/indentation.html)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20406459 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,68 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jf a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jf: JsonFactory)(row: Row): String = { --- End diff -- How about to rename `jf` to `jsonFactory`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20406513 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,68 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jf a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jf: JsonFactory)(row: Row): String = { --- End diff -- Done and done @davies, @yhaui On Sat, Nov 15, 2014 at 5:17 PM, Yin Huai notificati...@github.com wrote: In sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala: @@ -131,6 +134,68 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jf a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jf: JsonFactory)(row: Row): String = { How about to rename jf to jsonFactory? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213/files#r20406459. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20406534 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,69 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { +val writer = new StringWriter() +val gen = jsonFactory.createGenerator(writer) + +def valWriter: (DataType, Any) = Unit = { + case(_, null) = //do nothing + case(StringType, v: String) = gen.writeString(v) + case(TimestampType, v: java.sql.Timestamp) = gen.writeString(v.toString) + case(IntegerType, v: Int) = gen.writeNumber(v) + case(ShortType, v: Short) = gen.writeNumber(v) + case(FloatType, v: Float) = gen.writeNumber(v) + case(DoubleType, v: Double) = gen.writeNumber(v) + case(LongType, v: Long) = gen.writeNumber(v) + case(DecimalType(), v: java.math.BigDecimal) = gen.writeNumber(v) + case(ByteType, v: Byte) = gen.writeNumber(v.toInt) + case(BinaryType, v: Array[Byte]) = gen.writeBinary(v) + case(BooleanType, v: Boolean) = gen.writeBoolean(v) + + case(ArrayType(ty, _), v: Seq[_] ) = + gen.writeStartArray() + v.foreach(valWriter(ty,_)) --- End diff -- I think we need to keep `null`s in an array. Otherwise, the information carried by the array will be changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20406537 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -591,6 +591,30 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll { clear() } + test(SPARK-4228 SchemaRDD to JSON) --- End diff -- Can we also add complex data types? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20406544 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -17,9 +17,12 @@ package org.apache.spark.sql -import java.util.{List = JList} - +import java.util.{Map = JMap, List = JList, HashMap = JHMap} +import java.io.StringWriter import org.apache.spark.api.python.SerDeUtil +import org.apache.spark.storage.StorageLevel +import com.fasterxml.jackson.databind.ObjectMapper +import com.fasterxml.jackson.core.JsonFactory --- End diff -- Sort imports (see https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20406568 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,69 @@ class SchemaRDD( */ lazy val schema: StructType = queryExecution.analyzed.schema + /** Transforms a single Row to JSON using Jackson +* +* @param jsonFactory a JsonFactory object to construct a JsonGenerator +* @param rowSchema the schema object used for conversion +* @param row The row to convert +*/ + private def rowToJSON(rowSchema: StructType, jsonFactory: JsonFactory)(row: Row): String = { +val writer = new StringWriter() +val gen = jsonFactory.createGenerator(writer) + +def valWriter: (DataType, Any) = Unit = { --- End diff -- Seems we are missing `MapType`, `DateType`, and `UserDefinedType`. @marmbrus Should we write the string representation (returned by `toString`) for a `UserDefinedType`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63200778 Seems we also need the Java API (`JavaSchemaRDD`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63139261 Thanks for working on this. I have two high level comments: - I think it would be better to have a single implementation in Scala with a wrapper in python. This way we don't have to serialize / ship the objects to python which seems like it might be expensive, especially if the next thing you are going to do is something like `saveAsTextFile` - It would also be better to use jackson to do the generation of the JSON string as there are a lot of tricky edge cases around escaping that we need to handle if we do it ourselves. For example, I think this version will fail if a column name contains a quote character. /cc @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63145286 Happy to help; these changes should be quick. - Sure, the wrapper for pyspark makes more sense; I hadn't considered that we'd be shipping the objects back and forth to py4j. - Jackson should be a simple change; I'll just go hash-json via objectmapper if that makes sense. Cheers, Dan On Fri, Nov 14, 2014 at 2:27 PM, Michael Armbrust notificati...@github.com wrote: Thanks for working on this. I have two high level comments: - I think it would be better to have a single implementation in Scala with a wrapper in python. This way we don't have to serialize / ship the objects to python which seems like it might be expensive, especially if the next thing you are going to do is something like saveAsTextFile - It would also be better to use jackson to do the generation of the JSON string as there are a lot of tricky edge cases around escaping that we need to handle if we do it ourselves. For example, I think this version will fail if a column name contains a quote character. /cc @yhuai https://github.com/yhuai â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213#issuecomment-63139261. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63151280 Another approach is to use a `JsonGenerator` instead of an `ObjectMapper`. This is the implementation I've been using for a while: https://gist.github.com/NathanHowell/0a15f0bd23cd940becb3 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63152791 I pushed up a Jackson version, which cuts down the size quite a bit. At present we're not handling complex types, correct? What I'm a bit stuck on is getting the results of the scala method back into pyspark. If I call: newJsonRDD = someSchemaRDD._jschema_rdd.baseSchemaRDD().toJSON() I'm not sure how do deserialize it back on the python side. My intuition would be that I'd just do RDD(newJsonRDD, sc), but that doesn't seem to give me back a valid RDD. -D On Fri, Nov 14, 2014 at 4:41 PM, Nathan Howell notificati...@github.com wrote: Another approach is to use a JsonGenerator instead of an ObjectMapper. This is the implementation I've been using for a while: https://gist.github.com/NathanHowell/0a15f0bd23cd940becb3 â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213#issuecomment-63151280. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63156674 @dwmclary Thank you for working on it. I think we need to handle complex types. The approach @NathanHowell is using looks good since we do not need to convert Scala collections to Java collections to generate the JSON string. Regarding the Python API, `toJSON` will returns a Scala RDD and I think you can use `jrdd = self._jvm.JavaRDD.fromRDD` to generate a Java RDD and then [create an RDD at the Python side](https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L123) (`RDD(jrdd, self._sc, self._jrdd_deserializer)`). @davies is it the right way? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63157839 Yin, Thanks for jumping in. I'll run some complex types through ObjectMapper and see how it compares to JsonFactory. I figure object creation overhead is probably equivalent. I think you're correct about the python approach. I was hoping to pick up javaToPython, but since I have a MappedRDD at the end of the transformation I can't get it. javaToPython should stay private, so I probably have to go after the _jvm. Cheers, Dan On Fri, Nov 14, 2014 at 6:31 PM, Yin Huai notificati...@github.com wrote: @dwmclary https://github.com/dwmclary Thank you for working on it. I think we need to handle complex types. The approach @NathanHowell https://github.com/NathanHowell is using looks good since we do not need to convert Scala collections to Java collections to generate the JSON string. Regarding the Python API, toJSON will returns a Scala RDD and I think you can use jrdd = self._jvm.JavaRDD.fromRDD to generate a Java RDD and then create an RDD at the Python side https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L123 (RDD(jrdd, self._sc, self._jrdd_deserializer)). @davies https://github.com/davies is it the right way? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3213#issuecomment-63156674. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user davies commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63159199 I think it should be ``` def toJsonRDD(self): rdd = self._jschema_rdd.baseSchemaRDD().toJsonRDD() return RDD(rdd.toJavaRDD(), self.ctx, UTF8Deserializer()) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
GitHub user dwmclary opened a pull request: https://github.com/apache/spark/pull/3213 SPARK-4228 SchemaRDD to JSON Here's a simple fix for SchemaRDD to JSON. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dwmclary/spark SPARK-4228 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3213 commit d6d19e9f3e64b7c5b38994e56e0c16393f480e76 Author: Dan McClary dan.mccl...@gmail.com Date: 2014-07-22T16:38:05Z pr example commit 5d34e371b1af42febb9c8d8a0bb5ff5764577e92 Author: Dan McClary dan.mccl...@gmail.com Date: 2014-10-10T18:17:30Z merge resolved commit f7d166aff68772179fcde2c46f1ef4a935b20628 Author: Dan McClary dan.mccl...@gmail.com Date: 2014-11-11T07:37:06Z added toJSON method commit 626a5b1b589c5d11df05bccea3ac6db9c17960f1 Author: Dan McClary dan.mccl...@gmail.com Date: 2014-11-11T15:51:00Z added toJSON to SchemaRDD commit 424f130419d4a4727d1dc594d8d15e15a1293c55 Author: Dan McClary dan.mccl...@gmail.com Date: 2014-11-11T20:25:45Z tests pass, ready for pull and PR commit 319e3bac4c448119c3ac42343c14e6e52caf8dbe Author: Dan McClary dan.mccl...@gmail.com Date: 2014-11-11T22:50:36Z updated to upstream master with merged SPARK-4228 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-62636889 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org