[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1774#issuecomment-54113233 @yhuai can you close this now? I think it was fixed in another PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1774#issuecomment-54113827 @pwendell seems it is not a part of our sql programming guide. I can update it next week (I am out of town this week). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1774#issuecomment-54114283 I plan to use this branch as the starting point for the documentation I'll be writing this week. On Sep 1, 2014 11:28 PM, Yin Huai notificati...@github.com wrote: @pwendell https://github.com/pwendell seems it is not a part of our sql programming guide. I can update it next week (I am out of town this week). â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1774#issuecomment-54113827. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1774#issuecomment-54240312 @marmbrus should I close it now or wait until you have the new pr for our sql programming guide? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1774#issuecomment-54240358 You can close it. On Sep 2, 2014 6:13 PM, Yin Huai notificati...@github.com wrote: @marmbrus https://github.com/marmbrus should I close it now or wait until you have the new pr for our sql programming guide? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1774#issuecomment-54240312. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user yhuai closed the pull request at: https://github.com/apache/spark/pull/1774 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user chutium commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15862768 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self)) /** + * :: DeveloperApi :: + * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD. + * It is important to make sure that the structure of every [[Row]] of the provided RDD matches + * the provided schema. Otherwise, there will be runtime exception. + * Example: + * {{{ + * import org.apache.spark.sql._ + * val sqlContext = new org.apache.spark.sql.SQLContext(sc) + * + * val schema = + *StructType( + * StructField(name, StringType, false) :: + * StructField(age, IntegerType, true) :: Nil) + * --- End diff -- good, i merged the change and used this API ```applySchema(rowRDD, appliedSchema)``` in #1612 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user chutium commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15799720 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self)) /** + * :: DeveloperApi :: + * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD. + * It is important to make sure that the structure of every [[Row]] of the provided RDD matches + * the provided schema. Otherwise, there will be runtime exception. + * Example: + * {{{ + * import org.apache.spark.sql._ + * val sqlContext = new org.apache.spark.sql.SQLContext(sc) + * + * val schema = + *StructType( + * StructField(name, StringType, false) :: + * StructField(age, IntegerType, true) :: Nil) + * --- End diff -- o, yep, StructType is needed, i mean ```def applySchema(rowRDD: RDD[Row], schema: StructType): SchemaRDD``` could be ```def applySchema(rowRDD: RDD[Row], schema: Seq[StructField]): SchemaRDD``` then we do not need to always use ```schema.fields.map(f = AttributeReference...)``` we can direct ```schema.map(f = AttributeReference...)``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15801894 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self)) /** + * :: DeveloperApi :: + * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD. + * It is important to make sure that the structure of every [[Row]] of the provided RDD matches + * the provided schema. Otherwise, there will be runtime exception. + * Example: + * {{{ + * import org.apache.spark.sql._ + * val sqlContext = new org.apache.spark.sql.SQLContext(sc) + * + * val schema = + *StructType( + * StructField(name, StringType, false) :: + * StructField(age, IntegerType, true) :: Nil) + * --- End diff -- This might be crazy... but if `StructType : Seq[StructField]` then we could pass in either `StructType` or `Seq[StructField]`. Should be possible to do this fairly easily --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15827707 --- Diff: python/pyspark/sql.py --- @@ -269,7 +269,7 @@ def __repr__(self): class StructType(DataType): Spark SQL StructType -The data type representing rows. +The data type representing tuple or list values. --- End diff -- This inconsistency is introduced by the difference between the JVM Row and Python Row. For a JVM Row (both Scala and Java), fields in it are nameless and we need to extract values by providing ordinals. However, a field in a Python Row has its name. Right now, in Python, if users have an `RDD[Row]`, they need to use `inferSchema` to create a `SchemaRDD`. If they have an `RDD[tuple]` or `RDD[list]`, they need to use `applySchema`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1774#issuecomment-51276850 QA tests have started for PR 1774. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17960/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1774#issuecomment-51281876 QA results for PR 1774:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17960/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user chutium commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15766362 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self)) /** + * :: DeveloperApi :: + * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD. + * It is important to make sure that the structure of every [[Row]] of the provided RDD matches + * the provided schema. Otherwise, there will be runtime exception. + * Example: + * {{{ + * import org.apache.spark.sql._ + * val sqlContext = new org.apache.spark.sql.SQLContext(sc) + * + * val schema = + *StructType( + * StructField(name, StringType, false) :: + * StructField(age, IntegerType, true) :: Nil) + * --- End diff -- Hi @yhuai , why we need to define schema as a StructType, but not directly as a Seq[StructField]? i tried to build a Seq[StructField] from JDBC metadata in #1612 https://github.com/apache/spark/pull/1612/files#diff-3 (it followed the code of your JsonRDD :) it seems we do not need this StructType anywhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15767384 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self)) /** + * :: DeveloperApi :: + * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD. + * It is important to make sure that the structure of every [[Row]] of the provided RDD matches + * the provided schema. Otherwise, there will be runtime exception. + * Example: + * {{{ + * import org.apache.spark.sql._ + * val sqlContext = new org.apache.spark.sql.SQLContext(sc) + * + * val schema = + *StructType( + * StructField(name, StringType, false) :: + * StructField(age, IntegerType, true) :: Nil) + * --- End diff -- For the completeness of our data types, we need `StructType` (`Seq[StructField]` is not a data type). For example, if the type of a filed is a struct, we need to have a way to describe that the type of this field is a struct. Also, because a row is basically a struct value, it is natural to use `StructType` to represent a schema. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/1774 [SPARK-2179] [SQL] Public API for DataTypes and Schema (Draft update for SQL programming guide) This is the draft update for SQL programming guide. It adds doc for the data type and schema APIs. You can access it at http://yhuai.github.io/site/sql-programming-guide.html. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark dataTypeDoc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1774.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1774 commit 29bc6688943b5639c2e2705cb65d6d1ceca881c0 Author: Yin Huai h...@cse.ohio-state.edu Date: 2014-08-05T00:19:47Z Draft doc for data type and schema APIs. commit 31ba240ac37280072d97422275d4b2c2bf5f04a5 Author: Yin Huai h...@cse.ohio-state.edu Date: 2014-08-05T00:20:07Z Merge remote-tracking branch 'upstream/master' into dataTypeDoc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1774#issuecomment-51135882 QA tests have started for PR 1774. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17895/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790159 --- Diff: python/pyspark/sql.py --- @@ -269,7 +269,7 @@ def __repr__(self): class StructType(DataType): Spark SQL StructType -The data type representing rows. +The data type representing tuple or list values. --- End diff -- Whats up with this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790226 --- Diff: docs/sql-programming-guide.md --- @@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM people WHERE age = 13 AND age teenagers.map(t = Name: + t(0)).collect().foreach(println) {% endhighlight %} +Another way to turns an RDD to table is to use `applySchema`. Here is an example. --- End diff -- It would be good to provide some motivation here. Perhaps talk about programmatically creating a schema when it is not possible to statically define classes ahead of time. Related: an example where the schema is determined statically might make more sense (i.e. read from the first row of the file?) but maybe that is too complicated... Minor: Usually we just say For example. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790282 --- Diff: python/pyspark/sql.py --- @@ -269,7 +269,7 @@ def __repr__(self): class StructType(DataType): Spark SQL StructType -The data type representing rows. +The data type representing tuple or list values. --- End diff -- @davies told me that we only accept tuples or lists as values of `StructType` for`applySchema`. We need to finalize what are acceptable value types before the release. https://issues.apache.org/jira/browse/SPARK-2854 is used to track it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790429 --- Diff: docs/sql-programming-guide.md --- @@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM people WHERE age = 13 AND age teenagers.map(t = Name: + t(0)).collect().foreach(println) {% endhighlight %} +Another way to turns an RDD to table is to use `applySchema`. Here is an example. --- End diff -- to turn --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790441 --- Diff: docs/sql-programming-guide.md --- @@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM people WHERE age = 13 AND age teenagers.map(t = Name: + t(0)).collect().foreach(println) {% endhighlight %} +Another way to turns an RDD to table is to use `applySchema`. Here is an example. +{% highlight scala %} +// sc is an existing SparkContext. +val sqlContext = new org.apache.spark.sql.SQLContext(sc) + +// Create an RDD +val people = sc.textFile(examples/src/main/resources/people.txt) + +// Import Spark SQL data types and Row. +import org.apache.spark.sql._ + +// Define the schema that will be applied to the RDD. +val schema = + StructType( +StructField(name, StringType, true) :: +StructField(age, IntegerType, true) :: Nil) + +// Convert records of the RDD (people) to rows. --- End diff -- to Rows? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790528 --- Diff: docs/sql-programming-guide.md --- @@ -225,6 +260,54 @@ ListString teenagerNames = teenagers.map(new FunctionRow, String() { {% endhighlight %} +Another way to turns an RDD to table is to use `applySchema`. Here is an example. --- End diff -- to turn; to a table --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790538 --- Diff: docs/sql-programming-guide.md --- @@ -259,6 +342,40 @@ for teenName in teenNames.collect(): print teenName {% endhighlight %} +Another way to turns an RDD to table is to use `applySchema`. Here is an example. --- End diff -- Same - maybe do a replaceAll --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1774#issuecomment-51138896 QA results for PR 1774:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17895/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50581580 QA results for PR 1346:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50582008 Thanks for working on this! Merged to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1346 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50590014 Thank you @yhuai for the explanation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50649902 Maven build is failing. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/244/console I am look at it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50443467 QA tests have started for PR 1346. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17344/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15510908 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,457 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, TimestampType, DecimalType, +DoubleType, FloatType, ByteType, IntegerType, LongType, +ShortType, ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): --- End diff -- I think PEP8 requires two blank lines to separate top level classes. Better run the pep8 checker on files changed by this PR since most other files are now pep8 clean, and we will add a pep8 checker to jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15510935 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,457 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, TimestampType, DecimalType, +DoubleType, FloatType, ByteType, IntegerType, LongType, +ShortType, ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType + +The data type representing datetime.datetime values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DoubleType + +class FloatType(object): +Spark SQL FloatType + +For now, please use L{DoubleType} instead of using L{FloatType}. +Because query evaluation is done in Scala, java.lang.Double will be be used +for Python float numbers. Because the underlying JVM type of FloatType is +java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException +if FloatType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return FloatType + +class ByteType(object): +Spark SQL ByteType + +For now, please use L{IntegerType} instead of using L{ByteType}. +Because query evaluation is done in Scala, java.lang.Integer will be be used +for Python int numbers. Because the underlying JVM type of ByteType is +java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException +if ByteType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return ByteType + +class IntegerType(object): +Spark SQL IntegerType + +The data type representing int values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return IntegerType + +class LongType(object): +Spark SQL LongType + +The data type representing long values. If the any value is beyond the range of +[-9223372036854775808, 9223372036854775807], please use DecimalType. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return LongType + +class ShortType(object): +Spark SQL ShortType + +For now, please use L{IntegerType} instead of using L{ShortType}. --- End diff -- I don't get the problem after reading the comment here. Can you clarify? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15510995 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/WrapDynamic.scala --- @@ -21,7 +21,9 @@ import scala.language.dynamics import org.apache.spark.sql.catalyst.types.DataType -case object DynamicType extends DataType +case object DynamicType extends DataType { --- End diff -- Do you mind adding scaladoc to explain what DynamicType is used for? (While you are at it, also add scaladoc for WrapDynamic and DynamicRow) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15511092 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala --- @@ -201,47 +231,139 @@ object FractionalType { } } abstract class FractionalType extends NumericType { - val fractional: Fractional[JvmType] + private[sql] val fractional: Fractional[JvmType] } case object DecimalType extends FractionalType { - type JvmType = BigDecimal - @transient lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] } - val numeric = implicitly[Numeric[BigDecimal]] - val fractional = implicitly[Fractional[BigDecimal]] - val ordering = implicitly[Ordering[JvmType]] + private[sql] type JvmType = BigDecimal + @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] } + private[sql] val numeric = implicitly[Numeric[BigDecimal]] + private[sql] val fractional = implicitly[Fractional[BigDecimal]] + private[sql] val ordering = implicitly[Ordering[JvmType]] + def simpleString: String = decimal } case object DoubleType extends FractionalType { - type JvmType = Double - @transient lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] } - val numeric = implicitly[Numeric[Double]] - val fractional = implicitly[Fractional[Double]] - val ordering = implicitly[Ordering[JvmType]] + private[sql] type JvmType = Double + @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] } + private[sql] val numeric = implicitly[Numeric[Double]] + private[sql] val fractional = implicitly[Fractional[Double]] + private[sql] val ordering = implicitly[Ordering[JvmType]] + def simpleString: String = double } case object FloatType extends FractionalType { - type JvmType = Float - @transient lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] } - val numeric = implicitly[Numeric[Float]] - val fractional = implicitly[Fractional[Float]] - val ordering = implicitly[Ordering[JvmType]] + private[sql] type JvmType = Float + @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] } + private[sql] val numeric = implicitly[Numeric[Float]] + private[sql] val fractional = implicitly[Fractional[Float]] + private[sql] val ordering = implicitly[Ordering[JvmType]] + def simpleString: String = float +} + +object ArrayType { + /** Construct a [[ArrayType]] object with the given element type. The `containsNull` is false. */ + def apply(elementType: DataType): ArrayType = ArrayType(elementType, false) +} + +case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType { + private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = { +builder.append( + s${prefix}-- element: ${elementType.simpleString} (containsNull = ${containsNull})\n) +DataType.buildFormattedString(elementType, s$prefix|, builder) + } + + def simpleString: String = array } -case class ArrayType(elementType: DataType) extends DataType +case class StructField(name: String, dataType: DataType, nullable: Boolean) { --- End diff -- Add scaladoc to define the semantics of nullable (nullable keys vs nullable values vs both) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15511210 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java --- @@ -0,0 +1,212 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.api.java.types; + +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +/** + * The base type of all Spark SQL data types. + * + * To get/create specific data type, users should use singleton objects and factory methods + * provided by this class. + */ +public abstract class DataType { + + /** + * Gets the StringType object. + */ + public static final StringType StringType = new StringType(); + + /** + * Gets the BinaryType object. + */ + public static final BinaryType BinaryType = new BinaryType(); + + /** + * Gets the BooleanType object. + */ + public static final BooleanType BooleanType = new BooleanType(); + + /** + * Gets the TimestampType object. + */ + public static final TimestampType TimestampType = new TimestampType(); + + /** + * Gets the DecimalType object. + */ + public static final DecimalType DecimalType = new DecimalType(); + + /** + * Gets the DoubleType object. + */ + public static final DoubleType DoubleType = new DoubleType(); + + /** + * Gets the FloatType object. + */ + public static final FloatType FloatType = new FloatType(); + + /** + * Gets the ByteType object. + */ + public static final ByteType ByteType = new ByteType(); + + /** + * Gets the IntegerType object. + */ + public static final IntegerType IntegerType = new IntegerType(); + + /** + * Gets the LongType object. + */ + public static final LongType LongType = new LongType(); + + /** + * Gets the ShortType object. + */ + public static final ShortType ShortType = new ShortType(); + + /** + * Creates an ArrayType by specifying the data type of elements ({@code elementType}). + * The field of {@code containsNull} is set to {@code false}. + * + * @param elementType + * @return + */ + public static ArrayType createArrayType(DataType elementType) { +if (elementType == null) { + throw new IllegalArgumentException(elementType should not be null.); +} + +return new ArrayType(elementType, false); + } + + /** + * Creates an ArrayType by specifying the data type of elements ({@code elementType}) and + * whether the array contains null values ({@code containsNull}). + * @param elementType + * @param containsNull + * @return + */ + public static ArrayType createArrayType(DataType elementType, boolean containsNull) { +if (elementType == null) { + throw new IllegalArgumentException(elementType should not be null.); +} + +return new ArrayType(elementType, containsNull); + } + + /** + * Creates a MapType by specifying the data type of keys ({@code keyType}) and values + * ({@code keyType}). The field of {@code valueContainsNull} is set to {@code true}. + * + * @param keyType + * @param valueType + * @return --- End diff -- actually also params. if you don't explain any of them, just remove them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15511199 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java --- @@ -0,0 +1,212 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.api.java.types; + +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +/** + * The base type of all Spark SQL data types. + * + * To get/create specific data type, users should use singleton objects and factory methods + * provided by this class. + */ +public abstract class DataType { + + /** + * Gets the StringType object. + */ + public static final StringType StringType = new StringType(); + + /** + * Gets the BinaryType object. + */ + public static final BinaryType BinaryType = new BinaryType(); + + /** + * Gets the BooleanType object. + */ + public static final BooleanType BooleanType = new BooleanType(); + + /** + * Gets the TimestampType object. + */ + public static final TimestampType TimestampType = new TimestampType(); + + /** + * Gets the DecimalType object. + */ + public static final DecimalType DecimalType = new DecimalType(); + + /** + * Gets the DoubleType object. + */ + public static final DoubleType DoubleType = new DoubleType(); + + /** + * Gets the FloatType object. + */ + public static final FloatType FloatType = new FloatType(); + + /** + * Gets the ByteType object. + */ + public static final ByteType ByteType = new ByteType(); + + /** + * Gets the IntegerType object. + */ + public static final IntegerType IntegerType = new IntegerType(); + + /** + * Gets the LongType object. + */ + public static final LongType LongType = new LongType(); + + /** + * Gets the ShortType object. + */ + public static final ShortType ShortType = new ShortType(); + + /** + * Creates an ArrayType by specifying the data type of elements ({@code elementType}). + * The field of {@code containsNull} is set to {@code false}. + * + * @param elementType + * @return + */ + public static ArrayType createArrayType(DataType elementType) { +if (elementType == null) { + throw new IllegalArgumentException(elementType should not be null.); +} + +return new ArrayType(elementType, false); + } + + /** + * Creates an ArrayType by specifying the data type of elements ({@code elementType}) and + * whether the array contains null values ({@code containsNull}). + * @param elementType + * @param containsNull + * @return + */ + public static ArrayType createArrayType(DataType elementType, boolean containsNull) { +if (elementType == null) { + throw new IllegalArgumentException(elementType should not be null.); +} + +return new ArrayType(elementType, containsNull); + } + + /** + * Creates a MapType by specifying the data type of keys ({@code keyType}) and values + * ({@code keyType}). The field of {@code valueContainsNull} is set to {@code true}. + * + * @param keyType + * @param valueType + * @return --- End diff -- remove the return tag if you are not going to say anything about it. also remove it for other functions in this pr. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15511259 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -89,6 +90,45 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))) /** + * :: DeveloperApi :: + * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD. + * It is important to make sure that the structure of every [[Row]] of the provided RDD matches + * the provided schema. Otherwise, there will be runtime exception. + * + * @group userf --- End diff -- would be great to give an inline example. just wrap it with ```scala {{{ // example code here }}} ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15511405 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala --- @@ -259,8 +268,12 @@ private[sql] object JsonRDD extends Logging { // the ObjectMapper will take the last value associated with this duplicate key. // For example: for {key: 1, key:2}, we will get key-2. val mapper = new ObjectMapper() - iter.map(record = mapper.readValue(record, classOf[java.util.Map[String, Any]])) - }).map(scalafy).map(_.asInstanceOf[Map[String, Any]]) + iter.map { +record = --- End diff -- move record to the previous line and indent the whole thing one level less --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15511457 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -140,10 +142,12 @@ private[parquet] object ParquetTypesConverter extends Logging { assert(keyValueGroup.getFields.apply(0).getRepetition == Repetition.REQUIRED) val valueType = toDataType(keyValueGroup.getFields.apply(1)) assert(keyValueGroup.getFields.apply(1).getRepetition == Repetition.REQUIRED) -new MapType(keyType, valueType) +// TODO: set valueContainsNull explicitly instead of assuming valueContainsNull is true +// at here. +MapType(keyType, valueType) } else if (correspondsToArray(groupType)) { // ArrayType val elementType = toDataType(groupType.getFields.apply(0)) -new ArrayType(elementType) +ArrayType(elementType, false) --- End diff -- here too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15511453 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -116,7 +116,7 @@ private[parquet] object ParquetTypesConverter extends Logging { case ParquetOriginalType.LIST = { // TODO: check enums! assert(groupType.getFieldCount == 1) val field = groupType.getFields.apply(0) - new ArrayType(toDataType(field)) + ArrayType(toDataType(field), false) --- End diff -- for boolean argument, make them named argument. e.g. ```scala ArrayType(toDataType(field), nullable = false) // maybe it was containsNull ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15511498 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala --- @@ -0,0 +1,124 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.types.util + +import org.apache.spark.sql._ +import org.apache.spark.sql.api.java.types.{DataType = JDataType, StructField = JStructField} + +import scala.collection.JavaConverters._ + +protected[sql] object DataTypeConversions { + + /** + * Returns the equivalent StructField in Scala for the given StructField in Java. + */ + def asJavaStructField(scalaStructField: StructField): JStructField = { +org.apache.spark.sql.api.java.types.DataType.createStructField( + scalaStructField.name, + asJavaDataType(scalaStructField.dataType), + scalaStructField.nullable) + } + + /** + * Returns the equivalent DataType in Java for the given DataType in Scala. + */ + def asJavaDataType(scalaDataType: DataType): JDataType = scalaDataType match { +case StringType = + org.apache.spark.sql.api.java.types.DataType.StringType --- End diff -- Why not just ```JDataType. StringType``` instead of typing all the names? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50451426 @yhuai can you describe a little more about the `containsNull` for `ArrayType` and `MapType`, in my understanding, `Map` and `Array` contains null in most of cases during the runtime, why not just keep the previous implementation? Will that be something wrong when producing the RDD schema if the constraint not added? Sorry, if I missed some discussion here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50452464 QA results for PR 1346:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17344/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50531062 QA tests have started for PR 1346. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17372/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15548770 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,457 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, TimestampType, DecimalType, +DoubleType, FloatType, ByteType, IntegerType, LongType, +ShortType, ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType + +The data type representing datetime.datetime values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DoubleType + +class FloatType(object): +Spark SQL FloatType + +For now, please use L{DoubleType} instead of using L{FloatType}. +Because query evaluation is done in Scala, java.lang.Double will be be used +for Python float numbers. Because the underlying JVM type of FloatType is +java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException +if FloatType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return FloatType + +class ByteType(object): +Spark SQL ByteType + +For now, please use L{IntegerType} instead of using L{ByteType}. +Because query evaluation is done in Scala, java.lang.Integer will be be used +for Python int numbers. Because the underlying JVM type of ByteType is +java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException +if ByteType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return ByteType + +class IntegerType(object): +Spark SQL IntegerType + +The data type representing int values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return IntegerType + +class LongType(object): +Spark SQL LongType + +The data type representing long values. If the any value is beyond the range of +[-9223372036854775808, 9223372036854775807], please use DecimalType. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return LongType + +class ShortType(object): +Spark SQL ShortType + +For now, please use L{IntegerType} instead of using L{ShortType}. --- End diff -- If we have a ShortType column, the expression evaluator will try to cast it as a `Short` (`asInstanceOf[Short]`). However, the cast will fail because the data is `java.lang.Integer`. I will add more doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15548865 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,457 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, TimestampType, DecimalType, +DoubleType, FloatType, ByteType, IntegerType, LongType, +ShortType, ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType + +The data type representing datetime.datetime values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DoubleType + +class FloatType(object): +Spark SQL FloatType + +For now, please use L{DoubleType} instead of using L{FloatType}. +Because query evaluation is done in Scala, java.lang.Double will be be used +for Python float numbers. Because the underlying JVM type of FloatType is +java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException +if FloatType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return FloatType + +class ByteType(object): +Spark SQL ByteType + +For now, please use L{IntegerType} instead of using L{ByteType}. +Because query evaluation is done in Scala, java.lang.Integer will be be used +for Python int numbers. Because the underlying JVM type of ByteType is +java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException +if ByteType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return ByteType + +class IntegerType(object): +Spark SQL IntegerType + +The data type representing int values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return IntegerType + +class LongType(object): +Spark SQL LongType + +The data type representing long values. If the any value is beyond the range of +[-9223372036854775808, 9223372036854775807], please use DecimalType. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return LongType + +class ShortType(object): +Spark SQL ShortType + +For now, please use L{IntegerType} instead of using L{ShortType}. --- End diff -- We could also convert the type to the correct type on the way in from python. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50532980 QA tests have started for PR 1346. This patch DID NOT merge cleanly! brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17374/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50533105 QA results for PR 1346:br- This patch FAILED unit tests.brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17374/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15550525 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,457 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, TimestampType, DecimalType, +DoubleType, FloatType, ByteType, IntegerType, LongType, +ShortType, ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType + +The data type representing datetime.datetime values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DoubleType + +class FloatType(object): +Spark SQL FloatType + +For now, please use L{DoubleType} instead of using L{FloatType}. +Because query evaluation is done in Scala, java.lang.Double will be be used +for Python float numbers. Because the underlying JVM type of FloatType is +java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException +if FloatType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return FloatType + +class ByteType(object): +Spark SQL ByteType + +For now, please use L{IntegerType} instead of using L{ByteType}. +Because query evaluation is done in Scala, java.lang.Integer will be be used +for Python int numbers. Because the underlying JVM type of ByteType is +java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException +if ByteType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return ByteType + +class IntegerType(object): +Spark SQL IntegerType + +The data type representing int values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return IntegerType + +class LongType(object): +Spark SQL LongType + +The data type representing long values. If the any value is beyond the range of +[-9223372036854775808, 9223372036854775807], please use DecimalType. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return LongType + +class ShortType(object): +Spark SQL ShortType + +For now, please use L{IntegerType} instead of using L{ShortType}. --- End diff -- JsonRDD already has this kind of conversions. I am not sure we want to do the conversions in Java and Scala. In Scala and Java, users can actually use `Short`, `Byte`, and `Float` values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15551776 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,457 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, TimestampType, DecimalType, +DoubleType, FloatType, ByteType, IntegerType, LongType, +ShortType, ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType + +The data type representing datetime.datetime values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DoubleType + +class FloatType(object): +Spark SQL FloatType + +For now, please use L{DoubleType} instead of using L{FloatType}. +Because query evaluation is done in Scala, java.lang.Double will be be used +for Python float numbers. Because the underlying JVM type of FloatType is +java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException +if FloatType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return FloatType + +class ByteType(object): +Spark SQL ByteType + +For now, please use L{IntegerType} instead of using L{ByteType}. +Because query evaluation is done in Scala, java.lang.Integer will be be used +for Python int numbers. Because the underlying JVM type of ByteType is +java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException +if ByteType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return ByteType + +class IntegerType(object): +Spark SQL IntegerType + +The data type representing int values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return IntegerType + +class LongType(object): +Spark SQL LongType + +The data type representing long values. If the any value is beyond the range of +[-9223372036854775808, 9223372036854775807], please use DecimalType. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return LongType + +class ShortType(object): +Spark SQL ShortType + +For now, please use L{IntegerType} instead of using L{ShortType}. --- End diff -- In Java/Scala, when user loads data from csv file, they need to do this kind of type conversion, it will be better if we could do this for them automatically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15554197 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,457 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, TimestampType, DecimalType, +DoubleType, FloatType, ByteType, IntegerType, LongType, +ShortType, ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType + +The data type representing datetime.datetime values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return DoubleType + +class FloatType(object): +Spark SQL FloatType + +For now, please use L{DoubleType} instead of using L{FloatType}. +Because query evaluation is done in Scala, java.lang.Double will be be used +for Python float numbers. Because the underlying JVM type of FloatType is +java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException +if FloatType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return FloatType + +class ByteType(object): +Spark SQL ByteType + +For now, please use L{IntegerType} instead of using L{ByteType}. +Because query evaluation is done in Scala, java.lang.Integer will be be used +for Python int numbers. Because the underlying JVM type of ByteType is +java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException +if ByteType (Python) is used. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return ByteType + +class IntegerType(object): +Spark SQL IntegerType + +The data type representing int values. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return IntegerType + +class LongType(object): +Spark SQL LongType + +The data type representing long values. If the any value is beyond the range of +[-9223372036854775808, 9223372036854775807], please use DecimalType. + + +__metaclass__ = PrimitiveTypeSingleton + +def __repr__(self): +return LongType + +class ShortType(object): +Spark SQL ShortType + +For now, please use L{IntegerType} instead of using L{ShortType}. --- End diff -- Yes, we should provide convenient methods for users. But, we will provide methods for users to load CSV files and we will use mutable projection to do the type conversions (by using `Cast`). Considering the size of this PR and it is blocking other people's work, it is better to think about it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50576308 @chenghao-intel `containsNull` and `valueContainsNull` can be used for further optimization. For example, let's say we have an `ArrayType` column and the element type is `IntegerType`. If elements of those arrays do not have `null` values, we can use a primitive array internal. Since we will expose data types to users, we need to introduce these two booleans with this PR. It can be hard to add them once users start to use these APIs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50576339 QA tests have started for PR 1346. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15481080 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,413 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, DecimalType, DoubleType, +FloatType, ByteType, IntegerType, LongType, ShortType, +ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytes values and bytearray values. --- End diff -- We probably just want to say byte arrays here since we have a separate type for byte. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15481293 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,413 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, DecimalType, DoubleType, +FloatType, ByteType, IntegerType, LongType, ShortType, +ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytes values and bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType --- End diff -- We should also list the python types that are expected when its not obvious. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15481312 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,413 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, DecimalType, DoubleType, +FloatType, ByteType, IntegerType, LongType, ShortType, +ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytes values and bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. Because a float value --- End diff -- This comment isn't finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15481390 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,413 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, DecimalType, DoubleType, +FloatType, ByteType, IntegerType, LongType, ShortType, +ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytes values and bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. Because a float value + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return DoubleType + +class FloatType(object): +Spark SQL FloatType + +For PySpark, please use L{DoubleType} instead of using L{FloatType}. --- End diff -- Why? What if they know the values are limited to the float range and want to use less memory? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15481537 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,413 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, DecimalType, DoubleType, +FloatType, ByteType, IntegerType, LongType, ShortType, +ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytes values and bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. Because a float value + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return DoubleType + +class FloatType(object): +Spark SQL FloatType + +For PySpark, please use L{DoubleType} instead of using L{FloatType}. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return FloatType + +class ByteType(object): +Spark SQL ByteType + +For PySpark, please use L{IntegerType} instead of using L{ByteType}. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return ByteType + +class IntegerType(object): +Spark SQL IntegerType + +The data type representing int values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return IntegerType + +class LongType(object): +Spark SQL LongType + +The data type representing long values. If the any value is beyond the range of +[-9223372036854775808, 9223372036854775807], please use DecimalType. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return LongType + +class ShortType(object): +Spark SQL ShortType + +For PySpark, please use L{IntegerType} instead of using L{ShortType}. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return ShortType + +class ArrayType(object): +Spark SQL ArrayType + +The data type representing list values. + + +def __init__(self, elementType, containsNull): +Creates an ArrayType + +:param elementType: the data type of elements. +:param containsNull: indicates whether the list contains null values. +:return: + + ArrayType(StringType, True) == ArrayType(StringType, False) +False + ArrayType(StringType, True) == ArrayType(StringType, True) +True + +self.elementType = elementType +self.containsNull = containsNull + +def _get_scala_type_string(self): +return ArrayType( + self.elementType._get_scala_type_string() + , + \ + str(self.containsNull).lower() + ) + +def __eq__(self, other): +return (isinstance(other, self.__class__)
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15481675 --- Diff: python/pyspark/sql.py --- @@ -107,6 +512,25 @@ def inferSchema(self, rdd): srdd = self._ssql_ctx.inferSchema(jrdd.rdd()) return SchemaRDD(srdd, self) +def applySchema(self, rdd, schema): +Applies the given schema to the given RDD of L{dict}s. --- End diff -- Are we still allowing dicts? I thought there was at least going to be a warning? Or is this going to change with @davies PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15481834 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala --- @@ -17,11 +17,12 @@ package org.apache.spark.sql.catalyst.expressions +import com.typesafe.scalalogging.slf4j.Logging --- End diff -- We should use either Spark Logging or Spark SQL logging. (Ideally we will be removing catalyst's dependence on Spark solely for the logging code, but I'm okay with either ATM.) We shouldn't hard code this library here though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15482163 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/BooleanType.java --- @@ -0,0 +1,22 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.api.java.types; + +public class BooleanType extends DataType { --- End diff -- Missing Java Doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15482279 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java --- @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.api.java.types; + +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +/** + * The base type of all Spark SQL data types. --- End diff -- I'd also talk about how this class contains singletons and factory methods for constructing datatypes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15482239 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/BooleanType.java --- @@ -0,0 +1,22 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.api.java.types; + +public class BooleanType extends DataType { --- End diff -- Also perhaps the Java doc should make it clear that users don't instantiate these themselves, but instead get the singletons from the DataType class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15482765 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/package-info.java --- @@ -0,0 +1,22 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + + +/** + * Allows users to get and create Spark SQL data types. + */ +package org.apache.spark.sql.api.java.types; --- End diff -- Newline at end of file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15482947 --- Diff: python/pyspark/sql.py --- @@ -107,6 +512,25 @@ def inferSchema(self, rdd): srdd = self._ssql_ctx.inferSchema(jrdd.rdd()) return SchemaRDD(srdd, self) +def applySchema(self, rdd, schema): +Applies the given schema to the given RDD of L{dict}s. --- End diff -- Right, @davies will make the change in his PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15483028 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd))) } + /** + * Returns the equivalent StructField in Scala for the given StructField in Java. + */ + protected def asJavaStructField(scalaStructField: StructField): JStructField = { --- End diff -- Should this be here or in the JavaSQLContext? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15483058 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd))) } + /** + * Returns the equivalent StructField in Scala for the given StructField in Java. + */ + protected def asJavaStructField(scalaStructField: StructField): JStructField = { --- End diff -- Same for the functions below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15483514 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala --- @@ -0,0 +1,401 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import org.apache.spark.annotation.DeveloperApi + +/** + * Allows the execution of relational queries, including those expressed in SQL using Spark. + * + * @groupname dataType Data types + * @groupdesc Spark SQL data types. + * @groupprio dataType -3 + * @groupname field Field + * @groupprio field -2 + * @groupname row Row + * @groupprio row -1 + */ +package object sql { + + protected[sql] type Logging = com.typesafe.scalalogging.slf4j.Logging + + /** + * :: DeveloperApi :: + * + * Represents one row of output from a relational operator. + * @group row + */ + @DeveloperApi + type Row = catalyst.expressions.Row + + /** + * :: DeveloperApi :: + * + * A [[Row]] object can be constructed by providing field values. Example: + * {{{ + * import org.apache.spark.sql._ + * + * // Create a Row from values. + * Row(value1, value2, value3, ...) + * // Create a Row from a Seq of values. + * Row.fromSeq(Seq(value1, value2, ...)) + * }}} + * + * A value of a row can be accessed through both generic access by ordinal, + * which will incur boxing overhead for primitives, as well as native primitive access. + * An example of generic access by ordinal: + * {{{ + * import org.apache.spark.sql._ + * + * val row = Row(1, true, a string, null) + * // row: Row = [1,true,a string,null] + * val firstValue = row(0) + * // firstValue: Any = 1 + * val fourthValue = row(3) + * // fourthValue: Any = null + * }}} + * + * For native primitive access, it is invalid to use the native primitive interface to retrieve + * a value that is null, instead a user must check `isNullAt` before attempting to retrieve a + * value that might be null. + * An example of native primitive access: + * {{{ + * // using the row from the previous example. + * val firstValue = row.getInt(0) + * // firstValue: Int = 1 + * val isNull = row.isNullAt(3) + * // isNull: Boolean = true + * }}} + * + * Interfaces related to native primitive access are: + * + * `isNullAt(i: Int): Boolean` + * + * `getInt(i: Int): Int` + * + * `getLong(i: Int): Long` + * + * `getDouble(i: Int): Double` + * + * `getFloat(i: Int): Float` + * + * `getBoolean(i: Int): Boolean` + * + * `getShort(i: Int): Short` + * + * `getByte(i: Int): Byte` + * + * `getString(i: Int): String` + * + * Fields in a [[Row]] object can be extracted in a pattern match. Example: + * {{{ + * import org.apache.spark.sql._ + * + * val pairs = sql(SELECT key, value FROM src).rdd.map { + * case Row(key: Int, value: String) = + * key - value + * } + * }}} + * + * @group row + */ + @DeveloperApi + val Row = catalyst.expressions.Row + + /** + * :: DeveloperApi :: + * + * The base type of all Spark SQL data types. + * + * @group dataType + */ + @DeveloperApi + type DataType = catalyst.types.DataType + + /** + * :: DeveloperApi :: + * + * The data type representing `String` values + * + * @group dataType + */ + @DeveloperApi + val StringType = catalyst.types.StringType + + /** + * :: DeveloperApi :: + * + * The data type representing `Array[Byte]` values. + * + * @group dataType + */ + @DeveloperApi + val BinaryType = catalyst.types.BinaryType + + /** + *
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15483619 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala --- @@ -0,0 +1,401 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import org.apache.spark.annotation.DeveloperApi + +/** + * Allows the execution of relational queries, including those expressed in SQL using Spark. + * + * @groupname dataType Data types + * @groupdesc Spark SQL data types. + * @groupprio dataType -3 + * @groupname field Field + * @groupprio field -2 + * @groupname row Row + * @groupprio row -1 + */ +package object sql { + + protected[sql] type Logging = com.typesafe.scalalogging.slf4j.Logging + + /** + * :: DeveloperApi :: + * + * Represents one row of output from a relational operator. + * @group row + */ + @DeveloperApi + type Row = catalyst.expressions.Row + + /** + * :: DeveloperApi :: + * + * A [[Row]] object can be constructed by providing field values. Example: + * {{{ + * import org.apache.spark.sql._ + * + * // Create a Row from values. + * Row(value1, value2, value3, ...) + * // Create a Row from a Seq of values. + * Row.fromSeq(Seq(value1, value2, ...)) + * }}} + * + * A value of a row can be accessed through both generic access by ordinal, + * which will incur boxing overhead for primitives, as well as native primitive access. + * An example of generic access by ordinal: + * {{{ + * import org.apache.spark.sql._ + * + * val row = Row(1, true, a string, null) + * // row: Row = [1,true,a string,null] + * val firstValue = row(0) + * // firstValue: Any = 1 + * val fourthValue = row(3) + * // fourthValue: Any = null + * }}} + * + * For native primitive access, it is invalid to use the native primitive interface to retrieve + * a value that is null, instead a user must check `isNullAt` before attempting to retrieve a + * value that might be null. + * An example of native primitive access: + * {{{ + * // using the row from the previous example. + * val firstValue = row.getInt(0) + * // firstValue: Int = 1 + * val isNull = row.isNullAt(3) + * // isNull: Boolean = true + * }}} + * + * Interfaces related to native primitive access are: + * + * `isNullAt(i: Int): Boolean` + * + * `getInt(i: Int): Int` + * + * `getLong(i: Int): Long` + * + * `getDouble(i: Int): Double` + * + * `getFloat(i: Int): Float` + * + * `getBoolean(i: Int): Boolean` + * + * `getShort(i: Int): Short` + * + * `getByte(i: Int): Byte` + * + * `getString(i: Int): String` + * + * Fields in a [[Row]] object can be extracted in a pattern match. Example: + * {{{ + * import org.apache.spark.sql._ + * + * val pairs = sql(SELECT key, value FROM src).rdd.map { + * case Row(key: Int, value: String) = + * key - value + * } + * }}} + * + * @group row + */ + @DeveloperApi + val Row = catalyst.expressions.Row + + /** + * :: DeveloperApi :: + * + * The base type of all Spark SQL data types. + * + * @group dataType + */ + @DeveloperApi + type DataType = catalyst.types.DataType + + /** + * :: DeveloperApi :: + * + * The data type representing `String` values + * + * @group dataType + */ + @DeveloperApi + val StringType = catalyst.types.StringType + + /** + * :: DeveloperApi :: + * + * The data type representing `Array[Byte]` values. + * + * @group dataType + */ + @DeveloperApi + val BinaryType = catalyst.types.BinaryType + + /** + *
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15483710 --- Diff: python/pyspark/sql.py --- @@ -20,8 +20,413 @@ from py4j.protocol import Py4JError -__all__ = [SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +__all__ = [ +StringType, BinaryType, BooleanType, DecimalType, DoubleType, +FloatType, ByteType, IntegerType, LongType, ShortType, +ArrayType, MapType, StructField, StructType, +SQLContext, HiveContext, LocalHiveContext, TestHiveContext, SchemaRDD, Row] +class PrimitiveTypeSingleton(type): +_instances = {} +def __call__(cls): +if cls not in cls._instances: +cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__() +return cls._instances[cls] + +class StringType(object): +Spark SQL StringType + +The data type representing string values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return StringType + +class BinaryType(object): +Spark SQL BinaryType + +The data type representing bytes values and bytearray values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BinaryType + +class BooleanType(object): +Spark SQL BooleanType + +The data type representing bool values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return BooleanType + +class TimestampType(object): +Spark SQL TimestampType +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return TimestampType + +class DecimalType(object): +Spark SQL DecimalType + +The data type representing decimal.Decimal values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return DecimalType + +class DoubleType(object): +Spark SQL DoubleType + +The data type representing float values. Because a float value + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return DoubleType + +class FloatType(object): +Spark SQL FloatType + +For PySpark, please use L{DoubleType} instead of using L{FloatType}. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return FloatType + +class ByteType(object): +Spark SQL ByteType + +For PySpark, please use L{IntegerType} instead of using L{ByteType}. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return ByteType + +class IntegerType(object): +Spark SQL IntegerType + +The data type representing int values. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return IntegerType + +class LongType(object): +Spark SQL LongType + +The data type representing long values. If the any value is beyond the range of +[-9223372036854775808, 9223372036854775807], please use DecimalType. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return LongType + +class ShortType(object): +Spark SQL ShortType + +For PySpark, please use L{IntegerType} instead of using L{ShortType}. + + +__metaclass__ = PrimitiveTypeSingleton + +def _get_scala_type_string(self): +return ShortType + +class ArrayType(object): +Spark SQL ArrayType + +The data type representing list values. + + +def __init__(self, elementType, containsNull): --- End diff -- Should we have the same default value for containsNull that we have in Scala? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15483749 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java --- @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.api.java.types; + +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +/** + * The base type of all Spark SQL data types. + */ +public abstract class DataType { + + /** + * Gets the StringType object. + */ + public static final StringType StringType = new StringType(); + + /** + * Gets the BinaryType object. + */ + public static final BinaryType BinaryType = new BinaryType(); + + /** + * Gets the BooleanType object. + */ + public static final BooleanType BooleanType = new BooleanType(); + + /** + * Gets the TimestampType object. + */ + public static final TimestampType TimestampType = new TimestampType(); + + /** + * Gets the DecimalType object. + */ + public static final DecimalType DecimalType = new DecimalType(); + + /** + * Gets the DoubleType object. + */ + public static final DoubleType DoubleType = new DoubleType(); + + /** + * Gets the FloatType object. + */ + public static final FloatType FloatType = new FloatType(); + + /** + * Gets the ByteType object. + */ + public static final ByteType ByteType = new ByteType(); + + /** + * Gets the IntegerType object. + */ + public static final IntegerType IntegerType = new IntegerType(); + + /** + * Gets the LongType object. + */ + public static final LongType LongType = new LongType(); + + /** + * Gets the ShortType object. + */ + public static final ShortType ShortType = new ShortType(); + + /** + * Creates an ArrayType by specifying the data type of elements ({@code elementType}) and + * whether the array contains null values ({@code containsNull}). + * @param elementType + * @param containsNull + * @return + */ + public static ArrayType createArrayType(DataType elementType, boolean containsNull) { --- End diff -- Add another method that has a default for containsNull? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15492345 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd))) } + /** + * Returns the equivalent StructField in Scala for the given StructField in Java. + */ + protected def asJavaStructField(scalaStructField: StructField): JStructField = { --- End diff -- Originally, I put it in `JavaSQLContext`. But, I found I need the access to `asJavaDataType` in `JavaSchemaRDD` which only has `SQLContext` instead of `JavaSQLContext`. I guess we want to refactor `JavaSchemaRDD` to use `JavaSQLContext` instead of `SQLContext`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15492409 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd))) } + /** + * Returns the equivalent StructField in Scala for the given StructField in Java. + */ + protected def asJavaStructField(scalaStructField: StructField): JStructField = { --- End diff -- Oh, I see. These are all static functions right? Maybe we could put them all in a python support object. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15493024 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd))) } + /** + * Returns the equivalent StructField in Scala for the given StructField in Java. + */ + protected def asJavaStructField(scalaStructField: StructField): JStructField = { --- End diff -- Will move them to a better place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15499882 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java --- @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.api.java.types; + +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +/** + * The base type of all Spark SQL data types. + */ +public abstract class DataType { + + /** + * Gets the StringType object. + */ + public static final StringType StringType = new StringType(); + + /** + * Gets the BinaryType object. + */ + public static final BinaryType BinaryType = new BinaryType(); + + /** + * Gets the BooleanType object. + */ + public static final BooleanType BooleanType = new BooleanType(); + + /** + * Gets the TimestampType object. + */ + public static final TimestampType TimestampType = new TimestampType(); + + /** + * Gets the DecimalType object. + */ + public static final DecimalType DecimalType = new DecimalType(); + + /** + * Gets the DoubleType object. + */ + public static final DoubleType DoubleType = new DoubleType(); + + /** + * Gets the FloatType object. + */ + public static final FloatType FloatType = new FloatType(); + + /** + * Gets the ByteType object. + */ + public static final ByteType ByteType = new ByteType(); + + /** + * Gets the IntegerType object. + */ + public static final IntegerType IntegerType = new IntegerType(); + + /** + * Gets the LongType object. + */ + public static final LongType LongType = new LongType(); + + /** + * Gets the ShortType object. + */ + public static final ShortType ShortType = new ShortType(); + + /** + * Creates an ArrayType by specifying the data type of elements ({@code elementType}) and + * whether the array contains null values ({@code containsNull}). + * @param elementType + * @param containsNull + * @return + */ + public static ArrayType createArrayType(DataType elementType, boolean containsNull) { --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15499885 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/package-info.java --- @@ -0,0 +1,22 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + + +/** + * Allows users to get and create Spark SQL data types. + */ +package org.apache.spark.sql.api.java.types; --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1346#discussion_r15499886 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java --- @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.api.java.types; + +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +/** + * The base type of all Spark SQL data types. --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50420253 QA tests have started for PR 1346. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17319/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50420304 QA results for PR 1346:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17319/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50420623 QA tests have started for PR 1346. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17320/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50423690 @yhuai @marmbrus I am not sure if this has been discussed before, but what do you guys think about adding a version of `applySchema(RDD[Array[String]], StructType)`? The use case I have in mind is TPC-DS data preparation. Currently I have a bunch of text files, from which I can easily create an `RDD[String]`; by splitting each line on some separator I get an `RDD[Array[String]]`. Now, in TPC-DS the tables easily have 15+ columns, and I don't want to manually create a `Row` for each `Array[String]`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50423851 To add to this: for my own purpose, I can certainly hack something together based off this branch in a custom Spark build, but just want to throw this thought out there as I think it does have some generality (large number of columns, avoid writing `.map(p = Row(p(0), p(1), ..., p(LARGE_NUM)))`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50424563 @concretevitamin There is another way create a row, which is `Row.fromSeq(values: Seq[Any])`. Or, you can expand the array by using `:_*`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50426883 QA results for PR 1346:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17320/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50432240 I am reviewing it. Will have a update soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50291054 QA tests have started for PR 1346. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17263/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50292497 @yhuai awesome! I will update my diff to use this API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50293500 QA results for PR 1346:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17263/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---