[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-09-02 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1774#issuecomment-54113233
  
@yhuai can you close this now? I think it was fixed in another PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-09-02 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1774#issuecomment-54113827
  
@pwendell seems it is not a part of our sql programming guide. I can update 
it next week (I am out of town this week).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-09-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1774#issuecomment-54114283
  
I plan to use this branch as the starting point for the documentation I'll
be writing this week.
On Sep 1, 2014 11:28 PM, Yin Huai notificati...@github.com wrote:

 @pwendell https://github.com/pwendell seems it is not a part of our sql
 programming guide. I can update it next week (I am out of town this week).

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1774#issuecomment-54113827.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-09-02 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1774#issuecomment-54240312
  
@marmbrus should I close it now or wait until you have the new pr for our 
sql programming guide?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-09-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1774#issuecomment-54240358
  
You can close it.
On Sep 2, 2014 6:13 PM, Yin Huai notificati...@github.com wrote:

 @marmbrus https://github.com/marmbrus should I close it now or wait
 until you have the new pr for our sql programming guide?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1774#issuecomment-54240312.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-09-02 Thread yhuai
Github user yhuai closed the pull request at:

https://github.com/apache/spark/pull/1774


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-08-06 Thread chutium
Github user chutium commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15862768
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, 
SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
 
   /**
+   * :: DeveloperApi ::
+   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by 
applying a schema to this RDD.
+   * It is important to make sure that the structure of every [[Row]] of 
the provided RDD matches
+   * the provided schema. Otherwise, there will be runtime exception.
+   * Example:
+   * {{{
+   *  import org.apache.spark.sql._
+   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+   *
+   *  val schema =
+   *StructType(
+   *  StructField(name, StringType, false) ::
+   *  StructField(age, IntegerType, true) :: Nil)
+   *
--- End diff --

good, i merged the change and used this API ```applySchema(rowRDD, 
appliedSchema)``` in #1612


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-08-05 Thread chutium
Github user chutium commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15799720
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, 
SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
 
   /**
+   * :: DeveloperApi ::
+   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by 
applying a schema to this RDD.
+   * It is important to make sure that the structure of every [[Row]] of 
the provided RDD matches
+   * the provided schema. Otherwise, there will be runtime exception.
+   * Example:
+   * {{{
+   *  import org.apache.spark.sql._
+   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+   *
+   *  val schema =
+   *StructType(
+   *  StructField(name, StringType, false) ::
+   *  StructField(age, IntegerType, true) :: Nil)
+   *
--- End diff --

o, yep, StructType is needed, i mean
```def applySchema(rowRDD: RDD[Row], schema: StructType): SchemaRDD```
could be
```def applySchema(rowRDD: RDD[Row], schema: Seq[StructField]): SchemaRDD```

then we do not need to always use ```schema.fields.map(f = 
AttributeReference...)```

we can direct ```schema.map(f = AttributeReference...)```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-08-05 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15801894
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, 
SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
 
   /**
+   * :: DeveloperApi ::
+   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by 
applying a schema to this RDD.
+   * It is important to make sure that the structure of every [[Row]] of 
the provided RDD matches
+   * the provided schema. Otherwise, there will be runtime exception.
+   * Example:
+   * {{{
+   *  import org.apache.spark.sql._
+   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+   *
+   *  val schema =
+   *StructType(
+   *  StructField(name, StringType, false) ::
+   *  StructField(age, IntegerType, true) :: Nil)
+   *
--- End diff --

This might be crazy... but if `StructType : Seq[StructField]` then we 
could pass in either `StructType` or `Seq[StructField]`.  Should be possible to 
do this fairly easily


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-05 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15827707
  
--- Diff: python/pyspark/sql.py ---
@@ -269,7 +269,7 @@ def __repr__(self):
 class StructType(DataType):
 Spark SQL StructType
 
-The data type representing rows.
+The data type representing tuple or list values.
--- End diff --

This inconsistency is introduced by the difference between the JVM Row and 
Python Row. For a JVM Row (both Scala and Java), fields in it are nameless and 
we need to extract values by providing ordinals. However, a field in a Python 
Row has its name. Right now, in Python, if users have an `RDD[Row]`, they need 
to use `inferSchema` to create a `SchemaRDD`. If they have an `RDD[tuple]` or 
`RDD[list]`, they need to use `applySchema`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1774#issuecomment-51276850
  
QA tests have started for PR 1774. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17960/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1774#issuecomment-51281876
  
QA results for PR 1774:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17960/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-08-04 Thread chutium
Github user chutium commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15766362
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, 
SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
 
   /**
+   * :: DeveloperApi ::
+   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by 
applying a schema to this RDD.
+   * It is important to make sure that the structure of every [[Row]] of 
the provided RDD matches
+   * the provided schema. Otherwise, there will be runtime exception.
+   * Example:
+   * {{{
+   *  import org.apache.spark.sql._
+   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+   *
+   *  val schema =
+   *StructType(
+   *  StructField(name, StringType, false) ::
+   *  StructField(age, IntegerType, true) :: Nil)
+   *
--- End diff --

Hi @yhuai , why we need to define schema as a StructType, but not directly 
as a Seq[StructField]? i tried to build a Seq[StructField] from JDBC metadata 
in #1612 https://github.com/apache/spark/pull/1612/files#diff-3 (it followed 
the code of your JsonRDD :)

it seems we do not need this StructType anywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-08-04 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15767384
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, 
SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
 
   /**
+   * :: DeveloperApi ::
+   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by 
applying a schema to this RDD.
+   * It is important to make sure that the structure of every [[Row]] of 
the provided RDD matches
+   * the provided schema. Otherwise, there will be runtime exception.
+   * Example:
+   * {{{
+   *  import org.apache.spark.sql._
+   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+   *
+   *  val schema =
+   *StructType(
+   *  StructField(name, StringType, false) ::
+   *  StructField(age, IntegerType, true) :: Nil)
+   *
--- End diff --

For the completeness of our data types, we need `StructType` 
(`Seq[StructField]` is not a data type). For example, if the type of a filed is 
a struct, we need to have a way to describe that the type of this field is a 
struct. Also, because a row is basically a struct value, it is natural to use 
`StructType` to represent a schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread yhuai
GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/1774

[SPARK-2179] [SQL] Public API for DataTypes and Schema (Draft update for 
SQL programming guide)

This is the draft update for SQL programming guide. It adds doc for the 
data type and schema APIs. You can access it at 
http://yhuai.github.io/site/sql-programming-guide.html. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark dataTypeDoc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1774


commit 29bc6688943b5639c2e2705cb65d6d1ceca881c0
Author: Yin Huai h...@cse.ohio-state.edu
Date:   2014-08-05T00:19:47Z

Draft doc for data type and schema APIs.

commit 31ba240ac37280072d97422275d4b2c2bf5f04a5
Author: Yin Huai h...@cse.ohio-state.edu
Date:   2014-08-05T00:20:07Z

Merge remote-tracking branch 'upstream/master' into dataTypeDoc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1774#issuecomment-51135882
  
QA tests have started for PR 1774. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17895/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790159
  
--- Diff: python/pyspark/sql.py ---
@@ -269,7 +269,7 @@ def __repr__(self):
 class StructType(DataType):
 Spark SQL StructType
 
-The data type representing rows.
+The data type representing tuple or list values.
--- End diff --

Whats up with this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790226
  
--- Diff: docs/sql-programming-guide.md ---
@@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM 
people WHERE age = 13 AND age
 teenagers.map(t = Name:  + t(0)).collect().foreach(println)
 {% endhighlight %}
 
+Another way to turns an RDD to table is to use `applySchema`. Here is an 
example.
--- End diff --

It would be good to provide some motivation here.  Perhaps talk about 
programmatically creating a schema when it is not possible to statically define 
classes ahead of time.

Related: an example where the schema is determined statically might make 
more sense (i.e. read from the first row of the file?) but maybe that is too 
complicated...

Minor: Usually we just say For example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790282
  
--- Diff: python/pyspark/sql.py ---
@@ -269,7 +269,7 @@ def __repr__(self):
 class StructType(DataType):
 Spark SQL StructType
 
-The data type representing rows.
+The data type representing tuple or list values.
--- End diff --

@davies told me that we only accept tuples or lists as values of 
`StructType` for`applySchema`. We need to finalize what are acceptable value 
types before the release. https://issues.apache.org/jira/browse/SPARK-2854 is 
used to track it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790429
  
--- Diff: docs/sql-programming-guide.md ---
@@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM 
people WHERE age = 13 AND age
 teenagers.map(t = Name:  + t(0)).collect().foreach(println)
 {% endhighlight %}
 
+Another way to turns an RDD to table is to use `applySchema`. Here is an 
example.
--- End diff --

to turn


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790441
  
--- Diff: docs/sql-programming-guide.md ---
@@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM 
people WHERE age = 13 AND age
 teenagers.map(t = Name:  + t(0)).collect().foreach(println)
 {% endhighlight %}
 
+Another way to turns an RDD to table is to use `applySchema`. Here is an 
example.
+{% highlight scala %}
+// sc is an existing SparkContext.
+val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+
+// Create an RDD
+val people = sc.textFile(examples/src/main/resources/people.txt)
+
+// Import Spark SQL data types and Row.
+import org.apache.spark.sql._
+
+// Define the schema that will be applied to the RDD.
+val schema =
+  StructType(
+StructField(name, StringType, true) ::
+StructField(age, IntegerType, true) :: Nil)
+
+// Convert records of the RDD (people) to rows.
--- End diff --

to Rows?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790528
  
--- Diff: docs/sql-programming-guide.md ---
@@ -225,6 +260,54 @@ ListString teenagerNames = teenagers.map(new 
FunctionRow, String() {
 
 {% endhighlight %}
 
+Another way to turns an RDD to table is to use `applySchema`. Here is an 
example.
--- End diff --

to turn; to a table


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790538
  
--- Diff: docs/sql-programming-guide.md ---
@@ -259,6 +342,40 @@ for teenName in teenNames.collect():
   print teenName
 {% endhighlight %}
 
+Another way to turns an RDD to table is to use `applySchema`. Here is an 
example.
--- End diff --

Same - maybe do a replaceAll


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1774#issuecomment-51138896
  
QA results for PR 1774:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17895/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50581580
  
QA results for PR 1346:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-30 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50582008
  
Thanks for working on this!  Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1346


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-30 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50590014
  
Thank you @yhuai for the explanation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-30 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50649902
  
Maven build is failing. 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/244/console
 I am look at it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50443467
  
QA tests have started for PR 1346. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17344/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15510908
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,457 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, TimestampType, 
DecimalType,
+DoubleType, FloatType, ByteType, IntegerType, LongType,
+ShortType, ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
--- End diff --

I think PEP8 requires two blank lines to separate top level classes.

Better run the pep8 checker on files changed by this PR since most other 
files are now pep8 clean, and we will add a pep8 checker to jenkins. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15510935
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,457 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, TimestampType, 
DecimalType,
+DoubleType, FloatType, ByteType, IntegerType, LongType,
+ShortType, ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+
+The data type representing datetime.datetime values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DoubleType
+
+class FloatType(object):
+Spark SQL FloatType
+
+For now, please use L{DoubleType} instead of using L{FloatType}.
+Because query evaluation is done in Scala, java.lang.Double will be be 
used
+for Python float numbers. Because the underlying JVM type of FloatType 
is
+java.lang.Float (in Java) and Float (in scala), there will be a 
java.lang.ClassCastException
+if FloatType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return FloatType
+
+class ByteType(object):
+Spark SQL ByteType
+
+For now, please use L{IntegerType} instead of using L{ByteType}.
+Because query evaluation is done in Scala, java.lang.Integer will be 
be used
+for Python int numbers. Because the underlying JVM type of ByteType is
+java.lang.Byte (in Java) and Byte (in scala), there will be a 
java.lang.ClassCastException
+if ByteType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return ByteType
+
+class IntegerType(object):
+Spark SQL IntegerType
+
+The data type representing int values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return IntegerType
+
+class LongType(object):
+Spark SQL LongType
+
+The data type representing long values. If the any value is beyond the 
range of
+[-9223372036854775808, 9223372036854775807], please use DecimalType.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return LongType
+
+class ShortType(object):
+Spark SQL ShortType
+
+For now, please use L{IntegerType} instead of using L{ShortType}.
--- End diff --

I don't get the problem after reading the comment here. Can you clarify?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15510995
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/WrapDynamic.scala
 ---
@@ -21,7 +21,9 @@ import scala.language.dynamics
 
 import org.apache.spark.sql.catalyst.types.DataType
 
-case object DynamicType extends DataType
+case object DynamicType extends DataType {
--- End diff --

Do you mind adding scaladoc to explain what DynamicType is used for? (While 
you are at it, also add scaladoc for WrapDynamic and DynamicRow)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15511092
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -201,47 +231,139 @@ object FractionalType {
   }
 }
 abstract class FractionalType extends NumericType {
-  val fractional: Fractional[JvmType]
+  private[sql] val fractional: Fractional[JvmType]
 }
 
 case object DecimalType extends FractionalType {
-  type JvmType = BigDecimal
-  @transient lazy val tag = ScalaReflectionLock.synchronized { 
typeTag[JvmType] }
-  val numeric = implicitly[Numeric[BigDecimal]]
-  val fractional = implicitly[Fractional[BigDecimal]]
-  val ordering = implicitly[Ordering[JvmType]]
+  private[sql] type JvmType = BigDecimal
+  @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized 
{ typeTag[JvmType] }
+  private[sql] val numeric = implicitly[Numeric[BigDecimal]]
+  private[sql] val fractional = implicitly[Fractional[BigDecimal]]
+  private[sql] val ordering = implicitly[Ordering[JvmType]]
+  def simpleString: String = decimal
 }
 
 case object DoubleType extends FractionalType {
-  type JvmType = Double
-  @transient lazy val tag = ScalaReflectionLock.synchronized { 
typeTag[JvmType] }
-  val numeric = implicitly[Numeric[Double]]
-  val fractional = implicitly[Fractional[Double]]
-  val ordering = implicitly[Ordering[JvmType]]
+  private[sql] type JvmType = Double
+  @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized 
{ typeTag[JvmType] }
+  private[sql] val numeric = implicitly[Numeric[Double]]
+  private[sql] val fractional = implicitly[Fractional[Double]]
+  private[sql] val ordering = implicitly[Ordering[JvmType]]
+  def simpleString: String = double
 }
 
 case object FloatType extends FractionalType {
-  type JvmType = Float
-  @transient lazy val tag = ScalaReflectionLock.synchronized { 
typeTag[JvmType] }
-  val numeric = implicitly[Numeric[Float]]
-  val fractional = implicitly[Fractional[Float]]
-  val ordering = implicitly[Ordering[JvmType]]
+  private[sql] type JvmType = Float
+  @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized 
{ typeTag[JvmType] }
+  private[sql] val numeric = implicitly[Numeric[Float]]
+  private[sql] val fractional = implicitly[Fractional[Float]]
+  private[sql] val ordering = implicitly[Ordering[JvmType]]
+  def simpleString: String = float
+}
+
+object ArrayType {
+  /** Construct a [[ArrayType]] object with the given element type. The 
`containsNull` is false. */
+  def apply(elementType: DataType): ArrayType = ArrayType(elementType, 
false)
+}
+
+case class ArrayType(elementType: DataType, containsNull: Boolean) extends 
DataType {
+  private[sql] def buildFormattedString(prefix: String, builder: 
StringBuilder): Unit = {
+builder.append(
+  s${prefix}-- element: ${elementType.simpleString} (containsNull = 
${containsNull})\n)
+DataType.buildFormattedString(elementType, s$prefix|, builder)
+  }
+
+  def simpleString: String = array
 }
 
-case class ArrayType(elementType: DataType) extends DataType
+case class StructField(name: String, dataType: DataType, nullable: 
Boolean) {
--- End diff --

Add scaladoc to define the semantics of nullable (nullable keys vs nullable 
values vs both)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15511210
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
@@ -0,0 +1,212 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.api.java.types;
+
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * The base type of all Spark SQL data types.
+ *
+ * To get/create specific data type, users should use singleton objects 
and factory methods
+ * provided by this class.
+ */
+public abstract class DataType {
+
+  /**
+   * Gets the StringType object.
+   */
+  public static final StringType StringType = new StringType();
+
+  /**
+   * Gets the BinaryType object.
+   */
+  public static final BinaryType BinaryType = new BinaryType();
+
+  /**
+   * Gets the BooleanType object.
+   */
+  public static final BooleanType BooleanType = new BooleanType();
+
+  /**
+   * Gets the TimestampType object.
+   */
+  public static final TimestampType TimestampType = new TimestampType();
+
+  /**
+   * Gets the DecimalType object.
+   */
+  public static final DecimalType DecimalType = new DecimalType();
+
+  /**
+   * Gets the DoubleType object.
+   */
+  public static final DoubleType DoubleType = new DoubleType();
+
+  /**
+   * Gets the FloatType object.
+   */
+  public static final FloatType FloatType = new FloatType();
+
+  /**
+   * Gets the ByteType object.
+   */
+  public static final ByteType ByteType = new ByteType();
+
+  /**
+   * Gets the IntegerType object.
+   */
+  public static final IntegerType IntegerType = new IntegerType();
+
+  /**
+   * Gets the LongType object.
+   */
+  public static final LongType LongType = new LongType();
+
+  /**
+   * Gets the ShortType object.
+   */
+  public static final ShortType ShortType = new ShortType();
+
+  /**
+   * Creates an ArrayType by specifying the data type of elements ({@code 
elementType}).
+   * The field of {@code containsNull} is set to {@code false}.
+   *
+   * @param elementType
+   * @return
+   */
+  public static ArrayType createArrayType(DataType elementType) {
+if (elementType == null) {
+  throw new IllegalArgumentException(elementType should not be 
null.);
+}
+
+return new ArrayType(elementType, false);
+  }
+
+  /**
+   * Creates an ArrayType by specifying the data type of elements ({@code 
elementType}) and
+   * whether the array contains null values ({@code containsNull}).
+   * @param elementType
+   * @param containsNull
+   * @return
+   */
+  public static ArrayType createArrayType(DataType elementType, boolean 
containsNull) {
+if (elementType == null) {
+  throw new IllegalArgumentException(elementType should not be 
null.);
+}
+
+return new ArrayType(elementType, containsNull);
+  }
+
+  /**
+   * Creates a MapType by specifying the data type of keys ({@code 
keyType}) and values
+   * ({@code keyType}). The field of {@code valueContainsNull} is set to 
{@code true}.
+   *
+   * @param keyType
+   * @param valueType
+   * @return
--- End diff --

actually also params. if you don't explain any of them, just remove them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15511199
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
@@ -0,0 +1,212 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.api.java.types;
+
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * The base type of all Spark SQL data types.
+ *
+ * To get/create specific data type, users should use singleton objects 
and factory methods
+ * provided by this class.
+ */
+public abstract class DataType {
+
+  /**
+   * Gets the StringType object.
+   */
+  public static final StringType StringType = new StringType();
+
+  /**
+   * Gets the BinaryType object.
+   */
+  public static final BinaryType BinaryType = new BinaryType();
+
+  /**
+   * Gets the BooleanType object.
+   */
+  public static final BooleanType BooleanType = new BooleanType();
+
+  /**
+   * Gets the TimestampType object.
+   */
+  public static final TimestampType TimestampType = new TimestampType();
+
+  /**
+   * Gets the DecimalType object.
+   */
+  public static final DecimalType DecimalType = new DecimalType();
+
+  /**
+   * Gets the DoubleType object.
+   */
+  public static final DoubleType DoubleType = new DoubleType();
+
+  /**
+   * Gets the FloatType object.
+   */
+  public static final FloatType FloatType = new FloatType();
+
+  /**
+   * Gets the ByteType object.
+   */
+  public static final ByteType ByteType = new ByteType();
+
+  /**
+   * Gets the IntegerType object.
+   */
+  public static final IntegerType IntegerType = new IntegerType();
+
+  /**
+   * Gets the LongType object.
+   */
+  public static final LongType LongType = new LongType();
+
+  /**
+   * Gets the ShortType object.
+   */
+  public static final ShortType ShortType = new ShortType();
+
+  /**
+   * Creates an ArrayType by specifying the data type of elements ({@code 
elementType}).
+   * The field of {@code containsNull} is set to {@code false}.
+   *
+   * @param elementType
+   * @return
+   */
+  public static ArrayType createArrayType(DataType elementType) {
+if (elementType == null) {
+  throw new IllegalArgumentException(elementType should not be 
null.);
+}
+
+return new ArrayType(elementType, false);
+  }
+
+  /**
+   * Creates an ArrayType by specifying the data type of elements ({@code 
elementType}) and
+   * whether the array contains null values ({@code containsNull}).
+   * @param elementType
+   * @param containsNull
+   * @return
+   */
+  public static ArrayType createArrayType(DataType elementType, boolean 
containsNull) {
+if (elementType == null) {
+  throw new IllegalArgumentException(elementType should not be 
null.);
+}
+
+return new ArrayType(elementType, containsNull);
+  }
+
+  /**
+   * Creates a MapType by specifying the data type of keys ({@code 
keyType}) and values
+   * ({@code keyType}). The field of {@code valueContainsNull} is set to 
{@code true}.
+   *
+   * @param keyType
+   * @param valueType
+   * @return
--- End diff --

remove the  return tag if you are not going to say anything about it. also 
remove it for other functions in this pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15511259
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -89,6 +90,45 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd)))
 
   /**
+   * :: DeveloperApi ::
+   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by 
applying a schema to this RDD.
+   * It is important to make sure that the structure of every [[Row]] of 
the provided RDD matches
+   * the provided schema. Otherwise, there will be runtime exception.
+   *
+   * @group userf
--- End diff --

would be great to give an inline example. just wrap it with 
```scala
{{{
  // example code here
}}}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15511405
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 
---
@@ -259,8 +268,12 @@ private[sql] object JsonRDD extends Logging {
   // the ObjectMapper will take the last value associated with this 
duplicate key.
   // For example: for {key: 1, key:2}, we will get key-2.
   val mapper = new ObjectMapper()
-  iter.map(record = mapper.readValue(record, 
classOf[java.util.Map[String, Any]]))
-  }).map(scalafy).map(_.asInstanceOf[Map[String, Any]])
+  iter.map {
+record =
--- End diff --

move record to the previous line and indent the whole thing one level less


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15511457
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -140,10 +142,12 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 assert(keyValueGroup.getFields.apply(0).getRepetition == 
Repetition.REQUIRED)
 val valueType = toDataType(keyValueGroup.getFields.apply(1))
 assert(keyValueGroup.getFields.apply(1).getRepetition == 
Repetition.REQUIRED)
-new MapType(keyType, valueType)
+// TODO: set valueContainsNull explicitly instead of assuming 
valueContainsNull is true
+// at here.
+MapType(keyType, valueType)
   } else if (correspondsToArray(groupType)) { // ArrayType
 val elementType = toDataType(groupType.getFields.apply(0))
-new ArrayType(elementType)
+ArrayType(elementType, false)
--- End diff --

here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15511453
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -116,7 +116,7 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 case ParquetOriginalType.LIST = { // TODO: check enums!
   assert(groupType.getFieldCount == 1)
   val field = groupType.getFields.apply(0)
-  new ArrayType(toDataType(field))
+  ArrayType(toDataType(field), false)
--- End diff --

for boolean argument, make them named argument. e.g. 
```scala
ArrayType(toDataType(field), nullable = false)  // maybe it was containsNull
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15511498
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types.util
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.api.java.types.{DataType = JDataType, 
StructField = JStructField}
+
+import scala.collection.JavaConverters._
+
+protected[sql] object DataTypeConversions {
+
+  /**
+   * Returns the equivalent StructField in Scala for the given StructField 
in Java.
+   */
+  def asJavaStructField(scalaStructField: StructField): JStructField = {
+org.apache.spark.sql.api.java.types.DataType.createStructField(
+  scalaStructField.name,
+  asJavaDataType(scalaStructField.dataType),
+  scalaStructField.nullable)
+  }
+
+  /**
+   * Returns the equivalent DataType in Java for the given DataType in 
Scala.
+   */
+  def asJavaDataType(scalaDataType: DataType): JDataType = scalaDataType 
match {
+case StringType =
+  org.apache.spark.sql.api.java.types.DataType.StringType
--- End diff --

Why not just ```JDataType. StringType``` instead of typing all the names?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50451426
  
@yhuai can you describe a little more about the `containsNull` for 
`ArrayType` and `MapType`, in my understanding, `Map` and `Array` contains null 
in most of cases during the runtime, why not just keep the previous 
implementation? Will that be something wrong when producing the RDD schema if 
the constraint not added?

Sorry, if I missed some discussion here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50452464
  
QA results for PR 1346:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17344/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50531062
  
QA tests have started for PR 1346. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17372/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15548770
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,457 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, TimestampType, 
DecimalType,
+DoubleType, FloatType, ByteType, IntegerType, LongType,
+ShortType, ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+
+The data type representing datetime.datetime values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DoubleType
+
+class FloatType(object):
+Spark SQL FloatType
+
+For now, please use L{DoubleType} instead of using L{FloatType}.
+Because query evaluation is done in Scala, java.lang.Double will be be 
used
+for Python float numbers. Because the underlying JVM type of FloatType 
is
+java.lang.Float (in Java) and Float (in scala), there will be a 
java.lang.ClassCastException
+if FloatType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return FloatType
+
+class ByteType(object):
+Spark SQL ByteType
+
+For now, please use L{IntegerType} instead of using L{ByteType}.
+Because query evaluation is done in Scala, java.lang.Integer will be 
be used
+for Python int numbers. Because the underlying JVM type of ByteType is
+java.lang.Byte (in Java) and Byte (in scala), there will be a 
java.lang.ClassCastException
+if ByteType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return ByteType
+
+class IntegerType(object):
+Spark SQL IntegerType
+
+The data type representing int values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return IntegerType
+
+class LongType(object):
+Spark SQL LongType
+
+The data type representing long values. If the any value is beyond the 
range of
+[-9223372036854775808, 9223372036854775807], please use DecimalType.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return LongType
+
+class ShortType(object):
+Spark SQL ShortType
+
+For now, please use L{IntegerType} instead of using L{ShortType}.
--- End diff --

If we have a ShortType column, the expression evaluator will try to cast it 
as a `Short` (`asInstanceOf[Short]`). However, the cast will fail because the 
data is `java.lang.Integer`. I will add more doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15548865
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,457 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, TimestampType, 
DecimalType,
+DoubleType, FloatType, ByteType, IntegerType, LongType,
+ShortType, ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+
+The data type representing datetime.datetime values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DoubleType
+
+class FloatType(object):
+Spark SQL FloatType
+
+For now, please use L{DoubleType} instead of using L{FloatType}.
+Because query evaluation is done in Scala, java.lang.Double will be be 
used
+for Python float numbers. Because the underlying JVM type of FloatType 
is
+java.lang.Float (in Java) and Float (in scala), there will be a 
java.lang.ClassCastException
+if FloatType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return FloatType
+
+class ByteType(object):
+Spark SQL ByteType
+
+For now, please use L{IntegerType} instead of using L{ByteType}.
+Because query evaluation is done in Scala, java.lang.Integer will be 
be used
+for Python int numbers. Because the underlying JVM type of ByteType is
+java.lang.Byte (in Java) and Byte (in scala), there will be a 
java.lang.ClassCastException
+if ByteType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return ByteType
+
+class IntegerType(object):
+Spark SQL IntegerType
+
+The data type representing int values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return IntegerType
+
+class LongType(object):
+Spark SQL LongType
+
+The data type representing long values. If the any value is beyond the 
range of
+[-9223372036854775808, 9223372036854775807], please use DecimalType.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return LongType
+
+class ShortType(object):
+Spark SQL ShortType
+
+For now, please use L{IntegerType} instead of using L{ShortType}.
--- End diff --

We could also convert the type to the correct type on the way in from 
python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50532980
  
QA tests have started for PR 1346. This patch DID NOT merge cleanly! 
brView progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17374/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50533105
  
QA results for PR 1346:br- This patch FAILED unit tests.brbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17374/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15550525
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,457 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, TimestampType, 
DecimalType,
+DoubleType, FloatType, ByteType, IntegerType, LongType,
+ShortType, ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+
+The data type representing datetime.datetime values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DoubleType
+
+class FloatType(object):
+Spark SQL FloatType
+
+For now, please use L{DoubleType} instead of using L{FloatType}.
+Because query evaluation is done in Scala, java.lang.Double will be be 
used
+for Python float numbers. Because the underlying JVM type of FloatType 
is
+java.lang.Float (in Java) and Float (in scala), there will be a 
java.lang.ClassCastException
+if FloatType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return FloatType
+
+class ByteType(object):
+Spark SQL ByteType
+
+For now, please use L{IntegerType} instead of using L{ByteType}.
+Because query evaluation is done in Scala, java.lang.Integer will be 
be used
+for Python int numbers. Because the underlying JVM type of ByteType is
+java.lang.Byte (in Java) and Byte (in scala), there will be a 
java.lang.ClassCastException
+if ByteType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return ByteType
+
+class IntegerType(object):
+Spark SQL IntegerType
+
+The data type representing int values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return IntegerType
+
+class LongType(object):
+Spark SQL LongType
+
+The data type representing long values. If the any value is beyond the 
range of
+[-9223372036854775808, 9223372036854775807], please use DecimalType.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return LongType
+
+class ShortType(object):
+Spark SQL ShortType
+
+For now, please use L{IntegerType} instead of using L{ShortType}.
--- End diff --

JsonRDD already has this kind of conversions. I am not sure we want to do 
the conversions in Java and Scala. In Scala and Java, users can actually use 
`Short`, `Byte`, and `Float` values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15551776
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,457 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, TimestampType, 
DecimalType,
+DoubleType, FloatType, ByteType, IntegerType, LongType,
+ShortType, ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+
+The data type representing datetime.datetime values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DoubleType
+
+class FloatType(object):
+Spark SQL FloatType
+
+For now, please use L{DoubleType} instead of using L{FloatType}.
+Because query evaluation is done in Scala, java.lang.Double will be be 
used
+for Python float numbers. Because the underlying JVM type of FloatType 
is
+java.lang.Float (in Java) and Float (in scala), there will be a 
java.lang.ClassCastException
+if FloatType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return FloatType
+
+class ByteType(object):
+Spark SQL ByteType
+
+For now, please use L{IntegerType} instead of using L{ByteType}.
+Because query evaluation is done in Scala, java.lang.Integer will be 
be used
+for Python int numbers. Because the underlying JVM type of ByteType is
+java.lang.Byte (in Java) and Byte (in scala), there will be a 
java.lang.ClassCastException
+if ByteType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return ByteType
+
+class IntegerType(object):
+Spark SQL IntegerType
+
+The data type representing int values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return IntegerType
+
+class LongType(object):
+Spark SQL LongType
+
+The data type representing long values. If the any value is beyond the 
range of
+[-9223372036854775808, 9223372036854775807], please use DecimalType.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return LongType
+
+class ShortType(object):
+Spark SQL ShortType
+
+For now, please use L{IntegerType} instead of using L{ShortType}.
--- End diff --

In Java/Scala, when user loads data from csv file, they need to do this 
kind of type conversion,  it will be better if we could do this for them 
automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15554197
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,457 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, TimestampType, 
DecimalType,
+DoubleType, FloatType, ByteType, IntegerType, LongType,
+ShortType, ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+
+The data type representing datetime.datetime values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return DoubleType
+
+class FloatType(object):
+Spark SQL FloatType
+
+For now, please use L{DoubleType} instead of using L{FloatType}.
+Because query evaluation is done in Scala, java.lang.Double will be be 
used
+for Python float numbers. Because the underlying JVM type of FloatType 
is
+java.lang.Float (in Java) and Float (in scala), there will be a 
java.lang.ClassCastException
+if FloatType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return FloatType
+
+class ByteType(object):
+Spark SQL ByteType
+
+For now, please use L{IntegerType} instead of using L{ByteType}.
+Because query evaluation is done in Scala, java.lang.Integer will be 
be used
+for Python int numbers. Because the underlying JVM type of ByteType is
+java.lang.Byte (in Java) and Byte (in scala), there will be a 
java.lang.ClassCastException
+if ByteType (Python) is used.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return ByteType
+
+class IntegerType(object):
+Spark SQL IntegerType
+
+The data type representing int values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return IntegerType
+
+class LongType(object):
+Spark SQL LongType
+
+The data type representing long values. If the any value is beyond the 
range of
+[-9223372036854775808, 9223372036854775807], please use DecimalType.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def __repr__(self):
+return LongType
+
+class ShortType(object):
+Spark SQL ShortType
+
+For now, please use L{IntegerType} instead of using L{ShortType}.
--- End diff --

Yes,  we should provide convenient methods for users. But, we will provide 
methods for users to load CSV files and we will use mutable projection to do 
the type conversions (by using `Cast`).

Considering the size of this PR and it is blocking other people's work,  it 
is better to think about it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact 

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50576308
  
@chenghao-intel `containsNull` and `valueContainsNull` can be used for 
further optimization. For example, let's say we have an `ArrayType` column and 
the element type is `IntegerType`. If elements of those arrays do not have 
`null` values, we can use a primitive array internal. Since we will expose data 
types to users, we need to introduce these two booleans with this PR. It can be 
hard to add them once users start to use these APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50576339
  
QA tests have started for PR 1346. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15481080
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,413 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, DecimalType, DoubleType,
+FloatType, ByteType, IntegerType, LongType, ShortType,
+ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytes values and bytearray values.
--- End diff --

We probably just want to say byte arrays here since we have a separate type 
for byte.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15481293
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,413 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, DecimalType, DoubleType,
+FloatType, ByteType, IntegerType, LongType, ShortType,
+ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytes values and bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
--- End diff --

We should also list the python types that are expected when its not obvious.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15481312
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,413 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, DecimalType, DoubleType,
+FloatType, ByteType, IntegerType, LongType, ShortType,
+ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytes values and bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values. Because a float value
--- End diff --

This comment isn't finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15481390
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,413 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, DecimalType, DoubleType,
+FloatType, ByteType, IntegerType, LongType, ShortType,
+ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytes values and bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values. Because a float value
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return DoubleType
+
+class FloatType(object):
+Spark SQL FloatType
+
+For PySpark, please use L{DoubleType} instead of using L{FloatType}.
--- End diff --

Why?  What if they know the values are limited to the float range and want 
to use less memory?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15481537
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,413 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, DecimalType, DoubleType,
+FloatType, ByteType, IntegerType, LongType, ShortType,
+ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytes values and bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values. Because a float value
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return DoubleType
+
+class FloatType(object):
+Spark SQL FloatType
+
+For PySpark, please use L{DoubleType} instead of using L{FloatType}.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return FloatType
+
+class ByteType(object):
+Spark SQL ByteType
+
+For PySpark, please use L{IntegerType} instead of using L{ByteType}.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return ByteType
+
+class IntegerType(object):
+Spark SQL IntegerType
+
+The data type representing int values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return IntegerType
+
+class LongType(object):
+Spark SQL LongType
+
+The data type representing long values. If the any value is beyond the 
range of
+[-9223372036854775808, 9223372036854775807], please use DecimalType.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return LongType
+
+class ShortType(object):
+Spark SQL ShortType
+
+For PySpark, please use L{IntegerType} instead of using L{ShortType}.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return ShortType
+
+class ArrayType(object):
+Spark SQL ArrayType
+
+The data type representing list values.
+
+
+def __init__(self, elementType, containsNull):
+Creates an ArrayType
+
+:param elementType: the data type of elements.
+:param containsNull: indicates whether the list contains null 
values.
+:return:
+
+ ArrayType(StringType, True) == ArrayType(StringType, False)
+False
+ ArrayType(StringType, True) == ArrayType(StringType, True)
+True
+
+self.elementType = elementType
+self.containsNull = containsNull
+
+def _get_scala_type_string(self):
+return ArrayType( + self.elementType._get_scala_type_string() + 
, + \
+   str(self.containsNull).lower() + )
+
+def __eq__(self, other):
+return (isinstance(other, self.__class__) 

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15481675
  
--- Diff: python/pyspark/sql.py ---
@@ -107,6 +512,25 @@ def inferSchema(self, rdd):
 srdd = self._ssql_ctx.inferSchema(jrdd.rdd())
 return SchemaRDD(srdd, self)
 
+def applySchema(self, rdd, schema):
+Applies the given schema to the given RDD of L{dict}s.
--- End diff --

Are we still allowing dicts?  I thought there was at least going to be a 
warning? Or is this going to change with @davies PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15481834
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
 ---
@@ -17,11 +17,12 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
+import com.typesafe.scalalogging.slf4j.Logging
--- End diff --

We should use either Spark Logging or Spark SQL logging. (Ideally we will 
be removing catalyst's dependence on Spark solely for the logging code, but I'm 
okay with either ATM.)  We shouldn't hard code this library here though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15482163
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/BooleanType.java ---
@@ -0,0 +1,22 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.api.java.types;
+
+public class BooleanType extends DataType {
--- End diff --

Missing Java Doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15482279
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.api.java.types;
+
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * The base type of all Spark SQL data types.
--- End diff --

I'd also talk about how this class contains singletons and factory methods 
for constructing datatypes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15482239
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/BooleanType.java ---
@@ -0,0 +1,22 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.api.java.types;
+
+public class BooleanType extends DataType {
--- End diff --

Also perhaps the Java doc should make it clear that users don't instantiate 
these themselves, but instead get the singletons from the DataType class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15482765
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/package-info.java ---
@@ -0,0 +1,22 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+/**
+ * Allows users to get and create Spark SQL data types.
+ */
+package org.apache.spark.sql.api.java.types;
--- End diff --

Newline at end of file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15482947
  
--- Diff: python/pyspark/sql.py ---
@@ -107,6 +512,25 @@ def inferSchema(self, rdd):
 srdd = self._ssql_ctx.inferSchema(jrdd.rdd())
 return SchemaRDD(srdd, self)
 
+def applySchema(self, rdd, schema):
+Applies the given schema to the given RDD of L{dict}s.
--- End diff --

Right, @davies will make the change in his PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15483028
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
   }
 
+  /**
+   * Returns the equivalent StructField in Scala for the given StructField 
in Java.
+   */
+  protected def asJavaStructField(scalaStructField: StructField): 
JStructField = {
--- End diff --

Should this be here or in the JavaSQLContext?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15483058
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
   }
 
+  /**
+   * Returns the equivalent StructField in Scala for the given StructField 
in Java.
+   */
+  protected def asJavaStructField(scalaStructField: StructField): 
JStructField = {
--- End diff --

Same for the functions below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15483514
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala ---
@@ -0,0 +1,401 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * Allows the execution of relational queries, including those expressed 
in SQL using Spark.
+ *
+ *  @groupname dataType Data types
+ *  @groupdesc Spark SQL data types.
+ *  @groupprio dataType -3
+ *  @groupname field Field
+ *  @groupprio field -2
+ *  @groupname row Row
+ *  @groupprio row -1
+ */
+package object sql {
+
+  protected[sql] type Logging = com.typesafe.scalalogging.slf4j.Logging
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Represents one row of output from a relational operator.
+   * @group row
+   */
+  @DeveloperApi
+  type Row = catalyst.expressions.Row
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * A [[Row]] object can be constructed by providing field values. 
Example:
+   * {{{
+   * import org.apache.spark.sql._
+   *
+   * // Create a Row from values.
+   * Row(value1, value2, value3, ...)
+   * // Create a Row from a Seq of values.
+   * Row.fromSeq(Seq(value1, value2, ...))
+   * }}}
+   *
+   * A value of a row can be accessed through both generic access by 
ordinal,
+   * which will incur boxing overhead for primitives, as well as native 
primitive access.
+   * An example of generic access by ordinal:
+   * {{{
+   * import org.apache.spark.sql._
+   *
+   * val row = Row(1, true, a string, null)
+   * // row: Row = [1,true,a string,null]
+   * val firstValue = row(0)
+   * // firstValue: Any = 1
+   * val fourthValue = row(3)
+   * // fourthValue: Any = null
+   * }}}
+   *
+   * For native primitive access, it is invalid to use the native 
primitive interface to retrieve
+   * a value that is null, instead a user must check `isNullAt` before 
attempting to retrieve a
+   * value that might be null.
+   * An example of native primitive access:
+   * {{{
+   * // using the row from the previous example.
+   * val firstValue = row.getInt(0)
+   * // firstValue: Int = 1
+   * val isNull = row.isNullAt(3)
+   * // isNull: Boolean = true
+   * }}}
+   *
+   * Interfaces related to native primitive access are:
+   *
+   * `isNullAt(i: Int): Boolean`
+   *
+   * `getInt(i: Int): Int`
+   *
+   * `getLong(i: Int): Long`
+   *
+   * `getDouble(i: Int): Double`
+   *
+   * `getFloat(i: Int): Float`
+   *
+   * `getBoolean(i: Int): Boolean`
+   *
+   * `getShort(i: Int): Short`
+   *
+   * `getByte(i: Int): Byte`
+   *
+   * `getString(i: Int): String`
+   *
+   * Fields in a [[Row]] object can be extracted in a pattern match. 
Example:
+   * {{{
+   * import org.apache.spark.sql._
+   *
+   * val pairs = sql(SELECT key, value FROM src).rdd.map {
+   *   case Row(key: Int, value: String) =
+   * key - value
+   * }
+   * }}}
+   *
+   * @group row
+   */
+  @DeveloperApi
+  val Row = catalyst.expressions.Row
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * The base type of all Spark SQL data types.
+   *
+   * @group dataType
+   */
+  @DeveloperApi
+  type DataType = catalyst.types.DataType
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * The data type representing `String` values
+   *
+   * @group dataType
+   */
+  @DeveloperApi
+  val StringType = catalyst.types.StringType
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * The data type representing `Array[Byte]` values.
+   *
+   * @group dataType
+   */
+  @DeveloperApi
+  val BinaryType = catalyst.types.BinaryType
+
+  /**
+   * 

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15483619
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala ---
@@ -0,0 +1,401 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * Allows the execution of relational queries, including those expressed 
in SQL using Spark.
+ *
+ *  @groupname dataType Data types
+ *  @groupdesc Spark SQL data types.
+ *  @groupprio dataType -3
+ *  @groupname field Field
+ *  @groupprio field -2
+ *  @groupname row Row
+ *  @groupprio row -1
+ */
+package object sql {
+
+  protected[sql] type Logging = com.typesafe.scalalogging.slf4j.Logging
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Represents one row of output from a relational operator.
+   * @group row
+   */
+  @DeveloperApi
+  type Row = catalyst.expressions.Row
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * A [[Row]] object can be constructed by providing field values. 
Example:
+   * {{{
+   * import org.apache.spark.sql._
+   *
+   * // Create a Row from values.
+   * Row(value1, value2, value3, ...)
+   * // Create a Row from a Seq of values.
+   * Row.fromSeq(Seq(value1, value2, ...))
+   * }}}
+   *
+   * A value of a row can be accessed through both generic access by 
ordinal,
+   * which will incur boxing overhead for primitives, as well as native 
primitive access.
+   * An example of generic access by ordinal:
+   * {{{
+   * import org.apache.spark.sql._
+   *
+   * val row = Row(1, true, a string, null)
+   * // row: Row = [1,true,a string,null]
+   * val firstValue = row(0)
+   * // firstValue: Any = 1
+   * val fourthValue = row(3)
+   * // fourthValue: Any = null
+   * }}}
+   *
+   * For native primitive access, it is invalid to use the native 
primitive interface to retrieve
+   * a value that is null, instead a user must check `isNullAt` before 
attempting to retrieve a
+   * value that might be null.
+   * An example of native primitive access:
+   * {{{
+   * // using the row from the previous example.
+   * val firstValue = row.getInt(0)
+   * // firstValue: Int = 1
+   * val isNull = row.isNullAt(3)
+   * // isNull: Boolean = true
+   * }}}
+   *
+   * Interfaces related to native primitive access are:
+   *
+   * `isNullAt(i: Int): Boolean`
+   *
+   * `getInt(i: Int): Int`
+   *
+   * `getLong(i: Int): Long`
+   *
+   * `getDouble(i: Int): Double`
+   *
+   * `getFloat(i: Int): Float`
+   *
+   * `getBoolean(i: Int): Boolean`
+   *
+   * `getShort(i: Int): Short`
+   *
+   * `getByte(i: Int): Byte`
+   *
+   * `getString(i: Int): String`
+   *
+   * Fields in a [[Row]] object can be extracted in a pattern match. 
Example:
+   * {{{
+   * import org.apache.spark.sql._
+   *
+   * val pairs = sql(SELECT key, value FROM src).rdd.map {
+   *   case Row(key: Int, value: String) =
+   * key - value
+   * }
+   * }}}
+   *
+   * @group row
+   */
+  @DeveloperApi
+  val Row = catalyst.expressions.Row
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * The base type of all Spark SQL data types.
+   *
+   * @group dataType
+   */
+  @DeveloperApi
+  type DataType = catalyst.types.DataType
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * The data type representing `String` values
+   *
+   * @group dataType
+   */
+  @DeveloperApi
+  val StringType = catalyst.types.StringType
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * The data type representing `Array[Byte]` values.
+   *
+   * @group dataType
+   */
+  @DeveloperApi
+  val BinaryType = catalyst.types.BinaryType
+
+  /**
+   * 

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15483710
  
--- Diff: python/pyspark/sql.py ---
@@ -20,8 +20,413 @@
 
 from py4j.protocol import Py4JError
 
-__all__ = [SQLContext, HiveContext, LocalHiveContext, 
TestHiveContext, SchemaRDD, Row]
+__all__ = [
+StringType, BinaryType, BooleanType, DecimalType, DoubleType,
+FloatType, ByteType, IntegerType, LongType, ShortType,
+ArrayType, MapType, StructField, StructType,
+SQLContext, HiveContext, LocalHiveContext, TestHiveContext, 
SchemaRDD, Row]
 
+class PrimitiveTypeSingleton(type):
+_instances = {}
+def __call__(cls):
+if cls not in cls._instances:
+cls._instances[cls] = super(PrimitiveTypeSingleton, 
cls).__call__()
+return cls._instances[cls]
+
+class StringType(object):
+Spark SQL StringType
+
+The data type representing string values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return StringType
+
+class BinaryType(object):
+Spark SQL BinaryType
+
+The data type representing bytes values and bytearray values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BinaryType
+
+class BooleanType(object):
+Spark SQL BooleanType
+
+The data type representing bool values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return BooleanType
+
+class TimestampType(object):
+Spark SQL TimestampType
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return TimestampType
+
+class DecimalType(object):
+Spark SQL DecimalType
+
+The data type representing decimal.Decimal values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return DecimalType
+
+class DoubleType(object):
+Spark SQL DoubleType
+
+The data type representing float values. Because a float value
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return DoubleType
+
+class FloatType(object):
+Spark SQL FloatType
+
+For PySpark, please use L{DoubleType} instead of using L{FloatType}.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return FloatType
+
+class ByteType(object):
+Spark SQL ByteType
+
+For PySpark, please use L{IntegerType} instead of using L{ByteType}.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return ByteType
+
+class IntegerType(object):
+Spark SQL IntegerType
+
+The data type representing int values.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return IntegerType
+
+class LongType(object):
+Spark SQL LongType
+
+The data type representing long values. If the any value is beyond the 
range of
+[-9223372036854775808, 9223372036854775807], please use DecimalType.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return LongType
+
+class ShortType(object):
+Spark SQL ShortType
+
+For PySpark, please use L{IntegerType} instead of using L{ShortType}.
+
+
+__metaclass__ = PrimitiveTypeSingleton
+
+def _get_scala_type_string(self):
+return ShortType
+
+class ArrayType(object):
+Spark SQL ArrayType
+
+The data type representing list values.
+
+
+def __init__(self, elementType, containsNull):
--- End diff --

Should we have the same default value for containsNull that we have in 
Scala?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15483749
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.api.java.types;
+
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * The base type of all Spark SQL data types.
+ */
+public abstract class DataType {
+
+  /**
+   * Gets the StringType object.
+   */
+  public static final StringType StringType = new StringType();
+
+  /**
+   * Gets the BinaryType object.
+   */
+  public static final BinaryType BinaryType = new BinaryType();
+
+  /**
+   * Gets the BooleanType object.
+   */
+  public static final BooleanType BooleanType = new BooleanType();
+
+  /**
+   * Gets the TimestampType object.
+   */
+  public static final TimestampType TimestampType = new TimestampType();
+
+  /**
+   * Gets the DecimalType object.
+   */
+  public static final DecimalType DecimalType = new DecimalType();
+
+  /**
+   * Gets the DoubleType object.
+   */
+  public static final DoubleType DoubleType = new DoubleType();
+
+  /**
+   * Gets the FloatType object.
+   */
+  public static final FloatType FloatType = new FloatType();
+
+  /**
+   * Gets the ByteType object.
+   */
+  public static final ByteType ByteType = new ByteType();
+
+  /**
+   * Gets the IntegerType object.
+   */
+  public static final IntegerType IntegerType = new IntegerType();
+
+  /**
+   * Gets the LongType object.
+   */
+  public static final LongType LongType = new LongType();
+
+  /**
+   * Gets the ShortType object.
+   */
+  public static final ShortType ShortType = new ShortType();
+
+  /**
+   * Creates an ArrayType by specifying the data type of elements ({@code 
elementType}) and
+   * whether the array contains null values ({@code containsNull}).
+   * @param elementType
+   * @param containsNull
+   * @return
+   */
+  public static ArrayType createArrayType(DataType elementType, boolean 
containsNull) {
--- End diff --

Add another method that has a default for containsNull?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15492345
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
   }
 
+  /**
+   * Returns the equivalent StructField in Scala for the given StructField 
in Java.
+   */
+  protected def asJavaStructField(scalaStructField: StructField): 
JStructField = {
--- End diff --

Originally, I put it in `JavaSQLContext`. But, I found I need the access to 
`asJavaDataType` in `JavaSchemaRDD` which only has `SQLContext` instead of 
`JavaSQLContext`. I guess we want to refactor `JavaSchemaRDD` to use 
`JavaSQLContext` instead of `SQLContext`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15492409
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
   }
 
+  /**
+   * Returns the equivalent StructField in Scala for the given StructField 
in Java.
+   */
+  protected def asJavaStructField(scalaStructField: StructField): 
JStructField = {
--- End diff --

Oh, I see.  These are all static functions right?  Maybe we could put them 
all in a python support object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15493024
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
   }
 
+  /**
+   * Returns the equivalent StructField in Scala for the given StructField 
in Java.
+   */
+  protected def asJavaStructField(scalaStructField: StructField): 
JStructField = {
--- End diff --

Will move them to a better place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15499882
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.api.java.types;
+
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * The base type of all Spark SQL data types.
+ */
+public abstract class DataType {
+
+  /**
+   * Gets the StringType object.
+   */
+  public static final StringType StringType = new StringType();
+
+  /**
+   * Gets the BinaryType object.
+   */
+  public static final BinaryType BinaryType = new BinaryType();
+
+  /**
+   * Gets the BooleanType object.
+   */
+  public static final BooleanType BooleanType = new BooleanType();
+
+  /**
+   * Gets the TimestampType object.
+   */
+  public static final TimestampType TimestampType = new TimestampType();
+
+  /**
+   * Gets the DecimalType object.
+   */
+  public static final DecimalType DecimalType = new DecimalType();
+
+  /**
+   * Gets the DoubleType object.
+   */
+  public static final DoubleType DoubleType = new DoubleType();
+
+  /**
+   * Gets the FloatType object.
+   */
+  public static final FloatType FloatType = new FloatType();
+
+  /**
+   * Gets the ByteType object.
+   */
+  public static final ByteType ByteType = new ByteType();
+
+  /**
+   * Gets the IntegerType object.
+   */
+  public static final IntegerType IntegerType = new IntegerType();
+
+  /**
+   * Gets the LongType object.
+   */
+  public static final LongType LongType = new LongType();
+
+  /**
+   * Gets the ShortType object.
+   */
+  public static final ShortType ShortType = new ShortType();
+
+  /**
+   * Creates an ArrayType by specifying the data type of elements ({@code 
elementType}) and
+   * whether the array contains null values ({@code containsNull}).
+   * @param elementType
+   * @param containsNull
+   * @return
+   */
+  public static ArrayType createArrayType(DataType elementType, boolean 
containsNull) {
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15499885
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/package-info.java ---
@@ -0,0 +1,22 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+/**
+ * Allows users to get and create Spark SQL data types.
+ */
+package org.apache.spark.sql.api.java.types;
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15499886
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.api.java.types;
+
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * The base type of all Spark SQL data types.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50420253
  
QA tests have started for PR 1346. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17319/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50420304
  
QA results for PR 1346:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17319/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50420623
  
QA tests have started for PR 1346. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17320/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50423690
  
@yhuai @marmbrus I am not sure if this has been discussed before, but what 
do you guys think about adding a version of `applySchema(RDD[Array[String]], 
StructType)`? 

The use case I have in mind is TPC-DS data preparation. Currently I have a 
bunch of text files, from which I can easily create an `RDD[String]`; by 
splitting each line on some separator I get an `RDD[Array[String]]`. Now, in 
TPC-DS the tables easily have 15+ columns, and I don't want to manually create 
a `Row` for each `Array[String]`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50423851
  
To add to this: for my own purpose, I can certainly hack something together 
based off this branch in a custom Spark build, but just want to throw this 
thought out there as I think it does have some generality (large number of 
columns, avoid writing `.map(p = Row(p(0), p(1), ..., p(LARGE_NUM)))`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50424563
  
@concretevitamin There is another way create a row, which is 
`Row.fromSeq(values: Seq[Any])`. Or, you can expand the array by using `:_*`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50426883
  
QA results for PR 1346:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17320/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50432240
  
I am reviewing it. Will have a update soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50291054
  
QA tests have started for PR 1346. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17263/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-27 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50292497
  
@yhuai awesome! I will update my diff to use this API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50293500
  
QA results for PR 1346:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17263/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---