[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-29 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50526602
  
I've merged this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1601


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50307889
  
QA tests have started for PR 1601. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17278/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50315422
  
QA results for PR 1601:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17278/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-28 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50372083
  
Spark SQL does not support Set/List, so we should treat all sets from 
Python as Seq, then they can't be converted back. In other way, we could drop 
the set support right now.

@mateiz @marmbrus Do we need to clean up these in this PR, or do it later 
in another issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-28 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50372949
  
Lets just remove it now.  It should be as easy as adding an error and 
removing the tests in question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50383239
  
QA tests have started for PR 1601. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17299/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50398161
  
QA results for PR 1601:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17299/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-27 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1601#discussion_r15440847
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -357,16 +357,52 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
   case c: java.util.Map[_, _] =
 val (key, value) = c.head
 MapType(typeFor(key), typeFor(value))
+  case c: java.util.Calendar = TimestampType
   case c if c.getClass.isArray =
 val elem = c.asInstanceOf[Array[_]].head
 ArrayType(typeFor(elem))
   case c = throw new Exception(sObject of type $c cannot be used)
 }
-val schema = rdd.first().map { case (fieldName, obj) =
+val firstRow = rdd.first()
+val schema = firstRow.map { case (fieldName, obj) =
   AttributeReference(fieldName, typeFor(obj), true)()
 }.toSeq
 
-val rowRdd = rdd.mapPartitions { iter =
+def needTransform(obj: Any): Boolean = obj match {
+  case c: java.util.List[_] = c.exists(needTransform)
+  case c: java.util.Set[_] = c.exists(needTransform)
+  case c: java.util.Map[_, _] = c.exists {
+case (key, value) = needTransform(key) || needTransform(value)
+  }
+  case c if c.getClass.isArray =
+c.asInstanceOf[Array[_]].exists(needTransform)
+  case c: java.util.Calendar = true
+  case c = false
+}
+
+def transform(obj: Any): Any = obj match {
+  case c: java.util.List[_] = c.map(transform)
+  case c: java.util.Set[_] = c.map(transform)
+  case c: java.util.Map[_, _] = c.map {
+case (key, value) = (transform(key), transform(value))
+  }
--- End diff --

Spark SQL expects Scala Maps and Seqs internally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-27 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1601#discussion_r15444580
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -357,16 +357,52 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
   case c: java.util.Map[_, _] =
 val (key, value) = c.head
 MapType(typeFor(key), typeFor(value))
+  case c: java.util.Calendar = TimestampType
   case c if c.getClass.isArray =
 val elem = c.asInstanceOf[Array[_]].head
 ArrayType(typeFor(elem))
   case c = throw new Exception(sObject of type $c cannot be used)
 }
-val schema = rdd.first().map { case (fieldName, obj) =
+val firstRow = rdd.first()
+val schema = firstRow.map { case (fieldName, obj) =
   AttributeReference(fieldName, typeFor(obj), true)()
 }.toSeq
 
-val rowRdd = rdd.mapPartitions { iter =
+def needTransform(obj: Any): Boolean = obj match {
+  case c: java.util.List[_] = c.exists(needTransform)
+  case c: java.util.Set[_] = c.exists(needTransform)
+  case c: java.util.Map[_, _] = c.exists {
+case (key, value) = needTransform(key) || needTransform(value)
+  }
+  case c if c.getClass.isArray =
+c.asInstanceOf[Array[_]].exists(needTransform)
+  case c: java.util.Calendar = true
+  case c = false
+}
+
+def transform(obj: Any): Any = obj match {
+  case c: java.util.List[_] = c.map(transform)
+  case c: java.util.Set[_] = c.map(transform)
+  case c: java.util.Map[_, _] = c.map {
+case (key, value) = (transform(key), transform(value))
+  }
--- End diff --

Ah, okay. Then this looks good to me if the SQL part seems okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-26 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/1601

[SPARK-2674] [SQL] [PySpark] support datetime type for SchemaRDD

Datetime and time in Python will be converted into java.util.Calendar after 
serialization, it will be converted into java.sql.Timestamp during 
inferSchema().

In javaToPython(), Timestamp will be converted into Calendar, then be 
converted into datetime in Python after pickling.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark date

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1601


commit 96db384f9eba821cad803ee80e3e00e1dea50085
Author: Davies Liu davies@gmail.com
Date:   2014-07-26T06:59:39Z

support datetime type for SchemaRDD




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50225682
  
QA tests have started for PR 1601. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17219/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50228401
  
QA results for PR 1601:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17219/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-26 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1601#discussion_r15434945
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -395,6 +395,11 @@ class SchemaRDD(
   arr.asInstanceOf[Array[Any]].map {
 element = rowToMap(element.asInstanceOf[Row], struct)
   }
+case t: java.sql.Timestamp = {
--- End diff --

This block can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-26 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1601#discussion_r15435519
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -357,16 +357,52 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
   case c: java.util.Map[_, _] =
 val (key, value) = c.head
 MapType(typeFor(key), typeFor(value))
+  case c: java.util.Calendar = TimestampType
   case c if c.getClass.isArray =
 val elem = c.asInstanceOf[Array[_]].head
 ArrayType(typeFor(elem))
   case c = throw new Exception(sObject of type $c cannot be used)
 }
-val schema = rdd.first().map { case (fieldName, obj) =
+val firstRow = rdd.first()
+val schema = firstRow.map { case (fieldName, obj) =
   AttributeReference(fieldName, typeFor(obj), true)()
 }.toSeq
 
-val rowRdd = rdd.mapPartitions { iter =
+def needTransform(obj: Any): Boolean = obj match {
+  case c: java.util.List[_] = c.exists(needTransform)
+  case c: java.util.Set[_] = c.exists(needTransform)
+  case c: java.util.Map[_, _] = c.exists {
+case (key, value) = needTransform(key) || needTransform(value)
+  }
+  case c if c.getClass.isArray =
+c.asInstanceOf[Array[_]].exists(needTransform)
+  case c: java.util.Calendar = true
+  case c = false
+}
+
+def transform(obj: Any): Any = obj match {
+  case c: java.util.List[_] = c.map(transform)
+  case c: java.util.Set[_] = c.map(transform)
+  case c: java.util.Map[_, _] = c.map {
+case (key, value) = (transform(key), transform(value))
+  }
--- End diff --

FYI, this will return a Scala Map, not a Java one. Same with the maps on 
List, Set, etc. Will the rest of the code know how to deal with this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50253534
  
QA tests have started for PR 1601. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17236/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1601#issuecomment-50254990
  
QA results for PR 1601:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17236/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---