[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50526602 I've merged this into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1601 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50307889 QA tests have started for PR 1601. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17278/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50315422 QA results for PR 1601:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17278/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50372083 Spark SQL does not support Set/List, so we should treat all sets from Python as Seq, then they can't be converted back. In other way, we could drop the set support right now. @mateiz @marmbrus Do we need to clean up these in this PR, or do it later in another issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50372949 Lets just remove it now. It should be as easy as adding an error and removing the tests in question. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50383239 QA tests have started for PR 1601. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17299/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50398161 QA results for PR 1601:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17299/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1601#discussion_r15440847 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -357,16 +357,52 @@ class SQLContext(@transient val sparkContext: SparkContext) case c: java.util.Map[_, _] = val (key, value) = c.head MapType(typeFor(key), typeFor(value)) + case c: java.util.Calendar = TimestampType case c if c.getClass.isArray = val elem = c.asInstanceOf[Array[_]].head ArrayType(typeFor(elem)) case c = throw new Exception(sObject of type $c cannot be used) } -val schema = rdd.first().map { case (fieldName, obj) = +val firstRow = rdd.first() +val schema = firstRow.map { case (fieldName, obj) = AttributeReference(fieldName, typeFor(obj), true)() }.toSeq -val rowRdd = rdd.mapPartitions { iter = +def needTransform(obj: Any): Boolean = obj match { + case c: java.util.List[_] = c.exists(needTransform) + case c: java.util.Set[_] = c.exists(needTransform) + case c: java.util.Map[_, _] = c.exists { +case (key, value) = needTransform(key) || needTransform(value) + } + case c if c.getClass.isArray = +c.asInstanceOf[Array[_]].exists(needTransform) + case c: java.util.Calendar = true + case c = false +} + +def transform(obj: Any): Any = obj match { + case c: java.util.List[_] = c.map(transform) + case c: java.util.Set[_] = c.map(transform) + case c: java.util.Map[_, _] = c.map { +case (key, value) = (transform(key), transform(value)) + } --- End diff -- Spark SQL expects Scala Maps and Seqs internally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1601#discussion_r15444580 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -357,16 +357,52 @@ class SQLContext(@transient val sparkContext: SparkContext) case c: java.util.Map[_, _] = val (key, value) = c.head MapType(typeFor(key), typeFor(value)) + case c: java.util.Calendar = TimestampType case c if c.getClass.isArray = val elem = c.asInstanceOf[Array[_]].head ArrayType(typeFor(elem)) case c = throw new Exception(sObject of type $c cannot be used) } -val schema = rdd.first().map { case (fieldName, obj) = +val firstRow = rdd.first() +val schema = firstRow.map { case (fieldName, obj) = AttributeReference(fieldName, typeFor(obj), true)() }.toSeq -val rowRdd = rdd.mapPartitions { iter = +def needTransform(obj: Any): Boolean = obj match { + case c: java.util.List[_] = c.exists(needTransform) + case c: java.util.Set[_] = c.exists(needTransform) + case c: java.util.Map[_, _] = c.exists { +case (key, value) = needTransform(key) || needTransform(value) + } + case c if c.getClass.isArray = +c.asInstanceOf[Array[_]].exists(needTransform) + case c: java.util.Calendar = true + case c = false +} + +def transform(obj: Any): Any = obj match { + case c: java.util.List[_] = c.map(transform) + case c: java.util.Set[_] = c.map(transform) + case c: java.util.Map[_, _] = c.map { +case (key, value) = (transform(key), transform(value)) + } --- End diff -- Ah, okay. Then this looks good to me if the SQL part seems okay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1601 [SPARK-2674] [SQL] [PySpark] support datetime type for SchemaRDD Datetime and time in Python will be converted into java.util.Calendar after serialization, it will be converted into java.sql.Timestamp during inferSchema(). In javaToPython(), Timestamp will be converted into Calendar, then be converted into datetime in Python after pickling. You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark date Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1601.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1601 commit 96db384f9eba821cad803ee80e3e00e1dea50085 Author: Davies Liu davies@gmail.com Date: 2014-07-26T06:59:39Z support datetime type for SchemaRDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50225682 QA tests have started for PR 1601. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17219/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50228401 QA results for PR 1601:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17219/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1601#discussion_r15434945 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -395,6 +395,11 @@ class SchemaRDD( arr.asInstanceOf[Array[Any]].map { element = rowToMap(element.asInstanceOf[Row], struct) } +case t: java.sql.Timestamp = { --- End diff -- This block can be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1601#discussion_r15435519 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -357,16 +357,52 @@ class SQLContext(@transient val sparkContext: SparkContext) case c: java.util.Map[_, _] = val (key, value) = c.head MapType(typeFor(key), typeFor(value)) + case c: java.util.Calendar = TimestampType case c if c.getClass.isArray = val elem = c.asInstanceOf[Array[_]].head ArrayType(typeFor(elem)) case c = throw new Exception(sObject of type $c cannot be used) } -val schema = rdd.first().map { case (fieldName, obj) = +val firstRow = rdd.first() +val schema = firstRow.map { case (fieldName, obj) = AttributeReference(fieldName, typeFor(obj), true)() }.toSeq -val rowRdd = rdd.mapPartitions { iter = +def needTransform(obj: Any): Boolean = obj match { + case c: java.util.List[_] = c.exists(needTransform) + case c: java.util.Set[_] = c.exists(needTransform) + case c: java.util.Map[_, _] = c.exists { +case (key, value) = needTransform(key) || needTransform(value) + } + case c if c.getClass.isArray = +c.asInstanceOf[Array[_]].exists(needTransform) + case c: java.util.Calendar = true + case c = false +} + +def transform(obj: Any): Any = obj match { + case c: java.util.List[_] = c.map(transform) + case c: java.util.Set[_] = c.map(transform) + case c: java.util.Map[_, _] = c.map { +case (key, value) = (transform(key), transform(value)) + } --- End diff -- FYI, this will return a Scala Map, not a Java one. Same with the maps on List, Set, etc. Will the rest of the code know how to deal with this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50253534 QA tests have started for PR 1601. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17236/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1601#issuecomment-50254990 QA results for PR 1601:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17236/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---