[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61550257 Thanks! Merged to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2716 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61429974 Sorry, for the delay here. If you can merge I'll try to squeeze this into 1.2. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61435364 @marmbrus waiting for jenkins --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61435441 [Test build #22792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22792/consoleFull) for PR 2716 at commit [`e678f6d`](https://github.com/apache/spark/commit/e678f6d856a31231c9bff4381a19bdf3cd6166e2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61435474 [Test build #508 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/508/consoleFull) for PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61438114 [Test build #508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/508/consoleFull) for PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * `class NullType(PrimitiveType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61439288 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22792/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61439280 [Test build #22792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22792/consoleFull) for PR 2716 at commit [`e678f6d`](https://github.com/apache/spark/commit/e678f6d856a31231c9bff4381a19bdf3cd6166e2). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NullType(PrimitiveType):` * `// in some cases, such as when a class is enclosed in an object (in which case` * `abstract class UserDefinedType[UserType] extends DataType with Serializable ` * `public abstract class UserDefinedTypeUserType extends DataType implements Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61440930 @marmbrus It's ready to go --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61314256 Ping dashboard --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61194470 [Test build #495 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/495/consoleFull) for PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61203599 [Test build #495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/495/consoleFull) for PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61026632 [Test build #22496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22496/consoleFull) for PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61033023 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22496/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-61033022 [Test build #22496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22496/consoleFull) for PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class AddWebUIFilter(filterName:String, filterParams: Map[String, String], proxyBase: String)` * ` case class RequestExecutors(requestedTotal: Int) extends CoarseGrainedClusterMessage` * ` case class KillExecutors(executorIds: Seq[String]) extends CoarseGrainedClusterMessage` * `class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val actorSystem: ActorSystem)` * `class NullType(PrimitiveType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60554724 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22273/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60554714 [Test build #22273 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22273/consoleFull) for PR 2716 at commit [`567dc60`](https://github.com/apache/spark/commit/567dc60d7ce2c43ec7c1e24e47dc515ab5056ac0). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NullType(PrimitiveType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/2716#discussion_r19384056 --- Diff: python/pyspark/sql.py --- @@ -995,19 +1038,22 @@ def registerFunction(self, name, f, returnType=StringType()): self._sc._javaAccumulator, returnType.json()) -def inferSchema(self, rdd): +def inferSchema(self, rdd, samplingRatio=None): Infer and apply a schema to an RDD of L{Row}. -We peek at the first row of the RDD to determine the fields' names -and types. Nested collections are supported, which include array, -dict, list, Row, tuple, namedtuple, or object. +If `samplingRatio` is presented, it infer schema by all of the sampled +dataset. -All the rows in `rdd` should have the same type with the first one, -or it will cause runtime exceptions. +Otherwise, it peeks first few rows of the RDD to determine the fields' +names and types. Nested collections are supported, which include array, +dict, list, Row, tuple, namedtuple, or object. Each row could be L{pyspark.sql.Row} object or namedtuple or objects, using dict is deprecated. +If some of rows has different types with inferred types, it may cause +runtime exceptions. --- End diff -- When `samplingRatio` is specified, the schema is inferred by looking at the types of each row in the sampled dataset. Otherwise, the first 100 rows of the RDD are inspected. Nested collections are supported, which can include array, dict, list, Row, tuple, namedtuple, or object. Each row could be L{pyspark.sql.Row} object or namedtuple or objects. Using top level dicts is deprecated, as this datatype is used to represent Maps. If a single column has multiple distinct inferred types, it may cause runtime exceptions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60534136 Minor comment on documentation wording. Otherwise this LGTM! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60550377 @marmbrus fixed, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60550499 [Test build #22273 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22273/consoleFull) for PR 2716 at commit [`567dc60`](https://github.com/apache/spark/commit/567dc60d7ce2c43ec7c1e24e47dc515ab5056ac0). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60473675 [Test build #22197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22197/consoleFull) for PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60473680 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22197/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60473788 failed: ``` [info] - sorting without aggregation, with spill *** FAILED *** [info] java.io.FileNotFoundException: /tmp/spark-local-20141024230838-6b0e/07/temp_shuffle_79289879-f38b-46f0-9f49-99c962fca570 (No such file or directory) [info] at java.io.FileOutputStream.open(Native Method) [info] at java.io.FileOutputStream.init(FileOutputStream.java:221) [info] at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) [info] at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192) [info] at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:300) [info] at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:251) [info] at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:83) [info] at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:77) [info] at org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:83) [info] at org.apache.spark.util.collection.ExternalSorter.maybeSpillCollection(ExternalSorter.scala:238) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60473797 [Test build #433 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/433/consoleFull) for PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60475531 [Test build #433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/433/consoleFull) for PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60491420 [Test build #451 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/451/consoleFull) for PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60495230 [Test build #451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/451/consoleFull) for PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NullType(PrimitiveType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60472563 [Test build #22197 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22197/consoleFull) for PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58615814 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/344/consoleFull) for PR 2716 at commit [`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NullType(PrimitiveType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58615995 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21572/Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58615990 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21572/consoleFull) for PR 2716 at commit [`e48d7fb`](https://github.com/apache/spark/commit/e48d7fb0800946a50922caae0062805d0fd4c371). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NullType(PrimitiveType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58611722 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/344/consoleFull) for PR 2716 at commit [`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58611800 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21572/consoleFull) for PR 2716 at commit [`e48d7fb`](https://github.com/apache/spark/commit/e48d7fb0800946a50922caae0062805d0fd4c371). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2716 [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema. If there is still NullType in schema, then it will show an warning, tell user to try with sampling. If sampling is presented, it will infer schema from all the rows after sampling. Also, add samplingRatio for jsonFile() and jsonRDD() You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark infer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2716.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2716 commit 3603e00852f94568523bc641c756cde881616017 Author: Davies Liu davies@gmail.com Date: 2014-10-08T19:29:00Z take more rows to infer schema, or infer the schema by sampling the RDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58414816 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21480/consoleFull) for PR 2716 at commit [`3603e00`](https://github.com/apache/spark/commit/3603e00852f94568523bc641c756cde881616017). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58415442 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21481/consoleFull) for PR 2716 at commit [`f93fd84`](https://github.com/apache/spark/commit/f93fd84ce4ce7fd69ba8e12ff5343cd46116f78d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58424360 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21480/Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58424351 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21480/consoleFull) for PR 2716 at commit [`3603e00`](https://github.com/apache/spark/commit/3603e00852f94568523bc641c756cde881616017). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NullType(DataType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58425002 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21481/Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58424997 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21481/consoleFull) for PR 2716 at commit [`f93fd84`](https://github.com/apache/spark/commit/f93fd84ce4ce7fd69ba8e12ff5343cd46116f78d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NullType(DataType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58437772 @davies I believe this PR also relates to the features discussed in [SPARK-2870](https://issues.apache.org/jira/browse/SPARK-2870). Since you are already doing schema inference over multiple rows with an optional sampling ratio in this PR, how far are we from being able to do full schema inference on RDDs of `dict`s? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58440322 @nchammas This PR only fix the problem of having empty values in first few rows, it can not handle different types for one field (like what json() had done). Maybe we could support optional fields. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58444525 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21489/consoleFull) for PR 2716 at commit [`540d1d5`](https://github.com/apache/spark/commit/540d1d5ecfc1a3678e453bcfacd3bbeac4ce). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58449106 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21489/consoleFull) for PR 2716 at commit [`540d1d5`](https://github.com/apache/spark/commit/540d1d5ecfc1a3678e453bcfacd3bbeac4ce). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NullType(DataType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58449107 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21489/Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org