[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-03 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61550257
  
Thanks!  Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2716


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61429974
  
Sorry, for the delay here.  If you can merge I'll try to squeeze this into 
1.2.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-02 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61435364
  
@marmbrus waiting for jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61435441
  
  [Test build #22792 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22792/consoleFull)
 for   PR 2716 at commit 
[`e678f6d`](https://github.com/apache/spark/commit/e678f6d856a31231c9bff4381a19bdf3cd6166e2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61435474
  
  [Test build #508 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/508/consoleFull)
 for   PR 2716 at commit 
[`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61438114
  
  [Test build #508 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/508/consoleFull)
 for   PR 2716 at commit 
[`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61439288
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22792/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61439280
  
  [Test build #22792 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22792/consoleFull)
 for   PR 2716 at commit 
[`e678f6d`](https://github.com/apache/spark/commit/e678f6d856a31231c9bff4381a19bdf3cd6166e2).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class NullType(PrimitiveType):`
  * `//   in some cases, such as when a class is enclosed in an 
object (in which case`
  * `abstract class UserDefinedType[UserType] extends DataType with 
Serializable `
  * `public abstract class UserDefinedTypeUserType extends DataType 
implements Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-11-02 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61440930
  
@marmbrus It's ready to go


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-31 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61314256
  
Ping dashboard


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61194470
  
  [Test build #495 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/495/consoleFull)
 for   PR 2716 at commit 
[`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61203599
  
  [Test build #495 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/495/consoleFull)
 for   PR 2716 at commit 
[`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61026632
  
  [Test build #22496 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22496/consoleFull)
 for   PR 2716 at commit 
[`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61033023
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22496/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-61033022
  
  [Test build #22496 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22496/consoleFull)
 for   PR 2716 at commit 
[`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class AddWebUIFilter(filterName:String, filterParams: 
Map[String, String], proxyBase: String)`
  * `  case class RequestExecutors(requestedTotal: Int) extends 
CoarseGrainedClusterMessage`
  * `  case class KillExecutors(executorIds: Seq[String]) extends 
CoarseGrainedClusterMessage`
  * `class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val 
actorSystem: ActorSystem)`
  * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60554724
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22273/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60554714
  
  [Test build #22273 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22273/consoleFull)
 for   PR 2716 at commit 
[`567dc60`](https://github.com/apache/spark/commit/567dc60d7ce2c43ec7c1e24e47dc515ab5056ac0).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-26 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/2716#discussion_r19384056
  
--- Diff: python/pyspark/sql.py ---
@@ -995,19 +1038,22 @@ def registerFunction(self, name, f, 
returnType=StringType()):
   self._sc._javaAccumulator,
   returnType.json())
 
-def inferSchema(self, rdd):
+def inferSchema(self, rdd, samplingRatio=None):
 Infer and apply a schema to an RDD of L{Row}.
 
-We peek at the first row of the RDD to determine the fields' names
-and types. Nested collections are supported, which include array,
-dict, list, Row, tuple, namedtuple, or object.
+If `samplingRatio` is presented, it infer schema by all of the 
sampled
+dataset.
 
-All the rows in `rdd` should have the same type with the first one,
-or it will cause runtime exceptions.
+Otherwise, it peeks first few rows of the RDD to determine the 
fields'
+names and types. Nested collections are supported, which include 
array,
+dict, list, Row, tuple, namedtuple, or object.
 
 Each row could be L{pyspark.sql.Row} object or namedtuple or 
objects,
 using dict is deprecated.
 
+If some of rows has different types with inferred types, it may 
cause
+runtime exceptions.
--- End diff --

When `samplingRatio` is specified, the schema is inferred by looking at the 
types of each row in the sampled dataset.  Otherwise, the first 100 rows of the 
RDD are inspected. Nested collections are supported, which can include array, 
dict, list, Row, tuple, namedtuple, or object.

Each row could be L{pyspark.sql.Row} object or namedtuple or objects.  
Using top level dicts is deprecated, as this datatype is used to represent Maps.
 
If a single column has multiple distinct inferred types, it may cause 
runtime exceptions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-26 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60534136
  
Minor comment on documentation wording.  Otherwise this LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-26 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60550377
  
@marmbrus fixed, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60550499
  
  [Test build #22273 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22273/consoleFull)
 for   PR 2716 at commit 
[`567dc60`](https://github.com/apache/spark/commit/567dc60d7ce2c43ec7c1e24e47dc515ab5056ac0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60473675
  
  [Test build #22197 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22197/consoleFull)
 for   PR 2716 at commit 
[`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60473680
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22197/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-25 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60473788
  
failed: 
```
[info] - sorting without aggregation, with spill *** FAILED ***
[info]   java.io.FileNotFoundException: 
/tmp/spark-local-20141024230838-6b0e/07/temp_shuffle_79289879-f38b-46f0-9f49-99c962fca570
 (No such file or directory)
[info]   at java.io.FileOutputStream.open(Native Method)
[info]   at java.io.FileOutputStream.init(FileOutputStream.java:221)
[info]   at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
[info]   at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
[info]   at 
org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:300)
[info]   at 
org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:251)
[info]   at 
org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:83)
[info]   at 
org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:77)
[info]   at 
org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:83)
[info]   at 
org.apache.spark.util.collection.ExternalSorter.maybeSpillCollection(ExternalSorter.scala:238)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60473797
  
  [Test build #433 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/433/consoleFull)
 for   PR 2716 at commit 
[`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60475531
  
  [Test build #433 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/433/consoleFull)
 for   PR 2716 at commit 
[`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60491420
  
  [Test build #451 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/451/consoleFull)
 for   PR 2716 at commit 
[`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60495230
  
  [Test build #451 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/451/consoleFull)
 for   PR 2716 at commit 
[`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-60472563
  
  [Test build #22197 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22197/consoleFull)
 for   PR 2716 at commit 
[`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58615814
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/344/consoleFull)
 for   PR 2716 at commit 
[`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58615995
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21572/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58615990
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21572/consoleFull)
 for   PR 2716 at commit 
[`e48d7fb`](https://github.com/apache/spark/commit/e48d7fb0800946a50922caae0062805d0fd4c371).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58611722
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/344/consoleFull)
 for   PR 2716 at commit 
[`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58611800
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21572/consoleFull)
 for   PR 2716 at commit 
[`e48d7fb`](https://github.com/apache/spark/commit/e48d7fb0800946a50922caae0062805d0fd4c371).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/2716

[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling

This patch will try to infer schema for RDD which has empty value (None, 
[], {}) in the first row. It will try first 100 rows and merge the types into 
schema. If there is still NullType in schema, then it will show an warning, 
tell user to try with sampling.

If sampling is presented, it will infer schema from all the rows after 
sampling.

Also, add samplingRatio for jsonFile() and jsonRDD()

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark infer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2716.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2716


commit 3603e00852f94568523bc641c756cde881616017
Author: Davies Liu davies@gmail.com
Date:   2014-10-08T19:29:00Z

take more rows to infer schema, or infer the schema by sampling the RDD




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58414816
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21480/consoleFull)
 for   PR 2716 at commit 
[`3603e00`](https://github.com/apache/spark/commit/3603e00852f94568523bc641c756cde881616017).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58415442
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21481/consoleFull)
 for   PR 2716 at commit 
[`f93fd84`](https://github.com/apache/spark/commit/f93fd84ce4ce7fd69ba8e12ff5343cd46116f78d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58424360
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21480/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58424351
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21480/consoleFull)
 for   PR 2716 at commit 
[`3603e00`](https://github.com/apache/spark/commit/3603e00852f94568523bc641c756cde881616017).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class NullType(DataType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58425002
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21481/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58424997
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21481/consoleFull)
 for   PR 2716 at commit 
[`f93fd84`](https://github.com/apache/spark/commit/f93fd84ce4ce7fd69ba8e12ff5343cd46116f78d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class NullType(DataType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread nchammas
Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58437772
  
@davies I believe this PR also relates to the features discussed in 
[SPARK-2870](https://issues.apache.org/jira/browse/SPARK-2870). Since you are 
already doing schema inference over multiple rows with an optional sampling 
ratio in this PR, how far are we from being able to do full schema inference on 
RDDs of `dict`s?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58440322
  
@nchammas This PR only fix the problem of having empty values in first few 
rows, it can not handle different types for one field (like what json() had 
done).

Maybe we could support optional fields.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58444525
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21489/consoleFull)
 for   PR 2716 at commit 
[`540d1d5`](https://github.com/apache/spark/commit/540d1d5ecfc1a3678e453bcfacd3bbeac4ce).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58449106
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21489/consoleFull)
 for   PR 2716 at commit 
[`540d1d5`](https://github.com/apache/spark/commit/540d1d5ecfc1a3678e453bcfacd3bbeac4ce).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class NullType(DataType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58449107
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21489/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org