date:20170410

[GitHub] spark issue #17495: [SPARK-20172][Core] Add file permission check when listi...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17495
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17495: [SPARK-20172][Core] Add file permission check when listi...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17495
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75684/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17495: [SPARK-20172][Core] Add file permission check when listi...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17495
  
**[Test build #75684 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75684/testReport)**
 for PR 17495 at commit 
[`1d1440b`](https://github.com/apache/spark/commit/1d1440bcd956f0f80e299edea72bf54e1f0d6b0d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17436
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75685/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17436
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17436
  
**[Test build #75685 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75685/testReport)**
 for PR 17436 at commit 
[`adada35`](https://github.com/apache/spark/commit/adada35ecaaaff4d10bcfd329df5e0090c51f403).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17596: [SPARK-12837][CORE] reduce the serialized size of accumu...

2017-04-10 Thread mridulm

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/17596
  
My comments were based on a fix in 1.6; actually lot of values were 
actually observed to be 0 for a lot of cases - just a few were not (even here 
it is relevant - resultSize, gctime, various bytes spilled, etc). The bitmask 
actually ends up being a single long for the cardinality of metrics we have - 
which typically gets encoded in a byte or two in reality.

In addition, things like input/output/shuffle metrics, accumulator updates, 
block updates, etc (present in 1.6 in TaskMetrics) - can all be avoided in the 
serialized stream when not present.
When present, I used read/write External on those classes to directly 
encode the values.

IIRC this is relevant not just for the final result, but for heartbeats 
also - so serde saving helped a lot more than initially expected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17581: [SPARK-20248][ SQL]Spark SQL add limit parameter to enha...

2017-04-10 Thread shaolinliu

Github user shaolinliu commented on the issue:

https://github.com/apache/spark/pull/17581
  
Ok, I have modified the description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17596: [SPARK-12837][CORE] reduce the serialized size of accumu...

2017-04-10 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17596
  
@mridulm I actually have the same plan, I think it's an overkill to 
implement TaskMetrics with accumulators, we don't need to merge the accumulator 
updates at driver side for TaskMetrics accumulators. We should send back 
TaskMetrics directly with hearbeat, task failure and task finish, then we can 
just send a bunch of `long`s and compress.

One thing I'm not 100% agree with you is about the bitmask. According to my 
experiment, most of the task metrics will not be 0, so the bitmask may not be 
very useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110814617
  
--- Diff: docs/sql-programming-guide.md ---
@@ -883,7 +883,7 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
 
 Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a `Dataset[Row]`.
-This conversion can be done using `SparkSession.read.json()` on either an 
RDD of String,
+This conversion can be done using `SparkSession.read.json()` on either a 
`Dataset[String]`,
--- End diff --

Output:

![2017-04-11 1 43 
06](https://cloud.githubusercontent.com/assets/6477701/24893164/dbbd4300-1ebc-11e7-91f4-45d6a48f2da1.png)

Example: 

![2017-04-11 1 43 
10](https://cloud.githubusercontent.com/assets/6477701/24893165/dbe61b0e-1ebc-11e7-9ab6-1a12ef351bb2.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110814635
  
--- Diff: docs/sql-programming-guide.md ---
@@ -897,7 +897,7 @@ For a regular multi-line JSON file, set the `wholeFile` 
option to `true`.
 
 
 Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a `Dataset`.
-This conversion can be done using `SparkSession.read().json()` on either 
an RDD of String,
+This conversion can be done using `SparkSession.read().json()` on either a 
`Dataset`,
--- End diff --

Output:
![2017-04-11 1 43 
15](https://cloud.githubusercontent.com/assets/6477701/24893173/ee6fdb66-1ebc-11e7-85cf-fe5605d5a7c5.png)

Example:
![2017-04-11 1 43 
18](https://cloud.githubusercontent.com/assets/6477701/24893175/f174490a-1ebc-11e7-8434-55f45fa8805b.png)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17581: [SPARK-20248][ SQL]Spark SQL add limit parameter ...

2017-04-10 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/17581#discussion_r110813819
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -359,6 +359,16 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val THRIFTSERVER_RESULT_LIMIT =
+buildConf("spark.sql.thriftserver.retainedResults")
+  .internal()
+  .doc("The number of sql results returned by Thrift Server when 
running a query " +
--- End diff --

What about `The maximum number of rows` instead of `The number of sql 
results`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17602: [MINOR][DOCS] JSON APIs related documentation fixes

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17602
  
**[Test build #75692 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75692/testReport)**
 for PR 17602 at commit 
[`3f60861`](https://github.com/apache/spark/commit/3f6086123ec2f2b34360f9494df03ea4f466f510).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110812638
  
--- Diff: docs/sql-programming-guide.md ---
@@ -897,7 +897,7 @@ For a regular multi-line JSON file, set the `wholeFile` 
option to `true`.
 
 
 Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a `Dataset`.
-This conversion can be done using `SparkSession.read().json()` on either 
an RDD of String,
+This conversion can be done using `SparkSession.read().json()` on either 
an `Dataset`,
--- End diff --

Java example uses `Dataset` as below:

![2017-04-11 1 14 
54](https://cloud.githubusercontent.com/assets/6477701/24892622/fcad75ac-1eb8-11e7-8141-d0ea59d66cfb.png)

Output:

![2017-04-11 1 14 
57](https://cloud.githubusercontent.com/assets/6477701/24892623/ff6a93a6-1eb8-11e7-994a-c1d4654a767e.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17602: [MINOR][DOCS] JSON APIs related documentation fixes

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17602
  
**[Test build #75691 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75691/testReport)**
 for PR 17602 at commit 
[`9043f01`](https://github.com/apache/spark/commit/9043f01ae7c215da84cf12e449497ca7988329ba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17600: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...

2017-04-10 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/17600
  
Merged into mater.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17601: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...

2017-04-10 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/17601
  
Merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110811804
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -634,7 +634,9 @@ def saveAsTable(self, name, format=None, mode=None, 
partitionBy=None, **options)
 
 @since(1.4)
 def json(self, path, mode=None, compression=None, dateFormat=None, 
timestampFormat=None):
-"""Saves the content of the :class:`DataFrame` in JSON format at 
the specified path.
+"""Saves the content of the :class:`DataFrame` in JSON format
+(`JSON Lines text format or newline-delimited JSON 
`_) at the
--- End diff --

**Before**
![2017-04-11 10 02 
21](https://cloud.githubusercontent.com/assets/6477701/24892210/c53d6f9e-1eb5-11e7-9360-7fc172089ae4.png)

**After**

![2017-04-11 12 49 
38](https://cloud.githubusercontent.com/assets/6477701/24892184/8d72b5e2-1eb5-11e7-8f34-c6edc562c37f.png)

Note that this is not consistent with Scala/Java ones:

![2017-04-11 12 50 
13](https://cloud.githubusercontent.com/assets/6477701/24892182/8c0e0080-1eb5-11e7-847b-3df347b3e5c1.png)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17602: [MINOR][DOCS] JSON APIs related documentation fixes

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17602
  
**[Test build #75690 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75690/testReport)**
 for PR 17602 at commit 
[`82aadaa`](https://github.com/apache/spark/commit/82aadaa7fee5a5db6cbbfe0e5d1a11ddebb14c6b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110811551
  
--- Diff: docs/sql-programming-guide.md ---
@@ -883,7 +883,7 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
 
 Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a `Dataset[Row]`.
-This conversion can be done using `SparkSession.read.json()` on either an 
RDD of String,
+This conversion can be done using `SparkSession.read.json()` on either an 
`Dataset[String]`,
--- End diff --

Scala example uses `Dataset` as below:

![2017-04-11 10 21 
12](https://cloud.githubusercontent.com/assets/6477701/24892046/c9c5dfac-1eb4-11e7-938c-fe6be4ef8b39.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17596: [SPARK-12837][SQL] reduce the serialized size of ...

2017-04-10 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17596#discussion_r110811597
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/InternalLongAccumulator.scala ---
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+/**
+ * A simpler version of [[LongAccumulator]], which doesn't track the value 
count and only for
+ * internal usage.
--- End diff --

This may useful for Spark applications too. It can be non internal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110811091
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -268,8 +268,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   }
 
   /**
-   * Loads a JSON file (http://jsonlines.org/;>JSON Lines text 
format or
-   * newline-delimited JSON) and returns the result as a `DataFrame`.
+   * Loads a JSON file and returns the results as a `DataFrame`.
+   *
--- End diff --

This de-duplicate the documentation as it points the overloaded `json()` 
out below.

**Before**

![2017-04-11 10 33 
18](https://cloud.githubusercontent.com/assets/6477701/24892234/ff72e70c-1eb5-11e7-9096-dc29f2ed6a4d.png)


**After**

![2017-04-11 12 36 
03](https://cloud.githubusercontent.com/assets/6477701/24892237/0215a68e-1eb6-11e7-813d-e1451d542655.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110810961
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -405,8 +405,8 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 """
 Loads a JSON file stream and returns the results as a 
:class:`DataFrame`.
 
-`JSON Lines `_(newline-delimited JSON) is 
supported by default.
-For JSON (one record per file), set the `wholeFile` parameter to 
``true``.
+`JSON Lines `_ (newline-delimited JSON) is 
supported by default.
+For JSON (one record per file), set the ``wholeFile`` parameter to 
``true``.
--- End diff --

**Before**

![2017-04-11 10 10 
08](https://cloud.githubusercontent.com/assets/6477701/24892218/d3b2cbf0-1eb5-11e7-8f0d-8071e7c65832.png)

**After**

![2017-04-11 10 11 
46](https://cloud.githubusercontent.com/assets/6477701/24892223/dea9f240-1eb5-11e7-9137-c74960d2bf6d.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110810827
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -634,7 +634,9 @@ def saveAsTable(self, name, format=None, mode=None, 
partitionBy=None, **options)
 
 @since(1.4)
 def json(self, path, mode=None, compression=None, dateFormat=None, 
timestampFormat=None):
-"""Saves the content of the :class:`DataFrame` in JSON format at 
the specified path.
+"""Saves the content of the :class:`DataFrame` in JSON format
+(`JSON Lines text format or newline-delimited JSON 
<[http://jsonlines.org/>`_) at the
+specified path.
--- End diff --

**Before **
![2017-04-11 10 11 
46](https://cloud.githubusercontent.com/assets/6477701/24892138/3c52f686-1eb5-11e7-8aae-c698c762bb8b.png)

**After**

![2017-04-11 12 49 
38](https://cloud.githubusercontent.com/assets/6477701/24892184/8d72b5e2-1eb5-11e7-8f34-c6edc562c37f.png)

Note that this is not consistent with Scala/Java ones:

![2017-04-11 12 50 
13](https://cloud.githubusercontent.com/assets/6477701/24892182/8c0e0080-1eb5-11e7-847b-3df347b3e5c1.png)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17596: [SPARK-12837][SQL] reduce the serialized size of ...

2017-04-10 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17596#discussion_r110810746
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/InternalLongAccumulator.scala ---
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+/**
+ * A simpler version of [[LongAccumulator]], which doesn't track the value 
count and only for
+ * internal usage.
+ */
+private[spark] class InternalLongAccumulator extends AccumulatorV2[Long, 
Long] {
+  private[spark] var _value = 0L
--- End diff --

Original name `sum` seems better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110810554
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -173,8 +173,8 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 """
 Loads JSON files and returns the results as a :class:`DataFrame`.
 
-`JSON Lines `_(newline-delimited JSON) is 
supported by default.
-For JSON (one record per file), set the `wholeFile` parameter to 
``true``.
+`JSON Lines `_ (newline-delimited JSON) is 
supported by default.
+For JSON (one record per file), set the ``wholeFile`` parameter to 
``true``.
--- End diff --

**Before**

![2017-04-11 10 10 
08](https://cloud.githubusercontent.com/assets/6477701/24892123/215f27fa-1eb5-11e7-8587-3c873ce4a895.png)


**After**

![2017-04-11 10 06 
33](https://cloud.githubusercontent.com/assets/6477701/24892110/1587d06c-1eb5-11e7-8f7c-1aca568713cc.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17602: [MINOR][DOCS] JSON APIs related documentation fixes

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17602
  
**[Test build #75689 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75689/testReport)**
 for PR 17602 at commit 
[`fd64e49`](https://github.com/apache/spark/commit/fd64e49cf715ca8a5e04321415adacdb955dad5a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110810413
  
--- Diff: docs/sql-programming-guide.md ---
@@ -897,7 +897,7 @@ For a regular multi-line JSON file, set the `wholeFile` 
option to `true`.
 
 
 Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a `Dataset`.
-This conversion can be done using `SparkSession.read().json()` on either 
an RDD of String,
+This conversion can be done using `SparkSession.read().json()` on either 
an Dataset of String,
--- End diff --

Java example uses `Dataset` as below:

![2017-04-11 10 21 
18](https://cloud.githubusercontent.com/assets/6477701/24892067/e34b0538-1eb4-11e7-96cf-933388bc3937.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17602#discussion_r110810364
  
--- Diff: docs/sql-programming-guide.md ---
@@ -883,7 +883,7 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
 
 Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a `Dataset[Row]`.
-This conversion can be done using `SparkSession.read.json()` on either an 
RDD of String,
+This conversion can be done using `SparkSession.read.json()` on either an 
Dataset of String,
--- End diff --

Scala example uses `Dataset` as below:

![2017-04-11 10 21 
12](https://cloud.githubusercontent.com/assets/6477701/24892046/c9c5dfac-1eb4-11e7-938c-fe6be4ef8b39.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17602: [MINOR][DOCS] JSON APIs related documentation fix...

2017-04-10 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17602

[MINOR][DOCS] JSON APIs related documentation fixes

## What changes were proposed in this pull request?

This PR proposes corrections related to JSON APIs, including rendering 
links in Python documentation, replacing `RDD` to `Dataset` in programing 
guide, adding missing description about JSON Lines consistently in 
`DataFrameReader.json` in Python API and de-duplicating little bit of 
`DataFrameReader.json` in Scala/Java API .

## How was this patch tested?

Manually build the documentation via `jekyll build`. Corresponding 
snapstops will be left on the codes.

Note that currently there are Javadoc8 breaks in several places. These are 
proposed to be handled in https://github.com/apache/spark/pull/17477. So, this 
PR does not fix those.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark minor-json-documentation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17602.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17602


commit fd64e49cf715ca8a5e04321415adacdb955dad5a
Author: hyukjinkwon 
Date:   2017-04-11T03:37:01Z

JSON related documentation fixes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17599: [SPARK-17564][Tests]Fix flaky RequestTimeoutInteg...

2017-04-10 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17599


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17491: [SPARK-20175][SQL] Exists should not be evaluated in Joi...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17491
  
**[Test build #75687 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75687/testReport)**
 for PR 17491 at commit 
[`329f067`](https://github.com/apache/spark/commit/329f067d1a38469675367f1e29330034f6d923e8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistices to improve...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #75688 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75688/testReport)**
 for PR 16677 at commit 
[`b8a2275`](https://github.com/apache/spark/commit/b8a22755bfdef8f1ab78016aea6914155ada67c1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17599: [SPARK-17564][Tests]Fix flaky RequestTimeoutIntegrationS...

2017-04-10 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17599
  
Merging in master/branch-2.1.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17491: [SPARK-20175][SQL] Exists should not be evaluated in Joi...

2017-04-10 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17491
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16781
  
**[Test build #75686 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75686/testReport)**
 for PR 16781 at commit 
[`75e8579`](https://github.com/apache/spark/commit/75e8579551c2f1ae7ae958853754fbbc5a589dd4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17491: [SPARK-20175][SQL] Exists should not be evaluated in Joi...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17491
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75683/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17491: [SPARK-20175][SQL] Exists should not be evaluated in Joi...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17491
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17491: [SPARK-20175][SQL] Exists should not be evaluated in Joi...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17491
  
**[Test build #75683 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75683/testReport)**
 for PR 17491 at commit 
[`329f067`](https://github.com/apache/spark/commit/329f067d1a38469675367f1e29330034f6d923e8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17581: [SPARK-20248][ SQL]Spark SQL add limit parameter to enha...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17581
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75682/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17581: [SPARK-20248][ SQL]Spark SQL add limit parameter to enha...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17581
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17581: [SPARK-20248][ SQL]Spark SQL add limit parameter to enha...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17581
  
**[Test build #75682 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75682/testReport)**
 for PR 17581 at commit 
[`59a1b1a`](https://github.com/apache/spark/commit/59a1b1ae96809076916849c9a1d396dc7d40251d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17599: [SPARK-17564][Tests]Fix flaky RequestTimeoutIntegrationS...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17599
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17599: [SPARK-17564][Tests]Fix flaky RequestTimeoutIntegrationS...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17599
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75678/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17582: [SPARK-20239][Core] Improve HistoryServer's ACL mechanis...

2017-04-10 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17582
  
@tgravescs sorry for the confuse.

>if base URL's ACL (spark.acls.enable) is enabled but user A has no view 
permission. User "A" cannot see the app list but could still access details of 
it's own app.

Here actually has two list of acls, one is controlled by 
`spark.acls.enabled`, if user "A" is not added to this acl list, then user "A" 
cannot see the app list (`//api/v1/applications`). But if 
this app is run by user "A", then user "A" could still see the details of app, 
like (`//api/v1/applications//jobs`), this acl is 
controlled by "spark.history.ui.acls.enabled", and user "A" is automatically in 
the acl list (because of run by him).

> if ACLs of base URL (spark.acls.enable) is disabled. Then user "A" could 
see the summary of all the apps, even some apps didn't run by user "A", but can 
only access its own app's details.

If "spark.acls.enabled" is disabled, then `SecurityFilter` is not worked, 
so user "A" could access `//api/v1/applications`, which 
means user "A" could see all the applications even not run by him.

This `//api/v1/applications` doesn't touch 
`spark.history.ui.acls.enabled`.

> if ACLs of base URL (spark.acls.enable) is disabled, then user "A" could 
download any application's event log, even it is not run by user "A".

This is the same issue as above. 
`//api/v1/applications//logs` is only controlled by 
"spark.acls.enable", not "spark.history.ui.acls.enable". So anyone could 
download any even logs if "spark.acls.enable" is disabled.

So basically what I fixed is that:

1. disable the work of `spark.acls.enable`, which means `SecurityFilter` is 
not checked.
2. Using `spark.history.ui.acls.enable` to filter applications, application 
summary and application log based on users who run the app.

So the result of my PR is:

1. history admin user could see/download/access any apps.
2. normal user could see/download/access apps run by him.

@vanzin your suggestion is to only disable ACLs on the listing, that 
definitely simplifies the fix, but IMO that "all or nothing" solution is not so 
ideal:

1. any user could list all the apps, though cannot access the details if it 
is not run by him. For the sensitivity, is it better to even not show the apps 
not run by him?
2. currently if ACLs on listing is disabled, anyone could download event 
log, which on the other hand expose the security hole to other users.

So IMO filtering based on users is better than "all or nothing" solution. 
Also it doesn't increase the code complex much.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17599: [SPARK-17564][Tests]Fix flaky RequestTimeoutIntegrationS...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17599
  
**[Test build #75678 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75678/testReport)**
 for PR 17599 at commit 
[`35e2116`](https://github.com/apache/spark/commit/35e2116082d5e821fa9a659996507d532c68675f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread facaiy

Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
@srowen Hi, I forget unit tests in python and R. Where can I find document 
about creating develop environment? thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17600: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17600
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75679/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17600: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17600
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17600: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17600
  
**[Test build #75679 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75679/testReport)**
 for PR 17600 at commit 
[`61e2497`](https://github.com/apache/spark/commit/61e2497f36a9e38b06b512ae9552a24f26448a9e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17495: [SPARK-20172][Core] Add file permission check when listi...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17495
  
**[Test build #75684 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75684/testReport)**
 for PR 17495 at commit 
[`1d1440b`](https://github.com/apache/spark/commit/1d1440bcd956f0f80e299edea72bf54e1f0d6b0d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17436
  
**[Test build #75685 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75685/testReport)**
 for PR 17436 at commit 
[`adada35`](https://github.com/apache/spark/commit/adada35ecaaaff4d10bcfd329df5e0090c51f403).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...

2017-04-10 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/17480#discussion_r110804952
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager(
* yarn-client mode when AM re-registers after a failure.
*/
   def reset(): Unit = synchronized {
-initializing = true
+/**
+ * When some tasks need to be scheduled and initial executor = 0, 
resetting the initializing
+ * field may cause it to not be set to false in yarn.
+ * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
+ */
+if (maxNumExecutorsNeeded() == 0) {
+  initializing = true
--- End diff --

I think the purpose of "initializing" is to avoid unnecessary executors 
ramp down before the stage submitted or executor timeout. For example if min 
executor number is 0, initial number is 10. If "initializing" is set to false, 
executor number will ramp down to 0 immediately, and during this time if stage 
is submitted, then still requires unnecessary executor ramp up to meet this 
stage's requirement.

In the AM restart scenario, if we set "initializing" to false in `reset`, 
then we will may also meet the situation mentioned above, I think that's 
possible.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17601: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17601
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75680/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...

2017-04-10 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/17480#discussion_r110804557
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager(
* yarn-client mode when AM re-registers after a failure.
*/
   def reset(): Unit = synchronized {
-initializing = true
+/**
+ * When some tasks need to be scheduled and initial executor = 0, 
resetting the initializing
+ * field may cause it to not be set to false in yarn.
+ * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
+ */
+if (maxNumExecutorsNeeded() == 0) {
+  initializing = true
--- End diff --

The following code should have a similar function?

```scala
numExecutorsTarget = initialNumExecutors // The default value is 0
numExecutorsToAdd = 1
```
The incoming parameters of the client.requestTotalExecutors method are 
1,2,4,8,16...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17601: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17601
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17601: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17601
  
**[Test build #75680 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75680/consoleFull)**
 for PR 17601 at commit 
[`d3aa8bd`](https://github.com/apache/spark/commit/d3aa8bddac7ac738cb3de2cee0eadf1a9740eaa6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17563: [SPARK-20005][WEB UI]fix 'There is no "Newline" in UI in...

2017-04-10 Thread guoxiaolongzte

Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/17563
  
Does the PR nobody deal with it?@srowen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...

2017-04-10 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17480#discussion_r110803588
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager(
* yarn-client mode when AM re-registers after a failure.
*/
   def reset(): Unit = synchronized {
-initializing = true
+/**
+ * When some tasks need to be scheduled and initial executor = 0, 
resetting the initializing
+ * field may cause it to not be set to false in yarn.
+ * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
+ */
+if (maxNumExecutorsNeeded() == 0) {
+  initializing = true
--- End diff --

`updateAndSyncNumExecutorsTarget` is a weird method. It returns a value 
that is never used anywhere, the actual variables it sets internally are what 
matters...

But I still don't understand why, when the AM restarts, should 
`updateAndSyncNumExecutorsTarget` be a no-op except in this case. What is 
different about this case that makes it an exception? Shouldn't 
`updateAndSyncNumExecutorsTarget` be called instead from `reset()` or very soon 
after, so the code can update its internal state to match the current status of 
the app?

The thing I don't understand is why is it ever ok for 
`updateAndSyncNumExecutorsTarget` to just do nothing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17599: [SPARK-17564][Tests]Fix flaky RequestTimeoutIntegrationS...

2017-04-10 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17599
  
LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...

2017-04-10 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/17480#discussion_r110802758
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager(
* yarn-client mode when AM re-registers after a failure.
*/
   def reset(): Unit = synchronized {
-initializing = true
+/**
+ * When some tasks need to be scheduled and initial executor = 0, 
resetting the initializing
+ * field may cause it to not be set to false in yarn.
+ * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
+ */
+if (maxNumExecutorsNeeded() == 0) {
+  initializing = true
--- End diff --

@vanzin sorry I think I didn't explain well.

If this flag `initializing` is set to false during initialization, 
`updateAndSyncNumExecutorsTarget` will recalculate the required executor number 
and ramp down the executors if there's no job in the current time. And then if 
first job is submitted, it still requires to ramp up executors to meet the 
requirement.

For the AM restart scenario I think it is similar during initializing. One 
exception is the scenario mentioned here, for the case here should ramp up soon 
to meet the requirement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17540
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75681/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17540
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17540
  
**[Test build #75681 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75681/testReport)**
 for PR 17540 at commit 
[`7910825`](https://github.com/apache/spark/commit/7910825ce7e0cc7d8ba1eb9f6e534b087815f294).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...

2017-04-10 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17480#discussion_r110801458
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager(
* yarn-client mode when AM re-registers after a failure.
*/
   def reset(): Unit = synchronized {
-initializing = true
+/**
+ * When some tasks need to be scheduled and initial executor = 0, 
resetting the initializing
+ * field may cause it to not be set to false in yarn.
+ * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
+ */
+if (maxNumExecutorsNeeded() == 0) {
+  initializing = true
--- End diff --

Sorry but that doesn't really explain much. Why is it bad to ramp up 
quickly? At which point are things not "initializing" anymore?

Isn't the AM restarting the definition of "I should ramp up quickly because 
I might be in the middle of a big job being run"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17480: [SPARK-20079][Core][yarn] Re registration of AM hangs sp...

2017-04-10 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/17480
  
Sorry but that doesn't really explain much. Why is it bad to ramp up 
quickly? At which point are things not "initializing" anymore?

Isn't the AM restarting the definition of "I should ramp up quickly because 
I might be in the middle of a big job being run"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...

2017-04-10 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17436#discussion_r110801228
  
--- Diff: core/src/main/java/org/apache/spark/memory/MemoryConsumer.java ---
@@ -41,7 +41,7 @@ protected MemoryConsumer(TaskMemoryManager 
taskMemoryManager, long pageSize, Mem
   }
 
   protected MemoryConsumer(TaskMemoryManager taskMemoryManager) {
--- End diff --

Yes, you are right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17491: [SPARK-20175][SQL] Exists should not be evaluated in Joi...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17491
  
**[Test build #75683 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75683/testReport)**
 for PR 17491 at commit 
[`329f067`](https://github.com/apache/spark/commit/329f067d1a38469675367f1e29330034f6d923e8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r110800502
  
--- Diff: python/pyspark/sql/column.py ---
@@ -303,8 +342,27 @@ def isin(self, *cols):
 desc = _unary_op("desc", "Returns a sort expression based on the"
  " descending order of the given column name.")
 
-isNull = _unary_op("isNull", "True if the current expression is null.")
-isNotNull = _unary_op("isNotNull", "True if the current expression is 
not null.")
+_isNull_doc = """
+True if the current expression is null. Often combined with
+:func:`DataFrame.filter` to select rows with null values.
+
+>>> df2.collect()
--- End diff --

Also, It seems somehow it can't fine `df2` here in the doctest just as the 
error message says.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r110800343
  
--- Diff: python/pyspark/sql/column.py ---
@@ -250,11 +250,50 @@ def __iter__(self):
 raise TypeError("Column is not iterable")
 
 # string methods
+_rlike_doc = """
+Return a Boolean :class:`Column` based on a regex match.
+
+:param other: an extended regex expression
+
+>>> df.filter(df.name.rlike('ice$')).collect()
+[Row(name=u'Alice', age=1)]
--- End diff --

It sounds the test failure is here. If you click the link, the full logs 
can be checked (e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75622/console).

Up to my knowledge, Row in python sorts the field names and therefore I 
guess it should be `[Row(age=1, name=u'Alice')]`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17495: [SPARK-20172][Core] Add file permission check whe...

2017-04-10 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/17495#discussion_r110799889
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -571,6 +572,34 @@ class FsHistoryProviderSuite extends SparkFunSuite 
with BeforeAndAfter with Matc
 }
  }
 
+  test("log without read permission should be filtered out before actual 
reading") {
+class TestFsHistoryProvider extends 
FsHistoryProvider(createTestConf()) {
--- End diff --

The difference of this unit test is that this UT checks whether the file is 
filtered out during check, and for SPARK-3697 UT, it only checks the final 
result, so file could be filtered out during read. Let me merge this two UTs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17524: [SPARK-19235] [SQL] [TEST] [FOLLOW-UP] Enable Test Cases...

2017-04-10 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17524
  
retest this please



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should...

2017-04-10 Thread guoxiaolongzte

Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/17593
  
sorting can change?I do not think so.
Even if the sorting is only show the last 200, and I raised the issue is 
not contradictory.
The last 200 are the concept of a batch of data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...

2017-04-10 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/17480#discussion_r110797585
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager(
* yarn-client mode when AM re-registers after a failure.
*/
   def reset(): Unit = synchronized {
-initializing = true
+/**
+ * When some tasks need to be scheduled and initial executor = 0, 
resetting the initializing
+ * field may cause it to not be set to false in yarn.
+ * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
+ */
+if (maxNumExecutorsNeeded() == 0) {
+  initializing = true
--- End diff --

In the original design of dynamic executor allocation,  this flag is set to 
"true" to avoid sudden executor number ramp up (because of first job 
submission) during initializing. You could check the comment.

And for the AM restart scenario, because all the executors will be 
re-spawned, this is similar to AM first start scenario (if the job is submitted 
during restart), so we set this flag to true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...

2017-04-10 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/17480#discussion_r110796578
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager(
* yarn-client mode when AM re-registers after a failure.
*/
   def reset(): Unit = synchronized {
-initializing = true
+/**
+ * When some tasks need to be scheduled and initial executor = 0, 
resetting the initializing
+ * field may cause it to not be set to false in yarn.
+ * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
+ */
+if (maxNumExecutorsNeeded() == 0) {
+  initializing = true
--- End diff --

@jerryshao  Can you explain the following  comments? I do not understand.
```scala
 if (initializing) {
  // Do not change our target while we are still initializing,
  // Otherwise the first job may have to ramp up unnecessarily
  0
} else if (maxNeeded < numExecutorsTarget) {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17581: [SPARK-20248][ SQL]Spark SQL add limit parameter to enha...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17581
  
**[Test build #75682 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75682/testReport)**
 for PR 17581 at commit 
[`59a1b1a`](https://github.com/apache/spark/commit/59a1b1ae96809076916849c9a1d396dc7d40251d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17581: [SPARK-20248][ SQL]Spark SQL add limit parameter to enha...

2017-04-10 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17581
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17581: [SPARK-20248][ SQL]Spark SQL add limit parameter ...

2017-04-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17581#discussion_r110795341
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -359,6 +359,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val THRIFTSERVER_RESULT_LIMIT =
+buildConf("spark.sql.thriftserver.retainedResults")
+  .internal()
+  .doc("The number of sql results returned by Thrift Server when 
running a query " +
+"without a limit, and when a query with a limit or this is set to 
0, " +
+"we don't change user's behavior." )
+  .intConf
--- End diff --

Please add `checkValue`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17469: [SPARK-20132][Docs] Add documentation for column string ...

2017-04-10 Thread map222

Github user map222 commented on the issue:

https://github.com/apache/spark/pull/17469
  
@HyukjinKwon The Jenkins test failed. I'm having trouble running the tests 
locally (I can't build Spark yet), and I can't decipher the Jenkins error 
messages. Does something jump out to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17581: [SPARK-20248][ SQL]Spark SQL add limit parameter to enha...

2017-04-10 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17581
  
ok to test



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17540
  
**[Test build #75681 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75681/testReport)**
 for PR 17540 at commit 
[`7910825`](https://github.com/apache/spark/commit/7910825ce7e0cc7d8ba1eb9f6e534b087815f294).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17077
  
(I think we need @holdenk's sign-off and further review.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110794303
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write two bucketing columns
+df.write.bucketBy(3, "x", 
"y").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
--- End diff --

Thank you for taking my opinion into account. Yea, we should remove or 
change the version. I meant to follow the rest of contents.

Generally, the contents in documentation has been matched among APIs in 
different languages up to my knowledge. I don't think this is a kind of a must 
but I think it is safer to avoid getting blamed for any reason in the future 
and confusion for the users.

I have seen several minor PRs fixing documentations (e.g., typos) that has 
to identically be fixed for other APIs in different language and I also made 
some PRs to match the documentations, e.g., 
https://github.com/apache/spark/pull/17429


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17495: [SPARK-20172][Core] Add file permission check whe...

2017-04-10 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/17495#discussion_r110794078
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -320,14 +321,15 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
 .filter { entry =>
   try {
 val prevFileSize = 
fileToAppInfo.get(entry.getPath()).map{_.fileSize}.getOrElse(0L)
+fs.access(entry.getPath, FsAction.READ)
--- End diff --

I see, let me change it. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17600: [MINOR][SQL] Fix the @since tag when backporting critica...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17600
  
**[Test build #75679 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75679/testReport)**
 for PR 17600 at commit 
[`61e2497`](https://github.com/apache/spark/commit/61e2497f36a9e38b06b512ae9552a24f26448a9e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17601: [MINOR][SQL] Fix the @since tag when backporting critica...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17601
  
**[Test build #75680 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75680/consoleFull)**
 for PR 17601 at commit 
[`d3aa8bd`](https://github.com/apache/spark/commit/d3aa8bddac7ac738cb3de2cee0eadf1a9740eaa6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17601: [MINOR][SQL] Fix the @since tag when backporting ...

2017-04-10 Thread dbtsai

GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/17601

[MINOR][SQL] Fix the @since tag when backporting critical bugs from 2.2 
branch into 2.0 branch

## What changes were proposed in this pull request?

Fix the @since tag when backporting critical bugs from 2.2 branch into 2.0 
branch.

## How was this patch tested?

N/A

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17601


commit d3aa8bddac7ac738cb3de2cee0eadf1a9740eaa6
Author: DB Tsai 
Date:   2017-04-11T00:29:33Z

fix since tag version




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17540
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17540
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75677/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17540
  
**[Test build #75677 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75677/testReport)**
 for PR 17540 at commit 
[`ce0dbe7`](https://github.com/apache/spark/commit/ce0dbe7c6080dc3cdcee30ac404222ad3480b2ca).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110792594
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2167,6 +2167,56 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+def count_bucketed_cols(names, table="pyspark_bucket"):
+"""Given a sequence of column names and a table name
+query the catalog and return number o columns which are
+used for bucketing
+"""
+cols = self.spark.catalog.listColumns(table)
+num = len([c for c in cols if c.name in names and c.isBucket])
+return num
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(count_bucketed_cols(["x"]), 1)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write two bucketing columns
+df.write.bucketBy(3, "x", 
"y").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(count_bucketed_cols(["x", "y"]), 2)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort
+df.write.bucketBy(2, 
"x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(count_bucketed_cols(["x"]), 1)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with a list of columns
+df.write.bucketBy(3, ["x", 
"y"]).mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(count_bucketed_cols(["x", "y"]), 2)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort with a list of columns
+(df.write.bucketBy(2, "x")
+.sortBy(["y", "z"])
+.mode("overwrite").saveAsTable("pyspark_bucket"))
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort with multiple columns
+(df.write.bucketBy(2, "x")
+.sortBy("y", "z")
+.mode("overwrite").saveAsTable("pyspark_bucket"))
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+self.spark.sql("DROP TABLE IF EXISTS pyspark_bucket")
--- End diff --

Yea, I think this is a correct way to drop the table.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17600: [MINOR][SQL] Fix the @since tag when backporting ...

2017-04-10 Thread dbtsai

GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/17600

[MINOR][SQL] Fix the @since tag when backporting critical bugs from 2.2 
branch into 2.1 branch

## What changes were proposed in this pull request?

Fix the @since tag when backporting critical bugs from 2.2 branch into 2.1 
branch.

## How was this patch tested?

N/A

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17600.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17600


commit 61e2497f36a9e38b06b512ae9552a24f26448a9e
Author: DB Tsai 
Date:   2017-04-11T00:21:23Z

update spark version




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17599: [SPARK-17564][Tests]Fix flaky RequestTimeoutIntegrationS...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17599
  
**[Test build #75678 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75678/testReport)**
 for PR 17599 at commit 
[`35e2116`](https://github.com/apache/spark/commit/35e2116082d5e821fa9a659996507d532c68675f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17599: [SPARK-17564][Tests]Fix flaky RequestTimeoutIntegrationS...

2017-04-10 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/17599
  
cc @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...

2017-04-10 Thread mridulm

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/17596
  
The approach I took for this was slightly different.
* Create a bitmask indicating which accumulators are required in 
TaskMetrics - that is, have non zero values, and emit this first.
* Instead of relying on default serialization, simply do custom 
serialization for all internal accumulators - directly emit the long's (based 
on the bitmask for writing/reading).
* Encode long/int's so that they take less than 8/4 bytes (currently this 
is sitting inside graphx iirc - essentially same code as from kryo for 
optimizePositve)

For 1.6, this brought down the size from 1.6k or so average down to 200+ 
bytes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17599: [SPARK-17564][Tests]Fix flaky RequestTimeoutInteg...

2017-04-10 Thread zsxwing

GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/17599

[SPARK-17564][Tests]Fix flaky 
RequestTimeoutIntegrationSuite.furtherRequestsDelay

## What changes were proposed in this pull request?

This PR  fixs the following failure:
```
sbt.ForkMain$ForkError: java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.spark.network.RequestTimeoutIntegrationSuite.furtherRequestsDelay(RequestTimeoutIntegrationSuite.java:230)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```

It happens several times per month on 
[Jenkins](http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.network.RequestTimeoutIntegrationSuite_name=furtherRequestsDelay).
 The failure is because `callback1` may not be called before 
`assertTrue(callback1.failure instanceof IOException);`. It's pretty easy to 
reproduce this error by adding a sleep before this line: 
https://github.com/apache/spark/blob/379b0b0bbdbba2278ce3bcf471bd75f6ffd9cf0d/common/network-common/src/test/java/org/apache/spark/network/RequestTimeoutIntegrationSuite.java#L267

The fix is straightforward: just use the latch to wait until `callback1` is 
called. 

## How was this patch tested?

Jenkins

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-17564

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17599.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17599


commit 35e2116082d5e821fa9a659996507d532c68675f
Author: Shixiong Zhu 
Date:   2017-04-11T00:02:50Z

Fix flaky RequestTimeoutIntegrationSuite.furtherRequestsDelay




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at

[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17540
  
**[Test build #75677 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75677/testReport)**
 for PR 17540 at commit 
[`ce0dbe7`](https://github.com/apache/spark/commit/ce0dbe7c6080dc3cdcee30ac404222ad3480b2ca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17546
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75674/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17546
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17546
  
**[Test build #75674 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75674/testReport)**
 for PR 17546 at commit 
[`7a5d1d0`](https://github.com/apache/spark/commit/7a5d1d0615a98fbee3c58934f92314bb92a97354).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 402 matches

Mail list logo