date:20180414

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21007
  
What's wrong in the description and PR title, and what to document? Do you 
mean the first sentence `This PR proposes to add collect to a query executor as 
an action.` is wrong because this sentence doesn't mention Pyspark 
specifically? I think the PR title and other words explain what this PR changes.

It's already documented - 
https://github.com/apache/spark/blob/bd4eb9ce57da7bacff69d9ed958c94f349b7e6fb/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L44

Should we document `collect` in PySpark is now recignised as a query 
execution action in a query execution listener?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20451: [SPARK-23146][WIP] Support client mode for Kubernetes cl...

2018-04-14 Thread echarles

Github user echarles commented on the issue:

https://github.com/apache/spark/pull/20451
  
Now that #20910  has been merged, I will update this PR to take account the 
refactoring. @inetfuture Once these changes are pushed, there is the review 
process which needs to occur, so difficult to give you a plan.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21060
  
Yup, that should reduce some overhead like this. I would like to listen 
what you guys think cc @srowenn, @vanzin, @felixcheung, @holdenk too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21060
  
I do think we should clearly document the rule what we can backport. 

I do not think we should make an exception for this PR. cc @rxin @marmbrus 
@yhuai @cloud-fan @ueshin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21060
  
How about we formally document that in the guide?

I have been always putting more importance on practice and I personally 
think we are fine to make a backport if it's a bug and the fix is 
straightforward. IMHO, principal is a base but we should put more importance on 
practice. 

Even if I take your words, I would then like to make this as an exception 
since this fixes actual usecases from our customers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21060
  
If this can be treated as a bug to backport, we have many behavior change 
PRs that can be backported. We are building the system software. We have to be 
more principled.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21060
  
I agree that It's better to avoid a behaviour change but this one is a 
clearly a bug and the fix is straightforward. I am puzzled why this 
specifically prompted you. I wouldn't revert if there's not specific worry 
about this patch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21060
  
This is just the basic backport rule we follow for each PR. We should not 
make an exception for this PR. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21060
  
I am a bit puzzled because `QueryExecutionListener` should call the 
callback for actions and `collect` triggers it in Scala and R but it doesn't in 
PySpark specifically. It sounds a bug and this fix is relatively 
straightforward. The previous behaviour was it was not being called which 
didn't make sense.

I agree that it's discouraged to make a behaviour change to the maintenance 
release, sure. However, I was thinking it makes sense to backport if the fix is 
not complicated and looks a bug quite clearly. I think we shouldn't say it's 
improvement in this case.

Were actual apps or test cases broken somewhere?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21060
  
Users apps should not be blamed in this case. If they want this change, 
they should upgrade to the newer release. Basically, we should not introduce 
any external behavior change in the maintenance release if possible. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21060
  
I guess the behaviour changes here is that a custom query execution 
listener now can recognise the action `collect` in PySpark which other APIs 
have detected. Mind explaining how it breaks external apps? If the callback 
should not be called specifically `collect` but not other actions like `show` 
in PySpark, I would say it should be to blame yours apps.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21060
  
This will introduce the behavior change and it is not a regression. The 
changes we made in this PR could break the external app. We should not do it in 
the maintenance release. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21060
  
hm I would say it's a bug since the action is not detected which is 
supposed to call the callback. The test is a bit complicated but the fix is 
relatively straightforward.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21073: [SPARK-23936][SQL][WIP] Implement map_concat

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21073
  
**[Test build #89369 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89369/testReport)**
 for PR 21073 at commit 
[`d04893b`](https://github.com/apache/spark/commit/d04893bccbd2772eb937125895519228588e74b4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21073: [SPARK-23936][SQL][WIP] Implement map_concat

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21073
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21060
  
Since this is not a bug fix, I plan to revert this PR. WDYT? @HyukjinKwon 
@BryanCutler 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21073: [SPARK-23936][SQL][WIP] Implement map_concat

2018-04-14 Thread bersprockets

GitHub user bersprockets opened a pull request:

https://github.com/apache/spark/pull/21073

[SPARK-23936][SQL][WIP] Implement map_concat

## What changes were proposed in this pull request?

Implement map_concat high order function.

This is a work in progress.

There's no code generation yet.

My current implementation of MapConcat.checkInputDataTypes does not allow 
valueContainsNull to vary between the input maps. Not sure what the 
requirements are here.

I am using a java.util.Map implementation to merge all the maps together, 
since that is a very straightforward implementation. I chose 
java.util.LinkedHashMap because:

- It allows for predicatable tuple order (essentially, the original 
insertion order) for building the result MapData. Tests, like pyspark's 
doctests, which rely on tuple order, will work across Java versions
- java.util.LinkedHashMap seems to be about as fast as java.util.HashMap, 
at least when used to concatenate big (500+ key/values) maps for 150k rows, and 
it's much faster than the scala LinkedHashMap implementation.

## How was this patch tested?

New tests
Manual tests
Run all sbt SQL tests
Run all pyspark sql tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bersprockets/spark SPARK-23936

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21073.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21073


commit 707330cd88b269cb1bbee83b9b6476d05c8d177c
Author: Bruce Robbins 
Date:   2018-04-14T23:52:37Z

Initial commit

commit 84d696313972a237691eb46cad6a478167dbabee
Author: Bruce Robbins 
Date:   2018-04-15T02:04:45Z

Remove unused variable in test

commit d04893bccbd2772eb937125895519228588e74b4
Author: Bruce Robbins 
Date:   2018-04-15T03:35:47Z

Cleanup




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpark as a...

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21007
  
@HyukjinKwon @BryanCutler @viirya @felixcheung The first sentence of this 
PR really scares me. After reading the PR description. Since the PR description 
will be part of our change log. Please be careful to ensure they are right. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpa...

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21007#discussion_r181569225
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3189,10 +3189,10 @@ class Dataset[T] private[sql](
 
   private[sql] def collectToPython(): Int = {
 EvaluatePython.registerPicklers()
-withNewExecutionId {
+withAction("collectToPython", queryExecution) { plan =>
--- End diff --

These changes can cause the behavior changes. Please submit a PR to 
document it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21057: 2 Improvements to Pyspark docs

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21057
  
+1 for ^.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21057: 2 Improvements to Pyspark docs

2018-04-14 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21057
  
I think you may create a minor JIRA ticket for this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21057: 2 Improvements to Pyspark docs

2018-04-14 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21057#discussion_r181568263
  
--- Diff: python/pyspark/streaming/listener.py ---
@@ -22,6 +22,10 @@ class StreamingListener(object):
 
 def __init__(self):
 pass
+
+def onStreamingStarted(self, streamingStarted):
--- End diff --

Can you add a test to `pyspark.streaming.tests.StreamingListenerTests`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...

2018-04-14 Thread edlee123

Github user edlee123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18378#discussion_r181567770
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1750,6 +1761,24 @@ def _to_scala_map(sc, jm):
 return sc._jvm.PythonUtils.toScalaMap(jm)
 
 
+def _to_corrected_pandas_type(dt):
+"""
+When converting Spark SQL records to Pandas DataFrame, the inferred 
data type may be wrong.
+This method gets the corrected data type for Pandas if that type may 
be inferred uncorrectly.
+"""
+import numpy as np
+if type(dt) == ByteType:
+return np.int8
+elif type(dt) == ShortType:
+return np.int16
+elif type(dt) == IntegerType:
+return np.int32
+elif type(dt) == FloatType:
+return np.float32
+else:
--- End diff --

As far as I can so far just some of our unit tests where we are asserting 
some expected pandas dataframes.  Think maybe float also is affected...  Should 
I create a ticket in Jira?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18378#discussion_r181567408
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1750,6 +1761,24 @@ def _to_scala_map(sc, jm):
 return sc._jvm.PythonUtils.toScalaMap(jm)
 
 
+def _to_corrected_pandas_type(dt):
+"""
+When converting Spark SQL records to Pandas DataFrame, the inferred 
data type may be wrong.
+This method gets the corrected data type for Pandas if that type may 
be inferred uncorrectly.
+"""
+import numpy as np
+if type(dt) == ByteType:
+return np.int8
+elif type(dt) == ShortType:
+return np.int16
+elif type(dt) == IntegerType:
+return np.int32
+elif type(dt) == FloatType:
+return np.float32
+else:
--- End diff --

Yup, it was unfortunate but it was a bug that we should fix. Does that 
cause an actual break or simply just unit test failure?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...

2018-04-14 Thread edlee123

Github user edlee123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18378#discussion_r181565606
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1750,6 +1761,24 @@ def _to_scala_map(sc, jm):
 return sc._jvm.PythonUtils.toScalaMap(jm)
 
 
+def _to_corrected_pandas_type(dt):
+"""
+When converting Spark SQL records to Pandas DataFrame, the inferred 
data type may be wrong.
+This method gets the corrected data type for Pandas if that type may 
be inferred uncorrectly.
+"""
+import numpy as np
+if type(dt) == ByteType:
+return np.int8
+elif type(dt) == ShortType:
+return np.int16
+elif type(dt) == IntegerType:
+return np.int32
+elif type(dt) == FloatType:
+return np.float32
+else:
--- End diff --

Had a question: in Spark 2.2.1, if I do a .toPandas on a Spark DataFrame 
with column integer type, the dtypes in pandas is int64.  Whereas in in Spark 
2.3.0 they ints are converted to int32.   I ran the below in Spark 2.2.1 and 
2.3.0: 

```
df = spark.sparkContext.parallelize([(i, ) for i in [1, 2, 
3]]).toDF(["a"]).select(sf.col('a').cast('int')).toPandas()
df.dtypes
```
Is this intended?  We ran into as we have unit tests in a project that 
passed in Spark 2.2.1 that fail in Spark 2.3.0



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21070: SPARK-23972: Update Parquet to 1.10.0.

2018-04-14 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21070#discussion_r181564129
  
--- Diff: pom.xml ---
@@ -129,7 +129,7 @@
 
 1.2.1
 10.12.1.1
-1.8.2
+1.10.0
--- End diff --

Unlike [last 
time](https://github.com/apache/spark/commit/26a4cba3ffaadf382ca14980378965704ccef9ab),
 it seems that this PR touches `commons-pool` dependency together? Can we avoid 
this?
```
-commons-pool-1.5.4.jar
+commons-pool-1.6.jar
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-14 Thread henryr

Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r181563918
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,15 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if the next operator is a Sort itself
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case Sort(orders, true, child) if 
SortOrder.orderingSatisfies(child.outputOrdering, orders) =>
   child
+case s @ Sort(_, _, Sort(_, _, child)) => s.copy(child = child)
--- End diff --

Thanks for doing this! It might be useful to generalise this to any pair of 
sorts separated by 0 or more projections or filters. I did this for my 
SPARK-23975 PR, see: 
https://github.com/henryr/spark/commit/bb992c2058863322a9183b2985806a87729e4168#diff-a636a87d8843eeccca90140be91d4fafR322

What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21070: SPARK-23972: Update Parquet to 1.10.0.

2018-04-14 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21070#discussion_r181563672
  
--- Diff: pom.xml ---
@@ -129,7 +129,7 @@
 
 1.2.1
 10.12.1.1
-1.8.2
+1.10.0
--- End diff --

@rdblue . To see the Jenkins result, could you resolve the dependency check 
failure with the following?
```
./dev/test-dependencies.sh --replace-manifest
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21072
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89367/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21072
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21072
  
**[Test build #89367 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89367/testReport)**
 for PR 21072 at commit 
[`6ba4186`](https://github.com/apache/spark/commit/6ba418619dcfd139b6b3fa81cf8a29d54ca62681).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21056
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21056
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89366/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21056
  
**[Test build #89366 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89366/testReport)**
 for PR 21056 at commit 
[`5c1d2b2`](https://github.com/apache/spark/commit/5c1d2b2179a1609b17c8e47b2abeef8a298c1643).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21057: 2 Improvements to Pyspark docs

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21057
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21057: 2 Improvements to Pyspark docs

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21057
  
**[Test build #89368 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89368/testReport)**
 for PR 21057 at commit 
[`e040a68`](https://github.com/apache/spark/commit/e040a68151c308a1e46cda6723fed2b7023a7331).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21057: 2 Improvements to Pyspark docs

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21057
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89368/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21057: 2 Improvements to Pyspark docs

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21057
  
**[Test build #89368 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89368/testReport)**
 for PR 21057 at commit 
[`e040a68`](https://github.com/apache/spark/commit/e040a68151c308a1e46cda6723fed2b7023a7331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21057: 2 Improvements to Pyspark docs

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21057
  
BTW, mind fixing the PR title to `[MINOR][PYTHON] ... ` and make the title 
more descriptive? not a big deal but good to match it with other PRs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21057: 2 Improvements to Pyspark docs

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21057#discussion_r181553378
  
--- Diff: python/pyspark/streaming/kafka.py ---
@@ -104,7 +104,7 @@ def createDirectStream(ssc, topics, kafkaParams, 
fromOffsets=None,
 :param topics:  list of topic_name to consume.
 :param kafkaParams: Additional params for Kafka.
 :param fromOffsets: Per-topic/partition Kafka offsets defining the 
(inclusive) starting
-point of the stream.
+point of the stream (Dict with keys of type 
TopicAndPartition and int values).
--- End diff --

Shall we fix as so?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21057: 2 Improvements to Pyspark docs

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21057
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20611#discussion_r181552961
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -304,45 +304,14 @@ case class LoadDataCommand(
   }
 }
 
-val loadPath =
+val loadPath = {
   if (isLocal) {
 val uri = Utils.resolveURI(path)
-val file = new File(uri.getPath)
-val exists = if (file.getAbsolutePath.contains("*")) {
-  val fileSystem = FileSystems.getDefault
-  val dir = file.getParentFile.getAbsolutePath
-  if (dir.contains("*")) {
-throw new AnalysisException(
-  s"LOAD DATA input path allows only filename wildcard: $path")
-  }
-
-  // Note that special characters such as "*" on Windows are not 
allowed as a path.
--- End diff --

OK, I am fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes coll...

Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/21060


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21070: SPARK-23972: Update Parquet to 1.10.0.

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21070
  
@rdblue, BTW mind fixing the title to `[SPARK-23972][...] ...`? It's 
actually written in the guide.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIndex

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21004
  
(let's avoid to describe the PR description just saying improvement next 
time)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21060
  
Merged to branch-2.3.

Thanks for reviewing this @BryanCutler.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

2018-04-14 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20894
  
jenkins, retest this, please




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20937
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89364/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20937
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20937
  
**[Test build #89364 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89364/testReport)**
 for PR 20937 at commit 
[`36253f4`](https://github.com/apache/spark/commit/36253f4bcf5caaf1945fe526deeca61551e13da4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21072
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21072
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2329/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20894
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89365/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20894
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20894
  
**[Test build #89365 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89365/testReport)**
 for PR 20894 at commit 
[`b43a7c7`](https://github.com/apache/spark/commit/b43a7c7ec50e03aaf4990e9bbb6989cdb2c076ef).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21072
  
**[Test build #89367 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89367/testReport)**
 for PR 21072 at commit 
[`6ba4186`](https://github.com/apache/spark/commit/6ba418619dcfd139b6b3fa81cf8a29d54ca62681).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21072
  
cc @cloud-fan  @henryr 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/21072

[SPARK-23973][SQL] Remove consecutive Sorts

## What changes were proposed in this pull request?

In SPARK-23375 we introduced the ability of removing `Sort` operation 
during query optimization if the data is already sorted. In this follow-up we 
remove also a `Sort` which is followed by another `Sort`: in this case the 
first sort is not needed and can be safely removed.

The PR starts from @henryr's comment: 
https://github.com/apache/spark/pull/20560#discussion_r180601594. So credit 
should be given to him.

## How was this patch tested?

added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23973

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21072


commit 6ba418619dcfd139b6b3fa81cf8a29d54ca62681
Author: Marco Gaido 
Date:   2018-04-14T10:11:27Z

[SPARK-23973][SQL] Remove consecutive Sorts




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21056
  
**[Test build #89366 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89366/testReport)**
 for PR 21056 at commit 
[`5c1d2b2`](https://github.com/apache/spark/commit/5c1d2b2179a1609b17c8e47b2abeef8a298c1643).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20629

@holdenk I am not sure about requiring or not cluster centers for this
metric. On one side, since the `ClusteringEvaluator` should be a general
interface for all clustering models and some of them don't provide cluster
centers, it may be a good idea to compute them if necessary. On the other, does
this metric make sense for any model other than KMeans? And computing the
centers of the test dataset would lead to different results than the old API we
are replacing. So I am not sure it is the right thing to do.

Honestly, the more we go on the more my feeling is that we don't really
need to move that metric here. We can just deprecate it saying that there are
better metrics for evaluating a clustering available in the
`ClusteringEvaluator` (namely the silhouette). In these way people can move
away from using this metric.

Moreover, sklearn - which is one of the most widespread tool - doesn't
offer the ability of computing such a cost
(http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation).
The only thing sklearn offers is what it calls `inertia`
(https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/cluster/k_means_.py#L265),
ie. the cost computed on the training set.

So, I think the best option would be to follow what sklearn does:

1 - Introducing in the `KMeansSummary` (or `KMeansModel` if you prefer)
the cost attribute on the training set
2 - deprecate this method redirecting to `ClusteringEvaluator` for better
metrics and/or to the cost attribute introduced

What do you think?

---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r181547119
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   /**
* param for metric name in evaluation
-   * (supports `"silhouette"` (default))
+   * (supports `"silhouette"` (default), `"kmeansCost"`)
--- End diff --

but does it make sense to introduce something which is already considered 
legacy when introduced?

I think this brigs up again the question: shall we maintain a metric which 
was introduced only temporary as a fallback due to the lack of better metrics?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r181547107
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -322,7 +323,11 @@ def computeCost(self, dataset):
 """
 Return the K-means cost (sum of squared distances of points to 
their nearest center)
 for this model on the given data.
+
+..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use 
ClusteringEvaluator instead.
 """
+warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. 
Use ClusteringEvaluator"
--- End diff --

yes, or I can also update it here once we establish for sure what the new 
API has to look like, as you prefer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20894
  
**[Test build #89365 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89365/testReport)**
 for PR 20894 at commit 
[`b43a7c7`](https://github.com/apache/spark/commit/b43a7c7ec50e03aaf4990e9bbb6989cdb2c076ef).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

2018-04-14 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20894
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...