[GitHub] spark issue #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpark as a...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21007 What's wrong in the description and PR title, and what to document? Do you mean the first sentence `This PR proposes to add collect to a query executor as an action.` is wrong because this sentence doesn't mention Pyspark specifically? I think the PR title and other words explain what this PR changes. It's already documented - https://github.com/apache/spark/blob/bd4eb9ce57da7bacff69d9ed958c94f349b7e6fb/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L44 Should we document `collect` in PySpark is now recignised as a query execution action in a query execution listener? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20451: [SPARK-23146][WIP] Support client mode for Kubernetes cl...
Github user echarles commented on the issue: https://github.com/apache/spark/pull/20451 Now that #20910 has been merged, I will update this PR to take account the refactoring. @inetfuture Once these changes are pushed, there is the review process which needs to occur, so difficult to give you a plan. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21060 Yup, that should reduce some overhead like this. I would like to listen what you guys think cc @srowenn, @vanzin, @felixcheung, @holdenk too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21060 I do think we should clearly document the rule what we can backport. I do not think we should make an exception for this PR. cc @rxin @marmbrus @yhuai @cloud-fan @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21060 How about we formally document that in the guide? I have been always putting more importance on practice and I personally think we are fine to make a backport if it's a bug and the fix is straightforward. IMHO, principal is a base but we should put more importance on practice. Even if I take your words, I would then like to make this as an exception since this fixes actual usecases from our customers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21060 If this can be treated as a bug to backport, we have many behavior change PRs that can be backported. We are building the system software. We have to be more principled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21060 I agree that It's better to avoid a behaviour change but this one is a clearly a bug and the fix is straightforward. I am puzzled why this specifically prompted you. I wouldn't revert if there's not specific worry about this patch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21060 This is just the basic backport rule we follow for each PR. We should not make an exception for this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21060 I am a bit puzzled because `QueryExecutionListener` should call the callback for actions and `collect` triggers it in Scala and R but it doesn't in PySpark specifically. It sounds a bug and this fix is relatively straightforward. The previous behaviour was it was not being called which didn't make sense. I agree that it's discouraged to make a behaviour change to the maintenance release, sure. However, I was thinking it makes sense to backport if the fix is not complicated and looks a bug quite clearly. I think we shouldn't say it's improvement in this case. Were actual apps or test cases broken somewhere? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21060 Users apps should not be blamed in this case. If they want this change, they should upgrade to the newer release. Basically, we should not introduce any external behavior change in the maintenance release if possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21060 I guess the behaviour changes here is that a custom query execution listener now can recognise the action `collect` in PySpark which other APIs have detected. Mind explaining how it breaks external apps? If the callback should not be called specifically `collect` but not other actions like `show` in PySpark, I would say it should be to blame yours apps. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21060 This will introduce the behavior change and it is not a regression. The changes we made in this PR could break the external app. We should not do it in the maintenance release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21060 hm I would say it's a bug since the action is not detected which is supposed to call the callback. The test is a bit complicated but the fix is relatively straightforward. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21073: [SPARK-23936][SQL][WIP] Implement map_concat
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21073 **[Test build #89369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89369/testReport)** for PR 21073 at commit [`d04893b`](https://github.com/apache/spark/commit/d04893bccbd2772eb937125895519228588e74b4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21073: [SPARK-23936][SQL][WIP] Implement map_concat
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21073 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21060 Since this is not a bug fix, I plan to revert this PR. WDYT? @HyukjinKwon @BryanCutler --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21073: [SPARK-23936][SQL][WIP] Implement map_concat
GitHub user bersprockets opened a pull request: https://github.com/apache/spark/pull/21073 [SPARK-23936][SQL][WIP] Implement map_concat ## What changes were proposed in this pull request? Implement map_concat high order function. This is a work in progress. There's no code generation yet. My current implementation of MapConcat.checkInputDataTypes does not allow valueContainsNull to vary between the input maps. Not sure what the requirements are here. I am using a java.util.Map implementation to merge all the maps together, since that is a very straightforward implementation. I chose java.util.LinkedHashMap because: - It allows for predicatable tuple order (essentially, the original insertion order) for building the result MapData. Tests, like pyspark's doctests, which rely on tuple order, will work across Java versions - java.util.LinkedHashMap seems to be about as fast as java.util.HashMap, at least when used to concatenate big (500+ key/values) maps for 150k rows, and it's much faster than the scala LinkedHashMap implementation. ## How was this patch tested? New tests Manual tests Run all sbt SQL tests Run all pyspark sql tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/bersprockets/spark SPARK-23936 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21073.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21073 commit 707330cd88b269cb1bbee83b9b6476d05c8d177c Author: Bruce RobbinsDate: 2018-04-14T23:52:37Z Initial commit commit 84d696313972a237691eb46cad6a478167dbabee Author: Bruce Robbins Date: 2018-04-15T02:04:45Z Remove unused variable in test commit d04893bccbd2772eb937125895519228588e74b4 Author: Bruce Robbins Date: 2018-04-15T03:35:47Z Cleanup --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpark as a...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21007 @HyukjinKwon @BryanCutler @viirya @felixcheung The first sentence of this PR really scares me. After reading the PR description. Since the PR description will be part of our change log. Please be careful to ensure they are right. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21007#discussion_r181569225 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -3189,10 +3189,10 @@ class Dataset[T] private[sql]( private[sql] def collectToPython(): Int = { EvaluatePython.registerPicklers() -withNewExecutionId { +withAction("collectToPython", queryExecution) { plan => --- End diff -- These changes can cause the behavior changes. Please submit a PR to document it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21057: 2 Improvements to Pyspark docs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21057 +1 for ^. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21057: 2 Improvements to Pyspark docs
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21057 I think you may create a minor JIRA ticket for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21057: 2 Improvements to Pyspark docs
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21057#discussion_r181568263 --- Diff: python/pyspark/streaming/listener.py --- @@ -22,6 +22,10 @@ class StreamingListener(object): def __init__(self): pass + +def onStreamingStarted(self, streamingStarted): --- End diff -- Can you add a test to `pyspark.streaming.tests.StreamingListenerTests`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...
Github user edlee123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18378#discussion_r181567770 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1750,6 +1761,24 @@ def _to_scala_map(sc, jm): return sc._jvm.PythonUtils.toScalaMap(jm) +def _to_corrected_pandas_type(dt): +""" +When converting Spark SQL records to Pandas DataFrame, the inferred data type may be wrong. +This method gets the corrected data type for Pandas if that type may be inferred uncorrectly. +""" +import numpy as np +if type(dt) == ByteType: +return np.int8 +elif type(dt) == ShortType: +return np.int16 +elif type(dt) == IntegerType: +return np.int32 +elif type(dt) == FloatType: +return np.float32 +else: --- End diff -- As far as I can so far just some of our unit tests where we are asserting some expected pandas dataframes. Think maybe float also is affected... Should I create a ticket in Jira? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18378#discussion_r181567408 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1750,6 +1761,24 @@ def _to_scala_map(sc, jm): return sc._jvm.PythonUtils.toScalaMap(jm) +def _to_corrected_pandas_type(dt): +""" +When converting Spark SQL records to Pandas DataFrame, the inferred data type may be wrong. +This method gets the corrected data type for Pandas if that type may be inferred uncorrectly. +""" +import numpy as np +if type(dt) == ByteType: +return np.int8 +elif type(dt) == ShortType: +return np.int16 +elif type(dt) == IntegerType: +return np.int32 +elif type(dt) == FloatType: +return np.float32 +else: --- End diff -- Yup, it was unfortunate but it was a bug that we should fix. Does that cause an actual break or simply just unit test failure? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...
Github user edlee123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18378#discussion_r181565606 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1750,6 +1761,24 @@ def _to_scala_map(sc, jm): return sc._jvm.PythonUtils.toScalaMap(jm) +def _to_corrected_pandas_type(dt): +""" +When converting Spark SQL records to Pandas DataFrame, the inferred data type may be wrong. +This method gets the corrected data type for Pandas if that type may be inferred uncorrectly. +""" +import numpy as np +if type(dt) == ByteType: +return np.int8 +elif type(dt) == ShortType: +return np.int16 +elif type(dt) == IntegerType: +return np.int32 +elif type(dt) == FloatType: +return np.float32 +else: --- End diff -- Had a question: in Spark 2.2.1, if I do a .toPandas on a Spark DataFrame with column integer type, the dtypes in pandas is int64. Whereas in in Spark 2.3.0 they ints are converted to int32. I ran the below in Spark 2.2.1 and 2.3.0: ``` df = spark.sparkContext.parallelize([(i, ) for i in [1, 2, 3]]).toDF(["a"]).select(sf.col('a').cast('int')).toPandas() df.dtypes ``` Is this intended? We ran into as we have unit tests in a project that passed in Spark 2.2.1 that fail in Spark 2.3.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21070: SPARK-23972: Update Parquet to 1.10.0.
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r181564129 --- Diff: pom.xml --- @@ -129,7 +129,7 @@ 1.2.1 10.12.1.1 -1.8.2 +1.10.0 --- End diff -- Unlike [last time](https://github.com/apache/spark/commit/26a4cba3ffaadf382ca14980378965704ccef9ab), it seems that this PR touches `commons-pool` dependency together? Can we avoid this? ``` -commons-pool-1.5.4.jar +commons-pool-1.6.jar ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r181563918 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,15 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if the next operator is a Sort itself */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case Sort(orders, true, child) if SortOrder.orderingSatisfies(child.outputOrdering, orders) => child +case s @ Sort(_, _, Sort(_, _, child)) => s.copy(child = child) --- End diff -- Thanks for doing this! It might be useful to generalise this to any pair of sorts separated by 0 or more projections or filters. I did this for my SPARK-23975 PR, see: https://github.com/henryr/spark/commit/bb992c2058863322a9183b2985806a87729e4168#diff-a636a87d8843eeccca90140be91d4fafR322 What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21070: SPARK-23972: Update Parquet to 1.10.0.
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r181563672 --- Diff: pom.xml --- @@ -129,7 +129,7 @@ 1.2.1 10.12.1.1 -1.8.2 +1.10.0 --- End diff -- @rdblue . To see the Jenkins result, could you resolve the dependency check failure with the following? ``` ./dev/test-dependencies.sh --replace-manifest ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21072 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89367/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21072 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21072 **[Test build #89367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89367/testReport)** for PR 21072 at commit [`6ba4186`](https://github.com/apache/spark/commit/6ba418619dcfd139b6b3fa81cf8a29d54ca62681). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21056 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21056 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89366/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21056 **[Test build #89366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89366/testReport)** for PR 21056 at commit [`5c1d2b2`](https://github.com/apache/spark/commit/5c1d2b2179a1609b17c8e47b2abeef8a298c1643). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21057: 2 Improvements to Pyspark docs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21057 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21057: 2 Improvements to Pyspark docs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21057 **[Test build #89368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89368/testReport)** for PR 21057 at commit [`e040a68`](https://github.com/apache/spark/commit/e040a68151c308a1e46cda6723fed2b7023a7331). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21057: 2 Improvements to Pyspark docs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21057 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89368/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21057: 2 Improvements to Pyspark docs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21057 **[Test build #89368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89368/testReport)** for PR 21057 at commit [`e040a68`](https://github.com/apache/spark/commit/e040a68151c308a1e46cda6723fed2b7023a7331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21057: 2 Improvements to Pyspark docs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21057 BTW, mind fixing the PR title to `[MINOR][PYTHON] ... ` and make the title more descriptive? not a big deal but good to match it with other PRs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21057: 2 Improvements to Pyspark docs
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21057#discussion_r181553378 --- Diff: python/pyspark/streaming/kafka.py --- @@ -104,7 +104,7 @@ def createDirectStream(ssc, topics, kafkaParams, fromOffsets=None, :param topics: list of topic_name to consume. :param kafkaParams: Additional params for Kafka. :param fromOffsets: Per-topic/partition Kafka offsets defining the (inclusive) starting -point of the stream. +point of the stream (Dict with keys of type TopicAndPartition and int values). --- End diff -- Shall we fix as so? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21057: 2 Improvements to Pyspark docs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21057 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20611#discussion_r181552961 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -304,45 +304,14 @@ case class LoadDataCommand( } } -val loadPath = +val loadPath = { if (isLocal) { val uri = Utils.resolveURI(path) -val file = new File(uri.getPath) -val exists = if (file.getAbsolutePath.contains("*")) { - val fileSystem = FileSystems.getDefault - val dir = file.getParentFile.getAbsolutePath - if (dir.contains("*")) { -throw new AnalysisException( - s"LOAD DATA input path allows only filename wildcard: $path") - } - - // Note that special characters such as "*" on Windows are not allowed as a path. --- End diff -- OK, I am fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes coll...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/21060 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: SPARK-23972: Update Parquet to 1.10.0.
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21070 @rdblue, BTW mind fixing the title to `[SPARK-23972][...] ...`? It's actually written in the guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIndex
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21004 (let's avoid to describe the PR description just saying improvement next time) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21060 Merged to branch-2.3. Thanks for reviewing this @BryanCutler. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20894 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20937 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89364/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20937 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20937 **[Test build #89364 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89364/testReport)** for PR 20937 at commit [`36253f4`](https://github.com/apache/spark/commit/36253f4bcf5caaf1945fe526deeca61551e13da4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21072 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21072 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2329/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20894 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89365/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20894 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20894 **[Test build #89365 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89365/testReport)** for PR 20894 at commit [`b43a7c7`](https://github.com/apache/spark/commit/b43a7c7ec50e03aaf4990e9bbb6989cdb2c076ef). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21072 **[Test build #89367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89367/testReport)** for PR 21072 at commit [`6ba4186`](https://github.com/apache/spark/commit/6ba418619dcfd139b6b3fa81cf8a29d54ca62681). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21072 cc @cloud-fan @henryr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/21072 [SPARK-23973][SQL] Remove consecutive Sorts ## What changes were proposed in this pull request? In SPARK-23375 we introduced the ability of removing `Sort` operation during query optimization if the data is already sorted. In this follow-up we remove also a `Sort` which is followed by another `Sort`: in this case the first sort is not needed and can be safely removed. The PR starts from @henryr's comment: https://github.com/apache/spark/pull/20560#discussion_r180601594. So credit should be given to him. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23973 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21072.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21072 commit 6ba418619dcfd139b6b3fa81cf8a29d54ca62681 Author: Marco GaidoDate: 2018-04-14T10:11:27Z [SPARK-23973][SQL] Remove consecutive Sorts --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21056 **[Test build #89366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89366/testReport)** for PR 21056 at commit [`5c1d2b2`](https://github.com/apache/spark/commit/5c1d2b2179a1609b17c8e47b2abeef8a298c1643). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20629 @holdenk I am not sure about requiring or not cluster centers for this metric. On one side, since the `ClusteringEvaluator` should be a general interface for all clustering models and some of them don't provide cluster centers, it may be a good idea to compute them if necessary. On the other, does this metric make sense for any model other than KMeans? And computing the centers of the test dataset would lead to different results than the old API we are replacing. So I am not sure it is the right thing to do. Honestly, the more we go on the more my feeling is that we don't really need to move that metric here. We can just deprecate it saying that there are better metrics for evaluating a clustering available in the `ClusteringEvaluator` (namely the silhouette). In these way people can move away from using this metric. Moreover, sklearn - which is one of the most widespread tool - doesn't offer the ability of computing such a cost (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). The only thing sklearn offers is what it calls `inertia` (https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/cluster/k_means_.py#L265), ie. the cost computed on the training set. So, I think the best option would be to follow what sklearn does: 1 - Introducing in the `KMeansSummary` (or `KMeansModel` if you prefer) the cost attribute on the training set 2 - deprecate this method redirecting to `ClusteringEvaluator` for better metrics and/or to the cost attribute introduced What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r181547119 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str /** * param for metric name in evaluation - * (supports `"silhouette"` (default)) + * (supports `"silhouette"` (default), `"kmeansCost"`) --- End diff -- but does it make sense to introduce something which is already considered legacy when introduced? I think this brigs up again the question: shall we maintain a metric which was introduced only temporary as a fallback due to the lack of better metrics? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r181547107 --- Diff: python/pyspark/ml/clustering.py --- @@ -322,7 +323,11 @@ def computeCost(self, dataset): """ Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data. + +..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead. """ +warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator" --- End diff -- yes, or I can also update it here once we establish for sure what the new API has to look like, as you prefer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20894 **[Test build #89365 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89365/testReport)** for PR 20894 at commit [`b43a7c7`](https://github.com/apache/spark/commit/b43a7c7ec50e03aaf4990e9bbb6989cdb2c076ef). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20894 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20937 **[Test build #89364 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89364/testReport)** for PR 20937 at commit [`36253f4`](https://github.com/apache/spark/commit/36253f4bcf5caaf1945fe526deeca61551e13da4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes
Github user stoader commented on the issue: https://github.com/apache/spark/pull/21067 @mccheah > But whether or not the driver should be relaunchable should be determined by the application submitter, and not necessarily done all the time. Can we make this behavior configurable? This should be easy by configuring [Pod Backoff failure policy](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy) of the job such that it executes the pod only once. > We don't have a solid story for checkpointing streaming computation right now We've done work for this to store checkpointing on persistence volume but thought that should be a separate PR as it's not strictly linked to this change. > you'll certainly lose all progress from batch jobs Agree that the batch job would be rerun from scratch. Still I think there is value for one being able to run the batch job unattended and not intervene in case of machine failure as the batch job will be rescheduled to another node. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/20611#discussion_r181543985 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -304,45 +304,14 @@ case class LoadDataCommand( } } -val loadPath = +val loadPath = { if (isLocal) { val uri = Utils.resolveURI(path) -val file = new File(uri.getPath) -val exists = if (file.getAbsolutePath.contains("*")) { - val fileSystem = FileSystems.getDefault - val dir = file.getParentFile.getAbsolutePath - if (dir.contains("*")) { -throw new AnalysisException( - s"LOAD DATA input path allows only filename wildcard: $path") - } - - // Note that special characters such as "*" on Windows are not allowed as a path. --- End diff -- @wzhfy All test-cases related to windows query suite are passing i think in previous code, for reading the files based on wildcard char, we are trying to read the files parent directory first, and we were listing all files inside that folder and then we are trying to match the pattern for each file inside directory, so i think for getting the parent path we need to explicitly check that it should not have any unsupported characters like '*' ,But now we are directly passing the path with wildchar to globStatus() API of hdfs and this should able to pattern match irrespective of directory/files, in globStatus API i could see they have special handling for windows path, i will look into more details regarding this. ![image](https://user-images.githubusercontent.com/12999161/38765114-20a47d7e-3fd9-11e8-9863-59d179c4a2d8.png) Thanks all for the valuable feedbacks . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org