[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fall back to non-Arr...

2018-02-12 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20567
  
A quick bit: fallback is a single word. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...

2018-02-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20490#discussion_r167137165
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java
 ---
@@ -62,6 +62,16 @@
*/
   DataWriterFactory createWriterFactory();
 
+  /**
+   * Returns whether Spark should use the commit coordinator to ensure 
that only one attempt for
--- End diff --

This is actually not a guarantee, is it?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....

2018-02-07 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20499
  
I'd fix this in 2.3, and 2.2.1 as well.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r166701501
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java 
---
@@ -27,6 +27,39 @@
 /**
  * An immutable string-to-string map in which keys are case-insensitive. 
This is used to represent
  * data source options.
+ *
+ * Each data source implementation can define its own options and teach 
its users how to set them.
+ * Spark doesn't have any restrictions about what options a data source 
should or should not have.
+ * Instead Spark defines some standard options that data sources can 
optionally adopt. It's possible
+ * that some options are very common and many data sources use them. 
However different data
+ * sources may define the common options(key and meaning) differently, 
which is quite confusing to
+ * end users.
+ *
+ * The standard options defined by Spark:
+ * 
+ *   
+ * Option key
+ * Option value
+ *   
+ *   
+ * path
+ * A comma separated paths string of the data files/directories, 
like
+ * path1,/absolute/file2,path3/*. Each path can either be 
relative or absolute,
+ * points to either file or directory, and can contain wildcards. This 
option is commonly used
+ * by file-based data sources.
+ *   
+ *   
+ * table
+ * A table name string representing the table name directly 
without any interpretation.
--- End diff --

what do you mean by "without any interpretation"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20491: [SQL] Minor doc update: Add an example in DataFrameReade...

2018-02-02 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20491
  
This should also go into branch-2.3.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20491: [SQL] Minor doc update: Add an example in DataFra...

2018-02-02 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/20491

[SQL] Minor doc update: Add an example in DataFrameReader.schema

## What changes were proposed in this pull request?
This patch adds a small example to the schema string definition of schema 
function. It isn't obvious how to use it, so an example would be useful.

## How was this patch tested?
N/A - doc only.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark schema-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20491.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20491


commit 69193dbd64e9e0002abd9a8cd6fe60c1c87bc471
Author: Reynold Xin <rxin@...>
Date:   2018-02-02T23:00:39Z

[SQL] Minor doc update: Add an example in DataFrameReader.schema

commit e5e5e0b44e22f58736dd27e5c048395670574f18
Author: Reynold Xin <rxin@...>
Date:   2018-02-02T23:02:26Z

fix typo




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16793: [SPARK-19454][PYTHON][SQL] DataFrame.replace improvement...

2018-02-02 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16793
  
Also the implementation doesn't match what was proposed in 
https://issues.apache.org/jira/browse/SPARK-19454

Having null value as the default in a function called replace is too risky 
and error prone.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16793: [SPARK-19454][PYTHON][SQL] DataFrame.replace improvement...

2018-02-02 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16793
  
Sorry I object this change. Why would we put null as the default replace 
value, in a function called replace? That seems very counterintuitive and error 
prone.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20219
  
But it is possible to generate NullType data right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20152: [SPARK-22957] ApproxQuantile breaks if the number of row...

2018-01-04 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20152
  
cc @gatorsmile @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

2018-01-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20072#discussion_r159573530
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -261,6 +261,17 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
+"org.apache.spark.sql.execution.datasources.fileDataSizeFactor")
--- End diff --

shouldn't we call this something like compressionFactor?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20076
  
Thanks for the PR. Why are we complicating the PR by doing the rename? Does 
this actually gain anything other than minor cosmetic changes? It makes the 
simple PR pretty long ...



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documentation

2017-12-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19946
  
Merging in master.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19973: [SPARK-22779] FallbackConfigEntry's default value...

2017-12-21 Thread rxin
Github user rxin closed the pull request at:

https://github.com/apache/spark/pull/19973


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...

2017-12-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19946#discussion_r158205893
  
--- Diff: docs/building-spark.md ---
@@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by 
the
 to be runnable, use `./dev/make-distribution.sh` in the project root 
directory. It can be configured
 with Maven profile settings and so on like the direct Maven build. Example:
 
-./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn
+./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
--- End diff --

Yea I don't think you need to block this pr with this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryError in c...

2017-12-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20014
  
Overall change lgtm.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryErr...

2017-12-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20014#discussion_r157673852
  
--- Diff: 
core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryError.java ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.memory;
+
+/**
+ * This exception is thrown when a task can not acquire memory from the 
Memory manager.
+ * Instead of throwing {@link OutOfMemoryError}, which kills the executor,
+ * we should use throw this exception, which will just kill the current 
task.
+ */
+public final class SparkOutOfMemoryError extends OutOfMemoryError {
--- End diff --

is this an internal class? if yes perhaps we should label it.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19973: [SPARK-22779] FallbackConfigEntry's default value should...

2017-12-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19973
  
@vanzin you got a min to submit a patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19946: [SPARK-22648] [Scheduler] Spark on Kubernetes - D...

2017-12-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19946#discussion_r156821519
  
--- Diff: docs/building-spark.md ---
@@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by 
the
 to be runnable, use `./dev/make-distribution.sh` in the project root 
directory. It can be configured
 with Maven profile settings and so on like the direct Maven build. Example:
 
-./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn
+./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
--- End diff --

should we use k8s? I kept bringing this up and that's because I can never 
spell Kubernetes properly. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19973: [SPARK-22779] FallbackConfigEntry's default value should...

2017-12-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19973
  
That's what the "default" is, isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19973: [SPARK-22779] ConfigEntry's default value should actuall...

2017-12-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19973
  
The issue is in

```
  /**
   * Return the `string` value of Spark SQL configuration property for the 
given key. If the key is
   * not set yet, return `defaultValue`.
   */
  def getConfString(key: String, defaultValue: String): String = {
if (defaultValue != null && defaultValue != "") {
  val entry = sqlConfEntries.get(key)
  if (entry != null) {
// Only verify configs in the SQLConf object
entry.valueConverter(defaultValue)
  }
}
Option(settings.get(key)).getOrElse(defaultValue)
  }
```

The value converter gets applied on this generated string which is not a 
real value and will fail.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19973: [SPARK-22779] ConfigEntry's default value should ...

2017-12-13 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/19973

[SPARK-22779] ConfigEntry's default value should actually be a value

## What changes were proposed in this pull request?
ConfigEntry's config value right now shows a human readable message. In 
some places in SQL we actually rely on default value for real to be setting the 
values.

## How was this patch tested?
Tested manually.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-22779

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19973.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19973


commit 385c300c14a382654c2a1f94ccd2813487dbe26a
Author: Reynold Xin <r...@databricks.com>
Date:   2017-12-13T22:43:55Z

[SPARK-22779] ConfigEntry's default value should actually be a value




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19973: [SPARK-22779] ConfigEntry's default value should actuall...

2017-12-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19973
  
cc @vanzin @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r155693977
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import java.util.regex.Pattern
+
+import scala.collection.JavaConverters._
+import scala.collection.immutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.v2.ConfigSupport
+
+private[sql] object DataSourceV2ConfigSupport extends Logging {
+
+  /**
+   * Helper method to propagate session configs with config key that 
matches at least one of the
+   * given prefixes to the corresponding data source options.
+   *
+   * @param cs the session config propagate help class
+   * @param source the data source format
+   * @param conf the session conf
+   * @return an immutable map that contains all the session configs that 
should be propagated to
+   * the data source.
+   */
+  def withSessionConfig(
+  cs: ConfigSupport,
+  source: String,
+  conf: SQLConf): immutable.Map[String, String] = {
+val prefixes = cs.getConfigPrefixes
+require(prefixes != null, "The config key-prefixes cann't be null.")
+val mapping = cs.getConfigMapping.asScala
+val validOptions = cs.getValidOptions
+require(validOptions != null, "The valid options list cann't be null.")
--- End diff --

double n


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r155693966
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import java.util.regex.Pattern
+
+import scala.collection.JavaConverters._
+import scala.collection.immutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.v2.ConfigSupport
+
+private[sql] object DataSourceV2ConfigSupport extends Logging {
+
+  /**
+   * Helper method to propagate session configs with config key that 
matches at least one of the
+   * given prefixes to the corresponding data source options.
+   *
+   * @param cs the session config propagate help class
+   * @param source the data source format
+   * @param conf the session conf
+   * @return an immutable map that contains all the session configs that 
should be propagated to
+   * the data source.
+   */
+  def withSessionConfig(
+  cs: ConfigSupport,
+  source: String,
+  conf: SQLConf): immutable.Map[String, String] = {
+val prefixes = cs.getConfigPrefixes
+require(prefixes != null, "The config key-prefixes cann't be null.")
--- End diff --

double n


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19905: [SPARK-22710] ConfigBuilder.fallbackConf should trigger ...

2017-12-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19905
  
cc @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19905: [SPARK-22710] ConfigBuilder.fallbackConf should t...

2017-12-05 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/19905

[SPARK-22710] ConfigBuilder.fallbackConf should trigger onCreate function

## What changes were proposed in this pull request?
I was looking at the config code today and found that configs defined using 
ConfigBuilder.fallbackConf didn't trigger onCreate function. This patch fixes 
it.

This doesn't require backporting since we currently have no configs that 
use it.

## How was this patch tested?
Added a test case for all the config final creator functions in 
ConfigEntrySuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-22710

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19905.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19905


commit 028e0faf87f15c9a25acc3e4f6755d3d49bb7904
Author: Reynold Xin <r...@databricks.com>
Date:   2017-12-06T07:10:38Z

[SPARK-22710] ConfigBuilder.fallbackConf should trigger onCreate function




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19468
  
For future pull requests, can you create subtasks under 
https://issues.apache.org/jira/browse/SPARK-18278 ?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19468
  
Thanks - merging in master!



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11994: [SPARK-14151] Expose metrics Source and Sink inte...

2017-11-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11994#discussion_r153321690
  
--- Diff: core/src/main/scala/org/apache/spark/metrics/sink/Sink.scala ---
@@ -17,8 +17,48 @@
 
 package org.apache.spark.metrics.sink
 
-private[spark] trait Sink {
+import java.util.{Locale, Properties}
+import java.util.concurrent.TimeUnit
+
+import com.codahale.metrics.MetricRegistry
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.metrics.MetricsSystem
+
+/**
+ * :: DeveloperApi ::
+ * The abstract class of metrics Sink, by achiving the methods and 
registered through metrics
+ * .properties user could register customer metrics Sink into 
MetricsSystem.
+ *
+ * @param properties Properties related this specific Sink, properties are 
read from
+ *   configuration file, user could define their own 
configurations and get
+ *   from this parameter.
+ * @param metricRegistry The MetricRegistry for you to dump the collected 
metrics.
+ */
+@DeveloperApi
+abstract class Sink(properties: Properties, metricRegistry: 
MetricRegistry) {
+
+  protected val pollPeriod = properties.getProperty("period", "10").toInt
+
+  protected val pollUnit = Option(properties.getProperty("unit"))
+.map(s => TimeUnit.valueOf(s.toUpperCase(Locale.ROOT)))
+.getOrElse(TimeUnit.SECONDS)
+
+  MetricsSystem.checkMinimalPollingPeriod(pollUnit, pollPeriod)
+
+  /**
+   * Start this metrics Sink, this will be called by MetricsSystem. If 
this [[Sink]] is failed to
--- End diff --

fails to start


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11994: [SPARK-14151] Expose metrics Source and Sink interface

2017-11-27 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/11994
  
Hey so my main question is whether we should expose the coda hale metric 
library directly. In the past, we have done this and it has come back to bite 
us. For example, exposing the Hadoop Configuration object directly in Spark 
core has caused a lot of issues, including not being able to remove Hadoop as a 
dependency in cases Hadoop is not needed (which creates a huge jar).

It might be OK if we view this as a developer API that we can break in the 
future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-27 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19468
  
I went through the changes to make sure the non-k8s changes are ok. They do 
look ok to me. From that perspective, LGTM.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

2017-11-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148605519
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -50,28 +53,34 @@
 
   /**
* Creates a writer factory which will be serialized and sent to 
executors.
+   *
+   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
+   * submitted.
*/
   DataWriterFactory createWriterFactory();
 
   /**
* Commits this writing job with a list of commit messages. The commit 
messages are collected from
-   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
-   * fails(throw exception), this writing job is considered to be failed, 
and
-   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
-   * to data source readers if this method succeeds.
+   * successful data writers and are produced by {@link 
DataWriter#commit()}.
+   *
+   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
+   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
+   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
*
* Note that, one partition may have multiple committed data writers 
because of speculative tasks.
* Spark will pick the first successful one and get its commit message. 
Implementations should be
--- End diff --

@steveloughran it is proven to be not possible. Your algorithm doesn't work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

2017-11-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148553358
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -50,28 +53,34 @@
 
   /**
* Creates a writer factory which will be serialized and sent to 
executors.
+   *
+   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
+   * submitted.
*/
   DataWriterFactory createWriterFactory();
 
   /**
* Commits this writing job with a list of commit messages. The commit 
messages are collected from
-   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
-   * fails(throw exception), this writing job is considered to be failed, 
and
-   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
-   * to data source readers if this method succeeds.
+   * successful data writers and are produced by {@link 
DataWriter#commit()}.
+   *
+   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
+   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
+   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
*
* Note that, one partition may have multiple committed data writers 
because of speculative tasks.
* Spark will pick the first successful one and get its commit message. 
Implementations should be
--- End diff --

The only way to guarantee no more than one task can commit is if the 
underlying storage system guarantees that. There is no way to design something 
generic. It is simply not possible in a distributed system, when network 
partitioning or message lost.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19596: [SPARK-22369][PYTHON][DOCS] Exposes catalog API document...

2017-11-02 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19596
  
Merging in master.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

2017-11-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148545942
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -50,28 +53,34 @@
 
   /**
* Creates a writer factory which will be serialized and sent to 
executors.
+   *
+   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
+   * submitted.
*/
   DataWriterFactory createWriterFactory();
 
   /**
* Commits this writing job with a list of commit messages. The commit 
messages are collected from
-   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
-   * fails(throw exception), this writing job is considered to be failed, 
and
-   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
-   * to data source readers if this method succeeds.
+   * successful data writers and are produced by {@link 
DataWriter#commit()}.
+   *
+   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
+   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
+   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
*
* Note that, one partition may have multiple committed data writers 
because of speculative tasks.
* Spark will pick the first successful one and get its commit message. 
Implementations should be
--- End diff --

You are not taking into account network partitioning. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19629: [SPARK-22408][SQL] RelationalGroupedDataset's distinct p...

2017-11-02 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19629
  
Merging in master.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

2017-11-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148527451
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -50,28 +53,34 @@
 
   /**
* Creates a writer factory which will be serialized and sent to 
executors.
+   *
+   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
+   * submitted.
*/
   DataWriterFactory createWriterFactory();
 
   /**
* Commits this writing job with a list of commit messages. The commit 
messages are collected from
-   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
-   * fails(throw exception), this writing job is considered to be failed, 
and
-   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
-   * to data source readers if this method succeeds.
+   * successful data writers and are produced by {@link 
DataWriter#commit()}.
+   *
+   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
+   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
+   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
*
* Note that, one partition may have multiple committed data writers 
because of speculative tasks.
* Spark will pick the first successful one and get its commit message. 
Implementations should be
--- End diff --

@steveloughran it is not really possible to enforce exclusivity this way in 
a general commit system. That would just be a best effort thing to avoid 
duplicate work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19629: [SPARK-22408][SQL] RelationalGroupedDataset's distinct p...

2017-11-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19629
  
Jenkins, test this please.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19626: [minor] Data source v2 docs update.

2017-11-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19626
  
Merging in master.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19626: [minor] Data source v2 docs update.

2017-11-01 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/19626

[minor] Data source v2 docs update.

## What changes were proposed in this pull request?
This patch includes some doc updates for data source API v2. I was reading 
the code and noticed some minor issues.

## How was this patch tested?
This is a doc only change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark dsv2-update

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19626.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19626


commit 51bb2774609a6a8070c261e748ed6355761a6141
Author: Reynold Xin <r...@databricks.com>
Date:   2017-11-01T13:46:57Z

[minor] Data source v2 docs update.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19626: [minor] Data source v2 docs update.

2017-11-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19626
  
cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19596: [SPARK-22369][PYTHON][DOCS] Exposes catalog API document...

2017-10-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19596
  
Yea definitely.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19592: [SPARK-22347][SQL][PySpark] Support optionally running P...

2017-10-27 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19592
  
Is this complexity worth it? Can we just document it as a behavior and 
users need to be careful with it?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18828: [SPARK-21619][SQL] Fail the execution of canonica...

2017-10-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18828#discussion_r147544081
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkPlanSuite.scala ---
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.SharedSQLContext
+
+class SparkPlanSuite extends QueryTest with SharedSQLContext {
+
+  test("SPARK-21619 execution of a canonicalized plan should fail") {
+val plan = spark.range(10).queryExecution.executedPlan.canonicalized
+
+intercept[IllegalStateException] { plan.execute() }
+intercept[IllegalStateException] { plan.executeCollect() }
+intercept[IllegalStateException] { plan.executeCollectPublic() }
+intercept[IllegalStateException] { plan.executeToIterator() }
+intercept[IllegalStateException] { plan.executeBroadcast() }
+intercept[IllegalStateException] { plan.executeTake(1) }
--- End diff --

That's not an issue with this test, is it? It's just how execution is done.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...

2017-10-23 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19468#discussion_r146429113
  
--- Diff: pom.xml ---
@@ -2649,6 +2649,13 @@
 
 
 
+  kubernetes
+  
+resource-managers/kubernetes/core
--- End diff --

That (keeping them separate) is actually pretty useful for SBT.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

2017-10-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19498
  
cc @tdas 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19535: [SPARK-22313][PYTHON] Mark/print deprecation warnings as...

2017-10-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19535
  
Looks good at high level.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19512: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19512
  
Seems fine to backport into 2.2.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r145579513
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
 ---
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.writer;
+
+import java.io.Serializable;
+
+import org.apache.spark.annotation.InterfaceStability;
+
+/**
+ * A factory of {@link DataWriter} returned by {@link 
DataSourceV2Writer#createWriterFactory()},
+ * which is responsible for creating and initializing the actual data 
writer at executor side.
+ *
+ * Note that, the writer factory will be serialized and sent to executors, 
then the data writer
+ * will be created on executors and do the actual writing. So {@link 
DataWriterFactory} must be
+ * serializable and {@link DataWriter} doesn't need to be.
+ */
+@InterfaceStability.Evolving
+public interface DataWriterFactory extends Serializable {
+
+  /**
+   * Returns a data writer to do the actual writing work.
+   *
+   * @param stageId The id of the Spark stage that runs the returned 
writer.
+   * @param partitionId The id of the RDD partition that the returned 
writer will process.
+   * @param attemptNumber The attempt number of the Spark task that runs 
the returned writer, which
+   *  is usually 0 if the task is not a retried task 
or a speculative task.
+   */
+  DataWriter createWriter(int stageId, int partitionId, int 
attemptNumber);
--- End diff --

not sure why we have stageId here. I'd make it more generic, e.g. a string 
for some job id, and then some numeric value (64 bit long) for epoch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19521
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19524
  
seems fine.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-16 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19419
  
Yea in general for security features it seems like it's good to turn on 
them by default.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-16 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19419
  
Is there a reason why this cannot be always enabled?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-15 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18732
  
Grouped UDFs, or Grouped Vectorized UDFs.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-14 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144687542
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,51 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+
+  val attributeNameMap: Map[String, Attribute] = left.output.map(x => 
(x.name, x)).toMap
+  val transformedCondition = filterCondition transform { case a : 
AttributeReference =>
+attributeNameMap(a.name)
+  }
+
+  Distinct(Filter(Not(transformedCondition), left))
+  }
+
+  private def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = 
(left, right) match {
+case (left, right: Filter) => 
nonFilterChild(left).sameResult(nonFilterChild(right))
+case _ => false
+  }
+
+  private def nonFilterChild(plan: LogicalPlan) = 
plan.find(!_.isInstanceOf[Filter]).getOrElse {
+throw new IllegalStateException("LogicalPlan with no Local or Logical 
Relation")
--- End diff --

i left a comment before - we shouldn't throw analysis exception if it an 
internal bug


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19451
  
If we have to do this all over again i'd put all rules in their own files. 
Replace isn't really a great high level category because all rules at some 
level replace something.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19451
  
Actually you already have it in the classdoc, so please just update the pr 
description with it.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144461898
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+  Distinct(
+Filter(Not(replaceAttributesIn(filterCondition, left)), left)
+  )
+  }
+
+  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
+case (left, right: Filter) => 
nonFilterChild(left).sameResult(nonFilterChild(right))
+case _ => false
+  }
+
+  def nonFilterChild(plan: LogicalPlan): LogicalPlan = 
plan.find(!_.isInstanceOf[Filter]).get
--- End diff --

it shouldn't be an analysisexception, if there is a bug in catalyst.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144461913
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+  Distinct(
+Filter(Not(replaceAttributesIn(filterCondition, left)), left)
+  )
+  }
+
+  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
+case (left, right: Filter) => 
nonFilterChild(left).sameResult(nonFilterChild(right))
+case _ => false
+  }
+
+  def nonFilterChild(plan: LogicalPlan): LogicalPlan = 
plan.find(!_.isInstanceOf[Filter]).get
--- End diff --

but i agree we should throw a more meaningful error message in case there 
is a bug


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144461813
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
--- End diff --

I'd put this in a new file. the optimizer file is getting too long.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19451
  
Can you update the pr description with an example plan before / after this 
optimization, and also put that example in the comment section of the doc.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18805
  
Does the package include a binary distribution for Linux?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #6751: [SPARK-8300] DataFrame hint for broadcast join.

2017-10-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/6751
  
Isn't the hint available in SQL?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

2017-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19454
  
Honestly I don't think it is worth doing this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18732
  
I'm OK with the naming. We can change them later if needed before the 
release.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

2017-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19454
  
I actually think this can be confusing on Dataset[T], when the Dataset is 
just untyped and a DataFrame. Do we throw a runtime exception there?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

2017-10-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19454
  
Is this worth doing?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143340681
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -519,3 +519,18 @@ case class CoGroup(
 outputObjAttr: Attribute,
 left: LogicalPlan,
 right: LogicalPlan) extends BinaryNode with ObjectProducer
+
+case class FlatMapGroupsInPandas(
--- End diff --

I'd also put this somewhere else that's not object.scala 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143243338
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

What I'm saying is the analysis rule can just determine the delta, and then 
just do a simple add/delete.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122895
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TimestampTableTimeZone.scala
 ---
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.UnresolvedException
+import org.apache.spark.sql.catalyst.catalog.CatalogTable
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types.{StringType, TimestampType}
+
+/**
+ * Apply a correction to data loaded from, or saved to, Parquet, so that 
it timestamps can be read
+ * like TIMESTAMP WITHOUT TIMEZONE.  This gives correct behavior if you 
process data with
+ * machines in different timezones, or if you access the data from 
multiple SQL engines.
+ */
+private[sql] case class TimestampTableTimeZone(sparkSession: SparkSession)
--- End diff --

i need to take a look at this again to make sure i understand what's 
happening. conceptually what you are doing is pretty simple and it doesn't seem 
right it needs so many lines of code, but maybe i'm missing something.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122657
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -230,6 +230,13 @@ case class AlterTableSetPropertiesCommand(
 isView: Boolean)
   extends RunnableCommand {
 
+  if (isView) {
+properties.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { _ =>
--- End diff --

is there even a meaning to set properties for any views? we should either 
drop this check, or have a more general check.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122503
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -266,6 +267,10 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* @since 1.4.0
*/
   def insertInto(tableName: String): Unit = {
+extraOptions.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { 
tz =>
--- End diff --

we don't seem to be doing this type of validity check in general; otherwise 
we'd need to add a lot more checks here.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122396
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -1015,6 +1020,10 @@ object DateTimeUtils {
 guess
   }
 
+  def convertTz(ts: SQLTimestamp, fromZone: String, toZone: String): 
SQLTimestamp = {
+convertTz(ts, getTimeZone(fromZone), getTimeZone(toZone))
--- End diff --

performance is going to suck here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122317
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

do we need a whole expression for this? can't we just reuse existing 
expressions? It's just simple arithmetics isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19394: [SPARK-22170][SQL] Reduce memory consumption in broadcas...

2017-10-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19394
  
What's the other value?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19394: [SPARK-22170][SQL] Reduce memory consumption in broadcas...

2017-10-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19394
  
Not sure - maybe print the chi-value of the test and see if they make 
sense. If they do, we can change the threshold.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-09-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18732
  
What's the difference between this one and the transform function you also 
proposed? I'm trying to see if all the naming makes sense when considered 
together.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19393: [SPARK-21644][SQL] LocalLimit.maxRows is defined incorre...

2017-09-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19393
  
LGTM but I wrote most of the code so perhaps we should find somebody else 
to review.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-09-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18732
  
Is this just a mapGroups function?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19387: [SPARK-22160][SQL] Make sample points per partiti...

2017-09-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19387#discussion_r141786663
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -108,11 +108,21 @@ class HashPartitioner(partitions: Int) extends 
Partitioner {
 class RangePartitioner[K : Ordering : ClassTag, V](
 partitions: Int,
 rdd: RDD[_ <: Product2[K, V]],
-private var ascending: Boolean = true)
+private var ascending: Boolean = true,
+val samplePointsPerPartitionHint: Int = 20)
   extends Partitioner {
 
+  // A constructor declared in order to maintain backward compatibility 
for Java, when we add the
+  // 4th constructor parameter samplePointsPerPartitionHint. See 
SPARK-22160.
+  // This is added to make sure from a bytecode point of view, there is 
still a 3-arg ctor.
+  def this(partitions: Int, rdd: RDD[_ <: Product2[K, V]], ascending: 
Boolean) = {
+this(partitions, rdd, ascending, samplePointsPerPartitionHint = 20)
--- End diff --

That one has been there for much longer so I'd rather change the SQL 
default first and see what happens.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19387: [SPARK-22160][SQL] Make sample points per partition (in ...

2017-09-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19387
  
Merging in master.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19387: [SPARK-22160][SQL] Make sample points per partition (in ...

2017-09-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19387
  
I put up a comment saying this test result should be deterministic, since 
the sampling uses a fixed seed based on partition id.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19387: [SPARK-22160][SQL] Make sample points per partiti...

2017-09-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19387#discussion_r141764431
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.commons.math3.stat.inference.ChiSquareTest
+
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+
+class ConfigBehaviorSuite extends QueryTest with SharedSQLContext {
+
+  import testImplicits._
+
+  test("SPARK-22160 
spark.sql.execution.rangeExchange.sampleSizePerPartition") {
+// In this test, we run a sort and compute the histogram for partition 
size post shuffle.
+// With a high sample count, the partition size should be more evenly 
distributed, and has a
+// low chi-sq test value.
+
+val numPartitions = 4
+
+def computeChiSquareTest(): Double = {
+  val n = 1
+  // Trigger a sort
+  val data = spark.range(0, n, 1, 1).sort('id)
+.selectExpr("SPARK_PARTITION_ID() pid", "id").as[(Int, 
Long)].collect()
+
+  // Compute histogram for the number of records per partition post 
sort
+  val dist = data.groupBy(_._1).map(_._2.length.toLong).toArray
+  assert(dist.length == 4)
+
+  new ChiSquareTest().chiSquare(
+Array.fill(numPartitions) { n.toDouble / numPartitions },
+dist)
+}
+
+withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> numPartitions.toString) {
+  // The default chi-sq value should be low
+  assert(computeChiSquareTest() < 100)
+
+  withSQLConf(SQLConf.RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION.key -> 
"1") {
+// If we only sample one point, the range boundaries will be 
pretty bad and the
+// chi-sq value would be very high.
+assert(computeChiSquareTest() > 1000)
--- End diff --

the value i got from my laptop was 1800


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19387: [SPARK-22160][SQL] Make sample points per partiti...

2017-09-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19387#discussion_r141764415
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.commons.math3.stat.inference.ChiSquareTest
+
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+
+class ConfigBehaviorSuite extends QueryTest with SharedSQLContext {
+
+  import testImplicits._
+
+  test("SPARK-22160 
spark.sql.execution.rangeExchange.sampleSizePerPartition") {
+// In this test, we run a sort and compute the histogram for partition 
size post shuffle.
+// With a high sample count, the partition size should be more evenly 
distributed, and has a
+// low chi-sq test value.
+
+val numPartitions = 4
+
+def computeChiSquareTest(): Double = {
+  val n = 1
+  // Trigger a sort
+  val data = spark.range(0, n, 1, 1).sort('id)
+.selectExpr("SPARK_PARTITION_ID() pid", "id").as[(Int, 
Long)].collect()
+
+  // Compute histogram for the number of records per partition post 
sort
+  val dist = data.groupBy(_._1).map(_._2.length.toLong).toArray
+  assert(dist.length == 4)
+
+  new ChiSquareTest().chiSquare(
+Array.fill(numPartitions) { n.toDouble / numPartitions },
+dist)
+}
+
+withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> numPartitions.toString) {
+  // The default chi-sq value should be low
+  assert(computeChiSquareTest() < 100)
--- End diff --

100 - which is pretty high

the actual value computed on my laptop is around 10, so 1000 is already 
three orders of magnitude larger


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19387: [SPARK-22160][SQL] Make sample points per partiti...

2017-09-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19387#discussion_r141755874
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -108,9 +108,17 @@ class HashPartitioner(partitions: Int) extends 
Partitioner {
 class RangePartitioner[K : Ordering : ClassTag, V](
 partitions: Int,
 rdd: RDD[_ <: Product2[K, V]],
-private var ascending: Boolean = true)
+private var ascending: Boolean = true,
+val samplePointsPerPartitionHint: Int = 20)
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19387: [SPARK-22160][SQL] Allow changing sample points p...

2017-09-28 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/19387

[SPARK-22160][SQL] Allow changing sample points per partition in range 
shuffle exchange

## What changes were proposed in this pull request?
Spark's RangePartitioner hard codes the number of sampling points per 
partition to be 20. This is sometimes too low. This ticket makes it 
configurable, via spark.sql.execution.rangeExchange.sampleSizePerPartition, and 
raises the default in Spark SQL to be 100.

## How was this patch tested?
Added a pretty sophisticated test based on chi square test ...


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-22160

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19387.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19387


commit 843721b38e0a2385253053d475a855161dbc451c
Author: Reynold Xin <r...@databricks.com>
Date:   2017-09-28T21:36:34Z

[SPARK-22160][SQL] Allow changing sample points per partition in range 
shuffle exchange

(cherry picked from commit 8e51ae52b6d54ed46a3441bbb83a8e93ba214410)
Signed-off-by: Reynold Xin <r...@databricks.com>

commit b46c92bf73b486ebc494b44be3c392f4bcd0a7c9
Author: Reynold Xin <r...@databricks.com>
Date:   2017-09-28T22:51:04Z

Add a test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19384: [SPARK-22159][SQL] Make config names consistently end wi...

2017-09-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19384
  
I reverted the 2nd commit. Should be good for merge now.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19384: [SPARK-22159][SQL] Make config names consistently end wi...

2017-09-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19384
  
hm the 2nd commit is not meant for this one.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19384: [SPARK-22159][SQL] Make config names consistently...

2017-09-28 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/19384

[SPARK-22159][SQL] Make config names consistently end with "enabled".

## What changes were proposed in this pull request?
spark.sql.execution.arrow.enable and 
spark.sql.codegen.aggregate.map.twolevel.enable -> enabled

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-22159

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19384.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19384


commit d41e34720073be568053f220946cb01ea2ff61ef
Author: Reynold Xin <r...@databricks.com>
Date:   2017-09-28T18:42:41Z

[SPARK-22159][SQL] Make config names consistently end with "enabled".

spark.sql.execution.arrow.enable and 
spark.sql.codegen.aggregate.map.twolevel.enable -> enabled




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19376: [SPARK-22153][SQL] Rename ShuffleExchange -> Shuf...

2017-09-27 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/19376

[SPARK-22153][SQL] Rename ShuffleExchange -> ShuffleExchangeExec

## What changes were proposed in this pull request?
For some reason when we added the Exec suffix to all physical operators, we 
missed this one. I was looking for this physical operator today and couldn't 
find it, because I was looking for ExchangeExec.

## How was this patch tested?
This is a simple rename and should be covered by existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-22153

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19376.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19376


commit 9e9627dba1157499798a06f21d5aedf83dbe9acd
Author: Reynold Xin <r...@databricks.com>
Date:   2017-09-28T04:18:07Z

[SPARK-22153][SQL] Rename ShuffleExchange -> ShuffleExchangeExec

commit 1de6165cbe6abe3af5dee7882b8cd9185493
Author: Reynold Xin <r...@databricks.com>
Date:   2017-09-28T04:19:43Z

Fix 100 char




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19362: [SPARK-22141][SQL] Propagate empty relation befor...

2017-09-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19362#discussion_r141403817
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -136,6 +134,8 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
 Batch("LocalRelation", fixedPoint,
   ConvertToLocalRelation,
   PropagateEmptyRelation) ::
+Batch("Check Cartesian Products", Once,
--- End diff --

we should also add a comment here about the positioning ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-09-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r140379971
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.writer;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.sources.v2.DataSourceV2Options;
+import org.apache.spark.sql.sources.v2.WriteSupport;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A data source writer that is returned by
+ * {@link WriteSupport#createWriter(StructType, SaveMode, 
DataSourceV2Options)}.
+ * It can mix in various writing optimization interfaces to speed up the 
data saving. The actual
+ * writing logic is delegated to {@link WriteTask} that is returned by 
{@link #createWriteTask()}.
+ *
+ * The writing procedure is:
+ *   1. Create a write task by {@link #createWriteTask()}, serialize and 
send it to all the
+ *  partitions of the input data(RDD).
+ *   2. For each partition, create a data writer with the write task, and 
write the data of the
--- End diff --

That doesn't really help if one of the task fails and gets relaunched.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-09-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r139889741
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Command.scala
 ---
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.{SparkException, TaskContext}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.command.RunnableCommand
+import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+case class WriteToDataSourceV2Command(writer: DataSourceV2Writer, query: 
LogicalPlan)
+  extends RunnableCommand {
--- End diff --

RunnableCommand simply means it's both a logical plan and a physical plan.

We should fix the UI issue separately (which on its own is super annoying).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-09-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r139889045
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/SupportsWriteUnsafeRow.java
 ---
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.writer;
+
+import org.apache.spark.annotation.Experimental;
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+
+/**
+ * A mix-in interface for {@link DataSourceV2Writer}. Data source writers 
can implement this
+ * interface to write {@link UnsafeRow} directly and avoid the row copy at 
Spark side.
+ * This is an experimental and unstable interface, as {@link UnsafeRow} is 
not public and may get
+ * changed in the future Spark versions.
+ */
+
+@InterfaceStability.Evolving
+@Experimental
+@InterfaceStability.Unstable
+public interface SupportsWriteUnsafeRow extends DataSourceV2Writer {
--- End diff --

why do we need this?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-09-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r13951
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java 
---
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.writer;
+
+import org.apache.spark.annotation.InterfaceStability;
+
+/**
+ * A data writer returned by {@link WriteTask#createWriter(int, int, int)} 
and is responsible for
+ * writing data for an input RDD partition.
+ *
+ * Note that, Currently the type `T` can only be {@link 
org.apache.spark.sql.Row} for normal data
+ * source writers, or {@link 
org.apache.spark.sql.catalyst.expressions.UnsafeRow} for data source
+ * writers that mix in {@link SupportsWriteUnsafeRow}.
+ */
+@InterfaceStability.Evolving
+public interface DataWriter {
+
+  void write(T record);
--- End diff --

not allowed :)

We should specify all the exception behavior in all the APIs.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18704: [SPARK-20783][SQL] Create ColumnVector to abstract exist...

2017-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18704
  
cc @michal-databricks any thoughts on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19261: [SPARK-22040] Add current_date function with timezone id

2017-09-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19261
  
What does this even mean?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19136: [SPARK-15689][SQL] data source v2 read path

2017-09-14 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19136
  
LGTM. 

Still some feedback that can be addressed later. We should also document 
all the APIs as Evolving.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2 read path

2017-09-14 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138947707
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
 ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.{InterruptibleIterator, Partition, SparkContext, 
TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+import org.apache.spark.sql.sources.v2.reader.ReadTask
+
+class DataSourceRDDPartition(val index: Int, val readTask: 
ReadTask[UnsafeRow])
+  extends Partition with Serializable
+
+class DataSourceRDD(
+sc: SparkContext,
+@transient private val generators: java.util.List[ReadTask[UnsafeRow]])
--- End diff --

why is this called a generators?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >