date:20171107

[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19390
  
**[Test build #83548 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83548/testReport)**
 for PR 19390 at commit 
[`dbf7875`](https://github.com/apache/spark/commit/dbf787593d72afe16cadd9731a1dd1cba608451c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-11-07 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r149367812
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 ---
@@ -451,15 +465,20 @@ trait MesosSchedulerUtils extends Logging {
   }
 
   /** Creates a mesos resource for a specific port number. */
-  private def createResourcesFromPorts(portsAndRoles: List[(Long, 
String)]) : List[Resource] = {
-portsAndRoles.flatMap{ case (port, role) =>
-  createMesosPortResource(List((port, port)), Some(role))}
+  private def createResourcesFromPorts(
+   portsAndResourcesInfo: List[(Long, (String, AllocationInfo, 
Option[ReservationInfo]))])
+: List[Resource] = {
+portsAndResourcesInfo.flatMap{ case (port, rInfo) =>
+  createMesosPortResource(List((port, port)), Option(rInfo._1),
--- End diff --

fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19640: [SPARK-16986][CORE][WEB-UI] Support configure his...

2017-11-07 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19640#discussion_r149365816
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2742,6 +2742,11 @@ private[spark] object Utils extends Logging {
 }
   }
 
+  def getTimeZone: TimeZone = {
+val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
--- End diff --

Can we make SparkConf as a input param instead of create a new instance for 
every function call?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19479
  
**[Test build #83547 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83547/testReport)**
 for PR 19479 at commit 
[`fa338dd`](https://github.com/apache/spark/commit/fa338ddcb655f6e421b1be35fdd8dcd5cd866df0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19684: [SPARK-22461][ML] Move Spark ML model summaries into a d...

2017-11-07 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19684
  
@srowen , sorry, I haven't seen it. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19684: [SPARK-22461][ML] Move Spark ML model summaries i...

2017-11-07 Thread mgaido91

Github user mgaido91 closed the pull request at:

https://github.com/apache/spark/pull/19684


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19684: [SPARK-22461][ML] Move Spark ML model summaries into a d...

2017-11-07 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19684
  
This duplicates https://github.com/apache/spark/pull/19680 @mgaido91 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19680: [SPARK-22641][ML] Refactor Spark ML model summaries

2017-11-07 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19680
  
Oh, this is tagged to the wrong JIRA. Should be SPARK-22461


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19684: [SPARK-22461][ML] Move Spark ML model summaries i...

2017-11-07 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/19684

[SPARK-22461][ML] Move Spark ML model summaries into a dedicated package

## What changes were proposed in this pull request?

Added a common abstraction (the `Summary` trait) for all the summaries in 
ML and moved them all to a new `summary` package.

## How was this patch tested?

Existing UTs and manual execution of the changed examples

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22461

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19684.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19684


commit a663b7bbcb1591cd123b95478bb9771ed3674135
Author: Marco Gaido 
Date:   2017-11-07T11:58:27Z

[SPARK-22461][ML] Move Spark ML model summaries into a dedicated package




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-11-07 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r149361297
  
--- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala ---
@@ -542,7 +496,55 @@ private[spark] class SecurityManager(
* Gets the secret key.
* @return the secret key as a String if authentication is enabled, 
otherwise returns null
*/
-  def getSecretKey(): String = secretKey
+  def getSecretKey(): String = {
+if (isAuthenticationEnabled) {
+  Option(sparkConf.getenv(ENV_AUTH_SECRET))
+.orElse(sparkConf.getOption(SPARK_AUTH_SECRET_CONF))
+.getOrElse {
+  throw new IllegalArgumentException(
+s"A secret key must be specified via the 
$SPARK_AUTH_SECRET_CONF config")
+}
+} else {
+  null
+}
+  }
+
+  /**
+   * Initialize the configuration object held by this class for 
authentication.
+   *
+   * If authentication is disabled, do nothing.
+   *
+   * In YARN mode, generate a secret key and store it in the configuration 
object, setting it up to
+   * also be propagated to executors using an env variable.
+   *
+   * In other modes, assert that the auth secret is set in the 
configuration.
+   */
+  def initializeAuth(): Unit = {
+if (!sparkConf.get(NETWORK_AUTH_ENABLED)) {
+  return
+}
+
+if (sparkConf.get(SparkLauncher.SPARK_MASTER, null) != "yarn") {
+  require(sparkConf.contains(SPARK_AUTH_SECRET_CONF),
+s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF 
config.")
+  return
+}
+
+// In YARN, force creation of a new secret if this is client mode. 
This ensures each
--- End diff --

Is there a reason this _has_ to be unique to YARN? Will this solve the 
problem (in Mesos currently) where when the Executors 
[bootstrap](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L193)
 they do so without security (unless you "bake" the secret and secret config 
into the container image)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19649
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19649
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83538/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19649
  
**[Test build #83538 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83538/testReport)**
 for PR 19649 at commit 
[`c79c314`](https://github.com/apache/spark/commit/c79c31452d285c3b36b8adda55b9daccd6cea0d4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class AlterTablePreEvent(`
  * `case class AlterTableEvent(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19674
  
@ueshin, now I am getting what you meant by DST. I think you roughly knew 
this problem but let me describe it more given my debugging.

Looks the problem is in `mktime` and sounds `mktime` is platform-dependent 
(roughly assuming from codes and docs).

I made a minimised reproducer:

```python
import time
import os
from datetime import datetime
os.environ["TZ"] = "America/Los_Angeles"
time.tzset()
time.mktime(datetime(2100, 4, 4, 4, 4, 4).timetuple())
```

My local, it prints: 

```
4110523444.0
```

On Unbuntu 14.04:

```
4110519844.0
```

Jenkins, it prints:

```
4110519844.0
```

I am not sure if this is easily fixable within Spark as it looks dependent 
on Python implementation and/or the underlying C library, up to my knowledge 
and from my reading some docs.

Could you maybe avoid this time within DST for now in your PR? I currently 
don't have a good idea to fix this with a simple and surgical fix


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83546 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83546/testReport)**
 for PR 19674 at commit 
[`d03f768`](https://github.com/apache/spark/commit/d03f7689d531a1421af2f90b6c7735f0184d8b76).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83546/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19208
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83534/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19208
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19208
  
**[Test build #83534 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83534/testReport)**
 for PR 19208 at commit 
[`654e4d5`](https://github.com/apache/spark/commit/654e4d580889dcd2fcf7c0bea2060349190faaac).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83546 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83546/testReport)**
 for PR 19674 at commit 
[`d03f768`](https://github.com/apache/spark/commit/d03f7689d531a1421af2f90b6c7735f0184d8b76).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19667: [SPARK-21127][SQL][followup] fix a config name typo

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19667
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83537/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19667: [SPARK-21127][SQL][followup] fix a config name typo

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19667
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19667: [SPARK-21127][SQL][followup] fix a config name typo

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19667
  
**[Test build #83537 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83537/testReport)**
 for PR 19667 at commit 
[`55b949e`](https://github.com/apache/spark/commit/55b949e0e039ea981aeb69d6c3699c829071e368).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19683
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

2017-11-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19479#discussion_r149344559
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -275,6 +317,118 @@ object ColumnStat extends Logging {
   avgLen = row.getLong(4),
   maxLen = row.getLong(5)
 )
+if (row.isNullAt(6)) {
+  cs
+} else {
+  val ndvs = row.getArray(6).toLongArray()
+  assert(percentiles.get.length == ndvs.length + 1)
+  val endpoints = percentiles.get.map(_.toString.toDouble)
+  // Construct equi-height histogram
+  val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
+EquiHeightBucket(endpoints(i), endpoints(i + 1), ndv)
+  }
+  val nonNullRows = rowCount - cs.nullCount
+  val ehHistogram = EquiHeightHistogram(nonNullRows.toDouble / 
ndvs.length, buckets)
+  cs.copy(histogram = Some(ehHistogram))
+}
+  }
+
+}
+
+/**
+ * There are a few types of histograms in state-of-the-art estimation 
methods. E.g. equi-width
+ * histogram, equi-height histogram, frequency histogram (value-frequency 
pairs) and hybrid
+ * histogram, etc.
+ * Currently in Spark, we support equi-height histogram since it is good 
at handling skew
+ * distribution, and also provides reasonable accuracy in other cases.
+ * We can add other histograms in the future, which will make estimation 
logic more complicated.
+ * This is because we will have to deal with computation between different 
types of histograms in
+ * some cases, e.g. for join columns.
+ */
+trait Histogram
+
+/**
+ * Equi-height histogram represents column value distribution by a 
sequence of buckets. Each bucket
+ * has a value range and contains approximately the same number of rows.
+ * @param height number of rows in each bucket
+ * @param ehBuckets equi-height histogram buckets
+ */
+case class EquiHeightHistogram(height: Double, ehBuckets: 
Array[EquiHeightBucket])
+  extends Histogram {
+
+  // Only for histogram equality test.
+  override def equals(other: Any): Boolean = other match {
+case otherEHH: EquiHeightHistogram =>
+  height == otherEHH.height && 
ehBuckets.sameElements(otherEHH.ehBuckets)
+case _ => false
+  }
+
+  override def hashCode(): Int = super.hashCode()
+}
+
+/**
+ * A bucket in an equi-height histogram. We use double type for 
lower/higher bound for simplicity.
+ * @param lo lower bound of the value range in this bucket
+ * @param hi higher bound of the value range in this bucket
+ * @param ndv approximate number of distinct values in this bucket
+ */
+case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long)
+
+object HistogramSerializer {
+  // A flag to indicate the type of histogram
+  val EQUI_HEIGHT_HISTOGRAM_TYPE: Byte = 1
+
+  /**
+   * Serializes a given histogram to a string. For advanced statistics 
like histograms, sketches,
+   * etc, we don't provide readability for their serialized formats in 
metastore (as
+   * string-to-string table properties). This is because it's hard or 
unnatural for these
+   * statistics to be human readable. For example, histogram is probably 
split into multiple
+   * key-value properties, instead of a single, self-described property. 
And for
+   * count-min-sketch, it's essentially unnatural to make it a readable 
string.
+   */
+  final def serialize(histogram: Histogram): String = histogram match {
+case h: EquiHeightHistogram =>
+  // type + numBuckets + height + numBuckets * (lo + hi + ndv)
--- End diff --

Thanks for the advice. The default number of buckets is 254. Tests showed 
that after compression, the serialized length is reduced by more than 50%.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...

2017-11-07 Thread uzadude

GitHub user uzadude opened a pull request:

https://github.com/apache/spark/pull/19683

[SPARK-21657][SQL] optimize explode quadratic memory consumpation

## What changes were proposed in this pull request?

The issue has been raised in two Jira tickets: 
[SPARK-21657](https://issues.apache.org/jira/browse/SPARK-21657), 
[SPARK-16998](https://issues.apache.org/jira/browse/SPARK-16998). Basically, 
what happens is that in collection generators like explode/inline we create 
many rows from each row. Currently each exploded row contains also the column 
on which it was created. This causes, for example, if we have a 10k array in 
one row that this array will get copy 10k times - to each of the row. this 
results a qudratic memory consumption. However, it is a common case that the 
original column gets projected out after the explode, so we can avoid 
duplicating it.
In this solution we propose to identify this situation in the optimizer and 
turn on a flag for omitting the original column in the generation process.

## How was this patch tested?

1. We added a benchmark test to MiscBenchmark that shows x16 improvement in 
runtimes.
2. We ran some of the other tests in MiscBenchmark and they show 15% 
improvements.
3. We ran this code on a specific case from our production data with rows 
containing arrays of size ~200k and it reduced the runtime from 6 hours to 3 
mins.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uzadude/spark optimize_explode

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19683.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19683


commit ce7c3694a99584348957dc756234bb667466be4e
Author: oraviv 
Date:   2017-11-07T11:34:21Z

[SPARK-21657][SQL] optimize explode quadratic memory consumpation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19657
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19657
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83544/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19657
  
**[Test build #83544 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83544/testReport)**
 for PR 19657 at commit 
[`ca5349b`](https://github.com/apache/spark/commit/ca5349bfc0dae03c2402b104e51c78a841541b09).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19657
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83542/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19657
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19657
  
**[Test build #83542 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83542/testReport)**
 for PR 19657 at commit 
[`31f3bd0`](https://github.com/apache/spark/commit/31f3bd06cc7d2b7bf482eddfe2f2738244cfbca7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19479
  
**[Test build #83545 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83545/testReport)**
 for PR 19479 at commit 
[`804b375`](https://github.com/apache/spark/commit/804b37565b2f5d61edd492d415475a59afec41f5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

2017-11-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19479#discussion_r149341693
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -216,65 +218,61 @@ object ColumnStat extends Logging {
 }
   }
 
-  /**
-   * Constructs an expression to compute column statistics for a given 
column.
-   *
-   * The expression should create a single struct column with the 
following schema:
-   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, 
maxLen: Long
-   *
-   * Together with [[rowToColumnStat]], this function is used to create 
[[ColumnStat]] and
-   * as a result should stay in sync with it.
-   */
-  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
-def struct(exprs: Expression*): CreateNamedStruct = 
CreateStruct(exprs.map { expr =>
-  expr.transformUp { case af: AggregateFunction => 
af.toAggregateExpression() }
-})
-val one = Literal(1, LongType)
+  private def convertToHistogram(s: String): EquiHeightHistogram = {
+val idx = s.indexOf(",")
+if (idx <= 0) {
+  throw new AnalysisException("Failed to parse histogram.")
+}
+val height = s.substring(0, idx).toDouble
+val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
+val buckets = pattern.findAllMatchIn(s).map { m =>
+  EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, 
m.group(3).toLong)
+}.toSeq
+EquiHeightHistogram(height, buckets)
+  }
 
-// the approximate ndv (num distinct value) should never be larger 
than the number of rows
-val numNonNulls = if (col.nullable) Count(col) else Count(one)
-val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
-val numNulls = Subtract(Count(one), numNonNulls)
-val defaultSize = Literal(col.dataType.defaultSize, LongType)
+}
 
-def fixedLenTypeStruct(castType: DataType) = {
-  // For fixed width types, avg size should be the same as max size.
-  struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), 
numNulls, defaultSize,
-defaultSize)
-}
+/**
+ * There are a few types of histograms in state-of-the-art estimation 
methods. E.g. equi-width
+ * histogram, equi-height histogram, frequency histogram (value-frequency 
pairs) and hybrid
+ * histogram, etc.
+ * Currently in Spark, we support equi-height histogram since it is good 
at handling skew
+ * distribution, and also provides reasonable accuracy in other cases.
+ * We can add other histograms in the future, which will make estimation 
logic more complicated.
--- End diff --

OK. Removed the trait `Histogram`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19661: [SPARK-22450][Core][Mllib]safely register class f...

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19661#discussion_r14934
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -178,10 +178,40 @@ class KryoSerializer(conf: SparkConf)
 
kryo.register(Utils.classForName("scala.collection.immutable.Map$EmptyMap$"))
 kryo.register(classOf[ArrayBuffer[Any]])
 
+// We can't load those class directly in order to avoid unnecessary 
jar dependencies.
+// We load them safely, ignore it if the class not found.
+Seq("org.apache.spark.mllib.linalg.Vector",
+  "org.apache.spark.mllib.linalg.DenseVector",
+  "org.apache.spark.mllib.linalg.SparseVector",
+  "org.apache.spark.mllib.linalg.Matrix",
+  "org.apache.spark.mllib.linalg.DenseMatrix",
+  "org.apache.spark.mllib.linalg.SparseMatrix",
+  "org.apache.spark.ml.linalg.Vector",
+  "org.apache.spark.ml.linalg.DenseVector",
+  "org.apache.spark.ml.linalg.SparseVector",
+  "org.apache.spark.ml.linalg.Matrix",
+  "org.apache.spark.ml.linalg.DenseMatrix",
+  "org.apache.spark.ml.linalg.SparseMatrix",
+  "org.apache.spark.ml.feature.Instance",
+  "org.apache.spark.ml.feature.OffsetInstance"
+).flatMap(safeClassLoader(_)).foreach(kryo.register(_))
--- End diff --

please inline this `safeClassLoader`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of d...

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19156
  
the SQL part LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83541/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83541 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83541/testReport)**
 for PR 19674 at commit 
[`3ef8ec0`](https://github.com/apache/spark/commit/3ef8ec0acb23cd6e10346dbbb6d2d6510e019ac1).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19682: [SPARK-22464] [SQL] No pushdown for Hive metastore parti...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19682
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19682: [SPARK-22464] [SQL] No pushdown for Hive metastore parti...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19682
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83535/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19682: [SPARK-22464] [SQL] No pushdown for Hive metastore parti...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19682
  
**[Test build #83535 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83535/testReport)**
 for PR 19682 at commit 
[`be0e276`](https://github.com/apache/spark/commit/be0e276a371319eb19467707a105d968d685a0c3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83532/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19657
  
**[Test build #83544 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83544/testReport)**
 for PR 19657 at commit 
[`ca5349b`](https://github.com/apache/spark/commit/ca5349bfc0dae03c2402b104e51c78a841541b09).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #83532 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83532/testReport)**
 for PR 16578 at commit 
[`48a509e`](https://github.com/apache/spark/commit/48a509e8602ed44a4a0fd5268d91d917bb8e0748).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #83543 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83543/testReport)**
 for PR 19651 at commit 
[`f644c6a`](https://github.com/apache/spark/commit/f644c6a88b4f24376c67028d0e927a2ee49fedbe).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19657: [SPARK-22344][SPARKR] clean up install dir if running te...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19657
  
**[Test build #83542 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83542/testReport)**
 for PR 19657 at commit 
[`31f3bd0`](https://github.com/apache/spark/commit/31f3bd06cc7d2b7bf482eddfe2f2738244cfbca7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19657: [SPARK-22344][SPARKR] clean up install dir if run...

2017-11-07 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19657#discussion_r149331874
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -1183,3 +1183,24 @@ env | map
 ```{r, echo=FALSE}
 sparkR.session.stop()
 ```
+
+```{r cleanup, include=FALSE}
+# clean up if Spark was downloaded
+# get0 not supported before R 3.2.0
+sparkDownloaded <- mget(".sparkDownloaded"[1L],
--- End diff --

since this needs to go into 2.2, let's not add a public method for now, we 
could revisit this for 2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19662: [SPARK-22446][SQL][ML] Declare StringIndexerModel...

2017-11-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19662#discussion_r149331633
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -75,6 +75,7 @@ abstract class Expression extends TreeNode[Expression] {
* - it relies on some mutable internal state, or
* - it relies on some implicit input that is not part of the children 
expression list.
* - it has non-deterministic child or children.
+   * - it is an UDF that can cause runtime exception on some specific 
input.
--- End diff --

Ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r149331329
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 ---
@@ -0,0 +1,370 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.io._
+import org.apache.orc.{OrcFile, TypeDescription}
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.common.`type`.HiveDecimal
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.Resolver
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+object OrcUtils extends Logging {
+
+  def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = {
+val origPath = new Path(pathStr)
+val fs = origPath.getFileSystem(conf)
+val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath)
+  .filterNot(_.isDirectory)
+  .map(_.getPath)
+  .filterNot(_.getName.startsWith("_"))
+  .filterNot(_.getName.startsWith("."))
+paths
+  }
+
+  private[orc] def readSchema(file: Path, conf: Configuration): 
Option[TypeDescription] = {
+try {
+  val fs = file.getFileSystem(conf)
+  val readerOptions = OrcFile.readerOptions(conf).filesystem(fs)
+  val reader = OrcFile.createReader(file, readerOptions)
+  val schema = reader.getSchema
+  if (schema.getFieldNames.size == 0) {
+None
+  } else {
+Some(schema)
+  }
+} catch {
+  case _: IOException => None
+}
+  }
+
+  private[orc] def readSchema(sparkSession: SparkSession, files: 
Seq[FileStatus])
+  : Option[StructType] = {
+val conf = sparkSession.sparkContext.hadoopConfiguration
+files.map(_.getPath).flatMap(readSchema(_, conf)).headOption.map { 
schema =>
+  logDebug(s"Reading schema from file $files, got Hive schema string: 
$schema")
+  
CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
+}
+  }
+
+  private[orc] def getSchemaString(schema: StructType): String = {
+schema.fields.map(f => 
s"${f.name}:${f.dataType.catalogString}").mkString("struct<", ",", ">")
+  }
+
+  private[orc] def getTypeDescription(dataType: DataType) = dataType match 
{
+case st: StructType => TypeDescription.fromString(getSchemaString(st))
+case _ => TypeDescription.fromString(dataType.catalogString)
+  }
+
+  /**
+   * Return a missing schema in a give ORC file.
+   */
+  private[orc] def getMissingSchema(
+  resolver: Resolver,
+  dataSchema: StructType,
+  partitionSchema: StructType,
+  file: Path,
+  conf: Configuration): Option[StructType] = {
+try {
+  val fs = file.getFileSystem(conf)
+  val readerOptions = OrcFile.readerOptions(conf).filesystem(fs)
+  val reader = OrcFile.createReader(file, readerOptions)
+  val schema = reader.getSchema
+  if (schema.getFieldNames.size == 0) {
+None
+  } else {
+val orcSchema = if 
(schema.getFieldNames.asScala.forall(_.startsWith("_col"))) {
+

[GitHub] spark pull request #19665: [SPARK-22376][TESTS] Makes dev/run-tests.py scrip...

2017-11-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19665


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19665: [SPARK-22376][TESTS] Makes dev/run-tests.py script compa...

2017-11-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19665
  
Thank you @srowen for reviewing this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19665: [SPARK-22376][TESTS] Makes dev/run-tests.py script compa...

2017-11-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19665
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83541 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83541/testReport)**
 for PR 19674 at commit 
[`3ef8ec0`](https://github.com/apache/spark/commit/3ef8ec0acb23cd6e10346dbbb6d2d6510e019ac1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-07 Thread facaiy

Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/19666
  
I believe that unordered features will benefit a lot from the idea, however 
I have two questions:
1. I'm a little confused by 964L in `traverseUnorderedSplits`.  Is it a 
backtracking algorithm?
```scala
dfs(binIndex + 1, combNumber, stats) 
```
2. `traverseUnorderedSplits` succeed in decoupling an abstraction from its 
implementation. Cool! However, I just wonder whether we can write a simpler 
function, say, remove `seqOp`, `finalizer`, `T` and collect all logic together 
in one place?

Anyway, thanks for your good work, @WeichenXu123.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19651
  
Right. @HyukjinKwon . I'll follow the final decision on this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19651
  
@dongjoon-hyun, btw, if I understood correctly, 

> Note that this PR intentionally removes old ORCFileFormat to demonstrate 
a complete replacement. We will bring back the old ORCFileFormat and make them 
switchable in SPARK-20728

we don't necessarily remove the old 
`sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala` 
itself when it's ready for merging?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83540/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83540 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83540/testReport)**
 for PR 19674 at commit 
[`e64a95e`](https://github.com/apache/spark/commit/e64a95e21a8b06b077369e12a9e28c5d90d507d2).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83540 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83540/testReport)**
 for PR 19674 at commit 
[`e64a95e`](https://github.com/apache/spark/commit/e64a95e21a8b06b077369e12a9e28c5d90d507d2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19664: [SPARK-22442][SQL] ScalaReflection should produce correc...

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19664
  
good catch!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19664: [SPARK-22442][SQL] ScalaReflection should produce...

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19664#discussion_r149316667
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
 ---
@@ -335,4 +338,17 @@ class ScalaReflectionSuite extends SparkFunSuite {
 assert(linkedHashMapDeserializer.dataType == 
ObjectType(classOf[LHMap[_, _]]))
   }
 
+  test("SPARK-22442: Generate correct field names for special characters") 
{
+val serializer = serializerFor[SpecialCharAsFieldData](BoundReference(
+  0, ObjectType(classOf[SpecialCharAsFieldData]), nullable = false))
+val deserializer = deserializerFor[SpecialCharAsFieldData]
+assert(serializer.dataType(0).name == "field.1")
+assert(serializer.dataType(1).name == "field 2")
+
+val argumentsFields = 
deserializer.asInstanceOf[NewInstance].arguments.flatMap { _.collect {
+  case UpCast(u: UnresolvedAttribute, _, _) => u.name
+}}
+assert(argumentsFields(0) == "`field.1`")
--- End diff --

why it has backticks?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-07 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19649#discussion_r149315115
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/events.scala 
---
@@ -110,7 +120,31 @@ case class RenameTableEvent(
   extends TableEvent
 
 /**
- * Event fired when a function is created, dropped or renamed.
+ * Enumeration to indicate which part of table is altered. If a plain 
alterTable API is called, then
+ * type will generally be Table.
+ */
+object AlterTableKind extends Enumeration {
+  val Table, DataSchema, Stats = Value
--- End diff --

I'm OK to use String, but I'd prefer strong type to avoid nasty issues.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19664: [SPARK-22442][SQL] ScalaReflection should produce...

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19664#discussion_r149316254
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -214,11 +215,13 @@ case class Invoke(
   override def eval(input: InternalRow): Any =
 throw new UnsupportedOperationException("Only code-generated 
evaluation is supported.")
 
+  private lazy val encodedFunctionName = 
TermName(functionName).encodedName.toString
--- End diff --

does `StaticInvoke` have some issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83539 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83539/testReport)**
 for PR 19674 at commit 
[`2e4823d`](https://github.com/apache/spark/commit/2e4823d0991de438b82f4c274f19515d5c193075).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83539/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19649#discussion_r149314315
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/events.scala 
---
@@ -110,7 +120,31 @@ case class RenameTableEvent(
   extends TableEvent
 
 /**
- * Event fired when a function is created, dropped or renamed.
+ * Enumeration to indicate which part of table is altered. If a plain 
alterTable API is called, then
+ * type will generally be Table.
+ */
+object AlterTableKind extends Enumeration {
+  val Table, DataSchema, Stats = Value
--- End diff --

shall we just use string?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83539 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83539/testReport)**
 for PR 19674 at commit 
[`2e4823d`](https://github.com/apache/spark/commit/2e4823d0991de438b82f4c274f19515d5c193075).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19649#discussion_r149315470
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/events.scala 
---
@@ -110,7 +120,31 @@ case class RenameTableEvent(
   extends TableEvent
 
 /**
- * Event fired when a function is created, dropped or renamed.
+ * Enumeration to indicate which part of table is altered. If a plain 
alterTable API is called, then
+ * type will generally be Table.
+ */
+object AlterTableKind extends Enumeration {
+  val Table, DataSchema, Stats = Value
--- End diff --

String is better for backward compatibility, but easier to make mistake. I 
don't have a strong preference, cc @hvanhovell @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r149315182
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, 
SearchArgumentFactory}
+import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types._
+
+/**
+ * Utility functions to convert Spark data source filters to ORC filters.
+ */
+private[orc] object OrcFilters {
+
+  /**
+   * Create ORC filter as a SearchArgument instance.
+   */
+  def createFilter(schema: StructType, filters: Seq[Filter]): 
Option[SearchArgument] = {
+val dataTypeMap = schema.map(f => f.name -> f.dataType).toMap
+
+val convertibleFilters = for {
+  filter <- filters
+  _ <- buildSearchArgument(dataTypeMap, filter, 
SearchArgumentFactory.newBuilder())
--- End diff --

This is a two-step approach which validates each individual filter is 
convertible.
I'll add the comment of 
[SPARK-12218](https://github.com/apache/spark/commit/8e23d8db7f28a97e2f4394cdf9d4c4260abbd750#diff-6cac9bc2656e3782b0312dceb8c55d47R60).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19682: [SPARK-22464] [SQL] No pushdown for Hive metastor...

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19682#discussion_r149314957
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
@@ -592,6 +592,19 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
 }
   }
 
+
+  /**
+   * An extractor that matches all binary comparison operators except 
null-safe equality.
+   *
+   * null-safe equality is not supported by Hive metastore partition 
predicate pushdown
+   */
+  object OperatorsInMetastorePartitionFPD {
--- End diff --

This name looks weird... maybe just `SpecialBinaryComparison` and explain 
the detail in the document.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19649
  
**[Test build #83538 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83538/testReport)**
 for PR 19649 at commit 
[`c79c314`](https://github.com/apache/spark/commit/c79c31452d285c3b36b8adda55b9daccd6cea0d4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83536/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83536 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83536/testReport)**
 for PR 19674 at commit 
[`32d4796`](https://github.com/apache/spark/commit/32d4796c5e029cdbdf54d2625365e774d05b148c).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19674
  
I can directly add anything if you want to check, @ueshin. Just feel free 
to let me know, or I also support to open another PR to figure out if you are 
willing to. Meanwhile, let me print out few things I want to check.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-07 Thread facaiy

Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149313427
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -631,6 +614,42 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val expected = Map(0 -> 1.0 / 3.0, 2 -> 2.0 / 3.0)
 assert(mapToVec(map.toMap) ~== mapToVec(expected) relTol 0.01)
   }
+
+  test("traverseUnorderedSplits") {
+
--- End diff --

Since `traverseUnorderedSplits` is a private method, I wonder whether we 
can check the unorder splits on DecisonTree directly? For example, create a 
tiny dataset and generate a shallow tree (depth = 1?). I know the test case is 
difficult (maybe impossible) to design, however it focuses on behavior instead 
of implementation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r149311408
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala ---
@@ -22,9 +22,10 @@ import org.apache.hadoop.fs.Path
 import org.apache.hadoop.hive.ql.io.orc.{OrcFile, Reader}
 import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
 
-import org.apache.spark.deploy.SparkHadoopUtil
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.execution.datasources.orc.OrcUtils
+import org.apache.spark.sql.hive.HiveShim
 import org.apache.spark.sql.types.StructType
 
 private[hive] object OrcFileOperator extends Logging {
--- End diff --

`OrcFileOperator` defines functions depending on Hive. We cannot merge 
these functions into `sql/core`.
```
import org.apache.hadoop.hive.ql.io.orc.{OrcFile, Reader}
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r149310995
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
 ---
@@ -39,3 +58,134 @@ private[sql] object OrcFileFormat {
 names.foreach(checkFieldName)
   }
 }
+
+class DefaultSource extends OrcFileFormat
+
+/**
+ * New ORC File Format based on Apache ORC 1.4.1 and above.
+ */
+class OrcFileFormat
+  extends FileFormat
+  with DataSourceRegister
+  with Serializable {
+
+  override def shortName(): String = "orc"
+
+  override def toString: String = "ORC_1.4"
+
+  override def hashCode(): Int = getClass.hashCode()
+
+  override def equals(other: Any): Boolean = 
other.isInstanceOf[OrcFileFormat]
+
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = {
+OrcUtils.readSchema(sparkSession, files)
+  }
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job,
+  options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+val orcOptions = new OrcOptions(options, 
sparkSession.sessionState.conf)
+
+val conf = job.getConfiguration
+
+conf.set(MAPRED_OUTPUT_SCHEMA.getAttribute, 
OrcUtils.getSchemaString(dataSchema))
+
+conf.set(COMPRESS.getAttribute, orcOptions.compressionCodec)
+
+conf.asInstanceOf[JobConf]
+  
.setOutputFormat(classOf[org.apache.orc.mapred.OrcOutputFormat[OrcStruct]])
+
+new OutputWriterFactory {
+  override def newInstance(
+  path: String,
+  dataSchema: StructType,
+  context: TaskAttemptContext): OutputWriter = {
+new OrcOutputWriter(path, dataSchema, context)
+  }
+
+  override def getFileExtension(context: TaskAttemptContext): String = 
{
+val compressionExtension: String = {
+  val name = context.getConfiguration.get(COMPRESS.getAttribute)
+  OrcOptions.extensionsForCompressionCodecNames.getOrElse(name, "")
+}
+
+compressionExtension + ".orc"
+  }
+}
+  }
+
+  override def isSplitable(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  path: Path): Boolean = {
+true
+  }
+
+  override def buildReaderWithPartitionValues(
--- End diff --

Yep. I see. It was because I preferred to be consistent with 
`ParquetFileFormat` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r149311068
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala
 ---
@@ -67,4 +67,11 @@ object OrcOptions {
 "snappy" -> "SNAPPY",
 "zlib" -> "ZLIB",
 "lzo" -> "LZO")
+
+  // The extensions for ORC compression codecs
+  val extensionsForCompressionCodecNames = Map(
--- End diff --

It's moved to OrcUtils.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r149310585
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 ---
@@ -0,0 +1,370 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.io._
+import org.apache.orc.{OrcFile, TypeDescription}
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.common.`type`.HiveDecimal
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.Resolver
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+object OrcUtils extends Logging {
+
+  def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = {
+val origPath = new Path(pathStr)
+val fs = origPath.getFileSystem(conf)
+val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath)
+  .filterNot(_.isDirectory)
+  .map(_.getPath)
+  .filterNot(_.getName.startsWith("_"))
+  .filterNot(_.getName.startsWith("."))
+paths
+  }
+
+  private[orc] def readSchema(file: Path, conf: Configuration): 
Option[TypeDescription] = {
+try {
+  val fs = file.getFileSystem(conf)
+  val readerOptions = OrcFile.readerOptions(conf).filesystem(fs)
+  val reader = OrcFile.createReader(file, readerOptions)
+  val schema = reader.getSchema
+  if (schema.getFieldNames.size == 0) {
+None
+  } else {
+Some(schema)
+  }
+} catch {
+  case _: IOException => None
+}
+  }
+
+  private[orc] def readSchema(sparkSession: SparkSession, files: 
Seq[FileStatus])
+  : Option[StructType] = {
+val conf = sparkSession.sparkContext.hadoopConfiguration
+files.map(_.getPath).flatMap(readSchema(_, conf)).headOption.map { 
schema =>
--- End diff --

Later, I will implement schema merging in a parallel manner like Parquet.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r149309798
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 ---
@@ -0,0 +1,370 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.io._
+import org.apache.orc.{OrcFile, TypeDescription}
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.common.`type`.HiveDecimal
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.Resolver
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+object OrcUtils extends Logging {
+
+  def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = {
+val origPath = new Path(pathStr)
+val fs = origPath.getFileSystem(conf)
+val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath)
+  .filterNot(_.isDirectory)
+  .map(_.getPath)
+  .filterNot(_.getName.startsWith("_"))
+  .filterNot(_.getName.startsWith("."))
+paths
+  }
+
+  private[orc] def readSchema(file: Path, conf: Configuration): 
Option[TypeDescription] = {
+try {
+  val fs = file.getFileSystem(conf)
+  val readerOptions = OrcFile.readerOptions(conf).filesystem(fs)
+  val reader = OrcFile.createReader(file, readerOptions)
+  val schema = reader.getSchema
+  if (schema.getFieldNames.size == 0) {
+None
+  } else {
+Some(schema)
+  }
+} catch {
+  case _: IOException => None
+}
+  }
+
+  private[orc] def readSchema(sparkSession: SparkSession, files: 
Seq[FileStatus])
+  : Option[StructType] = {
+val conf = sparkSession.sparkContext.hadoopConfiguration
--- End diff --

Sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83536 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83536/testReport)**
 for PR 19674 at commit 
[`32d4796`](https://github.com/apache/spark/commit/32d4796c5e029cdbdf54d2625365e774d05b148c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19667: [SPARK-21127][SQL][followup] fix a config name typo

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19667
  
**[Test build #83537 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83537/testReport)**
 for PR 19667 at commit 
[`55b949e`](https://github.com/apache/spark/commit/55b949e0e039ea981aeb69d6c3699c829071e368).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19662: [SPARK-22446][SQL][ML] Declare StringIndexerModel...

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19662#discussion_r149308545
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -75,6 +75,7 @@ abstract class Expression extends TreeNode[Expression] {
* - it relies on some mutable internal state, or
* - it relies on some implicit input that is not part of the children 
expression list.
* - it has non-deterministic child or children.
+   * - it is an UDF that can cause runtime exception on some specific 
input.
--- End diff --

how about `it assumes the input satisfies some certain condition via the 
child operator`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19662: [SPARK-22446][SQL][ML] Declare StringIndexerModel indexe...

2017-11-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19662
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19674
  
Mind was:

```
Before ('KST', 'KST')
After ('PST', 'PDT')
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83533/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19674
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19674: [DO-NOT-MERGE] Investigate test failures related with SP...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19674
  
**[Test build #83533 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83533/testReport)**
 for PR 19674 at commit 
[`16e0614`](https://github.com/apache/spark/commit/16e0614300b092688e1d9f09a4575918e5d7358a).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19673: [SPARK-21640][SQL][PYTHON][R][FOLLOWUP] Add errorifexist...

2017-11-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19673
  
Oh, the test and code looked not printing out. It actually does :) (newline 
inserted for readability) :

```r
> write.df(createDataFrame(iris), path = "/tmp/foo", mode = "abc")
...
Error in mode : illegal argument - Unknown save mode: abc. 
Accepted save modes are 'overwrite', 'append', 'ignore', 'error', 
'errorifexists'.
```

Thank you for your review @felixcheung.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19682: [SPARK-22464] [SQL] No pushdown for Hive metastore parti...

2017-11-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19682
  
**[Test build #83535 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83535/testReport)**
 for PR 19682 at commit 
[`be0e276`](https://github.com/apache/spark/commit/be0e276a371319eb19467707a105d968d685a0c3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17953: [SPARK-20680][SQL] Spark-sql do not support for v...

2017-11-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17953


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16803: [SPARK-19458][BUILD]load hive jars from local rep...

2017-11-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16803


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17402: [SPARK-7200] Check that memory is not leaked in T...

2017-11-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17402


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14158: [SPARK-13547] [SQL] [WEBUI] Add SQL query in web ...

2017-11-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14158


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 >

301 - 400 of 430 matches

Mail list logo