[GitHub] spark issue #16204: [SPARK-18775][SQL] Limit the max number of records writt...

2016-12-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16204
  
@hvanhovell don't forget this one!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16296
  
**[Test build #70398 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70398/testReport)**
 for PR 16296 at commit 
[`4049645`](https://github.com/apache/spark/commit/4049645f9a251d6cb8db27f7d2341aab3a1a5596).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...

2016-12-19 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/16233
  
I think we all agree that a wrapper is needed to handle the case of nested 
views, it could be an `AnalysisContext` in `Analyzer`, or `viewContext` in 
`CatalogTable`, or an operator node such as `View` or `SubqueryAlias`. Perhaps 
we should ask @hvanhovell to share his opinion on this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16232: [SPARK-18800][SQL] Fix UnsafeKVExternalSorter by correct...

2016-12-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16232
  
@davies Ok. I got it. I will update the assert.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...

2016-12-19 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16233
  
I'm thinking about if we really need the wrapper: the `View` operator. 
Given a table/view identifier, the steps to resolve it:
1. if the database is specified, get the table/view metadata from that 
database.
1. If the database is not specified, try to resolve it as temp view first.
2. If it's not a temp view, get the table/view metadata from the current 
database.

For nested views, it's a different story. The sub-plan-tree of the nested 
view may have a different "currentDatabase". It's kind of under a different 
analysis context, and wrapping the sub-plan-tree with a `View` operator can 
solve this problem, but I have a simpler proposal:
```
def lookupRelation(...) = {
  ...
  if (table.tableType == CatalogTableType.VIEW) {
val viewContext = table.viewContext
val viewText = table.viewText
sparkSession.sessionState.sqlParser.parsePlan(viewText).transform {
  case u @ UnresolvedRelation(tableIdent) if 
tableIdent.database.isEmpty =>
u.copy(tableIdent = tableIdent.copy(database = 
Some(viewContext.currentDatase)))
}
...
  }
  ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16232: [SPARK-18800][SQL] Fix UnsafeKVExternalSorter by correct...

2016-12-19 Thread davies
Github user davies commented on the issue:

https://github.com/apache/spark/pull/16232
  
@viirya without a repro, I don't think this is the root cause. There could 
be a random corrupt that cause the error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12775
  
**[Test build #70397 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70397/testReport)**
 for PR 12775 at commit 
[`9778cef`](https://github.com/apache/spark/commit/9778cefce3e152d559e53cd4e2f5a113e561f0ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread lirui-intel
Github user lirui-intel commented on the issue:

https://github.com/apache/spark/pull/12775
  
Sure. Updated patch to not catch Throwable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/12775
  
Ok that's fine with me -- @lirui-intel can you make that change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16232: [SPARK-18800][SQL] Fix UnsafeKVExternalSorter by correct...

2016-12-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16232
  
@davies Actually this pr is motivated by a reported error on dev mailling 
list at 
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-IllegalStateException-There-is-no-space-for-new-record-tc20108.html

So if the array size is not enough, don't we need to allocate big enough 
array for the sorter like the current change?

The reporter doesn't have the repro, but I think this place is the only one 
which will cause this error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16296
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16296
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70395/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16296
  
**[Test build #70395 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70395/testReport)**
 for PR 16296 at commit 
[`631edf7`](https://github.com/apache/spark/commit/631edf75ed83a9e7598b746dc81c46d9a7761e09).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

2016-12-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16337
  
I actually don't mind having 01, 02, 03, etc, but still some higher level 
grouping would be useful.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/12775
  
@kayousterhout Exactly. The logError is already handled elsewhere (and the 
throwable it is not ignored there).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16330: [SPARK-18817][SPARKR][SQL] change derby log outpu...

2016-12-19 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16330#discussion_r93176308
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -104,6 +104,12 @@ class SparkHadoopUtil extends Logging {
   }
   val bufferSize = conf.get("spark.buffer.size", "65536")
   hadoopConf.set("io.file.buffer.size", bufferSize)
+
+  if (conf.contains("spark.sql.default.derby.dir")) {
--- End diff --

Why do we need to introduce this flag?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...

2016-12-19 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/16308
  
@rxin I see, created.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16189: [SPARK-18761][CORE] Introduce "task reaper" to oversee t...

2016-12-19 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16189
  
@mridulm Sure. Also, please feel free to leave more comments :) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...

2016-12-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16308
  
Can you create a subtask at 
https://issues.apache.org/jira/browse/SPARK-18350 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...

2016-12-19 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/16308
  
@rxin I'd like to have a follow-up pr related to partition values.
I didn't include it to this pr, but I think we need it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16308: [SPARK-18350][SQL] Support session local timezone.

2016-12-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16308
  
Thanks this looks great.

Couple things:

1. Can you change the referenced JIRA to 
https://issues.apache.org/jira/browse/SPARK-18936

2. We should do a more detailed pass to make sure there isn't any issue 
with performance for the impacted expressions (e.g. don't create a new timezone 
object or do hash lookups per row).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16308: [SPARK-18350][SQL] Support session local timezone.

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16308
  
**[Test build #70396 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70396/testReport)**
 for PR 16308 at commit 
[`4b6900c`](https://github.com/apache/spark/commit/4b6900cf6d182d87a545d736d320c6229fb8251d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16308: [SPARK-18350][SQL] Support session local timezone.

2016-12-19 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/16308
  
@rxin I updated the description. Is it enough for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15721: [SPARK-17772][ML][TEST] Add test functions for ML...

2016-12-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15721#discussion_r93174061
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
@@ -224,4 +208,139 @@ object MLTestingUtils extends SparkFunSuite {
 }.toDF()
 (overSampledData, weightedData)
   }
+
+  /**
+   * Generates a linear prediction function where the coefficients are 
generated randomly.
+   * The function produces a continuous (numClasses = 0) or categorical 
(numClasses > 0) label.
+   */
+  def getRandomLinearPredictionFunction(
+  numFeatures: Int,
+  numClasses: Int,
+  seed: Long): (Vector => Double) = {
+val rng = new scala.util.Random(seed)
+val trueNumClasses = if (numClasses == 0) 1 else numClasses
+val coefArray = Array.fill(numFeatures * 
trueNumClasses)(rng.nextDouble - 0.5)
+(features: Vector) => {
+  if (numClasses == 0) {
+BLAS.dot(features, new DenseVector(coefArray))
+  } else {
+val margins = new DenseVector(new Array[Double](numClasses))
+val coefMat = new DenseMatrix(numClasses, numFeatures, coefArray)
+BLAS.gemv(1.0, coefMat, features, 1.0, margins)
+margins.argmax.toDouble
+  }
+}
+  }
+
+  /**
+   * A helper function to generate synthetic data. Generates random 
feature values,
+   * both categorical and continuous, according to 
`categoricalFeaturesInfo`. The label is generated
+   * from a random prediction function, and noise is added to the true 
label.
+   *
+   * @param numPoints The number of data points to generate.
+   * @param numClasses The number of classes the outcome can take on. 0 
for continuous labels.
+   * @param numFeatures The number of features in the data.
+   * @param categoricalFeaturesInfo Map of (featureIndex -> numCategories) 
for categorical features.
+   * @param seed Random seed.
+   * @param noiseLevel A number in [0.0, 1.0] indicating how much noise to 
add to the label.
+   * @return Generated sequence of noisy instances.
+   */
+  def generateNoisyData(
+  numPoints: Int,
+  numClasses: Int,
+  numFeatures: Int,
+  categoricalFeaturesInfo: Map[Int, Int],
+  seed: Long,
+  noiseLevel: Double = 0.3): Seq[Instance] = {
+require(noiseLevel >= 0.0 && noiseLevel <= 1.0, "noiseLevel must be in 
range [0.0, 1.0]")
+val rng = new scala.util.Random(seed)
+val predictionFunc = getRandomLinearPredictionFunction(numFeatures, 
numClasses, seed)
+Range(0, numPoints).map { i =>
+  val features = Vectors.dense(Array.tabulate(numFeatures) { j =>
+val numCategories = categoricalFeaturesInfo.getOrElse(j, 0)
+if (numCategories > 0) {
+  rng.nextInt(numCategories)
+} else {
+  rng.nextDouble() - 0.5
+}
+  })
+  val label = predictionFunc(features)
+  val noisyLabel = if (numClasses > 0) {
+// with probability equal to noiseLevel, select a random class 
instead of the true class
+if (rng.nextDouble < noiseLevel) rng.nextInt(numClasses) else label
+  } else {
+// add noise to the label proportional to the noise level
+label + noiseLevel * rng.nextGaussian()
+  }
+  Instance(noisyLabel, 1.0, features)
+}
+  }
+
+  /**
+   * Helper function for testing sample weights. Tests that oversampling 
each point is equivalent
+   * to assigning a sample weight proportional to the number of samples 
for each point.
+   */
+  def testOversamplingVsWeighting[M <: Model[M], E <: Estimator[M]](
+spark: SparkSession,
+estimator: E with HasWeightCol with HasLabelCol with 
HasFeaturesCol,
+categoricalFeaturesInfo: Map[Int, Int],
+numPoints: Int,
+numClasses: Int,
+numFeatures: Int,
+modelEquals: (M, M) => Unit,
+seed: Long): Unit = {
+import spark.implicits._
+val df = generateNoisyData(numPoints, numClasses, numFeatures, 
categoricalFeaturesInfo,
+  seed).toDF()
+val (overSampledData, weightedData) = 
genEquivalentOversampledAndWeightedInstances(
+  df, estimator.getLabelCol, estimator.getFeaturesCol, seed)
+val weightedModel = estimator.set(estimator.weightCol, 
"weight").fit(weightedData)
+val overSampledModel = estimator.set(estimator.weightCol, 
"").fit(overSampledData)
+modelEquals(weightedModel, overSampledModel)
+  }
+
+  /**
+   * Helper function for testing sample weights. Tests that injecting a 
large number of outliers
+   * with very small sample weights does not affect fitting. The predictor 
should learn the 

[GitHub] spark pull request #15721: [SPARK-17772][ML][TEST] Add test functions for ML...

2016-12-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15721#discussion_r93172081
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
@@ -224,4 +208,139 @@ object MLTestingUtils extends SparkFunSuite {
 }.toDF()
 (overSampledData, weightedData)
   }
+
+  /**
+   * Generates a linear prediction function where the coefficients are 
generated randomly.
+   * The function produces a continuous (numClasses = 0) or categorical 
(numClasses > 0) label.
+   */
+  def getRandomLinearPredictionFunction(
+  numFeatures: Int,
+  numClasses: Int,
+  seed: Long): (Vector => Double) = {
+val rng = new scala.util.Random(seed)
+val trueNumClasses = if (numClasses == 0) 1 else numClasses
+val coefArray = Array.fill(numFeatures * 
trueNumClasses)(rng.nextDouble - 0.5)
+(features: Vector) => {
+  if (numClasses == 0) {
+BLAS.dot(features, new DenseVector(coefArray))
+  } else {
+val margins = new DenseVector(new Array[Double](numClasses))
+val coefMat = new DenseMatrix(numClasses, numFeatures, coefArray)
+BLAS.gemv(1.0, coefMat, features, 1.0, margins)
+margins.argmax.toDouble
+  }
+}
+  }
+
+  /**
+   * A helper function to generate synthetic data. Generates random 
feature values,
+   * both categorical and continuous, according to 
`categoricalFeaturesInfo`. The label is generated
+   * from a random prediction function, and noise is added to the true 
label.
+   *
+   * @param numPoints The number of data points to generate.
+   * @param numClasses The number of classes the outcome can take on. 0 
for continuous labels.
+   * @param numFeatures The number of features in the data.
+   * @param categoricalFeaturesInfo Map of (featureIndex -> numCategories) 
for categorical features.
+   * @param seed Random seed.
+   * @param noiseLevel A number in [0.0, 1.0] indicating how much noise to 
add to the label.
+   * @return Generated sequence of noisy instances.
+   */
+  def generateNoisyData(
+  numPoints: Int,
+  numClasses: Int,
+  numFeatures: Int,
+  categoricalFeaturesInfo: Map[Int, Int],
+  seed: Long,
+  noiseLevel: Double = 0.3): Seq[Instance] = {
+require(noiseLevel >= 0.0 && noiseLevel <= 1.0, "noiseLevel must be in 
range [0.0, 1.0]")
+val rng = new scala.util.Random(seed)
+val predictionFunc = getRandomLinearPredictionFunction(numFeatures, 
numClasses, seed)
+Range(0, numPoints).map { i =>
+  val features = Vectors.dense(Array.tabulate(numFeatures) { j =>
+val numCategories = categoricalFeaturesInfo.getOrElse(j, 0)
+if (numCategories > 0) {
+  rng.nextInt(numCategories)
+} else {
+  rng.nextDouble() - 0.5
+}
+  })
+  val label = predictionFunc(features)
+  val noisyLabel = if (numClasses > 0) {
+// with probability equal to noiseLevel, select a random class 
instead of the true class
+if (rng.nextDouble < noiseLevel) rng.nextInt(numClasses) else label
+  } else {
+// add noise to the label proportional to the noise level
+label + noiseLevel * rng.nextGaussian()
+  }
+  Instance(noisyLabel, 1.0, features)
+}
+  }
+
+  /**
+   * Helper function for testing sample weights. Tests that oversampling 
each point is equivalent
+   * to assigning a sample weight proportional to the number of samples 
for each point.
+   */
+  def testOversamplingVsWeighting[M <: Model[M], E <: Estimator[M]](
+spark: SparkSession,
+estimator: E with HasWeightCol with HasLabelCol with 
HasFeaturesCol,
+categoricalFeaturesInfo: Map[Int, Int],
+numPoints: Int,
+numClasses: Int,
+numFeatures: Int,
+modelEquals: (M, M) => Unit,
+seed: Long): Unit = {
+import spark.implicits._
+val df = generateNoisyData(numPoints, numClasses, numFeatures, 
categoricalFeaturesInfo,
--- End diff --

If we add noise in native data generators(see my above comment), we should 
remove this line and pass in the generated dataset(which already includes 
noise) directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For 

[GitHub] spark pull request #15721: [SPARK-17772][ML][TEST] Add test functions for ML...

2016-12-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15721#discussion_r93172224
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
@@ -224,4 +208,139 @@ object MLTestingUtils extends SparkFunSuite {
 }.toDF()
 (overSampledData, weightedData)
   }
+
+  /**
+   * Generates a linear prediction function where the coefficients are 
generated randomly.
+   * The function produces a continuous (numClasses = 0) or categorical 
(numClasses > 0) label.
+   */
+  def getRandomLinearPredictionFunction(
+  numFeatures: Int,
+  numClasses: Int,
+  seed: Long): (Vector => Double) = {
+val rng = new scala.util.Random(seed)
+val trueNumClasses = if (numClasses == 0) 1 else numClasses
+val coefArray = Array.fill(numFeatures * 
trueNumClasses)(rng.nextDouble - 0.5)
+(features: Vector) => {
+  if (numClasses == 0) {
+BLAS.dot(features, new DenseVector(coefArray))
+  } else {
+val margins = new DenseVector(new Array[Double](numClasses))
+val coefMat = new DenseMatrix(numClasses, numFeatures, coefArray)
+BLAS.gemv(1.0, coefMat, features, 1.0, margins)
+margins.argmax.toDouble
+  }
+}
+  }
+
+  /**
+   * A helper function to generate synthetic data. Generates random 
feature values,
+   * both categorical and continuous, according to 
`categoricalFeaturesInfo`. The label is generated
+   * from a random prediction function, and noise is added to the true 
label.
+   *
+   * @param numPoints The number of data points to generate.
+   * @param numClasses The number of classes the outcome can take on. 0 
for continuous labels.
+   * @param numFeatures The number of features in the data.
+   * @param categoricalFeaturesInfo Map of (featureIndex -> numCategories) 
for categorical features.
+   * @param seed Random seed.
+   * @param noiseLevel A number in [0.0, 1.0] indicating how much noise to 
add to the label.
+   * @return Generated sequence of noisy instances.
+   */
+  def generateNoisyData(
+  numPoints: Int,
+  numClasses: Int,
+  numFeatures: Int,
+  categoricalFeaturesInfo: Map[Int, Int],
+  seed: Long,
+  noiseLevel: Double = 0.3): Seq[Instance] = {
+require(noiseLevel >= 0.0 && noiseLevel <= 1.0, "noiseLevel must be in 
range [0.0, 1.0]")
+val rng = new scala.util.Random(seed)
+val predictionFunc = getRandomLinearPredictionFunction(numFeatures, 
numClasses, seed)
+Range(0, numPoints).map { i =>
+  val features = Vectors.dense(Array.tabulate(numFeatures) { j =>
+val numCategories = categoricalFeaturesInfo.getOrElse(j, 0)
+if (numCategories > 0) {
+  rng.nextInt(numCategories)
+} else {
+  rng.nextDouble() - 0.5
+}
+  })
+  val label = predictionFunc(features)
+  val noisyLabel = if (numClasses > 0) {
+// with probability equal to noiseLevel, select a random class 
instead of the true class
+if (rng.nextDouble < noiseLevel) rng.nextInt(numClasses) else label
+  } else {
+// add noise to the label proportional to the noise level
+label + noiseLevel * rng.nextGaussian()
+  }
+  Instance(noisyLabel, 1.0, features)
+}
+  }
+
+  /**
+   * Helper function for testing sample weights. Tests that oversampling 
each point is equivalent
+   * to assigning a sample weight proportional to the number of samples 
for each point.
+   */
+  def testOversamplingVsWeighting[M <: Model[M], E <: Estimator[M]](
+spark: SparkSession,
+estimator: E with HasWeightCol with HasLabelCol with 
HasFeaturesCol,
+categoricalFeaturesInfo: Map[Int, Int],
+numPoints: Int,
+numClasses: Int,
+numFeatures: Int,
+modelEquals: (M, M) => Unit,
+seed: Long): Unit = {
+import spark.implicits._
+val df = generateNoisyData(numPoints, numClasses, numFeatures, 
categoricalFeaturesInfo,
+  seed).toDF()
+val (overSampledData, weightedData) = 
genEquivalentOversampledAndWeightedInstances(
+  df, estimator.getLabelCol, estimator.getFeaturesCol, seed)
+val weightedModel = estimator.set(estimator.weightCol, 
"weight").fit(weightedData)
+val overSampledModel = estimator.set(estimator.weightCol, 
"").fit(overSampledData)
+modelEquals(weightedModel, overSampledModel)
+  }
+
+  /**
+   * Helper function for testing sample weights. Tests that injecting a 
large number of outliers
+   * with very small sample weights does not affect fitting. The predictor 
should learn the 

[GitHub] spark pull request #15721: [SPARK-17772][ML][TEST] Add test functions for ML...

2016-12-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15721#discussion_r93171343
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
@@ -224,4 +208,139 @@ object MLTestingUtils extends SparkFunSuite {
 }.toDF()
 (overSampledData, weightedData)
   }
+
+  /**
+   * Generates a linear prediction function where the coefficients are 
generated randomly.
+   * The function produces a continuous (numClasses = 0) or categorical 
(numClasses > 0) label.
+   */
+  def getRandomLinearPredictionFunction(
+  numFeatures: Int,
+  numClasses: Int,
+  seed: Long): (Vector => Double) = {
+val rng = new scala.util.Random(seed)
+val trueNumClasses = if (numClasses == 0) 1 else numClasses
+val coefArray = Array.fill(numFeatures * 
trueNumClasses)(rng.nextDouble - 0.5)
+(features: Vector) => {
+  if (numClasses == 0) {
+BLAS.dot(features, new DenseVector(coefArray))
+  } else {
+val margins = new DenseVector(new Array[Double](numClasses))
+val coefMat = new DenseMatrix(numClasses, numFeatures, coefArray)
+BLAS.gemv(1.0, coefMat, features, 1.0, margins)
+margins.argmax.toDouble
+  }
+}
+  }
+
+  /**
+   * A helper function to generate synthetic data. Generates random 
feature values,
+   * both categorical and continuous, according to 
`categoricalFeaturesInfo`. The label is generated
+   * from a random prediction function, and noise is added to the true 
label.
+   *
+   * @param numPoints The number of data points to generate.
+   * @param numClasses The number of classes the outcome can take on. 0 
for continuous labels.
+   * @param numFeatures The number of features in the data.
+   * @param categoricalFeaturesInfo Map of (featureIndex -> numCategories) 
for categorical features.
+   * @param seed Random seed.
+   * @param noiseLevel A number in [0.0, 1.0] indicating how much noise to 
add to the label.
+   * @return Generated sequence of noisy instances.
+   */
+  def generateNoisyData(
--- End diff --

I am a bit worried whether we should provide this general noisy data 
generation function:
* It's better we can generate data following the rule of specific 
algorithms, for example, users provide coefficients, the mean and variance of 
generated features for ```LogisticRegression```.
* Actually, some generators such as 
[```LinearDataGenerator.generateLinearInput```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala#L97)
 has already considered the noise level. 

Just like ```LinearDataGenerator.generateLinearInput```, I think we should 
add argument ```eps``` for other generators such as 
```LogisticRegressionSuite.generateLogisticInput, 
LogisticRegressionSuite.generateMultinomialLogisticInput, 
NaiveBayesSuite.generateNaiveBayesInput```, to make them output noisy label 
natively.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15721: [SPARK-17772][ML][TEST] Add test functions for ML...

2016-12-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15721#discussion_r93172654
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
@@ -224,4 +208,139 @@ object MLTestingUtils extends SparkFunSuite {
 }.toDF()
 (overSampledData, weightedData)
   }
+
+  /**
+   * Generates a linear prediction function where the coefficients are 
generated randomly.
+   * The function produces a continuous (numClasses = 0) or categorical 
(numClasses > 0) label.
+   */
+  def getRandomLinearPredictionFunction(
+  numFeatures: Int,
+  numClasses: Int,
+  seed: Long): (Vector => Double) = {
+val rng = new scala.util.Random(seed)
+val trueNumClasses = if (numClasses == 0) 1 else numClasses
+val coefArray = Array.fill(numFeatures * 
trueNumClasses)(rng.nextDouble - 0.5)
+(features: Vector) => {
+  if (numClasses == 0) {
+BLAS.dot(features, new DenseVector(coefArray))
+  } else {
+val margins = new DenseVector(new Array[Double](numClasses))
+val coefMat = new DenseMatrix(numClasses, numFeatures, coefArray)
+BLAS.gemv(1.0, coefMat, features, 1.0, margins)
+margins.argmax.toDouble
+  }
+}
+  }
+
+  /**
+   * A helper function to generate synthetic data. Generates random 
feature values,
+   * both categorical and continuous, according to 
`categoricalFeaturesInfo`. The label is generated
+   * from a random prediction function, and noise is added to the true 
label.
+   *
+   * @param numPoints The number of data points to generate.
+   * @param numClasses The number of classes the outcome can take on. 0 
for continuous labels.
+   * @param numFeatures The number of features in the data.
+   * @param categoricalFeaturesInfo Map of (featureIndex -> numCategories) 
for categorical features.
+   * @param seed Random seed.
+   * @param noiseLevel A number in [0.0, 1.0] indicating how much noise to 
add to the label.
+   * @return Generated sequence of noisy instances.
+   */
+  def generateNoisyData(
+  numPoints: Int,
+  numClasses: Int,
+  numFeatures: Int,
+  categoricalFeaturesInfo: Map[Int, Int],
+  seed: Long,
+  noiseLevel: Double = 0.3): Seq[Instance] = {
+require(noiseLevel >= 0.0 && noiseLevel <= 1.0, "noiseLevel must be in 
range [0.0, 1.0]")
+val rng = new scala.util.Random(seed)
+val predictionFunc = getRandomLinearPredictionFunction(numFeatures, 
numClasses, seed)
+Range(0, numPoints).map { i =>
+  val features = Vectors.dense(Array.tabulate(numFeatures) { j =>
+val numCategories = categoricalFeaturesInfo.getOrElse(j, 0)
+if (numCategories > 0) {
+  rng.nextInt(numCategories)
+} else {
+  rng.nextDouble() - 0.5
+}
+  })
+  val label = predictionFunc(features)
+  val noisyLabel = if (numClasses > 0) {
+// with probability equal to noiseLevel, select a random class 
instead of the true class
+if (rng.nextDouble < noiseLevel) rng.nextInt(numClasses) else label
+  } else {
+// add noise to the label proportional to the noise level
+label + noiseLevel * rng.nextGaussian()
+  }
+  Instance(noisyLabel, 1.0, features)
+}
+  }
+
+  /**
+   * Helper function for testing sample weights. Tests that oversampling 
each point is equivalent
+   * to assigning a sample weight proportional to the number of samples 
for each point.
+   */
+  def testOversamplingVsWeighting[M <: Model[M], E <: Estimator[M]](
+spark: SparkSession,
+estimator: E with HasWeightCol with HasLabelCol with 
HasFeaturesCol,
+categoricalFeaturesInfo: Map[Int, Int],
+numPoints: Int,
+numClasses: Int,
+numFeatures: Int,
+modelEquals: (M, M) => Unit,
+seed: Long): Unit = {
+import spark.implicits._
+val df = generateNoisyData(numPoints, numClasses, numFeatures, 
categoricalFeaturesInfo,
+  seed).toDF()
+val (overSampledData, weightedData) = 
genEquivalentOversampledAndWeightedInstances(
+  df, estimator.getLabelCol, estimator.getFeaturesCol, seed)
+val weightedModel = estimator.set(estimator.weightCol, 
"weight").fit(weightedData)
+val overSampledModel = estimator.set(estimator.weightCol, 
"").fit(overSampledData)
+modelEquals(weightedModel, overSampledModel)
+  }
+
+  /**
+   * Helper function for testing sample weights. Tests that injecting a 
large number of outliers
+   * with very small sample weights does not affect fitting. The predictor 
should learn the 

[GitHub] spark pull request #15721: [SPARK-17772][ML][TEST] Add test functions for ML...

2016-12-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15721#discussion_r93172182
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
@@ -224,4 +208,139 @@ object MLTestingUtils extends SparkFunSuite {
 }.toDF()
 (overSampledData, weightedData)
   }
+
+  /**
+   * Generates a linear prediction function where the coefficients are 
generated randomly.
+   * The function produces a continuous (numClasses = 0) or categorical 
(numClasses > 0) label.
+   */
+  def getRandomLinearPredictionFunction(
+  numFeatures: Int,
+  numClasses: Int,
+  seed: Long): (Vector => Double) = {
+val rng = new scala.util.Random(seed)
+val trueNumClasses = if (numClasses == 0) 1 else numClasses
+val coefArray = Array.fill(numFeatures * 
trueNumClasses)(rng.nextDouble - 0.5)
+(features: Vector) => {
+  if (numClasses == 0) {
+BLAS.dot(features, new DenseVector(coefArray))
+  } else {
+val margins = new DenseVector(new Array[Double](numClasses))
+val coefMat = new DenseMatrix(numClasses, numFeatures, coefArray)
+BLAS.gemv(1.0, coefMat, features, 1.0, margins)
+margins.argmax.toDouble
+  }
+}
+  }
+
+  /**
+   * A helper function to generate synthetic data. Generates random 
feature values,
+   * both categorical and continuous, according to 
`categoricalFeaturesInfo`. The label is generated
+   * from a random prediction function, and noise is added to the true 
label.
+   *
+   * @param numPoints The number of data points to generate.
+   * @param numClasses The number of classes the outcome can take on. 0 
for continuous labels.
+   * @param numFeatures The number of features in the data.
+   * @param categoricalFeaturesInfo Map of (featureIndex -> numCategories) 
for categorical features.
+   * @param seed Random seed.
+   * @param noiseLevel A number in [0.0, 1.0] indicating how much noise to 
add to the label.
+   * @return Generated sequence of noisy instances.
+   */
+  def generateNoisyData(
+  numPoints: Int,
+  numClasses: Int,
+  numFeatures: Int,
+  categoricalFeaturesInfo: Map[Int, Int],
+  seed: Long,
+  noiseLevel: Double = 0.3): Seq[Instance] = {
+require(noiseLevel >= 0.0 && noiseLevel <= 1.0, "noiseLevel must be in 
range [0.0, 1.0]")
+val rng = new scala.util.Random(seed)
+val predictionFunc = getRandomLinearPredictionFunction(numFeatures, 
numClasses, seed)
+Range(0, numPoints).map { i =>
+  val features = Vectors.dense(Array.tabulate(numFeatures) { j =>
+val numCategories = categoricalFeaturesInfo.getOrElse(j, 0)
+if (numCategories > 0) {
+  rng.nextInt(numCategories)
+} else {
+  rng.nextDouble() - 0.5
+}
+  })
+  val label = predictionFunc(features)
+  val noisyLabel = if (numClasses > 0) {
+// with probability equal to noiseLevel, select a random class 
instead of the true class
+if (rng.nextDouble < noiseLevel) rng.nextInt(numClasses) else label
+  } else {
+// add noise to the label proportional to the noise level
+label + noiseLevel * rng.nextGaussian()
+  }
+  Instance(noisyLabel, 1.0, features)
+}
+  }
+
+  /**
+   * Helper function for testing sample weights. Tests that oversampling 
each point is equivalent
+   * to assigning a sample weight proportional to the number of samples 
for each point.
+   */
+  def testOversamplingVsWeighting[M <: Model[M], E <: Estimator[M]](
+spark: SparkSession,
--- End diff --

Indent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16232: [SPARK-18800][SQL] Fix UnsafeKVExternalSorter by correct...

2016-12-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16232
  
OK. I will update accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16240: [SPARK-16792][SQL] Dataset containing a Case Class with ...

2016-12-19 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16240
  
The overall strategy LGTM.

> I had to alter and add new implicit encoders into SQLImplicits. The new 
encoders are for Seq with Product combination (essentially only List) to 
disambiguate between Seq and Product encoders.

Does scala have a clear definition for this case? i.e. we have implicit for 
both type `A` and `B`, given type `A with B`, which implicit will be picked?

For the optimization, we can do it in follow-up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16232: [SPARK-18800][SQL] Fix UnsafeKVExternalSorter by correct...

2016-12-19 Thread davies
Github user davies commented on the issue:

https://github.com/apache/spark/pull/16232
  
That make sense, we should update the assert. But this still is not a bug, 
the other changes are not needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

2016-12-19 Thread nsyca
Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/16337
  
Let me try to summarize the comments around the structure of the test files 
here:
1. A single file of 200+ test cases are too big. We prefer smaller files 
with logical groupings.
2. File name with serial number is not the way Spark names files.

I'd like to generate more discussions before we come to a conclusion.
- It is possible to group test cases and what we tried to loosely group 
them is naming them in groups in the test file like TC 01.xx. 01 is effectively 
the group number. We can easily change to put one group in each file.
- Sometimes grouping rigidly is not desirable, or impossible. Does a test 
case of 'EXISTS .. OR NOT IN' go to the 'EXISTS' group, the 'NOT IN' group, or 
the 'disjunctive subquery' group? Does a test case of 'EXISTS ( .. ) UNION 
EXISTS ( .. )' go into the same group as 'EXISTS ( .. UNION .. )', or the first 
goes to the 'UNION' suite and the latter 'subquery' suite? Shall we have test 
cases with one classification go to the "simple" set and the ones with more 
than one way to classify go to the "complex" set? Overtime, people will pile up 
most of them in the "complex" set and it will be bloated. And we will end up 
with "complex-1", "complex-2", etc.
- Arguably we have a purpose when writing a test case but sometimes it 
triggers an unrelated problem. If a test case is intended to test a subquery 
functionality but ends up revealing a missed opportunity in join reordering, 
should we move it into the 'join reordering' suite and leave it in the 
'subquery' suite?
- With the current one level flat structure in 
sql/core/src/test/resources/sql-tests/inputs/, we could possibly end up with 
thousands of files in the (near) future if a file contains only a handful of 
test cases. What is a good solution? Should we create a subdirectory named 
subquery/ and break up the test cases into small files under this directory? 

I don't think we have a silver bullet for this kind of problem. Let's 
brainstorm here. I (or someone else) could moderate the discussion. Eventually 
we will need to pick one way or the another. And if we need to change it in the 
future, we pay the price for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16346: [SPARK-16654][CORE] Add UI coverage for Application Leve...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16346
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16346: [SPARK-16654][CORE] Add UI coverage for Application Leve...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16346
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70393/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16346: [SPARK-16654][CORE] Add UI coverage for Application Leve...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16346
  
**[Test build #70393 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70393/testReport)**
 for PR 16346 at commit 
[`20ff7dd`](https://github.com/apache/spark/commit/20ff7dddea72bf8fc9330f464992b19e1bf1c59e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class SparkListenerExecutorBlacklisted(`
  * `case class SparkListenerExecutorUnblacklisted(time: Long, executorId: 
String)`
  * `case class SparkListenerNodeBlacklisted(`
  * `case class SparkListenerNodeUnblacklisted(time: Long, nodeId: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/12775
  
@mridulm I see -- so you're saying to keep the finally block but remove 
catching the Throwable?  So eliminate the logError, but otherwise the 
functionality is the same?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16296
  
**[Test build #70395 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70395/testReport)**
 for PR 16296 at commit 
[`631edf7`](https://github.com/apache/spark/commit/631edf75ed83a9e7598b746dc81c46d9a7761e09).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16313: [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refa...

2016-12-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16313


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16313: [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor th...

2016-12-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16313
  
Thanks! Merging to master/2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16348: Branch 2.0.4399

2016-12-19 Thread laixiaohang
Github user laixiaohang closed the pull request at:

https://github.com/apache/spark/pull/16348


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16348: Branch 2.0.4399

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16348
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16348: Branch 2.0.4399

2016-12-19 Thread laixiaohang
GitHub user laixiaohang opened a pull request:

https://github.com/apache/spark/pull/16348

Branch 2.0.4399

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/laixiaohang/spark branch-2.0.4399

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16348.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16348


commit c9c36fa0c7bccefde808bdbc32b04e8555356001
Author: Davies Liu 
Date:   2016-09-02T22:10:12Z

[SPARK-17230] [SQL] Should not pass optimized query into QueryExecution in 
DataFrameWriter

Some analyzer rules have assumptions on logical plans, optimizer may break 
these assumption, we should not pass an optimized query plan into 
QueryExecution (will be analyzed again), otherwise we may some weird bugs.

For example, we have a rule for decimal calculation to promote the 
precision before binary operations, use PromotePrecision as placeholder to 
indicate that this rule should not apply twice. But a Optimizer rule will 
remove this placeholder, that break the assumption, then the rule applied 
twice, cause wrong result.

Ideally, we should make all the analyzer rules all idempotent, that may 
require lots of effort to double checking them one by one (may be not easy).

An easier approach could be never feed a optimized plan into Analyzer, this 
PR fix the case for RunnableComand, they will be optimized, during execution, 
the passed `query` will also be passed into QueryExecution again. This PR make 
these `query` not part of the children, so they will not be optimized and 
analyzed again.

Right now, we did not know a logical plan is optimized or not, we could 
introduce a flag for that, and make sure a optimized logical plan will not be 
analyzed again.

Added regression tests.

Author: Davies Liu 

Closes #14797 from davies/fix_writer.

(cherry picked from commit ed9c884dcf925500ceb388b06b33bd2c95cd2ada)
Signed-off-by: Davies Liu 

commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b
Author: Sameer Agarwal 
Date:   2016-09-02T22:16:16Z

[SPARK-16334] Reusing same dictionary column for decoding consecutive row 
groups shouldn't throw an error

This patch fixes a bug in the vectorized parquet reader that's caused by 
re-using the same dictionary column vector while reading consecutive row 
groups. Specifically, this issue manifests for a certain distribution of 
dictionary/plain encoded data while we read/populate the underlying bit packed 
dictionary data into a column-vector based data structure.

Manually tested on datasets provided by the community. Thanks to Chris 
Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal 

Closes #14941 from sameeragarwal/parquet-exception-2.

(cherry picked from commit a2c9acb0e54b2e38cb8ee6431f1ea0e0b4cd959a)
Signed-off-by: Davies Liu 

commit b8f65dad7be22231e982aaec3bbd69dbeacc20da
Author: Davies Liu 
Date:   2016-09-02T22:40:02Z

Fix build

commit c0ea7707127c92ecb51794b96ea40d7cdb28b168
Author: Davies Liu 
Date:   2016-09-02T23:05:37Z

Revert "[SPARK-16334] Reusing same dictionary column for decoding 
consecutive row groups shouldn't throw an error"

This reverts commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b.

commit 12a2e2a5ab5db12f39a7b591e914d52058e1581b
Author: Junyang Qian 
Date:   2016-09-03T04:11:57Z

[SPARKR][MINOR] Fix docs for sparkR.session and count

## What changes were proposed in this pull request?

This PR tries to add some more explanation to `sparkR.session`. It also 
modifies doc for `count` so when grouped in one doc, the description doesn't 
confuse users.

## How was this patch tested?

Manual test.

![screen shot 2016-09-02 at 1 21 36 
pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)

Author: Junyang Qian 

Closes #14942 from junyangq/fixSparkRSessionDoc.

(cherry picked from commit d2fde6b72c4aede2e7edb4a7e6653fb1e7b19924)

[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/12775
  

@kayousterhout As @lirui-intel mentioned above, there are two parts to this 
change.
One is moving handleFailedTask to finally - that is a correct change.

The other is catching Throwable, logging it and ignoring it.
This is an incorrect practice. Specifically in this context, since it is 
within Utils.logUncaughtExceptions - the logging issue is already handled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/12775
  
@mridulm what's the scenario you're imagining where it's worse to catch the 
exception?  I'm imagining one of two scenarios:

(1) There's a recoverable exception, in which case we should properly 
register the task as failed (otherwise the job will hang) and log the exception 
(which is what this PR does).

(2) There's an irrecoverable exception.  My understanding is that this 
change only impacts the logging in that case (since the relevant thread is 
going to die anyway).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread lirui-intel
Github user lirui-intel commented on the issue:

https://github.com/apache/spark/pull/12775
  
Hi @kayousterhout and @mridulm, to clarify, I think the error won't 
disappear if we don't catch it. Because the runnable is wrapped in 
Utils.logUncaughtExceptions so the error will be logged eventually. But anyway 
I think we should handle the failed task in a finally block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/12775
  

If intent is only to log, why not register an uncaughtException handler for 
that purpose instead of catching Throwable and ignoring it after logging ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16313: [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor th...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16313
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16313: [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor th...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16313
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70392/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16313: [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor th...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16313
  
**[Test build #70392 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70392/testReport)**
 for PR 16313 at commit 
[`32857e6`](https://github.com/apache/spark/commit/32857e6c5fa89094b84d4ed78469217af8c515c7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/12775
  
@mridulm  My thought here was we might as well catch it, since the thread 
is about to die anyway.  The alternative is that we don't catch it, the thread 
dies (so the error disappears / we never see it), and then the VM is in the 
same inconsistent state.  At least the error message from catching it might 
provide a useful hint about what happened.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16189: [SPARK-18761][CORE] Introduce "task reaper" to oversee t...

2016-12-19 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/16189
  
Sounds good @JoshRosen.
In general @yhuai it would have been better to give some more time for 
reviewers to get to ongoing conversations before commiting a patch under active 
review unless hotfix, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/12775
  
Just saw this - catching Throwable is problematic : it could be any system 
related Error's too : which might leave the VM in inconsistent state if not 
properly handled.
Like an OOM or a link error. Are we sure ignoring Throwable is the right 
approach here ? It is not just  the current thread which might be at risk ? If 
there is a more specific subset which is relevant, it would be more appropriate 
to catch those.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-19 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16189: [SPARK-18761][CORE] Introduce "task reaper" to ov...

2016-12-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16189


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16189: [SPARK-18761][CORE] Introduce "task reaper" to oversee t...

2016-12-19 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16189
  
LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16347
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16189: [SPARK-18761][CORE] Introduce "task reaper" to oversee t...

2016-12-19 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16189
  
Thank you for those comments. I am merging this to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16189: [SPARK-18761][CORE] Introduce "task reaper" to ov...

2016-12-19 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16189#discussion_r93162832
  
--- Diff: core/src/test/scala/org/apache/spark/JobCancellationSuite.scala 
---
@@ -209,6 +209,83 @@ class JobCancellationSuite extends SparkFunSuite with 
Matchers with BeforeAndAft
 assert(jobB.get() === 100)
   }
 
+  test("task reaper kills JVM if killed tasks keep running for too long") {
+val conf = new SparkConf()
+  .set("spark.task.reaper.enabled", "true")
+  .set("spark.task.reaper.killTimeout", "5s")
+sc = new SparkContext("local-cluster[2,1,1024]", "test", conf)
+
+// Add a listener to release the semaphore once any tasks are launched.
+val sem = new Semaphore(0)
+sc.addSparkListener(new SparkListener {
+  override def onTaskStart(taskStart: SparkListenerTaskStart) {
+sem.release()
+  }
+})
+
+// jobA is the one to be cancelled.
+val jobA = Future {
+  sc.setJobGroup("jobA", "this is a job to be cancelled", 
interruptOnCancel = true)
+  sc.parallelize(1 to 1, 2).map { i =>
+while (true) { }
+  }.count()
+}
+
+// Block until both tasks of job A have started and cancel job A.
+sem.acquire(2)
+// Small delay to ensure tasks actually start executing the task body
+Thread.sleep(1000)
+
+sc.clearJobGroup()
+val jobB = sc.parallelize(1 to 100, 2).countAsync()
+sc.cancelJobGroup("jobA")
+val e = intercept[SparkException] { ThreadUtils.awaitResult(jobA, 
15.seconds) }.getCause
+assert(e.getMessage contains "cancel")
+
+// Once A is cancelled, job B should finish fairly quickly.
+assert(ThreadUtils.awaitResult(jobB, 60.seconds) === 100)
+  }
+
+  test("task reaper will not kill JVM if spark.task.killTimeout == -1") {
+val conf = new SparkConf()
+  .set("spark.task.reaper.enabled", "true")
+  .set("spark.task.reaper.killTimeout", "-1")
+  .set("spark.task.reaper.PollingInterval", "1s")
+  .set("spark.deploy.maxExecutorRetries", "1")
--- End diff --

We set it to 1 to make sure that we will not kill JVM, right (if we kill 
JVM, we will remove the application because spark.deploy.maxExecutorRetries is 
1.)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16343: [FLAKY-TEST][DO NOT MERGE] InputStreamsSuite.socket inpu...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16343
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70391/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16343: [FLAKY-TEST][DO NOT MERGE] InputStreamsSuite.socket inpu...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16343
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16343: [FLAKY-TEST][DO NOT MERGE] InputStreamsSuite.socket inpu...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16343
  
**[Test build #70391 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70391/testReport)**
 for PR 16343 at commit 
[`04fa2f7`](https://github.com/apache/spark/commit/04fa2f709d034841a0828bd110e5561198b000ea).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16341: [SQL] [WIP] Switch internal catalog types to use URI ins...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16341
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16341: [SQL] [WIP] Switch internal catalog types to use URI ins...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16341
  
**[Test build #70394 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70394/testReport)**
 for PR 16341 at commit 
[`bcdac16`](https://github.com/apache/spark/commit/bcdac1691c46395410eb090cd7e0805ed4d58f14).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16341: [SQL] [WIP] Switch internal catalog types to use URI ins...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16341
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70394/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16347: [SPARK-18934][SQL] Writing to dynamic partitions ...

2016-12-19 Thread junegunn
GitHub user junegunn opened a pull request:

https://github.com/apache/spark/pull/16347

[SPARK-18934][SQL] Writing to dynamic partitions does not preserve sort 
order if spills occur

## What changes were proposed in this pull request?

Make dynamic partition writer perform stable sort by the partition key, so 
that the sort order within the partition specified via `sortWithinPartitions` 
or `SORT BY` is preserved even when spill occurs.

## How was this patch tested?

Manually tested with the following code snippet and orcdump.

```scala
// FileFormatWriter
sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
  .repartition(1, 'part).sortWithinPartitions("value")
  .write.mode("overwrite").format("orc").partitionBy("part")
  .saveAsTable("test_sort_within")

spark.read.table("test_sort_within").filter('part === 0).show
spark.read.table("test_sort_within").filter('part === 1).show

// SparkHiveDynamicPartitionWriterContainer
//   Insert into an existing Hive table with dynamic partitions
// CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) 
STORED AS ORC
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
  .repartition(1, 'part).sortWithinPartitions("value")
  .write.mode("overwrite").insertInto("test_sort_within_hive")

spark.read.table("test_sort_within_hive").filter('part === 0).show
spark.read.table("test_sort_within_hive").filter('part === 1).show
```

It was not straightforward to come up with a unit test as the problem is 
only reproducible if spill occurs due to memory constraint. I'd appreciate any 
suggestions or pointers.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/junegunn/spark 
dynamic-partition-writer-stable-sort

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16347.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16347


commit bfeccd80ef032cab3525037be3d3e42519619493
Author: Junegunn Choi 
Date:   2016-12-19T05:54:42Z

[SPARK-18934][SQL] Writing to dynamic partitions does not preserve sort 
order if spills occur




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16345: [SPARK-17755][Core]Use workerRef to send RegisterWorkerR...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16345
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16345: [SPARK-17755][Core]Use workerRef to send RegisterWorkerR...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16345
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70390/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16345: [SPARK-17755][Core]Use workerRef to send RegisterWorkerR...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16345
  
**[Test build #70390 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70390/testReport)**
 for PR 16345 at commit 
[`b4b5552`](https://github.com/apache/spark/commit/b4b55528edc5e9c92f28cf81ea81e72748790100).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16341: [SQL] [WIP] Switch internal catalog types to use URI ins...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16341
  
**[Test build #70394 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70394/testReport)**
 for PR 16341 at commit 
[`bcdac16`](https://github.com/apache/spark/commit/bcdac1691c46395410eb090cd7e0805ed4d58f14).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12775
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12775
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70388/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12775
  
**[Test build #70388 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70388/testReport)**
 for PR 12775 at commit 
[`699730b`](https://github.com/apache/spark/commit/699730b592e8d913e728e0097e140c710c201dce).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16346: [SPARK-16654][CORE] Add UI coverage for Application Leve...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16346
  
**[Test build #70393 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70393/testReport)**
 for PR 16346 at commit 
[`20ff7dd`](https://github.com/apache/spark/commit/20ff7dddea72bf8fc9330f464992b19e1bf1c59e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16346: [SPARK-16654][CORE] Add UI coverage for Application Leve...

2016-12-19 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/16346
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16346: [SPARK-16654][CORE] Add UI coverage for Application Leve...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16346
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16346: [SPARK-16654][CORE] Add UI coverage for Applicati...

2016-12-19 Thread jsoltren
GitHub user jsoltren opened a pull request:

https://github.com/apache/spark/pull/16346

[SPARK-16654][CORE] Add UI coverage for Application Level Blacklisting

Builds on top of work in SPARK-8425 to update Application Level 
Blacklisting in the scheduler.

## What changes were proposed in this pull request?

Adds a UI to these patches by:
- defining new listener events for blacklisting and unblacklisting, nodes 
and executors;
- sending said events at the relevant points in BlacklistTracker;
- adding JSON (de)serialization code for these events;
- augmenting the Executors UI page to show which, and how many, executors 
are blacklisted;
- adding a unit test to make sure events are being fired;
- adding HistoryServerSuite coverage to verify that the SHS reads these 
events correctly.
- updates the Executor UI to show Blacklisted/Active/Dead as a tri-state in 
Executors Status

Updates .rat-excludes to pass tests.

@username squito

(Please fill in changes proposed in this fix)

## How was this patch tested?

./dev/run-tests
testOnly org.apache.spark.util.JsonProtocolSuite
testOnly org.apache.spark.scheduler.BlacklistTrackerSuite
testOnly org.apache.spark.deploy.history.HistoryServerSuite

https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh

![blacklist-20161219](https://cloud.githubusercontent.com/assets/1208477/21335321/9eda320a-c623-11e6-8b8c-9c912a73c276.jpg)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jsoltren/spark SPARK-16654-submit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16346.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16346


commit 20ff7dddea72bf8fc9330f464992b19e1bf1c59e
Author: José Hiram Soltren <j...@cloudera.com>
Date:   2016-10-14T21:09:44Z

[SPARK-16654][CORE] Add UI coverage for Application Level Blacklisting

Builds on top of work in SPARK-8425 to update
Application Level Blacklisting in the scheduler.

Adds a UI to these patches by:
- defining new listener events for blacklisting and unblacklisting,
  nodes and executors;
- sending said events at the relevant points in BlacklistTracker;
- adding JSON (de)serialization code for these events;
- augmenting the Executors UI page to show which, and how many,
  executors are blacklisted;
- adding a unit test to make sure events are being fired;
- adding HistoryServerSuite coverage to verify that the SHS reads
  these events correctly.
- updates the Executor UI to show Blacklisted/Active/Dead
  as a tri-state in Executors Status

Updates .rat-excludes to pass tests.

@username squito




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16325: [SPARK-18703] [SPARK-18675] [SQL] [BACKPORT-2.1] ...

2016-12-19 Thread gatorsmile
Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/16325


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16325: [SPARK-18703] [SPARK-18675] [SQL] [BACKPORT-2.1] CTAS fo...

2016-12-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16325
  
Sure, will do it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16326: [SPARK-18915] [SQL] Automatic Table Repair when Creating...

2016-12-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16326
  
We really need to improve the document, I think


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16326: [SPARK-18915] [SQL] Automatic Table Repair when C...

2016-12-19 Thread gatorsmile
Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/16326


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16326: [SPARK-18915] [SQL] Automatic Table Repair when Creating...

2016-12-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16326
  
Based on the discussion in https://github.com/apache/spark/pull/15983, we 
do not plan to add automatic table repairing. Let me close it first. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16313: [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor th...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16313
  
**[Test build #70392 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70392/testReport)**
 for PR 16313 at commit 
[`32857e6`](https://github.com/apache/spark/commit/32857e6c5fa89094b84d4ed78469217af8c515c7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16343: [FLAKY-TEST][DO NOT MERGE] InputStreamsSuite.socket inpu...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16343
  
**[Test build #70387 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70387/testReport)**
 for PR 16343 at commit 
[`92144e4`](https://github.com/apache/spark/commit/92144e428aa1919ed86e989f4015eb6f85186ea2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16343: [FLAKY-TEST][DO NOT MERGE] InputStreamsSuite.socket inpu...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16343
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70387/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16343: [FLAKY-TEST][DO NOT MERGE] InputStreamsSuite.socket inpu...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16343
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16338: [SPARK-18837][WEBUI] Very long stage descriptions do not...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16338
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70383/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16232: [SPARK-18800][SQL] Fix UnsafeKVExternalSorter by correct...

2016-12-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16232
  
@davies What I just said is not accurate. I don't meant the values have 
entry in the array. I meant each key/value pair will occupy two entries in the 
array.

We iterator all key/value pairs in the `BytesToBytesMap` and call 
`UnsafeInMemorySorter.insertRecord` which inserts record pointer and key prefix 
for each key/value pair in 
https://github.com/apache/spark/blob/5857b9ac2d9808d9b89a5b29620b5052e2beebf5/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L241

So you will have `map.numKeys() * 2` entries in the array because for each 
you will have one entry for record pointer and one entry for key prefix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16338: [SPARK-18837][WEBUI] Very long stage descriptions do not...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16338
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16338: [SPARK-18837][WEBUI] Very long stage descriptions do not...

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16338
  
**[Test build #70383 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70383/testReport)**
 for PR 16338 at commit 
[`c86dc72`](https://github.com/apache/spark/commit/c86dc72f553855843812151ff12e92fa779a5b37).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16313: [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor th...

2016-12-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16313
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15983: [SPARK-18544] [SQL] Append with df.saveAsTable writes da...

2016-12-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15983
  
I see the plan, but the behavior difference will still be affected by the 
value of `spark.sql.hive.manageFilesourcePartitions`, right? 

I might need more time to chew over it to find out the potential impacts. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-19 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/16344
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16314: [SPARK-18900][FLAKY-TEST] StateStoreSuite.maintenance

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16314
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16314: [SPARK-18900][FLAKY-TEST] StateStoreSuite.maintenance

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16314
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70384/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16314: [SPARK-18900][FLAKY-TEST] StateStoreSuite.maintenance

2016-12-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16314
  
**[Test build #70384 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70384/testReport)**
 for PR 16314 at commit 
[`6775639`](https://github.com/apache/spark/commit/67756391e11b4ad0ed38fec9cbe99bd7e8b2ce63).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15996
  
Could we update the PR description and add the test case in 
`PartitionProviderCompatibilitySuite.scala` to reflect the external behavior 
changes of CTAS on partitioned data source tables? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-19 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Jenkins, add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16342: [SPARK-18927][SS] MemorySink for StructuredStreaming can...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16342
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70382/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16342: [SPARK-18927][SS] MemorySink for StructuredStreaming can...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16342
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >