[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18880
  
**[Test build #80385 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80385/testReport)**
 for PR 18880 at commit 
[`96ca90e`](https://github.com/apache/spark/commit/96ca90e775da3ee5a762cc5b03b0fcf13bee5363).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18849#discussion_r131894824
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -2356,18 +2356,9 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
 }.getMessage
 assert(e.contains("Found duplicate column(s)"))
   } else {
-if (isUsingHiveMetastore) {
-  // hive catalog will still complains that c1 is duplicate 
column name because hive
-  // identifiers are case insensitive.
-  val e = intercept[AnalysisException] {
-sql("ALTER TABLE t1 ADD COLUMNS (C1 string)")
-  }.getMessage
-  assert(e.contains("HiveException"))
-} else {
-  sql("ALTER TABLE t1 ADD COLUMNS (C1 string)")
-  assert(spark.table("t1").schema
-.equals(new StructType().add("c1", IntegerType).add("C1", 
StringType)))
-}
+sql("ALTER TABLE t1 ADD COLUMNS (C1 string)")
+assert(spark.table("t1").schema
+  .equals(new StructType().add("c1", IntegerType).add("C1", 
StringType)))
--- End diff --

`.equals` -> `==`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-08-08 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18865#discussion_r131895717
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
 ---
@@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 }
 
 (file: PartitionedFile) => {
-  val parser = new JacksonParser(actualSchema, parsedOptions)
+  // SPARK-21610: when the `requiredSchema` only contains 
`_corrupt_record`,
--- End diff --

Yeah, we've also considered this possibility. But currently we thought a 
query like `dfFromFile.select($"_corrupt_record")` shows all nulls looks 
countering intuition because the query is intended to get the corrupt records 
among the JSON records. That is why we provide a fix like current one. IMO, 
it's meaningless to have all nulls for `_corrupted_record`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18814
  
**[Test build #80386 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80386/testReport)**
 for PR 18814 at commit 
[`3e3b58c`](https://github.com/apache/spark/commit/3e3b58c8911e266b2af985da3dec53418a608d2b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-08 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r131891836
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/CosineSilhouette.scala ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{DenseVector, Vector, 
VectorElementWiseSum}
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions.{col, count}
+
+private[evaluation] object CosineSilhouette {
--- End diff --

There is no clustering algorithms using other distance metrics except for 
squared euclidean distance currently. I'd suggest to remove the 
```CosineSilhouette``` implementation firstly, we can add it back when it's 
needed. This can also make this PR more easy to review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-08 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r131891038
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum}
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions.{col, count, sum}
+
+private[evaluation] object SquaredEuclideanSilhouette {
+
+  private[this] var kryoRegistrationPerformed: Boolean = false
+
+  /**
+   * This method registers the class
+   * 
[[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
+   * for kryo serialization.
+   *
+   * @param sc `SparkContext` to be used
+   */
+  def registerKryoClasses(sc: SparkContext): Unit = {
+if (! kryoRegistrationPerformed) {
+  sc.getConf.registerKryoClasses(
+Array(
+  classOf[SquaredEuclideanSilhouette.ClusterStats]
+)
+  )
+  kryoRegistrationPerformed = true
+}
+  }
+
+  case class ClusterStats(Y: Vector, psi: Double, count: Long)
+
+  def computeCsi(vector: Vector): Double = {
+var sumOfSquares = 0.0
+vector.foreachActive((_, v) => {
+  sumOfSquares += v * v
+})
+sumOfSquares
+  }
+
+  def computeYVectorPsiAndCount(
+  df: DataFrame,
+  predictionCol: String,
+  featuresCol: String): DataFrame = {
+val Yudaf = new VectorElementWiseSum()
+df.groupBy(predictionCol)
+  .agg(
+count("*").alias("count"),
+sum("csi").alias("psi"),
+Yudaf(col(featuresCol)).alias("y")
--- End diff --

Please rename ```csi``` to ```squaredNorm```, ```psi``` to 
```squaredNormSum```, ```y``` to ```featureSum``` if I don't have 
misunderstand. We should use more descriptive name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18880
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80385/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18880
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18873: Fixing python 2.6 tests for jenkings

2017-08-08 Thread dmvieira
Github user dmvieira commented on the issue:

https://github.com/apache/spark/pull/18873
  
Hey guys, I just opened this PR because I spent a lot of time trying to fix 
Jenkins tests in my last PR when the error was in test script with python 
2.6... I can close it, but I know that more guys will spend more time trying to 
fix it again. Ok @felixcheung


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18880
  
**[Test build #80391 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80391/testReport)**
 for PR 18880 at commit 
[`5a4e859`](https://github.com/apache/spark/commit/5a4e859d434b0a5b8539cab2ef0d9346ce81c08e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18814
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18814
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80386/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-08 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r131890318
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum}
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions.{col, count, sum}
+
+private[evaluation] object SquaredEuclideanSilhouette {
+
+  private[this] var kryoRegistrationPerformed: Boolean = false
+
+  /**
+   * This method registers the class
+   * 
[[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
+   * for kryo serialization.
+   *
+   * @param sc `SparkContext` to be used
+   */
+  def registerKryoClasses(sc: SparkContext): Unit = {
+if (! kryoRegistrationPerformed) {
+  sc.getConf.registerKryoClasses(
+Array(
+  classOf[SquaredEuclideanSilhouette.ClusterStats]
+)
+  )
+  kryoRegistrationPerformed = true
+}
+  }
+
+  case class ClusterStats(Y: Vector, psi: Double, count: Long)
+
+  def computeCsi(vector: Vector): Double = {
--- End diff --

Can we use ```Vectors.norm(vector, 2.0)```? It should be more efficient for 
both dense and sparse vector. Actually we can remove this function if you 
refactor code as my suggested below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-08 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r131868145
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util.{DefaultParamsReadable, 
DefaultParamsWritable, Identifiable, SchemaUtils}
+import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions.{avg, col}
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * Evaluator for clustering results.
+ * At the moment, the supported metrics are:
+ *  squaredSilhouette: silhouette measure using the squared Euclidean 
distance;
+ *  cosineSilhouette: silhouette measure using the cosine distance.
+ *  The implementation follows the proposal explained
+ * https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view;>
+ *   in this document.
+ */
+@Experimental
+class ClusteringEvaluator (val uid: String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+
+  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
+
+  override def copy(pMap: ParamMap): Evaluator = this.defaultCopy(pMap)
+
+  override def isLargerBetter: Boolean = true
+
+  /** @group setParam */
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  /** @group setParam */
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /**
+   * param for metric name in evaluation
+   * (supports `"squaredSilhouette"` (default), `"cosineSilhouette"`)
+   * @group param
+   */
+  val metricName: Param[String] = {
+val allowedParams = ParamValidators.inArray(Array("squaredSilhouette", 
"cosineSilhouette"))
+new Param(
+  this,
+  "metricName",
+  "metric name in evaluation (squaredSilhouette|cosineSilhouette)",
+  allowedParams
+)
+  }
+
+  /** @group getParam */
+  def getMetricName: String = $(metricName)
+
+  /** @group setParam */
+  def setMetricName(value: String): this.type = set(metricName, value)
+
+  setDefault(metricName -> "squaredSilhouette")
+
+  override def evaluate(dataset: Dataset[_]): Double = {
+SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), 
IntegerType)
+
+val metric: Double = $(metricName) match {
+  case "squaredSilhouette" =>
+computeSquaredSilhouette(dataset)
+  case "cosineSilhouette" =>
+computeCosineSilhouette(dataset)
+}
+metric
+  }
+
+  private[this] def computeCosineSilhouette(dataset: Dataset[_]): Double = 
{
+CosineSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
+
+val computeCsi = dataset.sparkSession.udf.register("computeCsi",
--- End diff --

Could we use more descriptive name? We can't get what this function does 
from its name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-08 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r131892095
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum}
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions.{col, count, sum}
+
+private[evaluation] object SquaredEuclideanSilhouette {
--- End diff --

Let's move this to file ```ClusteringEvaluator```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-08 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r131889868
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum}
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions.{col, count, sum}
+
+private[evaluation] object SquaredEuclideanSilhouette {
+
+  private[this] var kryoRegistrationPerformed: Boolean = false
+
+  /**
+   * This method registers the class
+   * 
[[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
+   * for kryo serialization.
+   *
+   * @param sc `SparkContext` to be used
+   */
+  def registerKryoClasses(sc: SparkContext): Unit = {
+if (! kryoRegistrationPerformed) {
+  sc.getConf.registerKryoClasses(
+Array(
+  classOf[SquaredEuclideanSilhouette.ClusterStats]
+)
+  )
+  kryoRegistrationPerformed = true
+}
+  }
+
+  case class ClusterStats(Y: Vector, psi: Double, count: Long)
+
+  def computeCsi(vector: Vector): Double = {
+var sumOfSquares = 0.0
+vector.foreachActive((_, v) => {
+  sumOfSquares += v * v
+})
+sumOfSquares
+  }
+
+  def computeYVectorPsiAndCount(
+  df: DataFrame,
+  predictionCol: String,
+  featuresCol: String): DataFrame = {
+val Yudaf = new VectorElementWiseSum()
+df.groupBy(predictionCol)
+  .agg(
+count("*").alias("count"),
+sum("csi").alias("psi"),
+Yudaf(col(featuresCol)).alias("y")
+  )
--- End diff --

Aggregate function performance is not ideal for column of non-primitive 
type(like here is vector type). So we would still use RDD-based aggregate. You 
can factor this part of code following 
[```NaiveBayes```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala#L161)
 like:
```
import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors}
import org.apache.spark.sql.functions._

val numFeatures = ...
val squaredNorm = udf { features: Vector => 
math.pow(Vectors.norm(features, 2.0), 2.0) }

df.select(col(predictionCol), col(featuresCol))
  .withColumn("squaredNorm", squaredNorm(col(featuresCol)))
  .rdd
  .map { row => (row.getDouble(0), (row.getAs[Vector](1), 
row.getDouble(2))) }
  .aggregateByKey[(DenseVector, 
Double)]((Vectors.zeros(numFeatures).toDense, 0.0))(
  seqOp = {
case ((featureSum: DenseVector, squaredNormSum: Double), (features, 
squaredNorm)) =>
  BLAS.axpy(1.0, features, featureSum)
  (featureSum, squaredNormSum + squaredNorm)
  },
  combOp = {
case ((featureSum1, squaredNormSum1), (featureSum2, 
squaredNormSum2)) =>
  BLAS.axpy(1.0, featureSum2, featureSum1)
  (featureSum1, squaredNormSum1 + squaredNormSum2)
  }).collect()
```
In my suggestion, you can compute ```csi``` and ```y``` in a single data 
pass, which should be more efficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18882
  
**[Test build #80390 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80390/testReport)**
 for PR 18882 at commit 
[`dc2112f`](https://github.com/apache/spark/commit/dc2112f6a26125cd6a67eac79cef91751ac639f8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18878: fix issue with spark-submit with java

2017-08-08 Thread brandonJY
Github user brandonJY commented on the issue:

https://github.com/apache/spark/pull/18878
  
Running spark 2.2.0 standalone mode in Ubuntu 16.04 LTS. It interprets 
classpath with the quote.  E.g. for classname=`foo`, it interprets it as `"foo"`
Probably it is just a environment issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18881: [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18881
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80384/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18881: [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18881
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18880: [SPARK-21665][Core]Need to close resources after ...

2017-08-08 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18880#discussion_r131897551
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
---
@@ -754,6 +754,7 @@ private[spark] class Client(
   val writer = new OutputStreamWriter(confStream, 
StandardCharsets.UTF_8)
   props.store(writer, "Spark configuration.")
   writer.flush()
+  writer.close()
--- End diff --

This should be moved after `confStream.closeEntry()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18880: [SPARK-21665][Core]Need to close resources after ...

2017-08-08 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18880#discussion_r131897597
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
---
@@ -754,6 +754,7 @@ private[spark] class Client(
   val writer = new OutputStreamWriter(confStream, 
StandardCharsets.UTF_8)
   props.store(writer, "Spark configuration.")
   writer.flush()
+  writer.close()
--- End diff --

This should be moved to after `confStream.closeEntry()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18544
  
**[Test build #80392 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80392/testReport)**
 for PR 18544 at commit 
[`bb454b1`](https://github.com/apache/spark/commit/bb454b1d89e239f0cfb1dfd98189ec3fdb59cff6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18492: [SPARK-19326] Speculated task attempts do not get launch...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18492
  
**[Test build #80387 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80387/testReport)**
 for PR 18492 at commit 
[`77b4729`](https://github.com/apache/spark/commit/77b47293d9655408b0dbe4d5896492b366575b3a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18867: MapOutputTrackerSuite Utest

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18867
  
**[Test build #80388 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80388/testReport)**
 for PR 18867 at commit 
[`b0c58c7`](https://github.com/apache/spark/commit/b0c58c7ff4f0be063e31d9b2f3b0b8b01094c51d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18882
  
**[Test build #80389 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80389/testReport)**
 for PR 18882 at commit 
[`d253e40`](https://github.com/apache/spark/commit/d253e40788b9e3408c106eff0ba84ae97d715cbb).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18882
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18882
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80389/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18855#discussion_r131889921
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
   super.fetchBlockSync(host, port, execId, blockId)
 }
   }
+
+  def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) {
+store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1")
+def mkBlobs() = {
+  val rng = new java.util.Random(42)
+  val buff = new Array[Byte](1024 * 1024)
+  rng.nextBytes(buff)
+  Iterator.fill(2 * 1024 + 1) {
+buff
+  }
+}
+val res1 = store.getOrElseUpdate(
+  RDDBlockId(42, 0),
+  storageLevel,
+  implicitly[ClassTag[Array[Byte]]],
+  mkBlobs _
+)
+withClue(res1) {
--- End diff --

does `res1` have a reasonable string representation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18798
  
**[Test build #80407 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80407/testReport)**
 for PR 18798 at commit 
[`b02db42`](https://github.com/apache/spark/commit/b02db420132f62299900a5089fb54810ec21a3d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18886: [SPARK-21671][core] Move kvstore to "util" sub-pa...

2017-08-08 Thread vanzin
GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/18886

[SPARK-21671][core] Move kvstore to "util" sub-package, add private 
annotation.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-21671

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18886.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18886


commit 1f86189bb71507c73c662bb1681166bc54db7dd3
Author: Marcelo Vanzin 
Date:   2017-08-08T18:07:17Z

[SPARK-21671][core] Move kvstore to "util" sub-package, add private 
annotation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18886: [SPARK-21671][core] Move kvstore to "util" sub-package, ...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18886
  
**[Test build #80408 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80408/testReport)**
 for PR 18886 at commit 
[`1f86189`](https://github.com/apache/spark/commit/1f86189bb71507c73c662bb1681166bc54db7dd3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17373: [SPARK-12664] Expose probability in mlp model

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17373
  
**[Test build #80409 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80409/testReport)**
 for PR 17373 at commit 
[`bcb44af`](https://github.com/apache/spark/commit/bcb44af65c3c7f9c0ead6cff5706243da80f88bc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r131992917
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -83,6 +83,36 @@ class MultilayerPerceptronClassifierSuite
 }
   }
 
+  test("strong dataset test") {
+val layers = Array[Int](4, 5, 5, 4)
+
+val rnd = new scala.util.Random(1234L)
+
+val strongDataset = Seq.tabulate(4) { index =>
+  (Vectors.dense(
+rnd.nextGaussian(),
+rnd.nextGaussian() * 2.0,
+rnd.nextGaussian() * 3.0,
+rnd.nextGaussian() * 2.0
+  ), (index % 4).toDouble)
+}.toDF("features", "label")
+val trainer = new MultilayerPerceptronClassifier()
+  .setLayers(layers)
+  .setBlockSize(1)
+  .setSeed(123L)
+  .setMaxIter(100)
+  .setSolver("l-bfgs")
+val model = trainer.fit(strongDataset)
+val result = model.transform(strongDataset)
+model.setProbabilityCol("probability")
+MLTestingUtils.checkCopyAndUids(trainer, model)
+// result.select("probability").show(false)
+val predictionAndLabels = result.select("prediction", 
"label").collect()
+predictionAndLabels.foreach { case Row(p: Double, l: Double) =>
+  assert(p == l)
+}
+  }
--- End diff --

@MrBago 
How do you like this test ?
The probability it generate is


+--+
|probability
   |

+--+

|[0.9917274999513315,0.001511626318489583,0.004831796668307991,0.0019290770618710876]
  |

|[4.2392735713619E-12,0.99955336,1.8369996605279208E-14,2.0871629225077174E-13]|

|[1.8975708749716946E-4,5.191732707447977E-22,0.5010860788259045,0.49872416408659836]
  |

|[1.6776134471360903E-4,3.9309610969078615E-22,0.49629577580941386,0.5035364628458726]
 |

+--+

it contains some values near 0.5



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18881: [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1

2017-08-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18881
  
cc @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18875
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18875
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80399/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18709: [SPARK-21504] [SQL] Add spark version info into t...

2017-08-08 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18709#discussion_r131995378
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -700,6 +704,9 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 table = restoreDataSourceTable(table, provider)
 }
 
+// Restore version info
+val version: String = 
table.properties.getOrElse(CREATED_SPARK_VERSION, "")
--- End diff --

I will change this to `2.2 or prior`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18709: [SPARK-21504] [SQL] Add spark version info into table me...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18709
  
**[Test build #80410 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80410/testReport)**
 for PR 18709 at commit 
[`9d7c577`](https://github.com/apache/spark/commit/9d7c577c24f320d9b7f35ae827f83ff65be5019f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18881: [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1

2017-08-08 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/18881
  
Can you add "Closes #18789" to automate the closing of the other PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18421
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80401/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18421
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18421
  
**[Test build #80402 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80402/testReport)**
 for PR 18421 at commit 
[`ba48073`](https://github.com/apache/spark/commit/ba4807342db2943d72c32e62a0bf043efeb573a3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18421
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18421
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80402/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-08 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18818#discussion_r132002008
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -453,6 +453,14 @@ case class Or(left: Expression, right: Expression) 
extends BinaryOperator with P
 
 abstract class BinaryComparison extends BinaryOperator with Predicate {
 
+  override def inputType: AbstractDataType = AnyDataType
--- End diff --

We have to write the code comments for explaining it. Given this 
assumption, `inputType` might be misused in the future. 

In addition, this PR has a pretty critical change. We really need to check 
all the new data types we support in this PR.  
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala#L90-L97

Could you write a comprehensive test case coverage for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0

2017-08-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18640
  
Sure. Thank you so much, @omalley !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/18798
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-08-08 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r131993682
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -256,6 +257,201 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", "SELECT '1', 'A' from src")
+  createPartition("2010-01-02", "SELECT '1', 'A' from src UNION ALL 
SELECT '1', 'A' from src")
+  createPartition("2010-01-03", "SELECT '1', 'A' from src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertPartitionStats(
+ds: String,
+hr: String,
+rowCount: Option[BigInt],
+sizeInBytes: BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === rowCount)
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def createPartition(ds: String, hr: Int, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) 
$query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING, hr INT)")
+
+  createPartition("2010-01-01", 10, "SELECT '1', 'A' from src")
+  createPartition("2010-01-01", 11, "SELECT '1', 'A' from src")
+  createPartition("2010-01-02", 10, "SELECT '1', 'A' from src")
+  createPartition("2010-01-02", 11,
+"SELECT '1', 'A' from src UNION ALL SELECT '1', 'A' from src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  assertPartitionStats("2010-01-01", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-01", "11", rowCount = None, 
sizeInBytes = 2000)
+  assert(queryStats("2010-01-02", "10") === None)
+  assert(queryStats("2010-01-02", "11") === None)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assertPartitionStats("2010-01-01", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-01", "11", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-02", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-02", "11", rowCount = None, 
sizeInBytes = 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  assertPartitionStats("2010-01-01", "10", rowCount = Some(500), 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-01", "11", rowCount = Some(500), 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-02", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-02", "11", rowCount = None, 
sizeInBytes = 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName 

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-08-08 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r131993628
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -107,6 +109,7 @@ case class CatalogTablePartition(
 if (parameters.nonEmpty) {
   map.put("Partition Parameters", s"{${parameters.map(p => p._1 + "=" 
+ p._2).mkString(", ")}}")
 }
+stats.foreach(s => map.put("Partition Statistics", s.simpleString))
--- End diff --

Yes, please add it. Try it and we should expose it to the external users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18875
  
**[Test build #80399 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80399/testReport)**
 for PR 18875 at commit 
[`4cf4212`](https://github.com/apache/spark/commit/4cf42127a7c1044263a69bc7c2562f198558eb27).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18709: [SPARK-21504] [SQL] Add spark version info into t...

2017-08-08 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18709#discussion_r131995780
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -217,6 +218,7 @@ case class CatalogTable(
 owner: String = "",
 createTime: Long = System.currentTimeMillis,
 lastAccessTime: Long = -1,
+createVersion: String = "",
--- End diff --

The default will also impact the temporary view. How about just keeping it 
unchanged?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18421
  
**[Test build #80401 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80401/testReport)**
 for PR 18421 at commit 
[`64efad2`](https://github.com/apache/spark/commit/64efad20eaa5253a08cd69f89e11550fd0246187).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18709: [SPARK-21504] [SQL] Add spark version info into table me...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18709
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18709: [SPARK-21504] [SQL] Add spark version info into table me...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18709
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80398/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18880
  
**[Test build #80396 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80396/testReport)**
 for PR 18880 at commit 
[`04054d0`](https://github.com/apache/spark/commit/04054d0baa5431e3e145ef42cea24c2ae15b6911).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18798
  
**[Test build #80403 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80403/testReport)**
 for PR 18798 at commit 
[`b02db42`](https://github.com/apache/spark/commit/b02db420132f62299900a5089fb54810ec21a3d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

2017-08-08 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18884
  
Jenkins, add to white list.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18837: [Spark-20812][Mesos] Add secrets support to the d...

2017-08-08 Thread ArtRand
Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/18837#discussion_r131986331
  
--- Diff: docs/running-on-mesos.md ---
@@ -479,6 +479,35 @@ See the [configuration page](configuration.html) for 
information on Spark config
 
 
 
+  spark.mesos.driver.secret.envkey
+  (none)
+  
+If set, the contents of the secret referenced by
+spark.mesos.driver.secret.name will be written to the provided
+environment variable in the driver's process.
+  
+  
+  
+spark.mesos.driver.secret.filename
+  (none)
+  
+If set, the contents of the secret referenced by
+spark.mesos.driver.secret.name will be written to the provided
+file.  Relative paths are relative to the container's work
+directory.  Absolute paths must already exist.  Consult the Mesos 
Secret
+protobuf for more information.
+  
+
+
+  spark.mesos.driver.secret.name
--- End diff --

Hey @susanxhuynh, so the way it works now is you can specify a secret as a 
`REFERENCE` or as a `VALUE` type by using the `spark.mesos.driver.secret.name` 
or `spark.mesos.driver.secret.value` configs, respectively. These secrets are 
then made file-based and/or env-based depending on the contents of 
`spark.mesos.driver.secret.filename` and `spark.mesos.driver.secret.envkey`. I 
allow for multiple secrets (but not multiple types) as comma-seperated lists 
(like Mesos URIs).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18880
  
**[Test build #80400 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80400/testReport)**
 for PR 18880 at commit 
[`7398e29`](https://github.com/apache/spark/commit/7398e2968fba8d7132772db186f3b3f4b15e0f46).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18880
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80400/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18880
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18867: MapOutputTrackerSuite Utest

2017-08-08 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/18867
  
This is by design, otherwise the test case would have failed.
Could you please close this PR? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18492: [SPARK-19326] Speculated task attempts do not get launch...

2017-08-08 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/18492
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18867: MapOutputTrackerSuite Utest

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18867
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18867: MapOutputTrackerSuite Utest

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18867
  
This may be ok to not have a ticket, but please update the PR title to be 
more concrete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18855#discussion_r131890077
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
   super.fetchBlockSync(host, port, execId, blockId)
 }
   }
+
+  def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) {
+store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1")
+def mkBlobs() = {
+  val rng = new java.util.Random(42)
+  val buff = new Array[Byte](1024 * 1024)
+  rng.nextBytes(buff)
+  Iterator.fill(2 * 1024 + 1) {
+buff
+  }
+}
+val res1 = store.getOrElseUpdate(
+  RDDBlockId(42, 0),
+  storageLevel,
+  implicitly[ClassTag[Array[Byte]]],
+  mkBlobs _
+)
+withClue(res1) {
+  assert(res1.isLeft)
+  assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
--- End diff --

just `a === b`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18867: MapOutputTrackerSuite Utest

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18867
  
**[Test build #80388 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80388/testReport)**
 for PR 18867 at commit 
[`b0c58c7`](https://github.com/apache/spark/commit/b0c58c7ff4f0be063e31d9b2f3b0b8b01094c51d).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18867: MapOutputTrackerSuite Utest

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18867
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80388/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18648: [SPARK-21428] Turn IsolatedClientLoader off while using ...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18648
  
OK to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18855#discussion_r131889556
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
   super.fetchBlockSync(host, port, execId, blockId)
 }
   }
+
+  def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) {
--- End diff --

nit: `storageLevel: StorageLevel`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18855#discussion_r131890565
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
   super.fetchBlockSync(host, port, execId, blockId)
 }
   }
+
+  def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) {
+store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1")
+def mkBlobs() = {
+  val rng = new java.util.Random(42)
+  val buff = new Array[Byte](1024 * 1024)
+  rng.nextBytes(buff)
+  Iterator.fill(2 * 1024 + 1) {
+buff
+  }
+}
+val res1 = store.getOrElseUpdate(
+  RDDBlockId(42, 0),
+  storageLevel,
+  implicitly[ClassTag[Array[Byte]]],
+  mkBlobs _
+)
+withClue(res1) {
+  assert(res1.isLeft)
+  assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
+  a != null &&
+b != null &&
+a.asInstanceOf[Array[Byte]].seq == 
b.asInstanceOf[Array[Byte]].seq
+  })
+}
+val getResult = store.get(RDDBlockId(42, 0))
+withClue(getResult) {
+  assert(getResult.isDefined)
+  assert(getResult.get.data.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
+  a != null &&
+b != null &&
+a.asInstanceOf[Array[Byte]].seq == 
b.asInstanceOf[Array[Byte]].seq
+  })
+}
+val getBlockRes = store.getBlockData(RDDBlockId(42, 0))
+withClue(getBlockRes) {
+  try {
+assert(getBlockRes.size() >= 2 * 1024 * 1024 * 1024)
+Utils.tryWithResource(getBlockRes.createInputStream()) { inpStrm =>
+  val iter = store
+.serializerManager
+.dataDeserializeStream(RDDBlockId(42, 0)
+  , inpStrm)(implicitly[ClassTag[Array[Byte]]])
+  assert(iter.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
+  a != null &&
+b != null &&
+a.asInstanceOf[Array[Byte]].seq == 
b.asInstanceOf[Array[Byte]].seq
+  })
+}
+  } finally {
+getBlockRes.release()
+  }
+}
+  }
+
+  test("getOrElseUpdate > 2gb, storage level = disk only") {
--- End diff --

oh we already have, then why we have these tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18880
  
**[Test build #80385 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80385/testReport)**
 for PR 18880 at commit 
[`96ca90e`](https://github.com/apache/spark/commit/96ca90e775da3ee5a762cc5b03b0fcf13bee5363).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18820
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80382/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...

2017-08-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18820
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18820
  
**[Test build #80382 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80382/testReport)**
 for PR 18820 at commit 
[`a09d3e9`](https://github.com/apache/spark/commit/a09d3e987dda4fe5b97a13e76cf9855f346b3eb8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18797: [SPARK-21523][ML] update breeze to 0.13.2 for an emergen...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18797
  
**[Test build #3884 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3884/testReport)**
 for PR 18797 at commit 
[`5063758`](https://github.com/apache/spark/commit/5063758c8b1903000e3718d8085b6ef1af3b37f3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18867: MapOutputTrackerSuite Utest

2017-08-08 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/18867
  
This looks like a valid fix, cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18867: MapOutputTrackerSuite Utest

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18867
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18882: [SPARK-21652][SQL] Filter out meaningless constra...

2017-08-08 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/18882

[SPARK-21652][SQL] Filter out meaningless constraints inferred in 
inferAdditionalConstraints

## What changes were proposed in this pull request?
This pr added code to filter out meaningless constraints inferred in 
`inferAdditionalConstraints` (e.g., given constraint `a = 1`, `b = 1`, `a = c`, 
and  `b = c`, we inferred `a = b` and this predicate was trivially true). These 
constraints possibly cause some `Optimizer` overhead and, for example;
```
scala> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
scala> Seq(1, 2).toDF("col").write.saveAsTable("t2")
scala> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 
AND t1.col1 = t2.col AND t1.col2 = t2.col").explain(true)
```

In this query, `InferFiltersFromConstraints` infers a new constraint 
'(col2#33 = col1#32)' that is appended to the join condition, then 
`PushPredicateThroughJoin` pushes it down, `ConstantPropagation` replaces 
'(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
`ConstantFolding` replaces '1 = 1' with 'true and `BooleanSimplification` 
finally removes this predicate. However, `InferFiltersFromConstraints` will 
again infer '(col2#33 = col1#32)' on the next iteration and the process will 
continue until the limit of iterations is reached.
See below for more details

```
=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
!Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) 
  Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
(col2#33 = col#34)))
 :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
(1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33)))
 :  +- Relation[col1#32,col2#33] parquet
  :  +- Relation[col1#32,col2#33] parquet
 +- Filter ((1 = col#34) && isnotnull(col#34))  
  +- Filter ((1 = col#34) && isnotnull(col#34))
+- Relation[col#34] parquet 
 +- Relation[col#34] parquet


=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
!Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
!:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
(1 = col2#33)))   :- Filter (col2#33 = col1#32)
!:  +- Relation[col1#32,col2#33] parquet
  :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33)))
!+- Filter ((1 = col#34) && isnotnull(col#34))  
  : +- Relation[col1#32,col2#33] parquet
!   +- Relation[col#34] parquet 
  +- Filter ((1 = col#34) && isnotnull(col#34))
!   
 +- Relation[col#34] parquet


=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
 Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) 
 Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
!:- Filter (col2#33 = col1#32)  
 :- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
!:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
&& (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
!: +- Relation[col1#32,col2#33] parquet 
 +- Filter ((1 = col#34) && isnotnull(col#34))
!+- Filter ((1 = col#34) && isnotnull(col#34))  
+- Relation[col#34] parquet
!   +- Relation[col#34] parquet 
 


=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.ConstantPropagation ===
 Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) 
   Join Inner, ((col1#32 = col#34) && 
(col2#33 = col#34))
!:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
&& (1 = col2#33))) && (col2#33 = col1#32))   :- Filter (((isnotnull(col1#32) && 
isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (1 = 1))
 :  +- Relation[col1#32,col2#33] parquet
   

[GitHub] spark pull request #18648: [SPARK-21428] Turn IsolatedClientLoader off while...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18648#discussion_r131886506
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveCliSessionStateSuite.scala
 ---
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.thriftserver
+
+import org.apache.hadoop.hive.cli.CliSessionState
+import org.apache.hadoop.hive.conf.HiveConf
+import org.apache.hadoop.hive.ql.session.SessionState
+
+import org.apache.spark.{SparkConf, SparkFunSuite}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.sql.hive.HiveUtils
+
+class HiveCliSessionStateSuite extends SparkFunSuite {
+
+  test("CliSessionState will be reused") {
+val hiveConf = new HiveConf(classOf[SessionState])
+HiveUtils.newTemporaryConfiguration(useInMemoryDerby = false).foreach {
+  case (key, value) => hiveConf.set(key, value)
+}
+val sessionState: SessionState = new CliSessionState(hiveConf)
+SessionState.start(sessionState)
+val s1 = SessionState.get
+val sparkConf = new SparkConf()
+val hadoopConf = SparkHadoopUtil.get.newConfiguration(sparkConf)
+val s2 = HiveUtils.newClientForMetadata(sparkConf, hadoopConf).getState
--- End diff --

how about `HiveUtils.newClientForMetadata(sparkConf, 
hadoopConf).asInstanceOf[HiveClientImpl].state`? then we don't need to add 
`getState`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18882
  
**[Test build #80389 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80389/testReport)**
 for PR 18882 at commit 
[`d253e40`](https://github.com/apache/spark/commit/d253e40788b9e3408c106eff0ba84ae97d715cbb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18865#discussion_r131887972
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
 ---
@@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 }
 
 (file: PartitionedFile) => {
-  val parser = new JacksonParser(actualSchema, parsedOptions)
+  // SPARK-21610: when the `requiredSchema` only contains 
`_corrupt_record`,
--- End diff --

> The column _corrupted_record is different when the selected columns are 
different

If `_corrupted_record` is designed to have different values for different 
selected columns, it may makes sense to set `_corrupted_record` to null if no 
columns are selected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18855#discussion_r131889407
  
--- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala ---
@@ -165,6 +147,62 @@ private[spark] class DiskStore(
 
 }
 
+private class DiskBlockData(
+conf: SparkConf,
--- End diff --

we can pass in `minMemoryMapBytes` directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18855#discussion_r131890389
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
   super.fetchBlockSync(host, port, execId, blockId)
 }
   }
+
+  def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) {
+store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1")
+def mkBlobs() = {
+  val rng = new java.util.Random(42)
+  val buff = new Array[Byte](1024 * 1024)
+  rng.nextBytes(buff)
+  Iterator.fill(2 * 1024 + 1) {
+buff
+  }
+}
+val res1 = store.getOrElseUpdate(
+  RDDBlockId(42, 0),
+  storageLevel,
+  implicitly[ClassTag[Array[Byte]]],
+  mkBlobs _
+)
+withClue(res1) {
+  assert(res1.isLeft)
+  assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
+  a != null &&
+b != null &&
+a.asInstanceOf[Array[Byte]].seq == 
b.asInstanceOf[Array[Byte]].seq
+  })
+}
+val getResult = store.get(RDDBlockId(42, 0))
+withClue(getResult) {
+  assert(getResult.isDefined)
+  assert(getResult.get.data.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
+  a != null &&
+b != null &&
+a.asInstanceOf[Array[Byte]].seq == 
b.asInstanceOf[Array[Byte]].seq
+  })
+}
+val getBlockRes = store.getBlockData(RDDBlockId(42, 0))
+withClue(getBlockRes) {
+  try {
+assert(getBlockRes.size() >= 2 * 1024 * 1024 * 1024)
+Utils.tryWithResource(getBlockRes.createInputStream()) { inpStrm =>
+  val iter = store
+.serializerManager
+.dataDeserializeStream(RDDBlockId(42, 0)
+  , inpStrm)(implicitly[ClassTag[Array[Byte]]])
+  assert(iter.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
+  a != null &&
+b != null &&
+a.asInstanceOf[Array[Byte]].seq == 
b.asInstanceOf[Array[Byte]].seq
+  })
+}
+  } finally {
+getBlockRes.release()
+  }
+}
+  }
+
+  test("getOrElseUpdate > 2gb, storage level = disk only") {
--- End diff --

shall we just write a test in `DiskStoreSuite`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetwee...

2017-08-08 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18814#discussion_r131876058
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 ---
@@ -89,7 +89,11 @@ case class WindowSpecDefinition(
 elements.mkString("(", " ", ")")
   }
 
-  private def isValidFrameType(ft: DataType): Boolean = 
orderSpec.head.dataType == ft
+  private def isValidFrameType(ft: DataType): Boolean = 
(orderSpec.head.dataType, ft) match {
+case (DateType, IntegerType) => true
--- End diff --

Yea, we should allow this, but this is a bit corner case and not very 
straight-forward to implement, maybe we can leave this as a follow-up?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18814
  
**[Test build #80386 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80386/testReport)**
 for PR 18814 at commit 
[`3e3b58c`](https://github.com/apache/spark/commit/3e3b58c8911e266b2af985da3dec53418a608d2b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18801: SPARK-10878 Fix race condition when multiple clients res...

2017-08-08 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/18801
  
Your test case don't reflect the change you made to support multiple 
clients resolves artifacts at the same time, could you add a new test case or 
manually test that? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18855#discussion_r131890999
  
--- Diff: core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala 
---
@@ -92,6 +92,31 @@ class DiskStoreSuite extends SparkFunSuite {
 assert(diskStore.getSize(blockId) === 0L)
   }
 
+  test("blocks larger than 2gb") {
+val conf = new SparkConf()
+val diskBlockManager = new DiskBlockManager(conf, deleteFilesOnStop = 
true)
+val diskStore = new DiskStore(conf, diskBlockManager, new 
SecurityManager(conf))
+
+val mb = 1024 * 1024
+val gb = 1024L * mb
+
+val blockId = BlockId("rdd_1_2")
+diskStore.put(blockId) { chan =>
+  val arr = new Array[Byte](mb)
+  for {
+_ <- 0 until 2048
+  } {
+val buf = ByteBuffer.wrap(arr)
+while (buf.hasRemaining()) {
+  chan.write(buf)
+}
+  }
+}
+
+val blockData = diskStore.getBytes(blockId)
+assert(blockData.size == 2 * gb)
--- End diff --

test with 3gb to be more explicit that it's larger than 2gb?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18878: fix issue with spark-submit with java

2017-08-08 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/18878
  
It works fine for me in, for example:

```
spark2-submit --class "org.apache.spark.examples.SparkPi" --deploy-mode 
client --master yarn 
/opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples_2.11-2.2.0.cloudera1.jar
```

The thing is the quotes never even make it to the submit script, right? 
bash interprets them. That's why I can't see how this would be an issue.  
What's your exact command line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

2017-08-08 Thread aokolnychyi
Github user aokolnychyi commented on the issue:

https://github.com/apache/spark/pull/18692
  
@gatorsmile I updated the rule to cover cross join cases. Regarding the 
case with the redundant condition mentioned by you, I opened 
[SPARK-21652](https://issues.apache.org/jira/browse/SPARK-21652). It is an 
existing issue and is not caused by the proposed rule. BTW, I can try to fix it 
once we agree on a solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2017-08-08 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
@gatorsmile 
Some test cases have been added.
Thanks for reviewing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-08 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18818#discussion_r131913185
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -453,6 +453,14 @@ case class Or(left: Expression, right: Expression) 
extends BinaryOperator with P
 
 abstract class BinaryComparison extends BinaryOperator with Predicate {
 
+  override def inputType: AbstractDataType = AnyDataType
--- End diff --

We have to define `inputType` because it extends `BinaryOperator`. 
Previously the `LessThan`-like operators defined `inputType` was a subset of 
what they could actually support. This PR fixes that, but since the supported 
types can not be finitely specified as a type collection (there are a countably 
infinite number of legal `StructType`'s), we need to give a superset of what is 
actually supported for `inputType` and then do the real recursive check in 
`checkInputDataTypes`. This is much like how the `EqualTo` and `EqualNullSafe` 
operators were previously implemented. In this PR we just move that logic up to 
`BinaryComparison` as it's really the same for equality and inequality 
operators.

Did that answer your concerns? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18544
  
**[Test build #80393 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80393/testReport)**
 for PR 18544 at commit 
[`9fa7078`](https://github.com/apache/spark/commit/9fa70788aa78bb23a10a04a6c99953d2050bab79).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...

2017-08-08 Thread eyalfa
Github user eyalfa commented on a diff in the pull request:

https://github.com/apache/spark/pull/18855#discussion_r131913940
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
   super.fetchBlockSync(host, port, execId, blockId)
 }
   }
+
+  def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) {
+store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1")
+def mkBlobs() = {
+  val rng = new java.util.Random(42)
+  val buff = new Array[Byte](1024 * 1024)
+  rng.nextBytes(buff)
+  Iterator.fill(2 * 1024 + 1) {
+buff
+  }
+}
+val res1 = store.getOrElseUpdate(
+  RDDBlockId(42, 0),
+  storageLevel,
+  implicitly[ClassTag[Array[Byte]]],
+  mkBlobs _
+)
+withClue(res1) {
+  assert(res1.isLeft)
+  assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
+  a != null &&
+b != null &&
+a.asInstanceOf[Array[Byte]].seq == 
b.asInstanceOf[Array[Byte]].seq
+  })
+}
+val getResult = store.get(RDDBlockId(42, 0))
+withClue(getResult) {
+  assert(getResult.isDefined)
+  assert(getResult.get.data.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
+  a != null &&
+b != null &&
+a.asInstanceOf[Array[Byte]].seq == 
b.asInstanceOf[Array[Byte]].seq
+  })
+}
+val getBlockRes = store.getBlockData(RDDBlockId(42, 0))
+withClue(getBlockRes) {
+  try {
+assert(getBlockRes.size() >= 2 * 1024 * 1024 * 1024)
+Utils.tryWithResource(getBlockRes.createInputStream()) { inpStrm =>
+  val iter = store
+.serializerManager
+.dataDeserializeStream(RDDBlockId(42, 0)
+  , inpStrm)(implicitly[ClassTag[Array[Byte]]])
+  assert(iter.zipAll(mkBlobs(), null, null).forall {
+case (a, b) =>
+  a != null &&
+b != null &&
+a.asInstanceOf[Array[Byte]].seq == 
b.asInstanceOf[Array[Byte]].seq
+  })
+}
+  } finally {
+getBlockRes.release()
+  }
+}
+  }
+
+  test("getOrElseUpdate > 2gb, storage level = disk only") {
--- End diff --

these tests cover more than just the `DiskOnly` storage level, they were 
crafted when I had bigger ambitions of solving the entire 2GB issue 
:sunglasses: , that was before seeing some ~100 files pull requests being 
abandoned or rejected.
aside, these tests also test the entire orchestration done by 
`BlockManager` when an `RDD` requests a cached partition, notice that these 
tests intentionally makes two calls to the BlockManager in order to simulate 
both code paths (cache-hit, cache-miss).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18883: [SPARK-21276][CORE] Update lz4-java to the latest...

2017-08-08 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18883#discussion_r131920841
  
--- Diff: core/src/main/java/org/apache/spark/io/LZ4BlockInputStream.java 
---
@@ -1,260 +0,0 @@
-/*
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.spark.io;
-
-import java.io.EOFException;
-import java.io.FilterInputStream;
-import java.io.IOException;
-import java.io.InputStream;
-import java.util.zip.Checksum;
-
-import net.jpountz.lz4.LZ4Exception;
-import net.jpountz.lz4.LZ4Factory;
-import net.jpountz.lz4.LZ4FastDecompressor;
-import net.jpountz.util.SafeUtils;
-import net.jpountz.xxhash.XXHashFactory;
-
-/**
- * {@link InputStream} implementation to decode data written with
- * {@link net.jpountz.lz4.LZ4BlockOutputStream}. This class is not 
thread-safe and does not
- * support {@link #mark(int)}/{@link #reset()}.
- * @see net.jpountz.lz4.LZ4BlockOutputStream
- *
- * This is based on net.jpountz.lz4.LZ4BlockInputStream
- *
- * changes: 
https://github.com/davies/lz4-java/commit/cc1fa940ac57cc66a0b937300f805d37e2bf8411
- *
- * TODO: merge this into upstream
- */
-public final class LZ4BlockInputStream extends FilterInputStream {
--- End diff --

Yeah I guess this needs a MiMa exclude. It's technically public but nobody 
should have ever referenced this directly. The codec class is separate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18883: [SPARK-21276][CORE] Update lz4-java to the latest (v1.4....

2017-08-08 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/18883
  
I'll update soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18492: [SPARK-19326] Speculated task attempts do not get launch...

2017-08-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18492
  
**[Test build #80387 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80387/testReport)**
 for PR 18492 at commit 
[`77b4729`](https://github.com/apache/spark/commit/77b47293d9655408b0dbe4d5896492b366575b3a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes fails fo...

2017-08-08 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18855
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18880: [SPARK-21665][Core]Need to close resources after ...

2017-08-08 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18880#discussion_r131907109
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -207,9 +207,16 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
   uriScheme match {
 case "file" =>
   try {
-val jar = new JarFile(uri.getPath)
-// Note that this might still return null if no main-class is 
set; we catch that later
-mainClass = 
jar.getManifest.getMainAttributes.getValue("Main-Class")
+var jar: JarFile = null
+Utils.tryWithSafeFinally {
--- End diff --

Ah pardon, I meant `Utils.tryWithResource`. This one doesn't help close the 
resource. I don't know which one is actually clearer; up to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   >