[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18880 **[Test build #80385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80385/testReport)** for PR 18880 at commit [`96ca90e`](https://github.com/apache/spark/commit/96ca90e775da3ee5a762cc5b03b0fcf13bee5363). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18849#discussion_r131894824 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -2356,18 +2356,9 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { }.getMessage assert(e.contains("Found duplicate column(s)")) } else { -if (isUsingHiveMetastore) { - // hive catalog will still complains that c1 is duplicate column name because hive - // identifiers are case insensitive. - val e = intercept[AnalysisException] { -sql("ALTER TABLE t1 ADD COLUMNS (C1 string)") - }.getMessage - assert(e.contains("HiveException")) -} else { - sql("ALTER TABLE t1 ADD COLUMNS (C1 string)") - assert(spark.table("t1").schema -.equals(new StructType().add("c1", IntegerType).add("C1", StringType))) -} +sql("ALTER TABLE t1 ADD COLUMNS (C1 string)") +assert(spark.table("t1").schema + .equals(new StructType().add("c1", IntegerType).add("C1", StringType))) --- End diff -- `.equals` -> `==` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r131895717 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { } (file: PartitionedFile) => { - val parser = new JacksonParser(actualSchema, parsedOptions) + // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`, --- End diff -- Yeah, we've also considered this possibility. But currently we thought a query like `dfFromFile.select($"_corrupt_record")` shows all nulls looks countering intuition because the query is intended to get the corrupt records among the JSON records. That is why we provide a fix like current one. IMO, it's meaningless to have all nulls for `_corrupted_record`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18814 **[Test build #80386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80386/testReport)** for PR 18814 at commit [`3e3b58c`](https://github.com/apache/spark/commit/3e3b58c8911e266b2af985da3dec53418a608d2b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r131891836 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/CosineSilhouette.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{DenseVector, Vector, VectorElementWiseSum} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.{col, count} + +private[evaluation] object CosineSilhouette { --- End diff -- There is no clustering algorithms using other distance metrics except for squared euclidean distance currently. I'd suggest to remove the ```CosineSilhouette``` implementation firstly, we can add it back when it's needed. This can also make this PR more easy to review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r131891038 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala --- @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.{col, count, sum} + +private[evaluation] object SquaredEuclideanSilhouette { + + private[this] var kryoRegistrationPerformed: Boolean = false + + /** + * This method registers the class + * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]] + * for kryo serialization. + * + * @param sc `SparkContext` to be used + */ + def registerKryoClasses(sc: SparkContext): Unit = { +if (! kryoRegistrationPerformed) { + sc.getConf.registerKryoClasses( +Array( + classOf[SquaredEuclideanSilhouette.ClusterStats] +) + ) + kryoRegistrationPerformed = true +} + } + + case class ClusterStats(Y: Vector, psi: Double, count: Long) + + def computeCsi(vector: Vector): Double = { +var sumOfSquares = 0.0 +vector.foreachActive((_, v) => { + sumOfSquares += v * v +}) +sumOfSquares + } + + def computeYVectorPsiAndCount( + df: DataFrame, + predictionCol: String, + featuresCol: String): DataFrame = { +val Yudaf = new VectorElementWiseSum() +df.groupBy(predictionCol) + .agg( +count("*").alias("count"), +sum("csi").alias("psi"), +Yudaf(col(featuresCol)).alias("y") --- End diff -- Please rename ```csi``` to ```squaredNorm```, ```psi``` to ```squaredNormSum```, ```y``` to ```featureSum``` if I don't have misunderstand. We should use more descriptive name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18880 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80385/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18880 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18873: Fixing python 2.6 tests for jenkings
Github user dmvieira commented on the issue: https://github.com/apache/spark/pull/18873 Hey guys, I just opened this PR because I spent a lot of time trying to fix Jenkins tests in my last PR when the error was in test script with python 2.6... I can close it, but I know that more guys will spend more time trying to fix it again. Ok @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18880 **[Test build #80391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80391/testReport)** for PR 18880 at commit [`5a4e859`](https://github.com/apache/spark/commit/5a4e859d434b0a5b8539cab2ef0d9346ce81c08e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18814 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18814 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80386/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r131890318 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala --- @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.{col, count, sum} + +private[evaluation] object SquaredEuclideanSilhouette { + + private[this] var kryoRegistrationPerformed: Boolean = false + + /** + * This method registers the class + * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]] + * for kryo serialization. + * + * @param sc `SparkContext` to be used + */ + def registerKryoClasses(sc: SparkContext): Unit = { +if (! kryoRegistrationPerformed) { + sc.getConf.registerKryoClasses( +Array( + classOf[SquaredEuclideanSilhouette.ClusterStats] +) + ) + kryoRegistrationPerformed = true +} + } + + case class ClusterStats(Y: Vector, psi: Double, count: Long) + + def computeCsi(vector: Vector): Double = { --- End diff -- Can we use ```Vectors.norm(vector, 2.0)```? It should be more efficient for both dense and sparse vector. Actually we can remove this function if you refactor code as my suggested below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r131868145 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions.{avg, col} +import org.apache.spark.sql.types.IntegerType + +/** + * Evaluator for clustering results. + * At the moment, the supported metrics are: + * squaredSilhouette: silhouette measure using the squared Euclidean distance; + * cosineSilhouette: silhouette measure using the cosine distance. + * The implementation follows the proposal explained + * https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view;> + * in this document. + */ +@Experimental +class ClusteringEvaluator (val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette")) + + override def copy(pMap: ParamMap): Evaluator = this.defaultCopy(pMap) + + override def isLargerBetter: Boolean = true + + /** @group setParam */ + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + /** @group setParam */ + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** + * param for metric name in evaluation + * (supports `"squaredSilhouette"` (default), `"cosineSilhouette"`) + * @group param + */ + val metricName: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredSilhouette", "cosineSilhouette")) +new Param( + this, + "metricName", + "metric name in evaluation (squaredSilhouette|cosineSilhouette)", + allowedParams +) + } + + /** @group getParam */ + def getMetricName: String = $(metricName) + + /** @group setParam */ + def setMetricName(value: String): this.type = set(metricName, value) + + setDefault(metricName -> "squaredSilhouette") + + override def evaluate(dataset: Dataset[_]): Double = { +SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) + +val metric: Double = $(metricName) match { + case "squaredSilhouette" => +computeSquaredSilhouette(dataset) + case "cosineSilhouette" => +computeCosineSilhouette(dataset) +} +metric + } + + private[this] def computeCosineSilhouette(dataset: Dataset[_]): Double = { +CosineSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext) + +val computeCsi = dataset.sparkSession.udf.register("computeCsi", --- End diff -- Could we use more descriptive name? We can't get what this function does from its name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r131892095 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala --- @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.{col, count, sum} + +private[evaluation] object SquaredEuclideanSilhouette { --- End diff -- Let's move this to file ```ClusteringEvaluator```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r131889868 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala --- @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.{col, count, sum} + +private[evaluation] object SquaredEuclideanSilhouette { + + private[this] var kryoRegistrationPerformed: Boolean = false + + /** + * This method registers the class + * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]] + * for kryo serialization. + * + * @param sc `SparkContext` to be used + */ + def registerKryoClasses(sc: SparkContext): Unit = { +if (! kryoRegistrationPerformed) { + sc.getConf.registerKryoClasses( +Array( + classOf[SquaredEuclideanSilhouette.ClusterStats] +) + ) + kryoRegistrationPerformed = true +} + } + + case class ClusterStats(Y: Vector, psi: Double, count: Long) + + def computeCsi(vector: Vector): Double = { +var sumOfSquares = 0.0 +vector.foreachActive((_, v) => { + sumOfSquares += v * v +}) +sumOfSquares + } + + def computeYVectorPsiAndCount( + df: DataFrame, + predictionCol: String, + featuresCol: String): DataFrame = { +val Yudaf = new VectorElementWiseSum() +df.groupBy(predictionCol) + .agg( +count("*").alias("count"), +sum("csi").alias("psi"), +Yudaf(col(featuresCol)).alias("y") + ) --- End diff -- Aggregate function performance is not ideal for column of non-primitive type(like here is vector type). So we would still use RDD-based aggregate. You can factor this part of code following [```NaiveBayes```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala#L161) like: ``` import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} import org.apache.spark.sql.functions._ val numFeatures = ... val squaredNorm = udf { features: Vector => math.pow(Vectors.norm(features, 2.0), 2.0) } df.select(col(predictionCol), col(featuresCol)) .withColumn("squaredNorm", squaredNorm(col(featuresCol))) .rdd .map { row => (row.getDouble(0), (row.getAs[Vector](1), row.getDouble(2))) } .aggregateByKey[(DenseVector, Double)]((Vectors.zeros(numFeatures).toDense, 0.0))( seqOp = { case ((featureSum: DenseVector, squaredNormSum: Double), (features, squaredNorm)) => BLAS.axpy(1.0, features, featureSum) (featureSum, squaredNormSum + squaredNorm) }, combOp = { case ((featureSum1, squaredNormSum1), (featureSum2, squaredNormSum2)) => BLAS.axpy(1.0, featureSum2, featureSum1) (featureSum1, squaredNormSum1 + squaredNormSum2) }).collect() ``` In my suggestion, you can compute ```csi``` and ```y``` in a single data pass, which should be more efficient. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18882 **[Test build #80390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80390/testReport)** for PR 18882 at commit [`dc2112f`](https://github.com/apache/spark/commit/dc2112f6a26125cd6a67eac79cef91751ac639f8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18878: fix issue with spark-submit with java
Github user brandonJY commented on the issue: https://github.com/apache/spark/pull/18878 Running spark 2.2.0 standalone mode in Ubuntu 16.04 LTS. It interprets classpath with the quote. E.g. for classname=`foo`, it interprets it as `"foo"` Probably it is just a environment issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18881: [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18881 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80384/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18881: [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18881 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18880: [SPARK-21665][Core]Need to close resources after ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/18880#discussion_r131897551 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -754,6 +754,7 @@ private[spark] class Client( val writer = new OutputStreamWriter(confStream, StandardCharsets.UTF_8) props.store(writer, "Spark configuration.") writer.flush() + writer.close() --- End diff -- This should be moved after `confStream.closeEntry()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18880: [SPARK-21665][Core]Need to close resources after ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/18880#discussion_r131897597 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -754,6 +754,7 @@ private[spark] class Client( val writer = new OutputStreamWriter(confStream, StandardCharsets.UTF_8) props.store(writer, "Spark configuration.") writer.flush() + writer.close() --- End diff -- This should be moved to after `confStream.closeEntry()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18544 **[Test build #80392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80392/testReport)** for PR 18544 at commit [`bb454b1`](https://github.com/apache/spark/commit/bb454b1d89e239f0cfb1dfd98189ec3fdb59cff6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18492: [SPARK-19326] Speculated task attempts do not get launch...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18492 **[Test build #80387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80387/testReport)** for PR 18492 at commit [`77b4729`](https://github.com/apache/spark/commit/77b47293d9655408b0dbe4d5896492b366575b3a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18867: MapOutputTrackerSuite Utest
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18867 **[Test build #80388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80388/testReport)** for PR 18867 at commit [`b0c58c7`](https://github.com/apache/spark/commit/b0c58c7ff4f0be063e31d9b2f3b0b8b01094c51d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18882 **[Test build #80389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80389/testReport)** for PR 18882 at commit [`d253e40`](https://github.com/apache/spark/commit/d253e40788b9e3408c106eff0ba84ae97d715cbb). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18882 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18882 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80389/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18855#discussion_r131889921 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE super.fetchBlockSync(host, port, execId, blockId) } } + + def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) { +store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1") +def mkBlobs() = { + val rng = new java.util.Random(42) + val buff = new Array[Byte](1024 * 1024) + rng.nextBytes(buff) + Iterator.fill(2 * 1024 + 1) { +buff + } +} +val res1 = store.getOrElseUpdate( + RDDBlockId(42, 0), + storageLevel, + implicitly[ClassTag[Array[Byte]]], + mkBlobs _ +) +withClue(res1) { --- End diff -- does `res1` have a reasonable string representation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18798 **[Test build #80407 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80407/testReport)** for PR 18798 at commit [`b02db42`](https://github.com/apache/spark/commit/b02db420132f62299900a5089fb54810ec21a3d5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18886: [SPARK-21671][core] Move kvstore to "util" sub-pa...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/18886 [SPARK-21671][core] Move kvstore to "util" sub-package, add private annotation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-21671 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18886.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18886 commit 1f86189bb71507c73c662bb1681166bc54db7dd3 Author: Marcelo VanzinDate: 2017-08-08T18:07:17Z [SPARK-21671][core] Move kvstore to "util" sub-package, add private annotation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18886: [SPARK-21671][core] Move kvstore to "util" sub-package, ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18886 **[Test build #80408 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80408/testReport)** for PR 18886 at commit [`1f86189`](https://github.com/apache/spark/commit/1f86189bb71507c73c662bb1681166bc54db7dd3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17373: [SPARK-12664] Expose probability in mlp model
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17373 **[Test build #80409 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80409/testReport)** for PR 17373 at commit [`bcb44af`](https://github.com/apache/spark/commit/bcb44af65c3c7f9c0ead6cff5706243da80f88bc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r131992917 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -83,6 +83,36 @@ class MultilayerPerceptronClassifierSuite } } + test("strong dataset test") { +val layers = Array[Int](4, 5, 5, 4) + +val rnd = new scala.util.Random(1234L) + +val strongDataset = Seq.tabulate(4) { index => + (Vectors.dense( +rnd.nextGaussian(), +rnd.nextGaussian() * 2.0, +rnd.nextGaussian() * 3.0, +rnd.nextGaussian() * 2.0 + ), (index % 4).toDouble) +}.toDF("features", "label") +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(strongDataset) +val result = model.transform(strongDataset) +model.setProbabilityCol("probability") +MLTestingUtils.checkCopyAndUids(trainer, model) +// result.select("probability").show(false) +val predictionAndLabels = result.select("prediction", "label").collect() +predictionAndLabels.foreach { case Row(p: Double, l: Double) => + assert(p == l) +} + } --- End diff -- @MrBago How do you like this test ? The probability it generate is +--+ |probability | +--+ |[0.9917274999513315,0.001511626318489583,0.004831796668307991,0.0019290770618710876] | |[4.2392735713619E-12,0.99955336,1.8369996605279208E-14,2.0871629225077174E-13]| |[1.8975708749716946E-4,5.191732707447977E-22,0.5010860788259045,0.49872416408659836] | |[1.6776134471360903E-4,3.9309610969078615E-22,0.49629577580941386,0.5035364628458726] | +--+ it contains some values near 0.5 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18881: [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18881 cc @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80399/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18709: [SPARK-21504] [SQL] Add spark version info into t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18709#discussion_r131995378 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -700,6 +704,9 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat table = restoreDataSourceTable(table, provider) } +// Restore version info +val version: String = table.properties.getOrElse(CREATED_SPARK_VERSION, "") --- End diff -- I will change this to `2.2 or prior` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18709: [SPARK-21504] [SQL] Add spark version info into table me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18709 **[Test build #80410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80410/testReport)** for PR 18709 at commit [`9d7c577`](https://github.com/apache/spark/commit/9d7c577c24f320d9b7f35ae827f83ff65be5019f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18881: [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/18881 Can you add "Closes #18789" to automate the closing of the other PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80401/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18421 **[Test build #80402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80402/testReport)** for PR 18421 at commit [`ba48073`](https://github.com/apache/spark/commit/ba4807342db2943d72c32e62a0bf043efeb573a3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80402/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18818#discussion_r132002008 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -453,6 +453,14 @@ case class Or(left: Expression, right: Expression) extends BinaryOperator with P abstract class BinaryComparison extends BinaryOperator with Predicate { + override def inputType: AbstractDataType = AnyDataType --- End diff -- We have to write the code comments for explaining it. Given this assumption, `inputType` might be misused in the future. In addition, this PR has a pretty critical change. We really need to check all the new data types we support in this PR. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala#L90-L97 Could you write a comprehensive test case coverage for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18640 Sure. Thank you so much, @omalley ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18798 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r131993682 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -256,6 +257,201 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("analyze single partition") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): CatalogStatistics = { + val partition = + spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats.get +} + +def createPartition(ds: String, query: String): Unit = { + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query") +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + + createPartition("2010-01-01", "SELECT '1', 'A' from src") + createPartition("2010-01-02", "SELECT '1', 'A' from src UNION ALL SELECT '1', 'A' from src") + createPartition("2010-01-03", "SELECT '1', 'A' from src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") + + assert(queryStats("2010-01-01").rowCount === None) + assert(queryStats("2010-01-01").sizeInBytes === 2000) + + assert(queryStats("2010-01-02").rowCount === None) + assert(queryStats("2010-01-02").sizeInBytes === 2*2000) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS") + + assert(queryStats("2010-01-01").rowCount.get === 500) + assert(queryStats("2010-01-01").sizeInBytes === 2000) + + assert(queryStats("2010-01-02").rowCount.get === 2*500) + assert(queryStats("2010-01-02").sizeInBytes === 2*2000) +} + } + + test("analyze a set of partitions") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String, hr: String): Option[CatalogStatistics] = { + val tableId = TableIdentifier(tableName) + val partition = +spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, "hr" -> hr)) + partition.stats +} + +def assertPartitionStats( +ds: String, +hr: String, +rowCount: Option[BigInt], +sizeInBytes: BigInt): Unit = { + val stats = queryStats(ds, hr).get + assert(stats.rowCount === rowCount) + assert(stats.sizeInBytes === sizeInBytes) +} + +def createPartition(ds: String, hr: Int, query: String): Unit = { + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) $query") +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING, hr INT)") + + createPartition("2010-01-01", 10, "SELECT '1', 'A' from src") + createPartition("2010-01-01", 11, "SELECT '1', 'A' from src") + createPartition("2010-01-02", 10, "SELECT '1', 'A' from src") + createPartition("2010-01-02", 11, +"SELECT '1', 'A' from src UNION ALL SELECT '1', 'A' from src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") + + assertPartitionStats("2010-01-01", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-01", "11", rowCount = None, sizeInBytes = 2000) + assert(queryStats("2010-01-02", "10") === None) + assert(queryStats("2010-01-02", "11") === None) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") + + assertPartitionStats("2010-01-01", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-01", "11", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-02", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-02", "11", rowCount = None, sizeInBytes = 2*2000) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS") + + assertPartitionStats("2010-01-01", "10", rowCount = Some(500), sizeInBytes = 2000) + assertPartitionStats("2010-01-01", "11", rowCount = Some(500), sizeInBytes = 2000) + assertPartitionStats("2010-01-02", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-02", "11", rowCount = None, sizeInBytes = 2*2000) + + sql(s"ANALYZE TABLE $tableName
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r131993628 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -107,6 +109,7 @@ case class CatalogTablePartition( if (parameters.nonEmpty) { map.put("Partition Parameters", s"{${parameters.map(p => p._1 + "=" + p._2).mkString(", ")}}") } +stats.foreach(s => map.put("Partition Statistics", s.simpleString)) --- End diff -- Yes, please add it. Try it and we should expose it to the external users. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #80399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80399/testReport)** for PR 18875 at commit [`4cf4212`](https://github.com/apache/spark/commit/4cf42127a7c1044263a69bc7c2562f198558eb27). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18709: [SPARK-21504] [SQL] Add spark version info into t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18709#discussion_r131995780 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -217,6 +218,7 @@ case class CatalogTable( owner: String = "", createTime: Long = System.currentTimeMillis, lastAccessTime: Long = -1, +createVersion: String = "", --- End diff -- The default will also impact the temporary view. How about just keeping it unchanged? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18421 **[Test build #80401 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80401/testReport)** for PR 18421 at commit [`64efad2`](https://github.com/apache/spark/commit/64efad20eaa5253a08cd69f89e11550fd0246187). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18709: [SPARK-21504] [SQL] Add spark version info into table me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18709 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18709: [SPARK-21504] [SQL] Add spark version info into table me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18709 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80398/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18880 **[Test build #80396 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80396/testReport)** for PR 18880 at commit [`04054d0`](https://github.com/apache/spark/commit/04054d0baa5431e3e145ef42cea24c2ae15b6911). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18798 **[Test build #80403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80403/testReport)** for PR 18798 at commit [`b02db42`](https://github.com/apache/spark/commit/b02db420132f62299900a5089fb54810ec21a3d5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18884 Jenkins, add to white list. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18837: [Spark-20812][Mesos] Add secrets support to the d...
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/18837#discussion_r131986331 --- Diff: docs/running-on-mesos.md --- @@ -479,6 +479,35 @@ See the [configuration page](configuration.html) for information on Spark config + spark.mesos.driver.secret.envkey + (none) + +If set, the contents of the secret referenced by +spark.mesos.driver.secret.name will be written to the provided +environment variable in the driver's process. + + + +spark.mesos.driver.secret.filename + (none) + +If set, the contents of the secret referenced by +spark.mesos.driver.secret.name will be written to the provided +file. Relative paths are relative to the container's work +directory. Absolute paths must already exist. Consult the Mesos Secret +protobuf for more information. + + + + spark.mesos.driver.secret.name --- End diff -- Hey @susanxhuynh, so the way it works now is you can specify a secret as a `REFERENCE` or as a `VALUE` type by using the `spark.mesos.driver.secret.name` or `spark.mesos.driver.secret.value` configs, respectively. These secrets are then made file-based and/or env-based depending on the contents of `spark.mesos.driver.secret.filename` and `spark.mesos.driver.secret.envkey`. I allow for multiple secrets (but not multiple types) as comma-seperated lists (like Mesos URIs). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18880 **[Test build #80400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80400/testReport)** for PR 18880 at commit [`7398e29`](https://github.com/apache/spark/commit/7398e2968fba8d7132772db186f3b3f4b15e0f46). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18880 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80400/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18880 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18867: MapOutputTrackerSuite Utest
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/18867 This is by design, otherwise the test case would have failed. Could you please close this PR? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18492: [SPARK-19326] Speculated task attempts do not get launch...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/18492 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18867: MapOutputTrackerSuite Utest
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18867 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18867: MapOutputTrackerSuite Utest
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18867 This may be ok to not have a ticket, but please update the PR title to be more concrete. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18855#discussion_r131890077 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE super.fetchBlockSync(host, port, execId, blockId) } } + + def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) { +store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1") +def mkBlobs() = { + val rng = new java.util.Random(42) + val buff = new Array[Byte](1024 * 1024) + rng.nextBytes(buff) + Iterator.fill(2 * 1024 + 1) { +buff + } +} +val res1 = store.getOrElseUpdate( + RDDBlockId(42, 0), + storageLevel, + implicitly[ClassTag[Array[Byte]]], + mkBlobs _ +) +withClue(res1) { + assert(res1.isLeft) + assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall { +case (a, b) => --- End diff -- just `a === b`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18867: MapOutputTrackerSuite Utest
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18867 **[Test build #80388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80388/testReport)** for PR 18867 at commit [`b0c58c7`](https://github.com/apache/spark/commit/b0c58c7ff4f0be063e31d9b2f3b0b8b01094c51d). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18867: MapOutputTrackerSuite Utest
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18867 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80388/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18648: [SPARK-21428] Turn IsolatedClientLoader off while using ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18648 OK to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18855#discussion_r131889556 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE super.fetchBlockSync(host, port, execId, blockId) } } + + def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) { --- End diff -- nit: `storageLevel: StorageLevel` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18855#discussion_r131890565 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE super.fetchBlockSync(host, port, execId, blockId) } } + + def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) { +store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1") +def mkBlobs() = { + val rng = new java.util.Random(42) + val buff = new Array[Byte](1024 * 1024) + rng.nextBytes(buff) + Iterator.fill(2 * 1024 + 1) { +buff + } +} +val res1 = store.getOrElseUpdate( + RDDBlockId(42, 0), + storageLevel, + implicitly[ClassTag[Array[Byte]]], + mkBlobs _ +) +withClue(res1) { + assert(res1.isLeft) + assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall { +case (a, b) => + a != null && +b != null && +a.asInstanceOf[Array[Byte]].seq == b.asInstanceOf[Array[Byte]].seq + }) +} +val getResult = store.get(RDDBlockId(42, 0)) +withClue(getResult) { + assert(getResult.isDefined) + assert(getResult.get.data.zipAll(mkBlobs(), null, null).forall { +case (a, b) => + a != null && +b != null && +a.asInstanceOf[Array[Byte]].seq == b.asInstanceOf[Array[Byte]].seq + }) +} +val getBlockRes = store.getBlockData(RDDBlockId(42, 0)) +withClue(getBlockRes) { + try { +assert(getBlockRes.size() >= 2 * 1024 * 1024 * 1024) +Utils.tryWithResource(getBlockRes.createInputStream()) { inpStrm => + val iter = store +.serializerManager +.dataDeserializeStream(RDDBlockId(42, 0) + , inpStrm)(implicitly[ClassTag[Array[Byte]]]) + assert(iter.zipAll(mkBlobs(), null, null).forall { +case (a, b) => + a != null && +b != null && +a.asInstanceOf[Array[Byte]].seq == b.asInstanceOf[Array[Byte]].seq + }) +} + } finally { +getBlockRes.release() + } +} + } + + test("getOrElseUpdate > 2gb, storage level = disk only") { --- End diff -- oh we already have, then why we have these tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18880 **[Test build #80385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80385/testReport)** for PR 18880 at commit [`96ca90e`](https://github.com/apache/spark/commit/96ca90e775da3ee5a762cc5b03b0fcf13bee5363). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18820 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80382/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18820 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18820 **[Test build #80382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80382/testReport)** for PR 18820 at commit [`a09d3e9`](https://github.com/apache/spark/commit/a09d3e987dda4fe5b97a13e76cf9855f346b3eb8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18797: [SPARK-21523][ML] update breeze to 0.13.2 for an emergen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18797 **[Test build #3884 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3884/testReport)** for PR 18797 at commit [`5063758`](https://github.com/apache/spark/commit/5063758c8b1903000e3718d8085b6ef1af3b37f3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18867: MapOutputTrackerSuite Utest
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/18867 This looks like a valid fix, cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18867: MapOutputTrackerSuite Utest
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18867 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18882: [SPARK-21652][SQL] Filter out meaningless constra...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/18882 [SPARK-21652][SQL] Filter out meaningless constraints inferred in inferAdditionalConstraints ## What changes were proposed in this pull request? This pr added code to filter out meaningless constraints inferred in `inferAdditionalConstraints` (e.g., given constraint `a = 1`, `b = 1`, `a = c`, and `b = c`, we inferred `a = b` and this predicate was trivially true). These constraints possibly cause some `Optimizer` overhead and, for example; ``` scala> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1") scala> Seq(1, 2).toDF("col").write.saveAsTable("t2") scala> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col").explain(true) ``` In this query, `InferFiltersFromConstraints` infers a new constraint '(col2#33 = col1#32)' that is appended to the join condition, then `PushPredicateThroughJoin` pushes it down, `ConstantPropagation` replaces '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, `ConstantFolding` replaces '1 = 1' with 'true and `BooleanSimplification` finally removes this predicate. However, `InferFiltersFromConstraints` will again infer '(col2#33 = col1#32)' on the next iteration and the process will continue until the limit of iterations is reached. See below for more details ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints === !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = col#34))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet : +- Relation[col1#32,col2#33] parquet +- Filter ((1 = col#34) && isnotnull(col#34)) +- Filter ((1 = col#34) && isnotnull(col#34)) +- Relation[col#34] parquet +- Relation[col#34] parquet === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin === !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = col#34))) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) :- Filter (col2#33 = col1#32) !: +- Relation[col1#32,col2#33] parquet : +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) !+- Filter ((1 = col#34) && isnotnull(col#34)) : +- Relation[col1#32,col2#33] parquet ! +- Relation[col#34] parquet +- Filter ((1 = col#34) && isnotnull(col#34)) ! +- Relation[col#34] parquet === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters === Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) !:- Filter (col2#33 = col1#32) :- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32)) !: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet !: +- Relation[col1#32,col2#33] parquet +- Filter ((1 = col#34) && isnotnull(col#34)) !+- Filter ((1 = col#34) && isnotnull(col#34)) +- Relation[col#34] parquet ! +- Relation[col#34] parquet === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation === Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32)) :- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (1 = 1)) : +- Relation[col1#32,col2#33] parquet
[GitHub] spark pull request #18648: [SPARK-21428] Turn IsolatedClientLoader off while...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18648#discussion_r131886506 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveCliSessionStateSuite.scala --- @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.thriftserver + +import org.apache.hadoop.hive.cli.CliSessionState +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.session.SessionState + +import org.apache.spark.{SparkConf, SparkFunSuite} +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.sql.hive.HiveUtils + +class HiveCliSessionStateSuite extends SparkFunSuite { + + test("CliSessionState will be reused") { +val hiveConf = new HiveConf(classOf[SessionState]) +HiveUtils.newTemporaryConfiguration(useInMemoryDerby = false).foreach { + case (key, value) => hiveConf.set(key, value) +} +val sessionState: SessionState = new CliSessionState(hiveConf) +SessionState.start(sessionState) +val s1 = SessionState.get +val sparkConf = new SparkConf() +val hadoopConf = SparkHadoopUtil.get.newConfiguration(sparkConf) +val s2 = HiveUtils.newClientForMetadata(sparkConf, hadoopConf).getState --- End diff -- how about `HiveUtils.newClientForMetadata(sparkConf, hadoopConf).asInstanceOf[HiveClientImpl].state`? then we don't need to add `getState` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18882 **[Test build #80389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80389/testReport)** for PR 18882 at commit [`d253e40`](https://github.com/apache/spark/commit/d253e40788b9e3408c106eff0ba84ae97d715cbb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r131887972 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { } (file: PartitionedFile) => { - val parser = new JacksonParser(actualSchema, parsedOptions) + // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`, --- End diff -- > The column _corrupted_record is different when the selected columns are different If `_corrupted_record` is designed to have different values for different selected columns, it may makes sense to set `_corrupted_record` to null if no columns are selected. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18855#discussion_r131889407 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -165,6 +147,62 @@ private[spark] class DiskStore( } +private class DiskBlockData( +conf: SparkConf, --- End diff -- we can pass in `minMemoryMapBytes` directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18855#discussion_r131890389 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE super.fetchBlockSync(host, port, execId, blockId) } } + + def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) { +store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1") +def mkBlobs() = { + val rng = new java.util.Random(42) + val buff = new Array[Byte](1024 * 1024) + rng.nextBytes(buff) + Iterator.fill(2 * 1024 + 1) { +buff + } +} +val res1 = store.getOrElseUpdate( + RDDBlockId(42, 0), + storageLevel, + implicitly[ClassTag[Array[Byte]]], + mkBlobs _ +) +withClue(res1) { + assert(res1.isLeft) + assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall { +case (a, b) => + a != null && +b != null && +a.asInstanceOf[Array[Byte]].seq == b.asInstanceOf[Array[Byte]].seq + }) +} +val getResult = store.get(RDDBlockId(42, 0)) +withClue(getResult) { + assert(getResult.isDefined) + assert(getResult.get.data.zipAll(mkBlobs(), null, null).forall { +case (a, b) => + a != null && +b != null && +a.asInstanceOf[Array[Byte]].seq == b.asInstanceOf[Array[Byte]].seq + }) +} +val getBlockRes = store.getBlockData(RDDBlockId(42, 0)) +withClue(getBlockRes) { + try { +assert(getBlockRes.size() >= 2 * 1024 * 1024 * 1024) +Utils.tryWithResource(getBlockRes.createInputStream()) { inpStrm => + val iter = store +.serializerManager +.dataDeserializeStream(RDDBlockId(42, 0) + , inpStrm)(implicitly[ClassTag[Array[Byte]]]) + assert(iter.zipAll(mkBlobs(), null, null).forall { +case (a, b) => + a != null && +b != null && +a.asInstanceOf[Array[Byte]].seq == b.asInstanceOf[Array[Byte]].seq + }) +} + } finally { +getBlockRes.release() + } +} + } + + test("getOrElseUpdate > 2gb, storage level = disk only") { --- End diff -- shall we just write a test in `DiskStoreSuite`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetwee...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/18814#discussion_r131876058 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala --- @@ -89,7 +89,11 @@ case class WindowSpecDefinition( elements.mkString("(", " ", ")") } - private def isValidFrameType(ft: DataType): Boolean = orderSpec.head.dataType == ft + private def isValidFrameType(ft: DataType): Boolean = (orderSpec.head.dataType, ft) match { +case (DateType, IntegerType) => true --- End diff -- Yea, we should allow this, but this is a bit corner case and not very straight-forward to implement, maybe we can leave this as a follow-up? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18814: [SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18814 **[Test build #80386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80386/testReport)** for PR 18814 at commit [`3e3b58c`](https://github.com/apache/spark/commit/3e3b58c8911e266b2af985da3dec53418a608d2b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18801: SPARK-10878 Fix race condition when multiple clients res...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/18801 Your test case don't reflect the change you made to support multiple clients resolves artifacts at the same time, could you add a new test case or manually test that? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18855#discussion_r131890999 --- Diff: core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala --- @@ -92,6 +92,31 @@ class DiskStoreSuite extends SparkFunSuite { assert(diskStore.getSize(blockId) === 0L) } + test("blocks larger than 2gb") { +val conf = new SparkConf() +val diskBlockManager = new DiskBlockManager(conf, deleteFilesOnStop = true) +val diskStore = new DiskStore(conf, diskBlockManager, new SecurityManager(conf)) + +val mb = 1024 * 1024 +val gb = 1024L * mb + +val blockId = BlockId("rdd_1_2") +diskStore.put(blockId) { chan => + val arr = new Array[Byte](mb) + for { +_ <- 0 until 2048 + } { +val buf = ByteBuffer.wrap(arr) +while (buf.hasRemaining()) { + chan.write(buf) +} + } +} + +val blockData = diskStore.getBytes(blockId) +assert(blockData.size == 2 * gb) --- End diff -- test with 3gb to be more explicit that it's larger than 2gb? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18878: fix issue with spark-submit with java
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18878 It works fine for me in, for example: ``` spark2-submit --class "org.apache.spark.examples.SparkPi" --deploy-mode client --master yarn /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples_2.11-2.2.0.cloudera1.jar ``` The thing is the quotes never even make it to the submit script, right? bash interprets them. That's why I can't see how this would be an issue. What's your exact command line? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/18692 @gatorsmile I updated the rule to cover cross join cases. Regarding the case with the redundant condition mentioned by you, I opened [SPARK-21652](https://issues.apache.org/jira/browse/SPARK-21652). It is an existing issue and is not caused by the proposed rule. BTW, I can try to fix it once we agree on a solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 @gatorsmile Some test cases have been added. Thanks for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18818#discussion_r131913185 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -453,6 +453,14 @@ case class Or(left: Expression, right: Expression) extends BinaryOperator with P abstract class BinaryComparison extends BinaryOperator with Predicate { + override def inputType: AbstractDataType = AnyDataType --- End diff -- We have to define `inputType` because it extends `BinaryOperator`. Previously the `LessThan`-like operators defined `inputType` was a subset of what they could actually support. This PR fixes that, but since the supported types can not be finitely specified as a type collection (there are a countably infinite number of legal `StructType`'s), we need to give a superset of what is actually supported for `inputType` and then do the real recursive check in `checkInputDataTypes`. This is much like how the `EqualTo` and `EqualNullSafe` operators were previously implemented. In this PR we just move that logic up to `BinaryComparison` as it's really the same for equality and inequality operators. Did that answer your concerns? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18544 **[Test build #80393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80393/testReport)** for PR 18544 at commit [`9fa7078`](https://github.com/apache/spark/commit/9fa70788aa78bb23a10a04a6c99953d2050bab79). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes f...
Github user eyalfa commented on a diff in the pull request: https://github.com/apache/spark/pull/18855#discussion_r131913940 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE super.fetchBlockSync(host, port, execId, blockId) } } + + def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) { +store = makeBlockManager(6L * 1024 * 1024 * 1024, "exec1") +def mkBlobs() = { + val rng = new java.util.Random(42) + val buff = new Array[Byte](1024 * 1024) + rng.nextBytes(buff) + Iterator.fill(2 * 1024 + 1) { +buff + } +} +val res1 = store.getOrElseUpdate( + RDDBlockId(42, 0), + storageLevel, + implicitly[ClassTag[Array[Byte]]], + mkBlobs _ +) +withClue(res1) { + assert(res1.isLeft) + assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall { +case (a, b) => + a != null && +b != null && +a.asInstanceOf[Array[Byte]].seq == b.asInstanceOf[Array[Byte]].seq + }) +} +val getResult = store.get(RDDBlockId(42, 0)) +withClue(getResult) { + assert(getResult.isDefined) + assert(getResult.get.data.zipAll(mkBlobs(), null, null).forall { +case (a, b) => + a != null && +b != null && +a.asInstanceOf[Array[Byte]].seq == b.asInstanceOf[Array[Byte]].seq + }) +} +val getBlockRes = store.getBlockData(RDDBlockId(42, 0)) +withClue(getBlockRes) { + try { +assert(getBlockRes.size() >= 2 * 1024 * 1024 * 1024) +Utils.tryWithResource(getBlockRes.createInputStream()) { inpStrm => + val iter = store +.serializerManager +.dataDeserializeStream(RDDBlockId(42, 0) + , inpStrm)(implicitly[ClassTag[Array[Byte]]]) + assert(iter.zipAll(mkBlobs(), null, null).forall { +case (a, b) => + a != null && +b != null && +a.asInstanceOf[Array[Byte]].seq == b.asInstanceOf[Array[Byte]].seq + }) +} + } finally { +getBlockRes.release() + } +} + } + + test("getOrElseUpdate > 2gb, storage level = disk only") { --- End diff -- these tests cover more than just the `DiskOnly` storage level, they were crafted when I had bigger ambitions of solving the entire 2GB issue :sunglasses: , that was before seeing some ~100 files pull requests being abandoned or rejected. aside, these tests also test the entire orchestration done by `BlockManager` when an `RDD` requests a cached partition, notice that these tests intentionally makes two calls to the BlockManager in order to simulate both code paths (cache-hit, cache-miss). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18883: [SPARK-21276][CORE] Update lz4-java to the latest...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/18883#discussion_r131920841 --- Diff: core/src/main/java/org/apache/spark/io/LZ4BlockInputStream.java --- @@ -1,260 +0,0 @@ -/* - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.spark.io; - -import java.io.EOFException; -import java.io.FilterInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.util.zip.Checksum; - -import net.jpountz.lz4.LZ4Exception; -import net.jpountz.lz4.LZ4Factory; -import net.jpountz.lz4.LZ4FastDecompressor; -import net.jpountz.util.SafeUtils; -import net.jpountz.xxhash.XXHashFactory; - -/** - * {@link InputStream} implementation to decode data written with - * {@link net.jpountz.lz4.LZ4BlockOutputStream}. This class is not thread-safe and does not - * support {@link #mark(int)}/{@link #reset()}. - * @see net.jpountz.lz4.LZ4BlockOutputStream - * - * This is based on net.jpountz.lz4.LZ4BlockInputStream - * - * changes: https://github.com/davies/lz4-java/commit/cc1fa940ac57cc66a0b937300f805d37e2bf8411 - * - * TODO: merge this into upstream - */ -public final class LZ4BlockInputStream extends FilterInputStream { --- End diff -- Yeah I guess this needs a MiMa exclude. It's technically public but nobody should have ever referenced this directly. The codec class is separate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18883: [SPARK-21276][CORE] Update lz4-java to the latest (v1.4....
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18883 I'll update soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18492: [SPARK-19326] Speculated task attempts do not get launch...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18492 **[Test build #80387 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80387/testReport)** for PR 18492 at commit [`77b4729`](https://github.com/apache/spark/commit/77b47293d9655408b0dbe4d5896492b366575b3a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes fails fo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18855 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18880: [SPARK-21665][Core]Need to close resources after ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/18880#discussion_r131907109 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -207,9 +207,16 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S uriScheme match { case "file" => try { -val jar = new JarFile(uri.getPath) -// Note that this might still return null if no main-class is set; we catch that later -mainClass = jar.getManifest.getMainAttributes.getValue("Main-Class") +var jar: JarFile = null +Utils.tryWithSafeFinally { --- End diff -- Ah pardon, I meant `Utils.tryWithResource`. This one doesn't help close the resource. I don't know which one is actually clearer; up to you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org