[GitHub] spark pull request #15513: [SPARK-17963][SQL][Documentation] Add examples (e...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15513#discussion_r84590609 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CallMethodViaReflection.scala --- @@ -43,11 +43,20 @@ import org.apache.spark.util.Utils * and the second element should be a literal string for the method name, * and the remaining are input arguments to the Java method. */ -// scalastyle:off line.size.limit @ExpressionDescription( - usage = "_FUNC_(class,method[,arg1[,arg2..]]) calls method with reflection", - extended = "> SELECT _FUNC_('java.util.UUID', 'randomUUID');\n c33fb387-8500-4bfa-81d2-6e0e3e930df2") -// scalastyle:on line.size.limit + usage = "_FUNC_(class, method[, arg1[, arg2 ..]]) - Calls method with reflection.", + extended = """ +Arguments: + class - a string literal that represents a fully-qualified class name. + method - a string literal that represents a method name. + arg - a string literal that represents arguments for the method. --- End diff -- Oh, it seems `arg` is not. Let me try to fine such cases here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15513: [SPARK-17963][SQL][Documentation] Add examples (e...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15513#discussion_r84590562 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala --- @@ -125,7 +129,7 @@ case class DescribeFunctionCommand( if (isExtended) { result :+ - Row(s"Extended Usage:\n${replaceFunctionName(info.getExtended, info.getName)}") + Row(s"Extended Usage:${replaceFunctionName(info.getExtended, info.getName)}") --- End diff -- I don't think stripMargin works (at least in one version of the scala we support perhaps 2.10) in annotations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15595: [SPARK-18058][SQL] Comparing column types ignorin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15595#discussion_r84590334 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala --- @@ -377,4 +377,14 @@ class AnalysisSuite extends AnalysisTest { assertExpressionType(sum(Divide(Decimal(1), 2.0)), DoubleType) assertExpressionType(sum(Divide(1.0, Decimal(2.0))), DoubleType) } + + + test("SPARK-18058: union operations shall not care about the nullability of columns") { --- End diff -- +1 (actually, it'd be nicer if it has unit test and end-to-end test). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15219: [SPARK-14098][SQL] Generate Java code to build CachedCol...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/15219 @davies @rxin, would it be possible to review this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15588: [SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTim...
Github user Astralidea commented on the issue: https://github.com/apache/spark/pull/15588 @lw-lin Thanks for you reply. In my private cluster running spark is a little different. (I start drivr & executor by myself) I had try maxRegisteredWaitingTime, but I had not try minRegisteredResourcesRatio. I thought minRegisteredResourcesRatio will not work if maxRegisteredWaitingTime won't work. Maybe it works, I will try spark.scheduler.minRegisteredResourcesRatio tommorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15575 LGTM, since the scope of this PR is just refactoring. Let me first post the existing code for `outputPartitioning` in `ExpandExec`: ```Scala // The GroupExpressions can output data with arbitrary partitioning, so set it // as UNKNOWN partitioning override def outputPartitioning: Partitioning = UnknownPartitioning(0) ``` It makes sense to set it either `UnknownPartitioning` or `child.outputPartitioning`. However, the above code is setting to a wrong number of partitions. We need to correct it no matter whether we are using the number or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15361: [SPARK-17765][SQL] Support for writing out user-d...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15361#discussion_r84590191 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala --- @@ -91,6 +91,16 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { } } + test("Read/write UserDefinedType") { +withTempPath { path => + val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25 + val udtDF = data.toDF("id", "vectors") + udtDF.write.orc(path.getAbsolutePath) + val readBack = spark.read.schema(udtDF.schema).orc(path.getAbsolutePath) --- End diff -- It seems fine for reading because it refers the schema from ORC (detecting the fields via field names). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15361: [SPARK-17765][SQL] Support for writing out user-d...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15361#discussion_r84590147 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala --- @@ -246,6 +246,9 @@ private[hive] trait HiveInspectors { * Wraps with Hive types based on object inspector. */ protected def wrapperFor(oi: ObjectInspector, dataType: DataType): Any => Any = oi match { +case _ if dataType.isInstanceOf[UserDefinedType[_]] => --- End diff -- >This codepath is shared by many things apart from ORC. Won't those be affected ? It seems this path is being used in `hiveUDFs.scala` and `hiveWriterContainers.scala`. Actually, it'd be fine that a value converter for UDT uses the equivalent type (inner sql type) converter. It is a common pattern for other data sources as well. >I would put this case in the every end. The reason being UserDefinedType are not that common compared to other types (esp. primitive types). So putting it below in the switch case will be better for perf. Cool :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15441: [SPARK-4411] [Web UI] Add "kill" link for jobs in the UI
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/15441 @srowen I addressed most of your comments except the one about the try-finally I commented on above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15441: [SPARK-4411] [Web UI] Add "kill" link for jobs in the UI
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15441 **[Test build #67406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67406/consoleFull)** for PR 15441 at commit [`b1e77ba`](https://github.com/apache/spark/commit/b1e77baaff2bae12e745d623ea27e7cb2ad5e2be). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15588: [SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTim...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/15588 Spark Streaming would do a very simple dummy job ensure that all slaves have registered before scheduling the `Receiver`s; please see https://github.com/apache/spark/blob/v2.0.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L436-L447. @Astralidea, `spark.scheduler.minRegisteredResourcesRatio` is the minimum ratio of registered resources to wait for before the dummy job begins.In our private clusters, configuring that to be `0.9` or even `1.0` helps a lot to balance our 100+ `Receiver`s. Maybe you could also give it a try. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15441: [SPARK-4411] [Web UI] Add "kill" link for jobs in...
Github user ajbozarth commented on a diff in the pull request: https://github.com/apache/spark/pull/15441#discussion_r84589829 --- Diff: core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala --- @@ -651,6 +671,15 @@ class UISeleniumSuite extends SparkFunSuite with WebBrowser with Matchers with B } } + def getResponseCode(url: URL, method: String): Int = { +val connection = url.openConnection().asInstanceOf[HttpURLConnection] +connection.setRequestMethod(method) +connection.connect() +val code = connection.getResponseCode() +connection.disconnect() --- End diff -- It might just because it's late and I'm tired but I'm not quite sure where you think the try-finally should be --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15484: [SPARK-17868][SQL] Do not use bitmasks during parsing an...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/15484 @tejasapatil @rxin I've addressed most of your comments, thanks for reviewing this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/15484#discussion_r84589691 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -216,10 +216,16 @@ class Analyzer( * Group Count: N + 1 (N is the number of group expressions) * * We need to get all of its subsets for the rule described above, the subset is - * represented as the bit masks. + * represented as sequence of expressions. */ -def bitmasks(r: Rollup): Seq[Int] = { - Seq.tabulate(r.groupByExprs.length + 1)(idx => (1 << idx) - 1) +def rollupExprs(exprs: Seq[Expression]): Seq[Seq[Expression]] = { + val buffer = ArrayBuffer.empty[Seq[Expression]] --- End diff -- The use of `ArrayBuffer` will make this piece of code more concise, since the sequence of `exprs` is not usually very long, maybe performance is not the major concern here, I'd prefer to keep this one, is it OK? @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15582 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67403/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15582 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15484: [SPARK-17868][SQL] Do not use bitmasks during parsing an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15484 **[Test build #67405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67405/consoleFull)** for PR 15484 at commit [`a47cc68`](https://github.com/apache/spark/commit/a47cc687d9606d8a22d0de9d9c9762fef44f897d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15582 **[Test build #67403 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67403/consoleFull)** for PR 15582 at commit [`5acbd6c`](https://github.com/apache/spark/commit/5acbd6ce3a1d8becc84c4e53b7f175b13bb8b7bf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13705: [SPARK-15472][SQL] Add support for writing in `csv` form...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13705 closing this in favor of SPARK-17924 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13705: [SPARK-15472][SQL] Add support for writing in `cs...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/13705 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15582 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67404/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15582 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15582 **[Test build #67404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67404/consoleFull)** for PR 15582 at commit [`3066efc`](https://github.com/apache/spark/commit/3066efc6b54111e0ec69dcd6110f32b8e7f56dbf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15573: [SPARK-18035] [SQL] Introduce performant and memo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15573 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15573: [SPARK-18035] [SQL] Introduce performant and memory effi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15573 Merging to master! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15573: [SPARK-18035] [SQL] Introduce performant and memory effi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15573 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15573: [SPARK-18035] [SQL] Introduce performant and memory effi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67402/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15573: [SPARK-18035] [SQL] Introduce performant and memory effi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15573 **[Test build #67402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67402/consoleFull)** for PR 15573 at commit [`b263278`](https://github.com/apache/spark/commit/b263278573adc00fcc3f9fc72604b573936a5516). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84589297 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RandomProjectionSuite.scala --- @@ -0,0 +1,148 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import breeze.numerics.{cos, sin} +import breeze.numerics.constants.Pi + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class RandomProjectionSuite extends SparkFunSuite with MLlibTestSparkContext { + test("RandomProjection") { +val data = { + for (i <- -5 until 5; j <- -5 until 5) yield Vectors.dense(i.toDouble, j.toDouble) +} +val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys") + +// Project from 2 dimensional Euclidean Space to 1 dimensions +val rp = new RandomProjection() + .setOutputDim(1) + .setInputCol("keys") + .setOutputCol("values") + .setBucketLength(1.0) + .setSeed(12345) + +val (falsePositive, falseNegative) = LSHTest.calculateLSHProperty(df, rp, 8.0, 2.0) +assert(falsePositive < 0.05) +assert(falseNegative < 0.06) + } + + test("RandomProjection with high dimension data") { +val numDim = 100 +val data = { + for (i <- 0 until numDim; j <- Seq(-2, -1, 1, 2)) +yield Vectors.sparse(numDim, Seq((i, j.toDouble))) +} +val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys") + +// Project from 100 dimensional Euclidean Space to 10 dimensions +val rp = new RandomProjection() + .setOutputDim(10) + .setInputCol("keys") + .setOutputCol("values") + .setBucketLength(2.5) + .setSeed(12345) + +val (falsePositive, falseNegative) = LSHTest.calculateLSHProperty(df, rp, 3.0, 2.0) +assert(falsePositive == 0.0) +assert(falseNegative < 0.05) + } + + test("approxNearestNeighbors for random projection") { +val data = { + for (i <- -10 until 10; j <- -10 until 10) yield Vectors.dense(i.toDouble, j.toDouble) +} +val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys") +val key = Vectors.dense(1.2, 3.4) + +val rp = new RandomProjection() + .setOutputDim(2) + .setInputCol("keys") + .setOutputCol("values") + .setBucketLength(4.0) + .setSeed(12345) + +val (precision, recall) = LSHTest.calculateApproxNearestNeighbors(rp, df, key, 100, + singleProbing = true) +assert(precision >= 0.6) +assert(recall >= 0.6) + } + + test("approxNearestNeighbors with multiple probing") { --- End diff -- If the goal here is to ensure multiple probing is a strict improvement, then I'd combine the unit tests to ensure that the data and Param settings remain the same. I see the Params are already different, but perhaps they should be made identical. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84588285 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.shared.HasSeed +import org.apache.spark.ml.util.{Identifiable, SchemaUtils} +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: + * Model produced by [[MinHash]] + * @param hashFunctions An array of hash functions, mapping elements to their hash values. + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Array[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + val elemsList = elems.toSparse.indices.toList + Vectors.dense(hashFunctions.map(func => elemsList.map(func).min.toDouble)) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +val intersectionSize = xSet.intersect(ySet).size.toDouble +val unionSize = xSet.size + ySet.size - intersectionSize +assert(unionSize > 0, "The union of two input sets must have at least 1 elements") +1 - intersectionSize / unionSize + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min + } +} + +/** + * :: Experimental :: + * LSH class for Jaccard distance. + * --- End diff -- Could you please link to Wikipedia? That tends to be useful: [https://en.wikipedia.org/wiki/MinHash] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84588545 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext { + test("MinHash") { --- End diff -- name test more specifically: "MinHash: test of LSH property" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84589114 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext { + test("MinHash") { +val data = { + for (i <- 0 to 95) yield Vectors.sparse(100, (i until i + 5).map((_, 1.0))) --- End diff -- If you're reusing data across tests, then I'd put it in a class member val. See example: [https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala#L40] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15582 It'd be great to move those as well! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 In practice, setting the `outputPartitioning` of a physical plan like `ExpandExec` to `child.outputPartitioning` doesn't cause any real problem, even this physical plan doesn't keep the same row distribution of its child. That is because if the physical plan changes output, it will have different output attributes, e.g., `col` to `col'` as @tejasapatil pointed out. If its parent plan requires a distribution, says `HashPartition`, this distribution will bound to the physical plan's output `col'`, instead of its child plan's `col`. So even the physical plan uses `child.outputPartitioning`, `EnsureRequirements ` will step in and inject an extra shuffle exchange of `HashPartition(col')` to satisfy the requirement. It works like that as per my understanding. However it doesn't mean the physical plan's output partitioning is exactly as same as its child's, i.e., `HashPartition(col)`, because it doesn't have the output `col`. This part might be confusing to some people, so I think it might be better to explain it more. That is what I understood about this, if I am wrong please kindly point out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15588: [SPARK-18039][Scheduler] fix bug maxRegisteredWai...
Github user Astralidea commented on a diff in the pull request: https://github.com/apache/spark/pull/15588#discussion_r84588811 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala --- @@ -440,7 +430,10 @@ class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false rcvr } -runDummySparkJob() +while ((System.currentTimeMillis() - createTime) < maxRegisteredWaitingTimeMs) {} --- End diff -- You're right. But I think it only waste a little time, and how to write code gracefully? I hope to make it better but do not know how to do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15588: [SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTim...
Github user Astralidea commented on the issue: https://github.com/apache/spark/pull/15588 @srowen But in my cluster I tested 10 times. 9 times successed, 1 time failed. Why not necessary? receiver balance scheduler affect performance. If new executor delay add to driver. receiver won't scheduler again. Or any other solution? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15354: [SPARK-17764][SQL] Add `to_json` supporting to convert n...
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/15354 It would be really nice to fail in analysis rather than execution. What if it only fails after hours of computation? As a user I'd be upset. I'm also concerned they will think it's a spark bug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15361: [SPARK-17765][SQL] Support for writing out user-d...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/15361#discussion_r84588700 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala --- @@ -91,6 +91,16 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { } } + test("Read/write UserDefinedType") { +withTempPath { path => + val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25 + val udtDF = data.toDF("id", "vectors") + udtDF.write.orc(path.getAbsolutePath) + val readBack = spark.read.schema(udtDF.schema).orc(path.getAbsolutePath) --- End diff -- Curious : how does this work ? I mean you added support for `UserDefinedType` in `wrapper` side, but at the `unwrapper` side I don't see `UserDefinedType` being handled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15361: [SPARK-17765][SQL] Support for writing out user-d...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/15361#discussion_r84588678 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala --- @@ -246,6 +246,9 @@ private[hive] trait HiveInspectors { * Wraps with Hive types based on object inspector. */ protected def wrapperFor(oi: ObjectInspector, dataType: DataType): Any => Any = oi match { +case _ if dataType.isInstanceOf[UserDefinedType[_]] => --- End diff -- - This codepath is shared by many things apart from ORC. Won't those be affected ? - I would put this case in the every end. The reason being `UserDefinedType` are not that common compared to other types (esp. primitive types). So putting it below in the switch case will be better for perf. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @tejasapatil yeah, that is correct. however, I am wondering if we can say this `ExpandExec` have the same distribution of rows as its child...because it even doesn't have the `col`... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15148 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67401/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15148 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15148 **[Test build #67401 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67401/consoleFull)** for PR 15148 at commit [`e14f73e`](https://github.com/apache/spark/commit/e14f73e8a49d409e09a6ed541d4b40f07dc81013). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/15582 @rxin I've moved the testcases added in this PR to an query file test, do we need to move other test cases for `ROLLUP/CUBE/GROUPING-SETS` too? Currently in `SQLQuerySuite` we have the following: ``` test("rollup") test("grouping sets when aggregate functions containing groupBy columns") test("cube") test("grouping sets") test("grouping and grouping_id") test("grouping and grouping_id in having") test("grouping and grouping_id in sort") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15582 **[Test build #67404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67404/consoleFull)** for PR 15582 at commit [`3066efc`](https://github.com/apache/spark/commit/3066efc6b54111e0ec69dcd6110f32b8e7f56dbf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15463 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67399/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15463 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15463 **[Test build #67399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67399/consoleFull)** for PR 15463 at commit [`cd6d240`](https://github.com/apache/spark/commit/cd6d240c8972e843a1abf586c6d324bff8beefd5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15595: [SPARK-18058][SQL] Comparing column types ignorin...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/15595#discussion_r84588354 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala --- @@ -377,4 +377,14 @@ class AnalysisSuite extends AnalysisTest { assertExpressionType(sum(Divide(Decimal(1), 2.0)), DoubleType) assertExpressionType(sum(Divide(1.0, Decimal(2.0))), DoubleType) } + + + test("SPARK-18058: union operations shall not care about the nullability of columns") { --- End diff -- This PR also affects `SetOperation`. Could you please also add tests for that ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15582 **[Test build #67403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67403/consoleFull)** for PR 15582 at commit [`5acbd6c`](https://github.com/apache/spark/commit/5acbd6ce3a1d8becc84c4e53b7f175b13bb8b7bf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15595: [SPARK-18058][SQL] Comparing column types ignorin...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/15595#discussion_r84588308 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala --- @@ -377,4 +377,14 @@ class AnalysisSuite extends AnalysisTest { assertExpressionType(sum(Divide(Decimal(1), 2.0)), DoubleType) assertExpressionType(sum(Divide(1.0, Decimal(2.0))), DoubleType) } + --- End diff -- nit: delete extra newline --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @viirya : As per my understanding, if the child operator emits `col`, after applying `ExpandExec`, the output is `col'`. The original child partitioning being over `col`, `ExpandExec` does not seem to alter that. The table above was to summarise the state of things before this PR. I did not change any semantics in this PR as its pure refactoring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveInspecto...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15573 **[Test build #67402 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67402/consoleFull)** for PR 15573 at commit [`b263278`](https://github.com/apache/spark/commit/b263278573adc00fcc3f9fc72604b573936a5516). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveInspecto...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15573 LGTM pending test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveI...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/15573#discussion_r84588059 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala --- @@ -433,18 +413,12 @@ object CatalystTypeConverters { case seq: Seq[Any] => new GenericArrayData(seq.map(convertToCatalyst).toArray) case r: Row => InternalRow(r.toSeq.map(convertToCatalyst): _*) case arr: Array[Any] => new GenericArrayData(arr.map(convertToCatalyst)) -case m: Map[_, _] => - val length = m.size - val convertedKeys = new Array[Any](length) - val convertedValues = new Array[Any](length) - - var i = 0 - for ((key, value) <- m) { -convertedKeys(i) = convertToCatalyst(key) -convertedValues(i) = convertToCatalyst(value) -i += 1 - } - ArrayBasedMapData(convertedKeys, convertedValues) +case map: Map[_, _] => + ArrayBasedMapData( +map.iterator, +map.size, +(key) => convertToCatalyst(key), +(value) => convertToCatalyst(value)) --- End diff -- changed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15148 **[Test build #67401 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67401/consoleFull)** for PR 15148 at commit [`e14f73e`](https://github.com/apache/spark/commit/e14f73e8a49d409e09a6ed541d4b40f07dc81013). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveI...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15573#discussion_r84587828 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala --- @@ -433,18 +413,12 @@ object CatalystTypeConverters { case seq: Seq[Any] => new GenericArrayData(seq.map(convertToCatalyst).toArray) case r: Row => InternalRow(r.toSeq.map(convertToCatalyst): _*) case arr: Array[Any] => new GenericArrayData(arr.map(convertToCatalyst)) -case m: Map[_, _] => - val length = m.size - val convertedKeys = new Array[Any](length) - val convertedValues = new Array[Any](length) - - var i = 0 - for ((key, value) <- m) { -convertedKeys(i) = convertToCatalyst(key) -convertedValues(i) = convertToCatalyst(value) -i += 1 - } - ArrayBasedMapData(convertedKeys, convertedValues) +case map: Map[_, _] => + ArrayBasedMapData( +map.iterator, +map.size, +(key) => convertToCatalyst(key), +(value) => convertToCatalyst(value)) --- End diff -- It just looks weird to use different apply functions in the same file. How about this? ```Scala ArrayBasedMapData( map, (key: Any) => convertToCatalyst(key), (value: Any) => convertToCatalyst(value)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15595: [SPARK-18058][SQL] Comparing column types ignoring Nulla...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15595 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #67400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67400/consoleFull)** for PR 14547 at commit [`66d3396`](https://github.com/apache/spark/commit/66d33963fcba05b4303d34891635607f54e10364). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67400/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15595: [SPARK-18058][SQL] Comparing column types ignoring Nulla...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15595 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67397/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15595: [SPARK-18058][SQL] Comparing column types ignoring Nulla...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15595 **[Test build #67397 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67397/consoleFull)** for PR 15595 at commit [`e7b5a9b`](https://github.com/apache/spark/commit/e7b5a9b32328c5896e676284db1638819530b6dc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @rxin yeah, I am curious why `ExpandExec` and `GenerateExec` have different `outputPartitioning`... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @tejasapatil I see there is 1:1 mapping among output partition of child operator and output partition of `ExpandExec`. For example we have an Expand applying on a data set like col: [1, 2, 3]. If the projections are col, col + 1, col + 2. Assume the partition of the data set is HashPartition(col). We have three partitions: p1: [1] p2: [2] p3: [3] After the Expand, the data set becomes: p1: [1, 2, 3] p2: [2, 3, 4] p3: [3, 4, 5] Is it still valid for HashPartition(col)? Looks like it doesn't. I think It is why there is a comment on ExpandExec in the code position you links to. BTW, in your table `ExpandExec`'s `outputPartitioning` is `UnknownPartitioning`, right? If it doesn't change child's partition, why we don't set it to child's outputPartitioning? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14547 **[Test build #67400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67400/consoleFull)** for PR 14547 at commit [`66d3396`](https://github.com/apache/spark/commit/66d33963fcba05b4303d34891635607f54e10364). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @viirya >> However, if its child has certain partition such as HashPartition, after ExpandExec it becomes a UnknownPartitioning The notion of `Partitioning` in Spark is the distribution of rows across tasks. Even if the child's output has `HashPartitioning`, there is a 1:1 mapping among output partition of child operator and output partition of `ExpandExec`. So, applying `ExpandExec` does not alter the partitioning of child outputs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 The current thing LGTM. cc @yhuai do you have any other feedback? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84587197 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext { + test("MinHash") { +val data = { + for (i <- 0 to 95) yield Vectors.sparse(100, (i until i + 5).map((_, 1.0))) +} +val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys") + +val mh = new MinHash() + .setOutputDim(1) + .setInputCol("keys") + .setOutputCol("values") + .setSeed(0) --- End diff -- Here and elsewhere --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84587191 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,340 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[LSH]]. + */ +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" + +" improves the running performance", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + setDefault(outputDim -> 1, outputCol -> "lshFeatures") + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without [[outputCol]] + * @return A derived schema with [[outputCol]] added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * Model produced by [[LSH]]. + */ +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One input vector in the metric space + * @param y One input vector in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the item. If the [[outputCol]] is missing, the method will transform the data; if + * the [[outputCol]] exists, it will use the
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84587195 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext { + test("MinHash") { +val data = { + for (i <- 0 to 95) yield Vectors.sparse(100, (i until i + 5).map((_, 1.0))) +} +val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys") + +val mh = new MinHash() + .setOutputDim(1) + .setInputCol("keys") + .setOutputCol("values") + .setSeed(0) --- End diff -- Use seed != 0 as a habit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15148 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15148 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67398/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15148 **[Test build #67398 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67398/consoleFull)** for PR 15148 at commit [`cad4ecb`](https://github.com/apache/spark/commit/cad4ecb3cea47e16b9c1073d30d8fd57bc397621). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @viirya >> In the table in the description, CoalesceExec output UnknownPartitioning Yes. Since partitions == 1 is a corner case, I did not put that in the table. If you look at the code, its doing the right thing: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L490 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15600: [SPARK-17698] [SQL] Join predicates should not co...
Github user tejasapatil closed the pull request at: https://github.com/apache/spark/pull/15600 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15463 **[Test build #67399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67399/consoleFull)** for PR 15463 at commit [`cd6d240`](https://github.com/apache/spark/commit/cd6d240c8972e843a1abf586c6d324bff8beefd5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15600 Thanks - merging in. Can you close this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15463 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 @viirya of course if you say coalesce(1) it is a single partition -- any operator that changes partition to 1 partition is single partition. For Expand isn't it just the same as Generate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @rxin In the table in the description, `CoalesceExec` output `UnknownPartitioning`, actually it can be `SinglePartition` if what you do is `coalesce(1)`. `ExpandExec` doesn't actually move rows across partitions as @tejasapatil pointed out. However, if its child has certain partition such as `HashPartition`, after `ExpandExec` it becomes a `UnknownPartitioning`. I am not sure if it does change the partitioning or not. From the view of output partition of the physical plan, it is changed indeed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84586831 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[LSH]]. + */ +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" + +" improves the running performance", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + setDefault(outputDim -> 1, outputCol -> "lshFeatures") + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without [[outputCol]] + * @return A derived schema with [[outputCol]] added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * Model produced by [[LSH]]. + */ +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One input vector in the metric space + * @param y One input vector in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the item. If the [[outputCol]] is missing, the method will transform the data; if + * the [[outputCol]] exists, it will use the
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84586829 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[LSH]]. + */ +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" + +" improves the running performance", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + setDefault(outputDim -> 1, outputCol -> "lshFeatures") + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without [[outputCol]] + * @return A derived schema with [[outputCol]] added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * Model produced by [[LSH]]. + */ +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One input vector in the metric space + * @param y One input vector in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the item. If the [[outputCol]] is missing, the method will transform the data; if + * the [[outputCol]] exists, it will use the
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 Thanks @jkbradley. I have removed BitSampling and SignRandomProjection for a follow-up PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15148 **[Test build #67398 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67398/consoleFull)** for PR 15148 at commit [`cad4ecb`](https://github.com/apache/spark/commit/cad4ecb3cea47e16b9c1073d30d8fd57bc397621). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14529: [TRIVIAL][SQL] Match the name of OrcRelation companion o...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14529 Thanks, I am closing this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14529: [TRIVIAL][SQL] Match the name of OrcRelation comp...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/14529 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15600 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15600 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67396/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15600 **[Test build #67396 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67396/consoleFull)** for PR 15600 at commit [`df50838`](https://github.com/apache/spark/commit/df5083894198e1a85fb17544fc596a3869a9e1b6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15595: [SPARK-18058][SQL] Comparing column types ignoring Nulla...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15595 **[Test build #67397 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67397/consoleFull)** for PR 15595 at commit [`e7b5a9b`](https://github.com/apache/spark/commit/e7b5a9b32328c5896e676284db1638819530b6dc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15541: [SPARK-17637][Scheduler]Packed scheduling for Spark task...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15541 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67395/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15541: [SPARK-17637][Scheduler]Packed scheduling for Spark task...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15541 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15541: [SPARK-17637][Scheduler]Packed scheduling for Spark task...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15541 **[Test build #67395 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67395/consoleFull)** for PR 15541 at commit [`dd2b207`](https://github.com/apache/spark/commit/dd2b2077430bbb07047e928d20c1ad8fe940827a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/15484#discussion_r84585794 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -255,98 +265,125 @@ class Analyzer( expr transform { case e: GroupingID => if (e.groupByExprs.isEmpty || e.groupByExprs == groupByExprs) { -gid +Alias(gid, toPrettySQL(e))() } else { throw new AnalysisException( s"Columns of grouping_id (${e.groupByExprs.mkString(",")}) does not match " + s"grouping columns (${groupByExprs.mkString(",")})") } -case Grouping(col: Expression) => +case e @ Grouping(col: Expression) => val idx = groupByExprs.indexOf(col) if (idx >= 0) { -Cast(BitwiseAnd(ShiftRight(gid, Literal(groupByExprs.length - 1 - idx)), - Literal(1)), ByteType) +Alias(Cast(BitwiseAnd(ShiftRight(gid, Literal(groupByExprs.length - 1 - idx)), + Literal(1)), ByteType), toPrettySQL(e))() } else { throw new AnalysisException(s"Column of grouping ($col) can't be found " + s"in grouping columns ${groupByExprs.mkString(",")}") } } } -// This require transformUp to replace grouping()/grouping_id() in resolved Filter/Sort -def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { - case a if !a.childrenResolved => a // be sure all of the children are resolved. - case p if p.expressions.exists(hasGroupingAttribute) => -failAnalysis( - s"${VirtualColumn.hiveGroupingIdName} is deprecated; use grouping_id() instead") - - case Aggregate(Seq(c @ Cube(groupByExprs)), aggregateExpressions, child) => -GroupingSets(bitmasks(c), groupByExprs, child, aggregateExpressions) - case Aggregate(Seq(r @ Rollup(groupByExprs)), aggregateExpressions, child) => -GroupingSets(bitmasks(r), groupByExprs, child, aggregateExpressions) +/* + * Create new alias for all group by expressions for `Expand` operator. + */ +private def constructGroupByAlias(groupByExprs: Seq[Expression]): Seq[Alias] = { + groupByExprs.map { +case e: NamedExpression => Alias(e, e.name)() +case other => Alias(other, other.toString)() + } +} - // Ensure all the expressions have been resolved. - case x: GroupingSets if x.expressions.forall(_.resolved) => -val gid = AttributeReference(VirtualColumn.groupingIdName, IntegerType, false)() - -// Expand works by setting grouping expressions to null as determined by the bitmasks. To -// prevent these null values from being used in an aggregate instead of the original value -// we need to create new aliases for all group by expressions that will only be used for -// the intended purpose. -val groupByAliases: Seq[Alias] = x.groupByExprs.map { - case e: NamedExpression => Alias(e, e.name)() - case other => Alias(other, other.toString)() +/* + * Construct [[Expand]] operator with grouping sets. + */ +private def constructExpand( +selectedGroupByExprs: Seq[Seq[Expression]], +child: LogicalPlan, +groupByAliases: Seq[Alias], +gid: Attribute): LogicalPlan = { + // Change the nullability of group by aliases if necessary. For example, if we have + // GROUPING SETS ((a,b), a), we do not need to change the nullability of a, but we + // should change the nullabilty of b to be TRUE. + // TODO: For Cube/Rollup just set nullability to be `true`. + val expandedAttributes = groupByAliases.zipWithIndex.map { case (a, idx) => --- End diff -- +1. Looking at it more, I feel `zipWithIndex` is not needed at all and the `map` would suffice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/15484#discussion_r84585577 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -216,10 +216,16 @@ class Analyzer( * Group Count: N + 1 (N is the number of group expressions) * * We need to get all of its subsets for the rule described above, the subset is - * represented as the bit masks. + * represented as sequence of expressions. */ -def bitmasks(r: Rollup): Seq[Int] = { - Seq.tabulate(r.groupByExprs.length + 1)(idx => (1 << idx) - 1) +def rollupExprs(exprs: Seq[Expression]): Seq[Seq[Expression]] = { + val buffer = ArrayBuffer.empty[Seq[Expression]] --- End diff -- Avoid using `ArrayBuffer` as insertions would lead to expansion of underlying array and copying of data to the new one. Since you know the size upfront, you could create an `Array` of required size. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/15484#discussion_r84585881 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -255,98 +265,125 @@ class Analyzer( expr transform { case e: GroupingID => if (e.groupByExprs.isEmpty || e.groupByExprs == groupByExprs) { -gid +Alias(gid, toPrettySQL(e))() } else { throw new AnalysisException( s"Columns of grouping_id (${e.groupByExprs.mkString(",")}) does not match " + s"grouping columns (${groupByExprs.mkString(",")})") } -case Grouping(col: Expression) => +case e @ Grouping(col: Expression) => val idx = groupByExprs.indexOf(col) if (idx >= 0) { -Cast(BitwiseAnd(ShiftRight(gid, Literal(groupByExprs.length - 1 - idx)), - Literal(1)), ByteType) +Alias(Cast(BitwiseAnd(ShiftRight(gid, Literal(groupByExprs.length - 1 - idx)), + Literal(1)), ByteType), toPrettySQL(e))() } else { throw new AnalysisException(s"Column of grouping ($col) can't be found " + s"in grouping columns ${groupByExprs.mkString(",")}") } } } -// This require transformUp to replace grouping()/grouping_id() in resolved Filter/Sort -def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { - case a if !a.childrenResolved => a // be sure all of the children are resolved. - case p if p.expressions.exists(hasGroupingAttribute) => -failAnalysis( - s"${VirtualColumn.hiveGroupingIdName} is deprecated; use grouping_id() instead") - - case Aggregate(Seq(c @ Cube(groupByExprs)), aggregateExpressions, child) => -GroupingSets(bitmasks(c), groupByExprs, child, aggregateExpressions) - case Aggregate(Seq(r @ Rollup(groupByExprs)), aggregateExpressions, child) => -GroupingSets(bitmasks(r), groupByExprs, child, aggregateExpressions) +/* + * Create new alias for all group by expressions for `Expand` operator. + */ +private def constructGroupByAlias(groupByExprs: Seq[Expression]): Seq[Alias] = { + groupByExprs.map { +case e: NamedExpression => Alias(e, e.name)() +case other => Alias(other, other.toString)() + } +} - // Ensure all the expressions have been resolved. - case x: GroupingSets if x.expressions.forall(_.resolved) => -val gid = AttributeReference(VirtualColumn.groupingIdName, IntegerType, false)() - -// Expand works by setting grouping expressions to null as determined by the bitmasks. To -// prevent these null values from being used in an aggregate instead of the original value -// we need to create new aliases for all group by expressions that will only be used for -// the intended purpose. -val groupByAliases: Seq[Alias] = x.groupByExprs.map { - case e: NamedExpression => Alias(e, e.name)() - case other => Alias(other, other.toString)() +/* + * Construct [[Expand]] operator with grouping sets. + */ +private def constructExpand( +selectedGroupByExprs: Seq[Seq[Expression]], +child: LogicalPlan, +groupByAliases: Seq[Alias], +gid: Attribute): LogicalPlan = { + // Change the nullability of group by aliases if necessary. For example, if we have + // GROUPING SETS ((a,b), a), we do not need to change the nullability of a, but we + // should change the nullabilty of b to be TRUE. + // TODO: For Cube/Rollup just set nullability to be `true`. + val expandedAttributes = groupByAliases.zipWithIndex.map { case (a, idx) => +if (selectedGroupByExprs.exists(!_.contains(a.child))) { + a.toAttribute.withNullability(true) +} else { + a.toAttribute } + } -val nonNullBitmask = x.bitmasks.reduce(_ & _) - -val expandedAttributes = groupByAliases.zipWithIndex.map { case (a, idx) => - a.toAttribute.withNullability((nonNullBitmask & 1 << idx) == 0) + val groupingSetsAttributes = selectedGroupByExprs.map { groupingSetExprs => +groupingSetExprs.map { expr => + val alias = groupByAliases.find(_.child.semanticEquals(expr)).getOrElse( +failAnalysis(s"$expr doesn't show up in the GROUP BY list")) --- End diff -- can you also display the GROUP BY list in the message ? --- If your project is set up for it, you can reply to
[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15600 **[Test build #67396 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67396/consoleFull)** for PR 15600 at commit [`df50838`](https://github.com/apache/spark/commit/df5083894198e1a85fb17544fc596a3869a9e1b6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15272 @rxin : Here is the backport for 2.0 branch: https://github.com/apache/spark/pull/15600 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15600: [SPARK-17698] [SQL] Join predicates should not co...
GitHub user tejasapatil opened a pull request: https://github.com/apache/spark/pull/15600 [SPARK-17698] [SQL] Join predicates should not contain filter clauses ## What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/15272 to 2.0 branch. Jira : https://issues.apache.org/jira/browse/SPARK-17698 `ExtractEquiJoinKeys` is incorrectly using filter predicates as the join condition for joins. `canEvaluate` [0] tries to see if the an `Expression` can be evaluated using output of a given `Plan`. In case of filter predicates (eg. `a.id='1'`), the `Expression` passed for the right hand side (ie. '1' ) is a `Literal` which does not have any attribute references. Thus `expr.references` is an empty set which theoretically is a subset of any set. This leads to `canEvaluate` returning `true` and `a.id='1'` is treated as a join predicate. While this does not lead to incorrect results but in case of bucketed + sorted tables, we might miss out on avoiding un-necessary shuffle + sort. See example below: [0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L91 eg. ``` val df = (1 until 10).toDF("id").coalesce(1) hc.sql("DROP TABLE IF EXISTS table1").collect df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1") hc.sql("DROP TABLE IF EXISTS table2").collect df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2") sqlContext.sql(""" SELECT a.id, b.id FROM table1 a FULL OUTER JOIN table2 b ON a.id = b.id AND a.id='1' AND b.id='1' """).explain(true) ``` BEFORE: This is doing shuffle + sort over table scan outputs which is not needed as both tables are bucketed and sorted on the same columns and have same number of buckets. This should be a single stage job. ``` SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as double)], FullOuter :- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200) : +- *FileScan parquet default.table1[id#38] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200) +- *FileScan parquet default.table2[id#39] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct ``` AFTER : ``` SortMergeJoin [id#32], [id#33], FullOuter, ((cast(id#32 as double) = 1.0) && (cast(id#33 as double) = 1.0)) :- *FileScan parquet default.table1[id#32] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- *FileScan parquet default.table2[id#33] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct ``` ## How was this patch tested? - Added a new test case for this scenario : `SPARK-17698 Join predicates should not contain filter clauses` - Ran all the tests in `BucketedReadSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/tejasapatil/spark SPARK-17698_2.0_backport Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15600.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15600 commit df5083894198e1a85fb17544fc596a3869a9e1b6 Author: Tejas PatilDate: 2016-10-22T20:16:40Z Backport to 2.0 : [SPARK-17698] [SQL] Join predicates should not contain filter clauses --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org