[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18620 I'm not understanding why `sorted` is slower than `sortBy` - `sortBy` uses `sorted` in its implementation: ```scala def sortBy[B](f: A => B)(implicit ord: Ordering[B]): Repr = sorted(ord on f) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18655 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18655 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18637: [SPARK-15526][ML][FOLLOWUP][test-maven] Make JPMML provi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18637 **[Test build #79701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79701/testReport)** for PR 18637 at commit [`9c3ab05`](https://github.com/apache/spark/commit/9c3ab057ad1ff89ab726ea86774692ef22151b49). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18635: [SPARK-21415] Triage scapegoat warnings, part 1
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/18635 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127903537 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. + // We can't push down non-deterministic conditions. + case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _) --- End diff -- > The other preceding join conditions before equi join condition also could impact it. It could be skipped if the preceding join conditions is false, right? No. We evaluate the joining keys first to find matching/not matching rows, and then evaluate other join conditions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18620 That would make sense. There must be something else going on. Overall, I don't think it is compelling enough evidence to make the `poll` change. (Though as mentioned it's not a huge deal so if others want to do it, no objection) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18513 **[Test build #79699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79699/testReport)** for PR 18513 at commit [`990b816`](https://github.com/apache/spark/commit/990b816428f8e5b94c08749650be05a3f52d07db). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18632 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18513 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18513 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79699/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18669: tfidf-new edit
Github user chlyzzo closed the pull request at: https://github.com/apache/spark/pull/18669 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18659: [SPARK-21404][PYSPARK][WIP] Simple Python Vectori...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18659#discussion_r127913117 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -132,6 +135,61 @@ private[sql] object ArrowConverters { } } + private[sql] def fromPayloadIterator(iter: Iterator[ArrowPayload]): Iterator[InternalRow] = { +new Iterator[InternalRow] { + private val _allocator = new RootAllocator(Long.MaxValue) + private var _reader: ArrowFileReader = _ + private var _root: VectorSchemaRoot = _ + private var _index = 0 + + loadNextBatch() + + override def hasNext: Boolean = _root != null && _index < _root.getRowCount + + override def next(): InternalRow = { +val fields = _root.getFieldVectors.asScala + +val genericRowData = fields.map { field => + field.getAccessor.getObject(_index) +}.toArray[Any] --- End diff -- How about using `SpecificInternalRow`? I think that it could eliminate some boxing/unboxing. The following is a snippet for this usage. ```java val fieldTypes = fields.map { field => field match { case NullableIntVector => IntegerType case NullableFloat8Vector => DoubleType ... } } val row = new SpecificInternalRow(fieldTypes) fields.zipWithIndex.map { case (field, i) => field match { case NullableIntVector => row.setInt(i, field.asInstanceOf[NullableIntVector].getAccessor.get(_index)) case NullableFloat8Vector => LongType row.setDouble(i, field.asInstanceOf[NullableFloat8Vector].getAccessor.get(_index)) ... } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18665: [SPARK-21446] [SQL] Fix setAutoCommit never executed
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18665 **[Test build #3844 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3844/testReport)** for PR 18665 at commit [`9ba431a`](https://github.com/apache/spark/commit/9ba431a838a16a8371b3d3f6ef028158576f85d2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15471: [SPARK-17919] Make timeout to RBackend configurab...
Github user QCTW commented on a diff in the pull request: https://github.com/apache/spark/pull/15471#discussion_r127771891 --- Diff: R/pkg/R/backend.R --- @@ -108,13 +108,27 @@ invokeJava <- function(isStatic, objId, methodName, ...) { conn <- get(".sparkRCon", .sparkREnv) writeBin(requestMessage, conn) - # TODO: check the status code to output error information returnStatus <- readInt(conn) + handleErrors(returnStatus, conn) + + # Backend will send -1 as keep alive value to prevent various connection timeouts + # on very long running jobs. See spark.r.heartBeatInterval + while (returnStatus == 1) { --- End diff -- Shoudn't it have a retry limit for the returnStatus check to avoid infinite loop? I have an infinite loop when the it is called by Toree sparkr_runner.R with error message "Failed to connect JVM: Error in socketConnection(host = hostname, port = port, server = FALSE, : argument "timeout" is missing, with no default" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127897413 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- > One row in side A could match multiple rows in side B. The join conditions could be also evaluated multiple times for the same row in side A, right? Then, if we push it down to the side A, it could also break the number of rand calls, right? No. Joining keys are evaluated at once on two tables. Then we simply match the evaluated results. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18656 No. I meant if there's a CodegenFallback expression, wholestage codegen will not be enabled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/18654 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18633: [SPARK-21411][YARN] Lazily create FS within kerberized U...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/18633 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18669: tfidf-new edit
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18669 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18635: [SPARK-21415] Triage scapegoat warnings, part 1
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18635 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18637: [SPARK-15526][ML][FOLLOWUP][test-maven] Make JPMM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/18637#discussion_r127903284 --- Diff: mllib/pom.xml --- @@ -139,8 +133,38 @@ + target/scala-${scala.binary.version}/classes target/scala-${scala.binary.version}/test-classes + + +org.apache.maven.plugins +maven-dependency-plugin + +
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127918499 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- Most the RDBMS systems allow non-deterministic join conditions. To support it correclty in Spark, we need to check how the other systems behave. After we deciding the rule, we can't break it. Thus, it has to be very careful to design the initial version. In the current stage, I do not think we have a bandwidth to make it perfect. If you want to continue the PR, could you just check how Hive works? Adding an extra flag for Hive users. It can simplify their migration task. By default, turn it off. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18639: [SPARK-21408][core] Better default number of RPC ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/18639#discussion_r127898848 --- Diff: core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala --- @@ -33,7 +33,7 @@ import org.apache.spark.util.ThreadUtils /** * A message dispatcher, responsible for routing RPC messages to the appropriate endpoint(s). */ -private[netty] class Dispatcher(nettyEnv: NettyRpcEnv) extends Logging { +private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: Int) extends Logging { --- End diff -- Should we document the behavior when `numUsableCores` is set to 0 in the comment above? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127901508 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- Could you check the behavior of DB2 and Oracle? This is not related to the semantics instead of performance. We need to check what is the correct behavior. BTW, `EnsureRequirements` could also add extra `Sort` below the join. In our implementation, we never consider this support. Many factors could break this assumption. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18667: Fix the simpleString used in error messages
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/18667#discussion_r127903964 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/LongType.scala --- @@ -43,7 +43,7 @@ class LongType private() extends IntegralType { */ override def defaultSize: Int = 8 - override def simpleString: String = "bigint" + override def simpleString: String = "long" --- End diff -- I don't think so. bigint is the SQL type for an 8-byte integer, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 I am ok to close this. Thanks @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18632: [SPARK-21412][SQL] Reset BufferHolder while initi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18632#discussion_r127904688 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java --- @@ -51,6 +51,7 @@ public UnsafeRowWriter(BufferHolder holder, int numFields) { this.nullBitsSize = UnsafeRow.calculateBitSetWidthInBytes(numFields); this.fixedSize = nullBitsSize + 8 * numFields; this.startingOffset = holder.cursor; +holder.reset(); --- End diff -- I think we don't guarantee all the calls to this `UnsafeRowWriter` constructor are at the timing of new incoming record. It could be possible that we pass a `BufferHolder` into this constructor but the holder is already written with some data and we want to continue writing from current cursor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18620 My benchmarks locally said poll() is a little faster on moderately large collections, like 100 elements in the queue. I'm really neutral. If it affords a little help, that's great. It's a natural method for a queue to have and no extra implementation cost. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 Thanks @srowen , my test also said pq.poll is a little faster on some cases. One possible benefit here is if we provide pq.poll, user's first choice may use pq.poll, not pq.toArray.sorted, which may causes performance reduction. As I have encounter for https://github.com/apache/spark/pull/18624 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18513 **[Test build #79699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79699/testReport)** for PR 18513 at commit [`990b816`](https://github.com/apache/spark/commit/990b816428f8e5b94c08749650be05a3f52d07db). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127903005 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. + // We can't push down non-deterministic conditions. + case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _) --- End diff -- I do not think we can have an easy solution to ensure it always works as you expected. `EnsureRequirements` is just one of rules that could break it. The other preceding join conditions before equi join condition also could impact it. It could be skipped if the preceding join conditions is false, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18667: Fix the simpleString used in error messages
Github user fxbonnet commented on a diff in the pull request: https://github.com/apache/spark/pull/18667#discussion_r127905854 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/LongType.scala --- @@ -43,7 +43,7 @@ class LongType private() extends IntegralType { */ override def defaultSize: Int = 8 - override def simpleString: String = "bigint" + override def simpleString: String = "long" --- End diff -- When you try to read a csv and map to a case class with a Long your get a message like this one: __EXCEPTION__:org.apache.spark.sql.AnalysisException: Cannot up cast linked_docs.`MR_NUMBER_OF_DOCS_UPLOADED` from string to bigint as it may truncate The type path of the target object is: - field (class: "scala.Long", name: "MR_NUMBER_OF_DOCS_UPLOADED") Getting a message that talks about bigint while you are trying to cast a String to a Long looks confusing to me. I thought this was a typo. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127907280 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. + // We can't push down non-deterministic conditions. + case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _) --- End diff -- We do not support non-deterministic join condition. Thus, our current execution orders in the join implementation might not behave correctly. If we really need to support it, we have to check what is the right behavior in the traditional DB system. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127909294 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- I just did a simple test on Oracle. Looks like it allows the following query: SELECT * from test1 join test2 on test1.a + FLOOR(DBMS_RANDOM.VALUE()) = test2.b + FLOOR(DBMS_RANDOM.VALUE()); Furthermore, it also doesn't disallow non-deterministic function as joining condition other than joining keys. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18632 **[Test build #79702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79702/testReport)** for PR 18632 at commit [`a098540`](https://github.com/apache/spark/commit/a0985404363f2975bf673e37306d0bd1c700a4d0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18468: [SPARK-20873][SQL] Creat CachedBatchColumnVector to abst...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18468 **[Test build #79703 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79703/testReport)** for PR 18468 at commit [`0aa1b78`](https://github.com/apache/spark/commit/0aa1b785a0ed0038cc6a30dbb9334a0ce98992d5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18654 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79695/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18656 Yeah, CodegenFallback just provide a fallback mode. However, in such case, SortMergeJoinExec passes incomplete row as input to hiveUDF that implements CodegenFallback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12646 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18655 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79696/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18654 **[Test build #79695 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79695/testReport)** for PR 18654 at commit [`d118d68`](https://github.com/apache/spark/commit/d118d685374242599a12d6536675ba7aeae4bfb7). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12646 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79697/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127896217 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. + // We can't push down non-deterministic conditions. + case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _) --- End diff -- The whole thing does not make sense to me at all. Here, I think we are just trying to behave consistent with Hive, although this looks a bug to me. We might really check how Hive works for supporting it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18654 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12646 **[Test build #79697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79697/testReport)** for PR 12646 at commit [`9bb80ea`](https://github.com/apache/spark/commit/9bb80eaf8e0b4339850d8c48e221c8ad1e477552). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18655 **[Test build #79696 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79696/testReport)** for PR 18655 at commit [`8ffedda`](https://github.com/apache/spark/commit/8ffedda9f05d379d700aef95dca049a751374f87). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 I also very confused about this. You can change https://github.com/apache/spark/pull/18624 to sorted and test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127897096 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- How `rand(a)` and `rand(b)` share the same state? They are different expression instances. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18655 **[Test build #79698 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79698/testReport)** for PR 18655 at commit [`8ffedda`](https://github.com/apache/spark/commit/8ffedda9f05d379d700aef95dca049a751374f87). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127898565 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. + // We can't push down non-deterministic conditions. + case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _) --- End diff -- From the line of discussion, it seems to me you still talk joining keys and other join conditions together. However, pushing down non-deterministic joining keys actually doesn't change join results, as I said above. I am not sure why it doesn't make sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18655 @BryanCutler I'd like to share the motivation of refactoring `ArrowConverters` and `ColumnWriter`. For `ColumnWriter`, at first I'd like to support complex types like `ArrayType` and `StructType`, so I refactored it based on your `ColumnWriter` implementation. And then I renamed and moved the package so that we can also use it for pandas UDF as @cloud-fan mentioned. As you might see before, I'll introduce `ArrowColumnVector` as a reader for Arrow vectors as well. For `ArrowConverters`, I thought we can skip the intermediate `ArrowRecordBatch` creation in `ArrowConverters.toPayloadIterator()`. What do you think about that? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 My micro benchmark (write a program only test pq.toArray.sorted and pq.Array.sortBy and pq.poll), not find significant performance difference. Only in the Spark job, there is big difference. Confused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18654 **[Test build #79700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79700/testReport)** for PR 18654 at commit [`d118d68`](https://github.com/apache/spark/commit/d118d685374242599a12d6536675ba7aeae4bfb7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18669: tfidf-new edit
GitHub user chlyzzo opened a pull request: https://github.com/apache/spark/pull/18669 tfidf-new edit ## What changes were proposed in this pull request? i add a TfIdf.scala,it can compute docs tfidf's vector. i hava a case that is compute docs similarity,so i use the spark millib,the code is follow: ~~~bash val hashingTF = new HashingTF() val tf = hashingTF.transform(dataSeg) val idfIgnore = new IDF().fit(tf) val tfidfIgnore= idfIgnore.transform(tf) val data = docIds.zip(tfidfIgnore)//RDD[(String,Vector)] ~~~ but run in a small dataset,it can get result,but take much time,then in big dataset,it does not work(25 document),the job does not get result in 1 hours. the spark config setting follow: ~~~bash --driver-memory 8G --conf spark.yarn.executor.memoryOverhead=6144 --conf spark.akka.frameSize=300 num-executors=20 executor-cores=5 executor-memory=10g ~~~ so,i write the tdidf method by meself,and test dataset(25 documents),it can get the result, ## How was this patch tested? i write the TfIdf.scala,it can compute doc tfidf value,and transfer the value to vector.then you can use cos similary. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18669.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18669 commit 7cb566abc27d41d5816dee16c6ecb749da2adf46 Author: Yuming WangDate: 2017-05-05T10:31:59Z [SPARK-19660][SQL] Replace the deprecated property name fs.default.name to fs.defaultFS that newly introduced ## What changes were proposed in this pull request? Replace the deprecated property name `fs.default.name` to `fs.defaultFS` that newly introduced. ## How was this patch tested? Existing tests Author: Yuming Wang Closes #17856 from wangyum/SPARK-19660. (cherry picked from commit 37cdf077cd3f436f777562df311e3827b0727ce7) Signed-off-by: Sean Owen commit dbb54a7b39568cc9e8046a86113b98c3c69b7d11 Author: jyu00 Date: 2017-05-05T10:36:51Z [SPARK-20546][DEPLOY] spark-class gets syntax error in posix mode ## What changes were proposed in this pull request? Updated spark-class to turn off posix mode so the process substitution doesn't cause a syntax error. ## How was this patch tested? Existing unit tests, manual spark-shell testing with posix mode on Author: jyu00 Closes #17852 from jyu00/master. (cherry picked from commit 5773ab121d5d7cbefeef17ff4ac6f8af36cc1251) Signed-off-by: Sean Owen commit 1fa3c86a740e072957a2104dbd02ca3c158c508d Author: Jarrett Meyer Date: 2017-05-05T15:30:42Z [SPARK-20613] Remove excess quotes in Windows executable ## What changes were proposed in this pull request? Quotes are already added to the RUNNER variable on line 54. There is no need to put quotes on line 67. If you do, you will get an error when launching Spark. '""C:\Program' is not recognized as an internal or external command, operable program or batch file. ## How was this patch tested? Tested manually on Windows 10. Author: Jarrett Meyer Closes #17861 from jarrettmeyer/fix-windows-cmd. (cherry picked from commit b9ad2d1916af5091c8585d06ccad8219e437e2bc) Signed-off-by: Felix Cheung commit f71aea6a0be6eda24623d8563d971687ecd04caf Author: Yucai Date: 2017-05-05T16:51:57Z [SPARK-20381][SQL] Add SQL metrics of numOutputRows for ObjectHashAggregateExec ## What changes were proposed in this pull request? ObjectHashAggregateExec is missing numOutputRows, add this metrics for it. ## How was this patch tested? Added unit tests for the new metrics. Author: Yucai Closes #17678 from yucai/objectAgg_numOutputRows. (cherry picked from commit 41439fd52dd263b9f7d92e608f027f193f461777) Signed-off-by: Xiao Li commit 24fffacad709c553e0f24ae12a8cca3ab980af3c Author: Shixiong Zhu Date: 2017-05-05T18:08:26Z [SPARK-20603][SS][TEST] Set default number of topic partitions to 1 to reduce the load ## What changes were proposed in this pull request? I checked the logs of https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.2-test-maven-hadoop-2.7/47/ and found it took several seconds to create Kafka internal
[GitHub] spark issue #18669: tfidf-new edit
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18669 @chlyzzo close this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18632: [SPARK-21412][SQL] Reset BufferHolder while initi...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/18632#discussion_r127907518 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java --- @@ -51,6 +51,7 @@ public UnsafeRowWriter(BufferHolder holder, int numFields) { this.nullBitsSize = UnsafeRow.calculateBitSetWidthInBytes(numFields); this.fixedSize = nullBitsSize + 8 * numFields; this.startingOffset = holder.cursor; +holder.reset(); --- End diff -- What do you mean by 'writer is for inner struct'? @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18632: [SPARK-21412][SQL] Reset BufferHolder while initi...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/18632#discussion_r127908258 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java --- @@ -51,6 +51,7 @@ public UnsafeRowWriter(BufferHolder holder, int numFields) { this.nullBitsSize = UnsafeRow.calculateBitSetWidthInBytes(numFields); this.fixedSize = nullBitsSize + 8 * numFields; this.startingOffset = holder.cursor; +holder.reset(); --- End diff -- @cloud-fan @viirya For your worries, maybe we can move the `holder.reset()` to `BufferHolder`'s constructor. Then the holder will be reset only once, and also it'ok to continue writing from a buffer's current cursor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18669: tfidf-new edit
Github user chlyzzo commented on the issue: https://github.com/apache/spark/pull/18669 closed, - åå§é®ä»¶ - å件人ï¼Sean Owenæ¶ä»¶äººï¼apache/spark æé人ï¼chlyzzo , Mention 主é¢ï¼Re: [apache/spark] tfidf-new edit (#18669) æ¥æï¼2017å¹´07æ18æ¥ 15ç¹41å @chlyzzo close this You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18624 Hi @srowen @MLnick @jkbradley @mengxr @yanboliang Is this change acceptable? if it is acceptable, I will update ALS ML code following this method. Also update Test Suite, which are too simple, can not detect ALS errors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127891910 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- For different joining type, I think the joining keys are used to find matching/not matching rows. Currently I can't think of the case we can't push down non-deterministic joining keys. Maybe you can also show an example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18656 Will CodegenFallback be used in wholestage codegen? I think it's not supported. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127894313 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- IIUC, for joining keys, it actually satisfies what you said: It's evaluated in the same order and in the same number as we don't push it down. I can't think an example it doesn't. So I may ask if you have an example for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127888746 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempPath { dir => --- End diff -- More clear :) No need to create source files in real. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties from s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18668 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127893543 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- The major point here is the non-deterministic join condition push-down is safe only when the results are the exactly same before and after the push down. After we push it down, basically, it will be evaluated for each row of that side. Will it be evaluated in the same order and in the same number if we do not push it down? We can find many different scenarios to break it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12646 **[Test build #79697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79697/testReport)** for PR 12646 at commit [`9bb80ea`](https://github.com/apache/spark/commit/9bb80eaf8e0b4339850d8c48e221c8ad1e477552). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127892847 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- What is the join key? Any definition? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...
GitHub user yaooqinn opened a pull request: https://github.com/apache/spark/pull/18668 [SPARK-21451][SQL]get `spark.hadoop.*` properties from sysProps to hiveconf ## What changes were proposed in this pull request? get `spark.hadoop.*` properties from sysProps to hiveconf ## How was this patch tested? UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/yaooqinn/spark SPARK-21451 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18668.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18668 commit 89d9b86616196fde5d0b3a08fb284e6af6afe588 Author: Kent YaoDate: 2017-07-18T06:41:24Z HiveConf in SparkSQLCLIDriver doesn't respect spark.hadoop.some.hive.variables --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127895586 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- However, `rand(a)` and `rand(b)` could share the same state inside of `rand`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18656 Hi, @cloud-fan, @vanzin , could you help to take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127893995 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. + // We can't push down non-deterministic conditions. + case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _) --- End diff -- Supporting only equi-join does not sound reasonable here. The join condition can be any predicate. How about adding a SQLConf flag for controlling it? We can simply pushing it down no matter whether its semantics are the same or not, for making it consistent with Hive. By default, turn that flag off. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127895248 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. + // We can't push down non-deterministic conditions. + case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _) --- End diff -- Joining keys can only be equi-join. It is exactly the use case discussed in the dev mailling list. It's actually useful for the use cases. A general non-deterministic join condition pushdown doesn't make a lot of sense. The kind of predicates like `rand(1) > 0 && rand(11) < 0` can be a serious concern. The join results can be different before and after pushdown. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127895399 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- `rand(a)` and `rand(b)` are belonging to individual tables. So they are evaluated individually on different tables. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127895419 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- One row in side A could match multiple rows in side B. The join conditions could be also evaluated multiple times for the same row in side A, right? Then, if we push it down to the side A, it could also break the number of `rand` calls, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18654 **[Test build #79695 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79695/testReport)** for PR 18654 at commit [`d118d68`](https://github.com/apache/spark/commit/d118d685374242599a12d6536675ba7aeae4bfb7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 Hi @MLnick , @srowen . My test showing: pq.poll is not significantly faster than pq.toArray.sortBy, but significantly faster than pq.toArray.sorted. Seems not each pq.toArray.sorted (such as used in topByKey) can be replaced by pq.toArray.sortBy, so use pq.poll to replace pq.toArray.sorted will benefit. You can compare the performance of pq.sorted, pq.sortBy, and pq.poll using: https://github.com/apache/spark/pull/18624 The performance of pq.toArray.sortBy is about the same as pq.poll, and about 20% improvement comparing pq.toArray.sorted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18655 **[Test build #79696 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79696/testReport)** for PR 18655 at commit [`8ffedda`](https://github.com/apache/spark/commit/8ffedda9f05d379d700aef95dca049a751374f87). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127893174 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- We use `ExtractEquiJoinKeys` to extract joining keys. You can check it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18555: [SPARK-21353][CORE]add checkValue in spark.internal.conf...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/18555 @gatorsmile Could you please review this code again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127894772 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- Even if for equi join, how about `rand(a) = rand(b)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18468: [SPARK-20873][SQL] Creat CachedBatchColumnVector ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18468#discussion_r127962023 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/CachedBatchColumnVector.java --- @@ -0,0 +1,421 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.vectorized; + +import java.nio.ByteBuffer; + +import org.apache.spark.memory.MemoryMode; +import org.apache.spark.sql.execution.columnar.*; +import org.apache.spark.sql.types.*; +import org.apache.spark.unsafe.types.UTF8String; + +/** + * A column backed by an in memory JVM array. + */ +public final class CachedBatchColumnVector extends ColumnVector implements java.io.Serializable { + + // keep compressed data + private byte[] buffer; + + // whether a row is already extracted or not. If extractTo() is called, set true + // e.g. when isNullAt() and getInt() ara called, extractTo() must be called only once + private boolean[] calledExtractTo; + + // accessor for a column + private transient ColumnAccessor columnAccessor; + + // a row where the compressed data is extracted + private transient ColumnVector columnVector; + + // an accessor uses only row 0 in columnVector + private final int ROWID = 0; + + + public CachedBatchColumnVector(byte[] buffer, int numRows, DataType type) { +super(numRows, DataTypes.NullType, MemoryMode.ON_HEAP); +initialize(buffer, type); +reserveInternal(numRows); +reset(); + } + + @Override + public long valuesNativeAddress() { +throw new RuntimeException("Cannot get native address for on heap column"); + } + @Override + public long nullsNativeAddress() { +throw new RuntimeException("Cannot get native address for on heap column"); + } + + @Override + public void close() { + } + + private void setColumnAccessor() { +ByteBuffer byteBuffer = ByteBuffer.wrap(buffer); +columnAccessor = ColumnAccessor$.MODULE$.apply(type, byteBuffer); +calledExtractTo = new boolean[capacity]; + } + + // call extractTo() before getting actual data + private void prepareAccess(int rowId) { +if (!calledExtractTo[rowId]) { + assert (columnAccessor.hasNext()); + columnAccessor.extractTo(columnVector, ROWID); + calledExtractTo[rowId] = true; +} + } + + // + // APIs dealing with nulls + // + + @Override + public void putNotNull(int rowId) { +throw new UnsupportedOperationException(); + } + + @Override + public void putNull(int rowId) { +throw new UnsupportedOperationException(); + } + + @Override + public void putNulls(int rowId, int count) { +throw new UnsupportedOperationException(); + } + + @Override + public void putNotNulls(int rowId, int count) { +throw new UnsupportedOperationException(); + } + + @Override + public boolean isNullAt(int rowId) { +prepareAccess(rowId); +return columnVector.isNullAt(ROWID); + } + + // + // APIs dealing with Booleans + // + + @Override + public void putBoolean(int rowId, boolean value) { +throw new UnsupportedOperationException(); + } + + @Override + public void putBooleans(int rowId, int count, boolean value) { +throw new UnsupportedOperationException(); + } + + @Override + public boolean getBoolean(int rowId) { --- End diff -- We do not support reading values in a random order. This is because implementation of `CompressionScheme` (e.g. `IntDelta`) supports only sequential access. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working,
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18656 I think the check for `SortMergeJoinExec` in `insertInputAdapter` should be corrected to: private def insertInputAdapter(plan: SparkPlan): SparkPlan = plan match { case p if !supportCodegen(p) => // collapse them recursively InputAdapter(insertWholeStageCodegen(p)) case j @ SortMergeJoinExec(_, _, _, _, left, right) => // The children of SortMergeJoin should do codegen separately. j.copy(left = InputAdapter(insertWholeStageCodegen(left)), right = InputAdapter(insertWholeStageCodegen(right))) case p => p.withNewChildren(p.children.map(insertInputAdapter)) } Can you try it? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127965749 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. --- End diff -- Sure. I agreed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18652#discussion_r127965550 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1912,6 +1913,26 @@ class Analyzer( nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e) }.copy(child = newChild) + case j: Join if j.condition.isDefined && !j.condition.get.deterministic => +j match { + // We can push down non-deterministic joining keys. + // We can't push down non-deterministic conditions. + case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _) --- End diff -- cc @cloud-fan and @hvanhovell if you have more insights that can be shared with us about this part. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18468: [SPARK-20873][SQL] Creat CachedBatchColumnVector ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18468#discussion_r127969028 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/CachedBatchColumnVector.java --- @@ -0,0 +1,421 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.vectorized; + +import java.nio.ByteBuffer; + +import org.apache.spark.memory.MemoryMode; +import org.apache.spark.sql.execution.columnar.*; +import org.apache.spark.sql.types.*; +import org.apache.spark.unsafe.types.UTF8String; + +/** + * A column backed by an in memory JVM array. + */ +public final class CachedBatchColumnVector extends ColumnVector implements java.io.Serializable { + + // keep compressed data + private byte[] buffer; + + // whether a row is already extracted or not. If extractTo() is called, set true + // e.g. when isNullAt() and getInt() ara called, extractTo() must be called only once + private boolean[] calledExtractTo; + + // accessor for a column + private transient ColumnAccessor columnAccessor; + + // a row where the compressed data is extracted + private transient ColumnVector columnVector; + + // an accessor uses only row 0 in columnVector + private final int ROWID = 0; + + + public CachedBatchColumnVector(byte[] buffer, int numRows, DataType type) { +super(numRows, DataTypes.NullType, MemoryMode.ON_HEAP); +initialize(buffer, type); +reserveInternal(numRows); +reset(); + } + + @Override + public long valuesNativeAddress() { +throw new RuntimeException("Cannot get native address for on heap column"); + } + @Override + public long nullsNativeAddress() { +throw new RuntimeException("Cannot get native address for on heap column"); + } + + @Override + public void close() { + } + + private void setColumnAccessor() { +ByteBuffer byteBuffer = ByteBuffer.wrap(buffer); +columnAccessor = ColumnAccessor$.MODULE$.apply(type, byteBuffer); +calledExtractTo = new boolean[capacity]; + } + + // call extractTo() before getting actual data + private void prepareAccess(int rowId) { +if (!calledExtractTo[rowId]) { + assert (columnAccessor.hasNext()); + columnAccessor.extractTo(columnVector, ROWID); + calledExtractTo[rowId] = true; +} + } + + // + // APIs dealing with nulls + // + + @Override + public void putNotNull(int rowId) { +throw new UnsupportedOperationException(); + } + + @Override + public void putNull(int rowId) { +throw new UnsupportedOperationException(); + } + + @Override + public void putNulls(int rowId, int count) { +throw new UnsupportedOperationException(); + } + + @Override + public void putNotNulls(int rowId, int count) { +throw new UnsupportedOperationException(); + } + + @Override + public boolean isNullAt(int rowId) { +prepareAccess(rowId); +return columnVector.isNullAt(ROWID); + } + + // + // APIs dealing with Booleans + // + + @Override + public void putBoolean(int rowId, boolean value) { +throw new UnsupportedOperationException(); + } + + @Override + public void putBooleans(int rowId, int count, boolean value) { +throw new UnsupportedOperationException(); + } + + @Override + public boolean getBoolean(int rowId) { --- End diff -- I see. I will add code to track access order for each getter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18654 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79704/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18654 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit pr...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18641#discussion_r127982825 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala --- @@ -273,12 +274,26 @@ case class CaseWhenCodegen( val cases = branches.map { case (condExpr, valueExpr) => val cond = condExpr.genCode(ctx) val res = valueExpr.genCode(ctx) + val (condFunc, condIsNull, condValue, resFunc, resIsNull, resValue ) = +if ((cond.code.length + res.code.length) > 1024 && --- End diff -- Ah, got it. You mean that we have to split super deeply-nested if-then-else statements into multiple methods, too. I will work for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18654 **[Test build #79704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79704/testReport)** for PR 18654 at commit [`d118d68`](https://github.com/apache/spark/commit/d118d685374242599a12d6536675ba7aeae4bfb7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18670: [SPARK-21455][CORE]RpcFailure should be call on RpcRespo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18670 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18670: [SPARK-21455][CORE]RpcFailure should be call on RpcRespo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18670 **[Test build #79706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79706/testReport)** for PR 18670 at commit [`962b605`](https://github.com/apache/spark/commit/962b6059bcc9f5b54a4e01351993982ef7bab9f1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18670: [SPARK-21455][CORE]RpcFailure should be call on R...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18670#discussion_r127988207 --- Diff: core/src/test/scala/org/apache/spark/rpc/RpcEnvSuite.scala --- @@ -624,7 +624,9 @@ abstract class RpcEnvSuite extends SparkFunSuite with BeforeAndAfterAll { val e = intercept[SparkException] { ThreadUtils.awaitResult(f, 1 seconds) } - assert(e.getCause.isInstanceOf[NotSerializableException]) + assert(e.getCause.isInstanceOf[RuntimeException]) --- End diff -- why the exception type changed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18654 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18654 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79700/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127934107 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") ( val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) val bcFeaturesStd = instances.context.broadcast(featuresStd) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, +val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, numClasses, $(fitIntercept), + multinomial = isMultinomial)(_) +val getFeaturesStd = (j: Int) => if (j >= 0 && j < numCoefficientSets * numFeatures) { + featuresStd(j / numCoefficientSets) +} else { + 0.0 +} + +val regularization = if (regParamL2 != 0.0) { + val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * numCoefficientSets --- End diff -- The intercepts are appended to the coefficient vectors, so the `idx` for intercept will be `>= numFeatures * numCoefficientSets`. Hence this function ignores intercept reg. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18305: [SPARK-20988][ML] Logistic regression uses aggregator hi...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18305 @sethah IMO we should back out the test-related bc var explicit destroy code as it complicates things. I hear that this _may_ help catch bugs... but frankly I'm not convinced. Because the code setup & path in the source may not be quite the same as in the tests (almost never I'd say), I don't believe you will necessarily catch bugs such as the one mentioned by Yanbo. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18632 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79702/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18632 **[Test build #79702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79702/testReport)** for PR 18632 at commit [`a098540`](https://github.com/apache/spark/commit/a0985404363f2975bf673e37306d0bd1c700a4d0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18632 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org