[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16386 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73030/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16386 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16386 **[Test build #73030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73030/testReport)** for PR 16386 at commit [`58118f2`](https://github.com/apache/spark/commit/58118f2762a3abfb5ca82c156eabb9622844b764). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16968: [SPARK-19337] [ML] [Dcoc] Documentation and examp...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16968#discussion_r101683048 --- Diff: docs/ml-classification-regression.md --- @@ -363,6 +363,51 @@ Refer to the [R API docs](api/R/spark.mlp.html) for more details. +## Linear Support Vector Machine + +A [support vector machine](https://en.wikipedia.org/wiki/Support_vector_machine) constructs a hyperplane +or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, +regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has +the largest distance to the nearest training-data point of any class (so-called functional margin), +since in general the larger the margin the lower the generalization error of the classifier. LinearSVC +in Spark ML supports binary calssification with linear SVM. Internally, it optimizes the --- End diff -- calssification -> classification --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16968: [SPARK-19337] [ML] [Dcoc] Documentation and examp...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16968#discussion_r101683098 --- Diff: docs/ml-classification-regression.md --- @@ -363,6 +363,51 @@ Refer to the [R API docs](api/R/spark.mlp.html) for more details. +## Linear Support Vector Machine + +A [support vector machine](https://en.wikipedia.org/wiki/Support_vector_machine) constructs a hyperplane +or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, +regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has +the largest distance to the nearest training-data point of any class (so-called functional margin), --- End diff -- "largest distance" -> "longest distance"? I think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16969#discussion_r101683604 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -564,6 +566,26 @@ model <- spark.logit(df, Species ~ ., regParam = 0.056) summary(model) ``` + Linear Support Vector Machine (SVM) Classifier + +[Linear Support Vector Machine (SVM)](https://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM) classifier is an SVM classifier with linear kernel. +This is a binary classifier. Multi-class classification can be achieved by one-vs-the-rest strategy. We use a simple example to show how to use `spark.svmLinear` +for binary classification. + +```{r} +# load training data and create a DataFrame +t <- as.data.frame(Titanic) +training <- createDataFrame(t) +# fit a Linear SVM classifier model +model <- spark.svmLinear(training, Survived ~ ., regParam = 0.01) --- End diff -- should it go with `maxIter = 10` here too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16969#discussion_r101683480 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -564,6 +566,26 @@ model <- spark.logit(df, Species ~ ., regParam = 0.056) summary(model) ``` + Linear Support Vector Machine (SVM) Classifier + +[Linear Support Vector Machine (SVM)](https://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM) classifier is an SVM classifier with linear kernel. +This is a binary classifier. Multi-class classification can be achieved by one-vs-the-rest strategy. We use a simple example to show how to use `spark.svmLinear` --- End diff -- we actually don't have support for `one-vs-the-rest strategy` in R at the moment (existing JIRA or design still open), so perhaps it's best we don't reference that here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16969#discussion_r101683657 --- Diff: examples/src/main/r/ml/svmLinear.R --- @@ -0,0 +1,41 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/svmLinear.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-svmLinear-example") + +# $example on$ +# load training data +t <- as.data.frame(Titanic) +training <- createDataFrame(t) + +# fit linearSvc model --- End diff -- `linearSvc` -> `svmLinear`? or `Linear SVM`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16969#discussion_r101683410 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -564,6 +566,26 @@ model <- spark.logit(df, Species ~ ., regParam = 0.056) summary(model) ``` + Linear Support Vector Machine (SVM) Classifier + +[Linear Support Vector Machine (SVM)](https://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM) classifier is an SVM classifier with linear kernel. --- End diff -- `linear kernels`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16969#discussion_r101683369 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -471,6 +471,8 @@ SparkR supports the following machine learning models and algorithms. * Logistic Regression +* Linear Support Vector Machine (SVM) Classifier --- End diff -- shouldn't `Linear` go before `Logistic`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16386 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73029/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16386 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16386 **[Test build #73029 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73029/testReport)** for PR 16386 at commit [`e323317`](https://github.com/apache/spark/commit/e32331794a63e8fcfe60a0901672e69dcfd6fe15). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16969 are we merging this after #16968? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/16972#discussion_r101682505 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -73,17 +81,52 @@ private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) e } def putBytes(blockId: BlockId, bytes: ChunkedByteBuffer): Unit = { +val bytesToStore = if (serializerManager.encryptionEnabled) { + try { +val data = bytes.toByteBuffer +val in = new ByteBufferInputStream(data, true) +val byteBufOut = new ByteBufferOutputStream(data.remaining()) +val out = CryptoStreamUtils.createCryptoOutputStream(byteBufOut, conf, + serializerManager.encryptionKey.get) +try { + ByteStreams.copy(in, out) +} finally { + in.close() + out.close() +} +new ChunkedByteBuffer(byteBufOut.toByteBuffer) + } finally { +bytes.dispose() + } +} else { + bytes +} + put(blockId) { fileOutputStream => val channel = fileOutputStream.getChannel Utils.tryWithSafeFinally { -bytes.writeFully(channel) +bytesToStore.writeFully(channel) } { channel.close() } } } def getBytes(blockId: BlockId): ChunkedByteBuffer = { +val bytes = readBytes(blockId) + +val in = serializerManager.wrapForEncryption(bytes.toInputStream(dispose = true)) +new ChunkedByteBuffer(ByteBuffer.wrap(IOUtils.toByteArray(in))) + } + + def getBytesAsValues[T](blockId: BlockId, classTag: ClassTag[T]): Iterator[T] = { +val bytes = readBytes(blockId) + +serializerManager + .dataDeserializeStream(blockId, bytes.toInputStream(dispose = true))(classTag) + } + + private[storage] def readBytes(blockId: BlockId): ChunkedByteBuffer = { --- End diff -- abstract it for unit test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/16972#discussion_r101682451 --- Diff: core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala --- @@ -39,27 +40,27 @@ class DiskStoreSuite extends SparkFunSuite { val blockId = BlockId("rdd_1_2") val diskBlockManager = new DiskBlockManager(new SparkConf(), deleteFilesOnStop = true) -val diskStoreMapped = new DiskStore(new SparkConf().set(confKey, "0"), diskBlockManager) +val conf = new SparkConf() +val serializer = new KryoSerializer(conf) +val serializerManager = new SerializerManager(serializer, conf) + +conf.set(confKey, "0") +val diskStoreMapped = new DiskStore(conf, serializerManager, diskBlockManager) diskStoreMapped.putBytes(blockId, byteBuffer) -val mapped = diskStoreMapped.getBytes(blockId) +val mapped = diskStoreMapped.readBytes(blockId) assert(diskStoreMapped.remove(blockId)) -val diskStoreNotMapped = new DiskStore(new SparkConf().set(confKey, "1m"), diskBlockManager) +conf.set(confKey, "1m") +val diskStoreNotMapped = new DiskStore(conf, serializerManager, diskBlockManager) diskStoreNotMapped.putBytes(blockId, byteBuffer) -val notMapped = diskStoreNotMapped.getBytes(blockId) +val notMapped = diskStoreNotMapped.readBytes(blockId) // Not possible to do isInstanceOf due to visibility of HeapByteBuffer assert(notMapped.getChunks().forall(_.getClass.getName.endsWith("HeapByteBuffer")), "Expected HeapByteBuffer for un-mapped read") assert(mapped.getChunks().forall(_.isInstanceOf[MappedByteBuffer]), "Expected MappedByteBuffer for mapped read") -def arrayFromByteBuffer(in: ByteBuffer): Array[Byte] = { - val array = new Array[Byte](in.remaining()) --- End diff -- remove unused --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/16972#discussion_r101682663 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -21,17 +21,25 @@ import java.io.{FileOutputStream, IOException, RandomAccessFile} import java.nio.ByteBuffer import java.nio.channels.FileChannel.MapMode -import com.google.common.io.Closeables +import scala.reflect.ClassTag + +import com.google.common.io.{ByteStreams, Closeables} +import org.apache.commons.io.IOUtils -import org.apache.spark.SparkConf import org.apache.spark.internal.Logging -import org.apache.spark.util.Utils +import org.apache.spark.SparkConf +import org.apache.spark.security.CryptoStreamUtils +import org.apache.spark.serializer.SerializerManager +import org.apache.spark.util.{ByteBufferInputStream, ByteBufferOutputStream, Utils} import org.apache.spark.util.io.ChunkedByteBuffer /** * Stores BlockManager blocks on disk. */ -private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) extends Logging { +private[spark] class DiskStore( +conf: SparkConf, +serializerManager: SerializerManager, --- End diff -- add `serializerManager ` to do decryption work in `DiskStore` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/16972#discussion_r101682730 --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala --- @@ -344,7 +370,7 @@ private[spark] class MemoryStore( val serializationStream: SerializationStream = { val autoPick = !blockId.isInstanceOf[StreamBlockId] val ser = serializerManager.getSerializer(classTag, autoPick).newInstance() - ser.serializeStream(serializerManager.wrapStream(blockId, redirectableStream)) + ser.serializeStream(serializerManager.wrapForCompression(blockId, redirectableStream)) } --- End diff -- `MemoryStore` will not do encryption work --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16972: [SPARK-19556][CORE][WIP] Broadcast data is not encrypted...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16972 @vanzin I will add some unit test in future. But could you please review this first? I think I may be missing something. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16972: [SPARK-19556][CORE][WIP] Broadcast data is not encrypted...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16972 **[Test build #73036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73036/testReport)** for PR 16972 at commit [`f9a91d6`](https://github.com/apache/spark/commit/f9a91d63af3191b853ef88bd48293bcc19f3ec4c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/16972 [SPARK-19556][CORE][WIP] Broadcast data is not encrypted when I/O encryption is on ## What changes were proposed in this pull request? `TorrentBroadcast` uses a couple of "back doors" into the block manager to write and read data. The thing these block manager methods have in common is that they bypass the encryption code; so broadcast data is stored unencrypted in the block manager, causing unencrypted data to be written to disk if those blocks need to be evicted from memory. The correct fix here is actually not to change `TorrentBroadcast`, but to fix the block manager so that: - data stored in memory is not encrypted - data written to disk is encrypted This would simplify the code paths that use BlockManager / SerializerManager APIs. ## How was this patch tested? update and add unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-19556 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16972.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16972 commit 63d909b4a0cc108fd8756436b21c65614abb9466 Author: uncleGenDate: 2017-02-15T14:12:47Z cp commit f9a91d63af3191b853ef88bd48293bcc19f3ec4c Author: uncleGen Date: 2017-02-17T03:54:44Z refactor blockmanager: data stored in memory is not encrypted, data written to disk is encrypted --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101682138 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- As `StorageUtils.dispose` only cleans up a memory-mapped `ByteBuffer`, I don't think calling `bytesToReplicate.dispose()` here would be a problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101680844 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- I did a search. Looks like the buffers passed to `putBytes` (the only caller of private `doPutBytes`) across Spark are all duplicated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16923: [SPARK-19038][Hive][YARN] Correctly figure out ke...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/16923#discussion_r101680633 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -106,21 +106,33 @@ private[hive] class HiveClientImpl( // Set up kerberos credentials for UserGroupInformation.loginUser within // current class loader -// Instead of using the spark conf of the current spark context, a new -// instance of SparkConf is needed for the original value of spark.yarn.keytab -// and spark.yarn.principal set in SparkSubmit, as yarn.Client resets the -// keytab configuration for the link name in distributed cache if (sparkConf.contains("spark.yarn.principal") && sparkConf.contains("spark.yarn.keytab")) { val principalName = sparkConf.get("spark.yarn.principal") - val keytabFileName = sparkConf.get("spark.yarn.keytab") - if (!new File(keytabFileName).exists()) { -throw new SparkException(s"Keytab file: ${keytabFileName}" + - " specified in spark.yarn.keytab does not exist") - } else { -logInfo("Attempting to login to Kerberos" + - s" using principal: ${principalName} and keytab: ${keytabFileName}") -UserGroupInformation.loginUserFromKeytab(principalName, keytabFileName) + val keytabFileName = { +val keytab = sparkConf.get("spark.yarn.keytab") +if (new File(keytab).exists()) { + keytab +} else { + // Instead of using the spark conf of the current spark context, a new + // instance of SparkConf is needed for the original value of spark.yarn.keytab + // set in SparkSubmit, as yarn.Client resets the keytab configuration for the link name + // in distributed cache, and this will make Spark driver fail to get correct keytab + // path in yarn client mode. + val originKeytab = new SparkConf().get("spark.yarn.keytab") + require(originKeytab != null, +"spark.yarn.keytab is not configured, this is unexpected") + if (new File(originKeytab).exists()) { +originKeytab + } else { +throw new SparkException(s"Keytab file: $originKeytab " + + s"specified in spark.yarn.keytab does not exist") + } +} } + + logInfo("Attempting to login to Kerberos" + +s" using principal: ${principalName} and keytab: ${keytabFileName}") + UserGroupInformation.loginUserFromKeytab(principalName, keytabFileName) --- End diff -- Perhaps the reason is described in [here](https://github.com/apache/spark/pull/9272). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16923: [SPARK-19038][Hive][YARN] Correctly figure out ke...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/16923#discussion_r101680479 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -106,21 +106,33 @@ private[hive] class HiveClientImpl( // Set up kerberos credentials for UserGroupInformation.loginUser within // current class loader -// Instead of using the spark conf of the current spark context, a new -// instance of SparkConf is needed for the original value of spark.yarn.keytab -// and spark.yarn.principal set in SparkSubmit, as yarn.Client resets the -// keytab configuration for the link name in distributed cache if (sparkConf.contains("spark.yarn.principal") && sparkConf.contains("spark.yarn.keytab")) { val principalName = sparkConf.get("spark.yarn.principal") - val keytabFileName = sparkConf.get("spark.yarn.keytab") - if (!new File(keytabFileName).exists()) { -throw new SparkException(s"Keytab file: ${keytabFileName}" + - " specified in spark.yarn.keytab does not exist") - } else { -logInfo("Attempting to login to Kerberos" + - s" using principal: ${principalName} and keytab: ${keytabFileName}") -UserGroupInformation.loginUserFromKeytab(principalName, keytabFileName) + val keytabFileName = { +val keytab = sparkConf.get("spark.yarn.keytab") +if (new File(keytab).exists()) { + keytab +} else { + // Instead of using the spark conf of the current spark context, a new + // instance of SparkConf is needed for the original value of spark.yarn.keytab + // set in SparkSubmit, as yarn.Client resets the keytab configuration for the link name + // in distributed cache, and this will make Spark driver fail to get correct keytab + // path in yarn client mode. + val originKeytab = new SparkConf().get("spark.yarn.keytab") + require(originKeytab != null, +"spark.yarn.keytab is not configured, this is unexpected") + if (new File(originKeytab).exists()) { +originKeytab + } else { +throw new SparkException(s"Keytab file: $originKeytab " + + s"specified in spark.yarn.keytab does not exist") + } +} } + + logInfo("Attempting to login to Kerberos" + +s" using principal: ${principalName} and keytab: ${keytabFileName}") + UserGroupInformation.loginUserFromKeytab(principalName, keytabFileName) --- End diff -- Not clearly sure why here login from keytab is required. It is the behavior required by Hive? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16785 **[Test build #73035 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73035/testReport)** for PR 16785 at commit [`278c31c`](https://github.com/apache/spark/commit/278c31cf8aa27c71e0f5178bebcb426ec5fba6ce). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16970 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73028/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16970 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16970 **[Test build #73028 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73028/testReport)** for PR 16970 at commit [`63a7f4c`](https://github.com/apache/spark/commit/63a7f4c62b2da32351d008f9719d513e14562e56). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Deduplication(` * `trait WatermarkSupport extends SparkPlan ` * `case class DeduplicationExec(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16971 **[Test build #73033 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73033/testReport)** for PR 16971 at commit [`d5e79a8`](https://github.com/apache/spark/commit/d5e79a809b1edd91a7e0c1d8046bb8bfec2ba4c9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16785 **[Test build #73034 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73034/testReport)** for PR 16785 at commit [`4ba93fe`](https://github.com/apache/spark/commit/4ba93fecfcbdeebecc9526a90b5800c98b3f35a7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16785 > this looks like a very big hammer to solve this problem. Can't we try a different approach? I think we should try to avoid optimizing already optimized code snippets, you might be able to do this using some kind of a fence. It would even be better if we would have a recursive node. @cloud-fan @hvanhovell Ok. I've figured out to add a check to reduce the candidates of aliased constraints. It can achieve same speed-up (cut of half running time in benchmark) without parallel collection hammer. Can you have time to look at it? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16851: [SPARK-19508][Core] Improve error message when binding s...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16851 gentle ping @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/16971 [SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile ## What changes were proposed in this pull request? update `StatFunctions.multipleApproxQuantiles` to handle NaN/null (Please fill in changes proposed in this fix) ## How was this patch tested? existing tests and added tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark quantiles_nan Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16971.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16971 commit d5e79a809b1edd91a7e0c1d8046bb8bfec2ba4c9 Author: Zheng RuiFengDate: 2017-02-17T03:18:43Z create pr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16923: [SPARK-19038][Hive][YARN] Correctly figure out ke...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/16923#discussion_r101678941 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -106,21 +106,33 @@ private[hive] class HiveClientImpl( // Set up kerberos credentials for UserGroupInformation.loginUser within // current class loader -// Instead of using the spark conf of the current spark context, a new -// instance of SparkConf is needed for the original value of spark.yarn.keytab -// and spark.yarn.principal set in SparkSubmit, as yarn.Client resets the -// keytab configuration for the link name in distributed cache if (sparkConf.contains("spark.yarn.principal") && sparkConf.contains("spark.yarn.keytab")) { val principalName = sparkConf.get("spark.yarn.principal") - val keytabFileName = sparkConf.get("spark.yarn.keytab") - if (!new File(keytabFileName).exists()) { -throw new SparkException(s"Keytab file: ${keytabFileName}" + - " specified in spark.yarn.keytab does not exist") - } else { -logInfo("Attempting to login to Kerberos" + - s" using principal: ${principalName} and keytab: ${keytabFileName}") -UserGroupInformation.loginUserFromKeytab(principalName, keytabFileName) + val keytabFileName = { +val keytab = sparkConf.get("spark.yarn.keytab") +if (new File(keytab).exists()) { + keytab +} else { + // Instead of using the spark conf of the current spark context, a new + // instance of SparkConf is needed for the original value of spark.yarn.keytab + // set in SparkSubmit, as yarn.Client resets the keytab configuration for the link name + // in distributed cache, and this will make Spark driver fail to get correct keytab + // path in yarn client mode. + val originKeytab = new SparkConf().get("spark.yarn.keytab") + require(originKeytab != null, +"spark.yarn.keytab is not configured, this is unexpected") + if (new File(originKeytab).exists()) { +originKeytab + } else { +throw new SparkException(s"Keytab file: $originKeytab " + + s"specified in spark.yarn.keytab does not exist") + } +} } + + logInfo("Attempting to login to Kerberos" + +s" using principal: ${principalName} and keytab: ${keytabFileName}") + UserGroupInformation.loginUserFromKeytab(principalName, keytabFileName) --- End diff -- I probably should look at the history of this file, but I'm a little puzzled about why is this login necessary. `SparkSubmit.scala` already logs in the user with the provided keytab, before YARN's `Client.scala` has a chance to mess with it. So it seems to me like this code is redundant? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16951: [SPARK-18285][SPARKR] SparkR approxQuantile suppo...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16951#discussion_r101676637 --- Diff: R/pkg/R/stats.R --- @@ -149,15 +149,18 @@ setMethod("freqItems", signature(x = "SparkDataFrame", cols = "character"), #' This method implements a variation of the Greenwald-Khanna algorithm (with some speed #' optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna. +#' Note that rows containing any NA values will be removed before calculation. #' #' @param x A SparkDataFrame. -#' @param col The name of the numerical column. +#' @param cols The names of the numerical columns. --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16951: [SPARK-18285][SPARKR] SparkR approxQuantile suppo...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16951#discussion_r101676617 --- Diff: R/pkg/R/stats.R --- @@ -149,15 +149,18 @@ setMethod("freqItems", signature(x = "SparkDataFrame", cols = "character"), #' This method implements a variation of the Greenwald-Khanna algorithm (with some speed #' optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna. +#' Note that rows containing any NA values will be removed before calculation. --- End diff -- Added test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/16949 Sure, this PR is fine, I'd just prefer some minor API adjustments to bring it closer to the code I linked above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16951: [SPARK-18285][SPARKR] SparkR approxQuantile supports inp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16951 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16951: [SPARK-18285][SPARKR] SparkR approxQuantile supports inp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16951 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73031/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16951: [SPARK-18285][SPARKR] SparkR approxQuantile supports inp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16951 **[Test build #73031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73031/testReport)** for PR 16951 at commit [`27b2cd6`](https://github.com/apache/spark/commit/27b2cd6aa3b27e44792cda4a23123b8ad3ef58aa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101675669 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- So maybe use `putBlockStatus.storageLevel` instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101675576 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- Is it safe to store a ref to `bytes` if the memory is stored off-heap? If the caller changes the values in that memory or frees it, the buffer we put in the memory store will be affected. We don't want that kind of side-effect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16964: [SPARK-19534][TESTS] Convert Java tests to use lambdas, ...
Github user zzcclp commented on the issue: https://github.com/apache/spark/pull/16964 @srowen after update to master, in Eclipse IDE, there is an error in JavaConsumerStrategySuite.java line 52: `final Mapoffsets = new HashMap<>(); offsets.put(tp1, 23L); final scala.collection.Map sOffsets = JavaConverters.mapAsScalaMapConverter(offsets).asScala().mapValues( new scala.runtime.AbstractFunction1 () { @Override public Object apply(Long x) { return (Object) x; } } );` Error is as follows: **The method mapValues(Function1 ) is ambiguous for the type Map ** Is it related to updating Java 7 to 8? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101664624 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- This condition can just check for `level.useOffHeap` right? But more generally, I had the same question that @viirya has. Is there a way to check if `bytes` is already off-heap and avoid the defensive copy? Or will that never be the case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101674560 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- Similarly, aren't we deciding whether we should dispose `bytesToReplicate` based on the 'target' `StorageLevel`? Shouldn't we make that decision based on whether `bytesToReplicate` were stored off-heap or not? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16826 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16826 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73027/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16790: [SPARK-19450] Replace askWithRetry with askSync.
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/16790 https://github.com/apache/spark/pull/16690#discussion_r101616883 causes the build to produce lots of deprecation warnings. @srowen @vanzin How do you think about this ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16826 **[Test build #73027 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73027/testReport)** for PR 16826 at commit [`e2bbfa8`](https://github.com/apache/spark/commit/e2bbfa8c81f91c57f5628e771f42d414a1031d57). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16690: [SPARK-19347] ReceiverSupervisorImpl can add block to Re...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/16690 @srowen How do you think about https://github.com/apache/spark/pull/16790? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16476: [SPARK-19084][SQL] Implement expression field
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/16476#discussion_r101673768 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala --- @@ -340,3 +341,91 @@ object CaseKeyWhen { CaseWhen(cases, elseValue) } } + +/** + * A function that returns the index of str in (str1, str2, ...) list or 0 if not found. + * It takes at least 2 parameters, and all parameters' types should be subtypes of AtomicType. + */ +@ExpressionDescription( + usage = "_FUNC_(str, str1, str2, ...) - Returns the index of str in the str1,str2,... or 0 if not found.", + extended = """ +Examples: + > SELECT _FUNC_(10, 9, 3, 10, 4); + 3 + """) +case class Field(children: Seq[Expression]) extends Expression { + + override def nullable: Boolean = false + override def foldable: Boolean = children.forall(_.foldable) + + private lazy val ordering = TypeUtils.getInterpretedOrdering(children(0).dataType) + + override def checkInputDataTypes(): TypeCheckResult = { +if (children.length <= 1) { + TypeCheckResult.TypeCheckFailure(s"FIELD requires at least 2 arguments") +} else if (!children.forall(_.dataType.isInstanceOf[AtomicType])) { + TypeCheckResult.TypeCheckFailure(s"FIELD requires all arguments to be of AtomicType") +} else + TypeCheckResult.TypeCheckSuccess + } + + override def dataType: DataType = IntegerType + + override def eval(input: InternalRow): Any = { +val target = children.head.eval(input) +val targetDataType = children.head.dataType +def findEqual(target: Any, params: Seq[Expression], index: Int): Int = { + params.toList match { +case Nil => 0 +case head::tail if targetDataType == head.dataType + && head.eval(input) != null && ordering.equiv(target, head.eval(input)) => index +case _ => findEqual(target, params.tail, index + 1) + } +} +if(target == null) + 0 +else + findEqual(target, children.tail, 1) + } + + protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val evalChildren = children.map(_.genCode(ctx)) +val target = evalChildren(0) +val targetDataType = children(0).dataType +val rest = evalChildren.drop(1) +val restDataType = children.drop(1).map(_.dataType) + +def updateEval(evalWithIndex: ((ExprCode, DataType), Int)): String = { + val ((eval, dataType), index) = evalWithIndex + s""" +${eval.code} +if (${dataType.equals(targetDataType)} + && ${ctx.genEqual(targetDataType, eval.value, target.value)}) { + ${ev.value} = ${index}; +} + """ +} + +def genIfElseStructure(code1: String, code2: String): String = { --- End diff -- For now, I use `reduceRight`, which I think is a 'special case' function of `foldRight`. If I understand your meaning of floating `else` right(could you please explain it a little bit?), these 2 functions both can't avoid floating `else`, because we need nested `else` in `else` block, like this: `if (xxx) else { if (xxx) else { ... } } `, so if we avoid floating `else` in `genIfElseStructure`, `else` should be in `updateEval`, which will make the code unclear and complicated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16949 @vanzin I opened a jira (https://issues.apache.org/jira/browse/SPARK-19642) to research and address the potential security flaws. Do you mind if I continue this pr? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16951: [SPARK-18285][SPARKR] SparkR approxQuantile supports inp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16951 **[Test build #73031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73031/testReport)** for PR 16951 at commit [`27b2cd6`](https://github.com/apache/spark/commit/27b2cd6aa3b27e44792cda4a23123b8ad3ef58aa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16386 **[Test build #73032 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73032/testReport)** for PR 16386 at commit [`b801ab0`](https://github.com/apache/spark/commit/b801ab096c3bf426c7d2044291785359ace4922f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16826 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16826 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73022/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16826 **[Test build #73022 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73022/testReport)** for PR 16826 at commit [`f423f74`](https://github.com/apache/spark/commit/f423f7481348c021d9a27986064dfbe389c5de77). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16386 **[Test build #73030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73030/testReport)** for PR 16386 at commit [`58118f2`](https://github.com/apache/spark/commit/58118f2762a3abfb5ca82c156eabb9622844b764). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16968: [SPARK-19337] [ML] [Dcoc] Documentation and examples for...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/16968 I see. I will drop the R example here, whichever PR goes in later can finish the document update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16386 **[Test build #73029 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73029/testReport)** for PR 16386 at commit [`e323317`](https://github.com/apache/spark/commit/e32331794a63e8fcfe60a0901672e69dcfd6fe15). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r101671453 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,117 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { val df2 = spark.read.option("PREfersdecimaL", "true").json(records) assert(df2.schema == schema) } + + test("SPARK-18352: Parse normal multi-line JSON files (compressed)") { +withTempPath { dir => + val path = dir.getCanonicalPath + primitiveFieldAndType +.toDF("value") +.write +.option("compression", "GzIp") +.text(path) + + assert(new File(path).listFiles().exists(_.getName.endsWith(".gz"))) + + val jsonDF = spark.read.option("wholeFile", true).json(path) --- End diff -- I rewrote these tests. Please take a look @gatorsmile and @cloud-fan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 @cloud-fan When implementing tests for the other modes I've uncovered an existing bug in schema inference in `DROPMALFORMED` mode: https://issues.apache.org/jira/browse/SPARK-19641. Since it is not introduced in this set of patches I will open a new pull request once this is one merged. You can inspect the fix here: https://github.com/NathanHowell/spark/commit/e233fd03346a73b3b447fa4c24f3b12c8b2e53ae --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecutionList...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16962 **[Test build #73023 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73023/testReport)** for PR 16962 at commit [`d35fac3`](https://github.com/apache/spark/commit/d35fac34247d6d6f2593b5284bbaba906a8dd033). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecutionList...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16962 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecutionList...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16962 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73023/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16970 **[Test build #73028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73028/testReport)** for PR 16970 at commit [`63a7f4c`](https://github.com/apache/spark/commit/63a7f4c62b2da32351d008f9719d513e14562e56). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/16970 [SPARK-19497][SS]Implement streaming deduplication ## What changes were proposed in this pull request? This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`. The following cases are supported: - dropDuplicates() (one `dropDuplicates` with any output mode) - withWatermark(...).dropDuplicates().groupBy(...)outputMode("append") - dropDuplicates().groupBy(...)outputMode("update") Not supported cases: - dropDuplicates().dropDuplicates() (multiple `dropDuplicates`s) - groupBy(...).dropDuplicates() (`dropDuplicates` after `aggregation`) - dropDuplicates().groupBy(...)outputMode("complete") ## How was this patch tested? The new unite tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark dedup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16970.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16970 commit 63a7f4c62b2da32351d008f9719d513e14562e56 Author: Shixiong ZhuDate: 2017-02-15T19:01:57Z Implement deduplication --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16938: [SPARK-19583][SQL]CTAS for data source table with a crea...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16938 One more case: 5. `CREATE TABLE` or `CTAS` without the location spec: if the default path exists, should we succeed or fail? After we finishing the TABLE-level DDLs, we also need to do the same things for DATABASE-level DDLs and PARTITION-level DDLs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 LGTM. @felixcheung are we good to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15125 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73019/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15125 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15125 **[Test build #73019 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73019/testReport)** for PR 15125 at commit [`f2efef6`](https://github.com/apache/spark/commit/f2efef6afe5692a70325bc788892f5b675755f08). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16826 **[Test build #73027 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73027/testReport)** for PR 16826 at commit [`e2bbfa8`](https://github.com/apache/spark/commit/e2bbfa8c81f91c57f5628e771f42d414a1031d57). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecutionList...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16962 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecut...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16962#discussion_r101666288 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.command.RunnableCommand + +/** + * Saves the results of `query` in to a data source. + * + * Note that this command is different from [[InsertIntoDataSourceCommand]]. This command will call + * `CreatableRelationProvider.createRelation` to write out the data, while + * [[InsertIntoDataSourceCommand]] calls `InsertableRelation.insert`. Ideally these 2 data source + * interfaces should do the same thing, but as we've already published these 2 interfaces and the + * implementations may have different logic, we have to keep these 2 different commands. + */ +case class SaveIntoDataSourceCommand( +query: LogicalPlan, +provider: String, +partitionColumns: Seq[String], +options: Map[String, String], +mode: SaveMode) extends RunnableCommand { + + override protected def innerChildren: Seq[QueryPlan[_]] = Seq(query) + + override def run(sparkSession: SparkSession): Seq[Row] = { +DataSource( + sparkSession, + className = provider, + partitionColumns = partitionColumns, + options = options).write(mode, Dataset.ofRows(sparkSession, query)) + --- End diff -- : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101666251 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import scala.collection.mutable + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} + +class PowerIterationClusteringSuite extends SparkFunSuite + with MLlibTestSparkContext with DefaultReadWriteTest { + + @transient var data: Dataset[_] = _ + final val r1 = 1.0 + final val n1 = 10 + final val r2 = 4.0 + final val n2 = 40 + + override def beforeAll(): Unit = { +super.beforeAll() + +data = PowerIterationClusteringSuite.generatePICData(spark, r1, r2, n1, n2) + } + + test("default parameters") { +val pic = new PowerIterationClustering() + +assert(pic.getK === 2) +assert(pic.getMaxIter === 20) +assert(pic.getInitMode === "random") +assert(pic.getFeaturesCol === "features") +assert(pic.getPredictionCol === "prediction") +assert(pic.getIdCol === "id") + } + + test("set parameters") { +val pic = new PowerIterationClustering() + .setK(9) + .setMaxIter(33) + .setInitMode("degree") + .setFeaturesCol("test_feature") + .setPredictionCol("test_prediction") + .setIdCol("test_id") + +assert(pic.getK === 9) +assert(pic.getMaxIter === 33) +assert(pic.getInitMode === "degree") +assert(pic.getFeaturesCol === "test_feature") +assert(pic.getPredictionCol === "test_prediction") +assert(pic.getIdCol === "test_id") + } + + test("parameters validation") { +intercept[IllegalArgumentException] { + new PowerIterationClustering().setK(1) +} +intercept[IllegalArgumentException] { + new PowerIterationClustering().setInitMode("no_such_a_mode") +} + } + + test("power iteration clustering") { --- End diff -- can you also add a test with a dataframe that has some extra data in it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16611 Sure, I will rebase and update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101665899 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case "random" => true + case "degree" => true + case _ => false +} + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "idCol", "column name for ids.") + + /** @group getParam */ + def getIdCol: String = $(idCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { +SchemaUtils.checkColumnType(schema, $(idCol), LongType) +SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType) + } +} + +/** + * :: Experimental :: + * Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the abstract: PIC finds a very + * low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise + * similarity matrix of the data. + * + * Note that we implement [[PowerIterationClustering]] as a transformer. The [[transform]] is an + * expensive operation, because it uses PIC algorithm to cluster the whole input dataset. + * + * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral clustering (Wikipedia)]] + */ +@Since("2.2.0") +@Experimental +class PowerIterationClustering private[clustering] ( +@Since("2.2.0") override val uid: String) + extends Transformer with PowerIterationClusteringParams with DefaultParamsWritable { + + setDefault( +k -> 2, +maxIter -> 20, +initMode ->
[GitHub] spark issue #16923: [SPARK-19038][Hive][YARN] Correctly figure out keytab fi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/16923 @vanzin , would you mind helping to review this PR, thanks a lot. IIUC the issue was introduced in #11510 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101664268 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case "random" => true + case "degree" => true + case _ => false +} + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "idCol", "column name for ids.") --- End diff -- Instead of making an 'id' column, which does not convey much information, we should follow the example of `K-Means` and call it `prediction`. You already include the trait for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101663790 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case "random" => true + case "degree" => true + case _ => false +} + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "idCol", "column name for ids.") + + /** @group getParam */ + def getIdCol: String = $(idCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { --- End diff -- Instead of just validating the schema, we should validate and transform. You can follow the example in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L92 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16969 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16969 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73025/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16969 **[Test build #73025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73025/testReport)** for PR 16969 at commit [`6e54a4d`](https://github.com/apache/spark/commit/6e54a4d0c2f05cb017b3ce4ce105a723d03a9306). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16826 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16826 **[Test build #73026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73026/testReport)** for PR 16826 at commit [`2cee190`](https://github.com/apache/spark/commit/2cee190eb6c6902d39f68c25d928fbd5aaa522bc). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16826 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73026/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662332 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case "random" => true + case "degree" => true + case _ => false +} + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "idCol", "column name for ids.") + + /** @group getParam */ + def getIdCol: String = $(idCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { +SchemaUtils.checkColumnType(schema, $(idCol), LongType) +SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType) + } +} + +/** + * :: Experimental :: + * Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the abstract: PIC finds a very + * low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise + * similarity matrix of the data. + * + * Note that we implement [[PowerIterationClustering]] as a transformer. The [[transform]] is an + * expensive operation, because it uses PIC algorithm to cluster the whole input dataset. + * + * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral clustering (Wikipedia)]] + */ +@Since("2.2.0") +@Experimental +class PowerIterationClustering private[clustering] ( +@Since("2.2.0") override val uid: String) --- End diff -- indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662298 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { --- End diff -- No need with comment above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662273 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) --- End diff -- You do not need to use write a function as you do below after that, it will allow more user-friendly error messages in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662038 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662018 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} --- End diff -- no need for brackets --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16826 **[Test build #73026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73026/testReport)** for PR 16826 at commit [`2cee190`](https://github.com/apache/spark/commit/2cee190eb6c6902d39f68c25d928fbd5aaa522bc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r101660601 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,569 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import java.sql.{Date, Timestamp} + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * @param catalystConf a configuration showing if CBO is enabled + */ +case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate: Option[Statistics] = { +// We first copy child node's statistics and then modify it based on filter selectivity. +val stats: Statistics = plan.child.stats(catalystConf) +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +mutableColStats = mutable.Map(stats.attributeStats.map(kv => (kv._1.exprId, kv._2)).toSeq: _*) + +// estimate selectivity of this filter predicate +val filterSelectivity: Double = calculateConditions(plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCount: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * filterSelectivity) +val filteredSizeInBytes: BigInt = EstimationUtils.ceil(BigDecimal( +EstimationUtils.getOutputSize(plan.output, filteredRowCount, newColStats) +)) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCount), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is decomposed into multiple single conditions linked with AND, OR, NOT. + *
[GitHub] spark issue #16967: [MINOR][PYTHON] Fix typo docstring: 'top' -> 'topic'
Github user rolando commented on the issue: https://github.com/apache/spark/pull/16967 I've (rip)grep'ed over all `.py` and `.md` files in the repository searching for ` topi? ` (case insensitive regex) and haven't seen other case of this typo (writing `top` instead of `topic`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16938: [SPARK-19583][SQL]CTAS for data source table with a crea...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16938 ok let's discuss it case by case: 1. `CREATE TABLE ... LOCATION path` works if path exists, it's expected 2. `CREATE TABLE ... LOCATION path` fails if path doesn't exist, is it expected? 3. `CREATE TABLE ... LOCATION path AS SELECT ...`, shall we fail if path exists? 4. `ALTER TABLE ... SET LOCATION path`, shall we fail if path not exist? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r101659663 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,569 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import java.sql.{Date, Timestamp} + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * @param catalystConf a configuration showing if CBO is enabled + */ +case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate: Option[Statistics] = { +// We first copy child node's statistics and then modify it based on filter selectivity. +val stats: Statistics = plan.child.stats(catalystConf) +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +mutableColStats = mutable.Map(stats.attributeStats.map(kv => (kv._1.exprId, kv._2)).toSeq: _*) + +// estimate selectivity of this filter predicate +val filterSelectivity: Double = calculateConditions(plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCount: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * filterSelectivity) +val filteredSizeInBytes: BigInt = EstimationUtils.ceil(BigDecimal( +EstimationUtils.getOutputSize(plan.output, filteredRowCount, newColStats) +)) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCount), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is decomposed into multiple single conditions linked with AND, OR, NOT. + * For
[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16969 **[Test build #73025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73025/testReport)** for PR 16969 at commit [`6e54a4d`](https://github.com/apache/spark/commit/6e54a4d0c2f05cb017b3ce4ce105a723d03a9306). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16967: [MINOR][PYTHON] Fix typo docstring: 'top' -> 'topic'
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16967 Sure, it's worth a search for similar instances because sometimes typos spread via copy and paste. Could you make a pass over related code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org