[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19293 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19293 **[Test build #82126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82126/testReport)** for PR 19293 at commit [`45477fb`](https://github.com/apache/spark/commit/45477fbf00558066e3733a34e1d59ce22c192ee2). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19293 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82126/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19294 @szhem You are correct, currently it fails in the driver itself. So failures in executor are not seen - since job submission fails. With this pr, the job submission should succeed - but the subsequent execution in sql could fail (since sql uses some of the methods which have not been patched in this PR if I am not wrong - newTaskTempFileAbsPath, newTaskTempFile, etc). A testcase to validate that would clarify things. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19332: [SPARK-22093][TESTS] Fixes `assume` in `UtilsSuite` and ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19332 Merged to master. Thank you @srowen and @dongjoon-hyun. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19332: [SPARK-22093][TESTS] Fixes `assume` in `UtilsSuit...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19332 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19326 That's fine. I believe we don't usually need a JIRA for a trivial change though. Would you mind double checking if there are similar instances in the PySpark doc? Also, it'd be great if we add `Closes #19283` in the PR description so that `#19283` can be automatically closed when this PR is merged. It looks `#19283` has gone inactive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19277: [SPARK-22058][CORE]the BufferedInputStream will n...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19277 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19307: [SPARK-22087][SPARK-14650][WIP][BUILD][REPL][CORE] Compi...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19307 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19307: [SPARK-22087][SPARK-14650][WIP][BUILD][REPL][CORE...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19307 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19184 @viirya @jerryshao To take a step back here. This specific issue is applicable to window operations and not to shuffle. In shuffle, you a much larger volume of data written per file vs 4k records per file for window operation. To get to 9k files with shuffle, you are typically processing a TB or more data per shuffle task (unless executor is strapped of memory and spilt large number of files). On other hand, with 4k window size (the default in spark), getting to 9k files is possible within a single task. From what I see, there is actually no functional/performance reason to keep all the files opened, unlike in shuffle. Having said that, there is an additional cost we pay with this approach. With N files, we incur an additional cost of `N * cost(open + read int + close)` Practically though, the impact is much lower when compared to current code (since re-open + read will be disk read friendly). While getting it fixed for all cases would be ideal, the solution for window operation does not transfer to shuffle (and vice versa) due to the difference in the nature of how files are used in both. In case I missed something here, please let me know. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18747: [WIP][SPARK-20822][SQL] Generate code to directly get va...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18747 **[Test build #82127 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82127/testReport)** for PR 18747 at commit [`c476e87`](https://github.com/apache/spark/commit/c476e87c5216db8e0dbaa2d0382b1cc1d44cab07). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest commit sh...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19290 **[Test build #82128 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82128/testReport)** for PR 19290 at commit [`387228d`](https://github.com/apache/spark/commit/387228d9fd5290df4a99026f9446821bce60e779). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19290#discussion_r140651399 --- Diff: R/pkg/R/column.R --- @@ -238,8 +238,10 @@ setMethod("between", signature(x = "Column"), #' @param x a Column. #' @param dataType a character object describing the target data type. #'See +# nolint start --- End diff -- I just double checked the links. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19290: [WIP][SPARK-22063][R] Upgrades lintr to latest co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19290#discussion_r140651387 --- Diff: dev/lint-r.R --- @@ -24,10 +24,16 @@ if (! library(SparkR, lib.loc = LOCAL_LIB_LOC, logical.return = TRUE)) { stop("You should install SparkR in a local directory with `R/install-dev.sh`.") } -# Installs lintr from Github in a local directory. -# NOTE: The CRAN's version is too old to adapt to our rules. -if ("lintr" %in% row.names(installed.packages()) == FALSE) { - devtools::install_github("jimhester/lintr@a769c0b") +# Installs lintr from Github in a local directory if lintr is not installed already or +# the existing lintr is not the specified version. +# +# Note that, the CRAN's version is too old to adapt to our rules. Therefore, we try to +# install lintr from the latest commit in Github, rather than a specific tag or release. +# The current latest is jimhester/lintr@5431140 (see SPARK-22063), the dev version, +# '1.0.1.9000'. +if ("lintr" %in% row.names(installed.packages()) == FALSE || +packageVersion("lintr") != "1.0.1.9000") { --- End diff -- I checked this - when downgrading, upgrading and not installed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19290 **[Test build #82129 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82129/testReport)** for PR 19290 at commit [`ee7eb9d`](https://github.com/apache/spark/commit/ee7eb9dd925cf3892ba02ee5c116e3fe52274ebe). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 Ugh.. it failed to install due to permission issue ... ``` Downloading GitHub repo jimhester/lintr@5431140 from URL https://api.github.com/repos/jimhester/lintr/zipball/5431140 Installing lintr Downloading GitHub repo hadley/xml2@master from URL https://api.github.com/repos/hadley/xml2/zipball/master Installing xml2 '/usr/lib64/R/bin/R' --no-site-file --no-environ --no-save --no-restore \ --quiet CMD INSTALL \ '/tmp/RtmprFnaBu/devtools438c516b7717/r-lib-xml2-5799cd9' \ --library='/usr/lib64/R/library' --install-tests Error: ERROR: no permission to install to directory '/usr/lib64/R/library' Installation failed: Command failed (1) '/usr/lib64/R/bin/R' --no-site-file --no-environ --no-save --no-restore \ --quiet CMD INSTALL \ '/tmp/RtmprFnaBu/devtools438c286c3792/jimhester-lintr-5431140' \ --library='/usr/lib64/R/library' --install-tests Error: ERROR: no permission to install to directory '/usr/lib64/R/library' Installation failed: Command failed (1) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19311: [SPARK-22083][CORE] Release locks in MemoryStore....
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19311#discussion_r140651513 --- Diff: core/src/test/scala/org/apache/spark/storage/MemoryStoreSuite.scala --- @@ -407,4 +407,119 @@ class MemoryStoreSuite }) assert(memoryStore.getSize(blockId) === 1) } + + test("SPARK-22083: Release all locks in evictBlocksToFreeSpace") { +// Setup a memory store with many blocks cached, and then one request which leads to multiple +// blocks getting evicted. We'll make the eviction throw an exception, and make sure that +// all locks are released. +val ct = implicitly[ClassTag[Array[Byte]]] +def testFailureOnNthDrop(failAfterDroppingNBlocks: Int, readLockAfterDrop: Boolean): Unit = { --- End diff -- Nit: failAfterDroppingNBlocks -> numValidBlocks, readLockAfterDrop -> validBlock ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19311: [SPARK-22083][CORE] Release locks in MemoryStore....
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19311#discussion_r140651741 --- Diff: core/src/test/scala/org/apache/spark/storage/MemoryStoreSuite.scala --- @@ -407,4 +407,119 @@ class MemoryStoreSuite }) assert(memoryStore.getSize(blockId) === 1) } + + test("SPARK-22083: Release all locks in evictBlocksToFreeSpace") { +// Setup a memory store with many blocks cached, and then one request which leads to multiple +// blocks getting evicted. We'll make the eviction throw an exception, and make sure that +// all locks are released. +val ct = implicitly[ClassTag[Array[Byte]]] +def testFailureOnNthDrop(failAfterDroppingNBlocks: Int, readLockAfterDrop: Boolean): Unit = { + val tc = TaskContext.empty() + val memManager = new StaticMemoryManager(conf, Long.MaxValue, 100, numCores = 1) + val blockInfoManager = new BlockInfoManager + blockInfoManager.registerTask(tc.taskAttemptId) + var droppedSoFar = 0 + val blockEvictionHandler = new BlockEvictionHandler { +var memoryStore: MemoryStore = _ + +override private[storage] def dropFromMemory[T: ClassTag]( +blockId: BlockId, +data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel = { + if (droppedSoFar < failAfterDroppingNBlocks) { +droppedSoFar += 1 +memoryStore.remove(blockId) +if (readLockAfterDrop) { + // for testing purposes, we act like another thread gets the read lock on the new + // block + StorageLevel.DISK_ONLY +} else { + StorageLevel.NONE +} + } else { +throw new RuntimeException(s"Mock error dropping block $droppedSoFar") + } +} + } + val memoryStore = new MemoryStore(conf, blockInfoManager, serializerManager, memManager, + blockEvictionHandler) { +override def afterDropAction(blockId: BlockId): Unit = { + if (readLockAfterDrop) { +// pretend that we get a read lock on the block (now on disk) in another thread +TaskContext.setTaskContext(tc) +blockInfoManager.lockForReading(blockId) +TaskContext.unset() + } +} + } + + blockEvictionHandler.memoryStore = memoryStore + memManager.setMemoryStore(memoryStore) + + // Put in some small blocks to fill up the memory store + val initialBlocks = (1 to 10).map { id => +val blockId = BlockId(s"rdd_1_$id") +val blockInfo = new BlockInfo(StorageLevel.MEMORY_ONLY, ct, tellMaster = false) +val initialWriteLock = blockInfoManager.lockNewBlockForWriting(blockId, blockInfo) +assert(initialWriteLock) +val success = memoryStore.putBytes(blockId, 10, MemoryMode.ON_HEAP, () => { + new ChunkedByteBuffer(ByteBuffer.allocate(10)) +}) +assert(success) +blockInfoManager.unlock(blockId, None) + } + assert(blockInfoManager.size === 10) + + + // Add one big block, which will require evicting everything in the memorystore. However our + // mock BlockEvictionHandler will throw an exception -- make sure all locks are cleared. + val largeBlockId = BlockId(s"rdd_2_1") + val largeBlockInfo = new BlockInfo(StorageLevel.MEMORY_ONLY, ct, tellMaster = false) + val initialWriteLock = blockInfoManager.lockNewBlockForWriting(largeBlockId, largeBlockInfo) + assert(initialWriteLock) + if (failAfterDroppingNBlocks < 10) { +val exc = intercept[RuntimeException] { + memoryStore.putBytes(largeBlockId, 100, MemoryMode.ON_HEAP, () => { +new ChunkedByteBuffer(ByteBuffer.allocate(100)) + }) +} +assert(exc.getMessage().startsWith("Mock error dropping block"), exc) +// BlockManager.doPut takes care of releasing the lock for the newly written block -- not +// testing that here, so do it manually +blockInfoManager.removeBlock(largeBlockId) + } else { +memoryStore.putBytes(largeBlockId, 100, MemoryMode.ON_HEAP, () => { + new ChunkedByteBuffer(ByteBuffer.allocate(100)) +}) +// BlockManager.doPut takes care of releasing the lock for the newly written block -- not +// testing that here, so do it manually +blockInfoManager.unlock(largeBlockId) + } + + val largeBlockInMemory = if (failAfterDroppingNBlocks == 10) 1 else 0 + val expBlocks = 10 + +(if (re
[GitHub] spark pull request #19311: [SPARK-22083][CORE] Release locks in MemoryStore....
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19311#discussion_r140651608 --- Diff: core/src/test/scala/org/apache/spark/storage/MemoryStoreSuite.scala --- @@ -407,4 +407,119 @@ class MemoryStoreSuite }) assert(memoryStore.getSize(blockId) === 1) } + + test("SPARK-22083: Release all locks in evictBlocksToFreeSpace") { +// Setup a memory store with many blocks cached, and then one request which leads to multiple +// blocks getting evicted. We'll make the eviction throw an exception, and make sure that +// all locks are released. +val ct = implicitly[ClassTag[Array[Byte]]] +def testFailureOnNthDrop(failAfterDroppingNBlocks: Int, readLockAfterDrop: Boolean): Unit = { + val tc = TaskContext.empty() + val memManager = new StaticMemoryManager(conf, Long.MaxValue, 100, numCores = 1) + val blockInfoManager = new BlockInfoManager + blockInfoManager.registerTask(tc.taskAttemptId) + var droppedSoFar = 0 + val blockEvictionHandler = new BlockEvictionHandler { +var memoryStore: MemoryStore = _ + +override private[storage] def dropFromMemory[T: ClassTag]( +blockId: BlockId, +data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel = { + if (droppedSoFar < failAfterDroppingNBlocks) { +droppedSoFar += 1 +memoryStore.remove(blockId) +if (readLockAfterDrop) { + // for testing purposes, we act like another thread gets the read lock on the new + // block + StorageLevel.DISK_ONLY +} else { + StorageLevel.NONE +} + } else { +throw new RuntimeException(s"Mock error dropping block $droppedSoFar") + } +} + } + val memoryStore = new MemoryStore(conf, blockInfoManager, serializerManager, memManager, + blockEvictionHandler) { +override def afterDropAction(blockId: BlockId): Unit = { + if (readLockAfterDrop) { +// pretend that we get a read lock on the block (now on disk) in another thread +TaskContext.setTaskContext(tc) +blockInfoManager.lockForReading(blockId) +TaskContext.unset() + } +} + } + + blockEvictionHandler.memoryStore = memoryStore + memManager.setMemoryStore(memoryStore) + + // Put in some small blocks to fill up the memory store + val initialBlocks = (1 to 10).map { id => --- End diff -- To piggy back on @vanzin's comment, sizePerBlock also please (so that 100 goes away) ? Thx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 Hi @shaneknapp. I am sorry but it's me again ... Here, this PR tries to upgrade an R package, [lintr](https://github.com/jimhester/lintr) for static Code analysis for R, which is ran via `./dev/lint-r` -> `dev/lint-r.R` as a part of Jenkins PR build. This upgrade catches many meaningful instances that should have caught before in SparkR codes and if I understood correctly, we decided to upgrade this if it does not cause a problem. I tried to fix the `dev/lint-r.R` script so that `lintr` can be upgraded if this package is not installed or not the specific version (1.0.1.9000) after testing this in my local, but looks failed in Jenkins due to a permission issue. I believe the bash command below installs the package correctly: ```R Rscript -e "devtools::install_github('jimhester/lintr@5431140')" ``` Maybe, I should have cc'ed you here first before trying this one. I appologise and I will cc you in such cases in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...
Github user szhem commented on a diff in the pull request: https://github.com/apache/spark/pull/19294#discussion_r140652214 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -130,17 +135,21 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) val filesToMove = taskCommits.map(_.obj.asInstanceOf[Map[String, String]]) .foldLeft(Map[String, String]())(_ ++ _) logDebug(s"Committing files staged for absolute locations $filesToMove") -val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration) -for ((src, dst) <- filesToMove) { - fs.rename(new Path(src), new Path(dst)) +if (hasAbsPathFiles) { + val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration) + for ((src, dst) <- filesToMove) { +fs.rename(new Path(src), new Path(dst)) + } + fs.delete(absPathStagingDir, true) } -fs.delete(absPathStagingDir, true) --- End diff -- Wouldn't it be better to fix it in separate PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19295: [SPARK-22080][SQL] Adds support for allowing user to add...
Github user sathiyapk commented on the issue: https://github.com/apache/spark/pull/19295 @gatorsmile thanks for your comments. Here are my thoughts, thanks for correcting me if i'm wrong. (sorry for the big comment though :)) 1. This PR don't change any existing API, it adds a new one. 2. In the usual cases, for the people who don't use `ExperimentalMethods`, it don't affect or break anything. 3. For the people who use `ExperimentalMethods`, irrespective of whether it is pre-optimizer or post-optimizer rule, it will break anyway if they do it wrong. 4. One of the advantages of this PR `sparkSession.experimental.extraPreOptimizations` is that the user provided rule can get further optimizer by the native rules of spark, which is not possible with `sparkSession.experimental.extraOptimizations`. I'm writing a blog post regarding this with an example, i will post the link soon. 5. Last but not least, one of the main intention of the spark catalyst optimizer, as mentioned in its sigmod paper, is it's simplicity in defining new optimization rules and plug it into the query optimizer during runtime, so we should consider not to limit it even if it only concerns a rare case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19295: [SPARK-22080][SQL] Adds support for allowing user...
Github user sathiyapk commented on a diff in the pull request: https://github.com/apache/spark/pull/19295#discussion_r140652720 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ExperimentalMethods.scala --- @@ -44,11 +44,14 @@ class ExperimentalMethods private[sql]() { */ @volatile var extraStrategies: Seq[Strategy] = Nil + @volatile var extraPreOptimizations: Seq[Rule[LogicalPlan]] = Nil + @volatile var extraOptimizations: Seq[Rule[LogicalPlan]] = Nil --- End diff -- Yes, i agree with @gatorsmile, renaming `extraOptimizations ` to `extraPostOptimizations` will be symmetric with `extraPreOptimizations`, but doing so may affect the existing API calls. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19324: [SPARK-22103] Move HashAggregateExec parent consume to a...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19324 @juliuszsompolski Thanks for pinging me. #18931 is an attempt to separate the consume function as it can as possible. With long chain of any operators, you can have a long consume function and fail JIT, this is the one reason it tries to split into functions at the root of codegen support, instead of in some operators. I'd avoid to duplicate the separate logic in all operators, IMO. For the explicit delaying evaluation of projection, currently the strategy I take is not going to split it. I guess that you mean the evaluation that can be delayed by the compiler, I personally think it should not be a big impact under the whole-stage codegen framework. The reason is those evaluation are actually needed and can't be avoided in most of (if not all) cases. From the benchmark we can see there is no negative impact even in the cases where no long consume function exists. Yeah, I think the simplifies for the use of `continue` is a good thing, if it is possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18747: [WIP][SPARK-20822][SQL] Generate code to directly get va...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18747 **[Test build #82127 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82127/testReport)** for PR 18747 at commit [`c476e87`](https://github.com/apache/spark/commit/c476e87c5216db8e0dbaa2d0382b1cc1d44cab07). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18747: [WIP][SPARK-20822][SQL] Generate code to directly get va...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18747 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82127/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18747: [WIP][SPARK-20822][SQL] Generate code to directly get va...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18747 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 @ConeyLiu Yes tree aggregate introduce extra shuffle. But it is possible to improve perf when driver total collecting data size from executors are large and there're many partitions. But I think we can keep the same with `reduceByKeyLocally` for now. This is possible optimization which can be done in future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82130 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82130/testReport)** for PR 19294 at commit [`ae0ba0a`](https://github.com/apache/spark/commit/ae0ba0a49454e67497aff4b2dd59e3986e5c6a03). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...
Github user szhem commented on a diff in the pull request: https://github.com/apache/spark/pull/19294#discussion_r140654204 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -57,6 +57,11 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) */ private def absPathStagingDir: Path = new Path(path, "_temporary-" + jobId) + /** + * Checks whether there are files to be committed to an absolute output location. + */ + private def hasAbsPathFiles: Boolean = addedAbsPathFiles != null && addedAbsPathFiles.nonEmpty --- End diff -- Good catch, thank you! According to the `FileCommitProtocol`, `addedAbsPathFiles` is always null on driver, so we will not be able to commit or remove these files. Replaced it with private def hasAbsPathFiles: Boolean = path != null --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19295: [SPARK-22080][SQL] Adds support for allowing user to add...
Github user sathiyapk commented on the issue: https://github.com/apache/spark/pull/19295 I pushed a new commit that addresses @wzhfy review comments.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19290 **[Test build #82128 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82128/testReport)** for PR 19290 at commit [`387228d`](https://github.com/apache/spark/commit/387228d9fd5290df4a99026f9446821bce60e779). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19290 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82128/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19290 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19290 **[Test build #82129 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82129/testReport)** for PR 19290 at commit [`ee7eb9d`](https://github.com/apache/spark/commit/ee7eb9dd925cf3892ba02ee5c116e3fe52274ebe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19290 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19290 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82129/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82131 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82131/testReport)** for PR 19294 at commit [`3429de5`](https://github.com/apache/spark/commit/3429de551b96ab60fa6c9b211a2c534ce71df398). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82131 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82131/testReport)** for PR 19294 at commit [`3429de5`](https://github.com/apache/spark/commit/3429de551b96ab60fa6c9b211a2c534ce71df398). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82131/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user szhem commented on the issue: https://github.com/apache/spark/pull/19294 @mridulm Updated `FileFormatWriterSuite` [to cover](https://github.com/apache/spark/pull/19294/files#diff-bc98a3d91cf4f95f4f473146400044aa) both branches of the [committer calling](https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L498) - for `newTaskTempFile` as well as for `newTaskTempFileAbsPath`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 OK, just keep it. Does this need more test or more improvements ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19222 **[Test build #82132 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82132/testReport)** for PR 19222 at commit [`7ec26f6`](https://github.com/apache/spark/commit/7ec26f678ecb08779c69ac10155d180ec8e58864). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 It is better adding more perf test for `OpenHashSet` replacement to avoid perf regression. And I found `reduceByKeyLocally` also use `JHashSet`, I am not sure whether there is some special reason. ping @cloud-fan Can you help confirm this ? I cannot find the original author for that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82133 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82133/testReport)** for PR 19294 at commit [`7963b58`](https://github.com/apache/spark/commit/7963b58e1a3e6b1687268e4b663ce4d5ffd821be). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82134 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82134/testReport)** for PR 19294 at commit [`1e42e83`](https://github.com/apache/spark/commit/1e42e83f62c9eb7819662b6363d77e9eefb916ad). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19229 @WeichenXu123 Have any more comments on this? Thanks. I think the ML part is straightforward. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82135 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82135/testReport)** for PR 19294 at commit [`0cf4724`](https://github.com/apache/spark/commit/0cf47248b84916092d85743d679969c26b0be54b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82130 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82130/testReport)** for PR 19294 at commit [`ae0ba0a`](https://github.com/apache/spark/commit/ae0ba0a49454e67497aff4b2dd59e3986e5c6a03). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82130/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/19294#discussion_r140658582 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -130,17 +135,21 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) val filesToMove = taskCommits.map(_.obj.asInstanceOf[Map[String, String]]) .foldLeft(Map[String, String]())(_ ++ _) logDebug(s"Committing files staged for absolute locations $filesToMove") -val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration) -for ((src, dst) <- filesToMove) { - fs.rename(new Path(src), new Path(dst)) +if (hasAbsPathFiles) { + val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration) + for ((src, dst) <- filesToMove) { +fs.rename(new Path(src), new Path(dst)) + } + fs.delete(absPathStagingDir, true) } -fs.delete(absPathStagingDir, true) --- End diff -- can do, now you've got a little mock committer in someone can just extend it to optionally throw an IOE in abort(). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82136 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82136/testReport)** for PR 19294 at commit [`7c58abb`](https://github.com/apache/spark/commit/7c58abb5d01a0ff30e001eb87051b3aa091ae805). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82133 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82133/testReport)** for PR 19294 at commit [`7963b58`](https://github.com/apache/spark/commit/7963b58e1a3e6b1687268e4b663ce4d5ffd821be). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82133/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82137/testReport)** for PR 19294 at commit [`11445b4`](https://github.com/apache/spark/commit/11445b4a0741257b5ac546513b288f98d4d6cdac). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82134 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82134/testReport)** for PR 19294 at commit [`1e42e83`](https://github.com/apache/spark/commit/1e42e83f62c9eb7819662b6363d77e9eefb916ad). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82134/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82135 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82135/testReport)** for PR 19294 at commit [`0cf4724`](https://github.com/apache/spark/commit/0cf47248b84916092d85743d679969c26b0be54b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19222 **[Test build #82132 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82132/testReport)** for PR 19222 at commit [`7ec26f6`](https://github.com/apache/spark/commit/7ec26f678ecb08779c69ac10155d180ec8e58864). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class MemoryBlockSuite ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19222 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82135/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19222 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82132/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82136 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82136/testReport)** for PR 19294 at commit [`7c58abb`](https://github.com/apache/spark/commit/7c58abb5d01a0ff30e001eb87051b3aa091ae805). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82136/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart
Github user jgoleary commented on the issue: https://github.com/apache/spark/pull/19326 Updated description. The only other mentions of `as()` I can find in the docs are in Java examples, and the method appears to exist on the Java side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19222 **[Test build #82138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82138/testReport)** for PR 19222 at commit [`0714ddc`](https://github.com/apache/spark/commit/0714ddcab6d83a489e791536775630e75e8fe5c6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 @hvanhovell @rednaxelafx After running a benchmark program, I took a polymorphic approach (i.e. each subclass has `getInt()`/`putInt()` methods. Then, I got better performance than monomorphic approach (i.e. only `MemoryBlock` class has `final` `getInt()`/`putInt()` methods. **The root cause for better performance is to pass a concrete type to the first argument of `Platform.getInt()/putInt()` instead of virtual call.** I run [this benchmark program](https://gist.github.com/kiszk/94f75b506c93a663bbbc372ffe8f05de) using [the commit](https://github.com/apache/spark/commit/0714ddcab6d83a489e791536775630e75e8fe5c6). I got the following results: ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz Memory access benchmarks:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative IntArrayMemoryBlock423 / 445634.1 1.6 1.0X ByteArrayMemoryBlock 433 / 443620.3 1.6 1.0X Platform 431 / 436622.7 1.6 1.0X Platform Object 1004 / 1055267.4 3.7 0.4X Platform copyMemory 45 / 48 5903.9 0.2 9.3X Platform copyMemory Object 45 / 47 6004.0 0.2 9.5X ``` This result shows three facts: 1. According to the first three results, To have `getInt()/putInt()` in subclasses of `MemoryBlock` can achieve comparable performance to the current implementation (`Platform` in a table). 2. According to the third and forth results, even if we use `Platform.getInt()/putInt(), we achieve more than 2x worse performance (`Platform Object` in a table) when we pass `Object` to the first argument instead of concrete type (i.e. `byte[]`). For example, `byte[] b; Platform.getInt(b, 0);` can achieve better performance than `Object o; Platform.getInt(o, 0);` 3. According to the fifth and sixth results, for Platform.copy(), to pass `Object` can achieve the same performance as to pass `byte[]`. From fact 2., I used polymorphic approach to pass the concrete type for each subclass of `MemoryBlock`. As a result, we can achieve the same performance if the current Spark uses a concrete type for the first argument of `Platform.getInt()/putInt()`. If the current Spark uses `Object` (e.g. [here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java#L61)), this PR can achieve better performance. Probably, @rednaxelafx can explain this very well :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19324: [SPARK-22103] Move HashAggregateExec parent consu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19324#discussion_r140664499 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala --- @@ -328,10 +325,11 @@ case class BroadcastHashJoinExec( | UnsafeRow $matched = $matches != null && $matches.hasNext() ? |(UnsafeRow) $matches.next() : null; | ${checkCondition.trim} --- End diff -- ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19324: [SPARK-22103] Move HashAggregateExec parent consu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19324#discussion_r140664550 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala --- @@ -328,10 +325,11 @@ case class BroadcastHashJoinExec( | UnsafeRow $matched = $matches != null && $matches.hasNext() ? |(UnsafeRow) $matches.next() : null; | ${checkCondition.trim} --- End diff -- nvm. This is for outer join. The same name but different value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19324: [SPARK-22103] Move HashAggregateExec parent consu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19324#discussion_r140664581 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala --- @@ -186,8 +186,7 @@ case class BroadcastHashJoinExec( */ private def getJoinCondition( ctx: CodegenContext, - input: Seq[ExprCode], - anti: Boolean = false): (String, String, Seq[ExprCode]) = { --- End diff -- I like this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19324: [SPARK-22103] Move HashAggregateExec parent consume to a...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19324 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82137 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82137/testReport)** for PR 19294 at commit [`11445b4`](https://github.com/apache/spark/commit/11445b4a0741257b5ac546513b288f98d4d6cdac). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82137/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19326 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19326 **[Test build #82139 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82139/testReport)** for PR 19326 at commit [`c6bf156`](https://github.com/apache/spark/commit/c6bf156116e347e07a92cf75b07443cb327ce4ab). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19326 **[Test build #82139 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82139/testReport)** for PR 19326 at commit [`c6bf156`](https://github.com/apache/spark/commit/c6bf156116e347e07a92cf75b07443cb327ce4ab). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19326 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82139/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19222 **[Test build #82138 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82138/testReport)** for PR 19222 at commit [`0714ddc`](https://github.com/apache/spark/commit/0714ddcab6d83a489e791536775630e75e8fe5c6). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public final class ByteArrayMemoryBlock extends MemoryBlock ` * `public final class IntArrayMemoryBlock extends MemoryBlock ` * `public final class LongArrayMemoryBlock extends MemoryBlock ` * `public class OffHeapMemoryBlock extends MemoryBlock ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19222 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82138/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19222 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16548: [SPARK-19158][SPARKR][EXAMPLES] Fix ml.R example fails d...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/16548 So there is something similar in the fulltests for R `./R/pkg/tests/fulltests/test_mllib.R` (found while working on packaging). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19326: [SPARK-22107] Change as to alias in python quickstart
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19326 Merged to master and branch-2.2. @jgoleary, I merged this considering the first contribution but let's do this in a batch if possible in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19326: [SPARK-22107] Change as to alias in python quicks...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19326 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19283: Update quickstart python dataset example
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19283 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user shaneknapp commented on the issue: https://github.com/apache/spark/pull/19290 @HyukjinKwon -- you will absolutely not have builds install packages on the build system. this is a really bad idea. is this absolutely required, or just to fix a warning in the build output? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 @shaneknapp Sure, it was my bad. I will be careful next time. It is required to fix an actual issue in order to to detect R codes that do not follow project's R style. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Upgrades lintr to latest commit sha1 ID
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 Do you maybe have some worries about this? If that worry is quite crucial, I think we could also consider an option, not upgrading this, leaving `lint-r.R` script as was, and only fixing the instances detected by the upgrade here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19229#discussion_r140675967 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -223,20 +223,18 @@ class ImputerModel private[ml] ( override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -var outputDF = dataset val surrogates = surrogateDF.select($(inputCols).map(col): _*).head().toSeq -$(inputCols).zip($(outputCols)).zip(surrogates).foreach { +val newCols = $(inputCols).zip($(outputCols)).zip(surrogates).map { case ((inputCol, outputCol), surrogate) => val inputType = dataset.schema(inputCol).dataType val ic = col(inputCol) -outputDF = outputDF.withColumn(outputCol, - when(ic.isNull, surrogate) +when(ic.isNull, surrogate) --- End diff -- style: indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19321: [SPARK-22100] [SQL] Make percentile_approx support numer...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19321 **[Test build #82140 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82140/testReport)** for PR 19321 at commit [`1d26f50`](https://github.com/apache/spark/commit/1d26f501c6e9870969f5312b2a91b05c7e97cef3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19229#discussion_r140678574 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -223,20 +223,18 @@ class ImputerModel private[ml] ( override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -var outputDF = dataset val surrogates = surrogateDF.select($(inputCols).map(col): _*).head().toSeq -$(inputCols).zip($(outputCols)).zip(surrogates).foreach { +val newCols = $(inputCols).zip($(outputCols)).zip(surrogates).map { case ((inputCol, outputCol), surrogate) => val inputType = dataset.schema(inputCol).dataType val ic = col(inputCol) -outputDF = outputDF.withColumn(outputCol, - when(ic.isNull, surrogate) +when(ic.isNull, surrogate) --- End diff -- This `when` is not a call of previous line. I think it doesn't need to indent? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19229#discussion_r140678630 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2102,6 +2102,55 @@ class Dataset[T] private[sql]( } /** + * Returns a new Dataset by adding columns or replacing the existing columns that has + * the same names. + */ + private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = { --- End diff -- ping @cloud-fan or @gatorsmile Can you check the SQL part? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org