[GitHub] spark issue #17009: [SPARK-19674][SQL]Ignore driver accumulator updates don'...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/17009 I created a Spark 2.1 backport at #17418. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17418: [SPARK-19674][SQL] Ignore driver accumulator upda...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/17418 [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to ⦠[SPARK-19674][SQL] Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates N.B. This is a backport to branch-2.1 of #17009. ## What changes were proposed in this pull request? In SQLListener.getExecutionMetrics, driver accumulator updates don't belong to the execution should be ignored when merging all accumulator updates to prevent NoSuchElementException. ## How was this patch tested? Updated unit test. Author: Carson Wang You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-19674-backport_2.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17418.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17418 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17009: [SPARK-19674][SQL]Ignore driver accumulator updates don'...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/17009 It looks like this will fix a bug we're experiencing in Spark 2.1. Given that this PR is a bug fix, any chance we can get a backport into `branch-2.1`? I can work on it myself if @carsonwang is unable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 Rebased to latest master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 I'm still working actively on this PR (as I have time), but I wanted to share that I will be away and unavailable from tonight, March 24th until Tuesday, April 11th. If you post a comment to this PR during that timeframe I'll respond as soon as I can after I return. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17390: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman closed the pull request at: https://github.com/apache/spark/pull/17390 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16499 Backport PR is #17390 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17390: [SPARK-17204][CORE] Fix replicated off heap storage
Github user mallman commented on the issue: https://github.com/apache/spark/pull/17390 This is a backport of #16499 to branch-2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17390: [SPARK-17204][CORE] Fix replicated off heap stora...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/17390 [SPARK-17204][CORE] Fix replicated off heap storage (Jira: https://issues.apache.org/jira/browse/SPARK-17204) There are a couple of bugs in the `BlockManager` with respect to support for replicated off-heap storage. First, the locally-stored off-heap byte buffer is disposed of when it is replicated. It should not be. Second, the replica byte buffers are stored as heap byte buffers instead of direct byte buffers even when the storage level memory mode is off-heap. This PR addresses both of these problems. `BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. It now fails if either of the bugs in this PR exist. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-17204-replicated_off_heap_storage-2.0_backport Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17390.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17390 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @felixcheung We haven't heard from @jkbradley or @ankurdave in a week. Should we give them more time or can we merge to master? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16499 > @mallman can you send a new PR for 2.0? thanks! Will do. Do I need to open a new JIRA ticket for that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r107028767 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1048,7 +1065,7 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +bytesToReplicate.unmap() --- End diff -- The best I think we can expect from such a flag is a hint. The constructor of a `ChunkedByteBuffer` will not always know if the underlying byte buffers are memory mapped or not. For example, see https://github.com/apache/spark/blob/bec6b16c1900fe93def89cc5eb51cbef498196cb/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L326. In this case, `data.nioByteBuffer()` might or might not be memory-mapped. I still think the current patch set is the best overall amongst the other options we've considered. I can add a unit test for `StorageUtils.unmap` to ensure it works as expected (only disposing memory-mapped buffers). I can also add an `if` clause around the call to `bytesToReplicate.unmap()` to ensure this is only called when the replication storage level is off-heap. This will ensure the reflective call on the `fd` field only occurs for off-heap replication. Given that off-heap replication is currently broken, I doubt anyone will notice a performance degradation... Besides that, I suspect that network and disk IO performance will dominate the reflective method call performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 @viirya A month has gone by since my last update. I've added much more comprehensive coverage to the `SelectedFieldSuite`, however I haven't yet fixed the `SelectedField` extractor to pass all of the tests. All of the failures are related to handling path expressions including `GetArrayStructFields` extractors. There are many complicated cases, and they are proving quite a challenge to resolve comprehensively. I hope to spend some more time on this by the end of next week. I would love to push an update by then. After next week I will be away for two weeks. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r106552361 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1048,7 +1065,7 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +bytesToReplicate.unmap() --- End diff -- @cloud-fan I explored the approach of making the `MemoryStore` return a `ChunkedByteBuffer` that cannot be disposed, however I don't think there's a clean way to safely support that behavior. In essence, if the memory manager marks a buffer as indisposable when it returns it to the block manager, then that buffer cannot be evicted later. Adding additional code to handle this other behavior correctly was looking rather messy, and I abandoned the effort. At this point, I think that explicitly separating `unmap` and `dispose` methods is still the best way to resolve this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r104770778 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1048,7 +1065,7 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +bytesToReplicate.unmap() --- End diff -- I would to explore this further to ensure this would really work well, but I like your idea with one caveat. I think we should avoid using the `disposed` var for this purpose. Using it this way would introduce ambiguity in its meaning when `disposed` is true. In addition to its current meaning, it could also mean "this buffer is not disposed but we don't want to dispose it". Instead, I suggest keeping the usage of `disposed` as-is and adding an additional var, e.g. `indisposable`, which makes the `dispose` method itself a no-op. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16499 I looked into simply cleaning up the `StorageUtils.dispose` method to only dispose memory-mapped buffers. However, I did find legitimate uses of that method to dispose of direct/non-memory-mapped buffers. So I kept the behavior of that method as-is. Instead, I added a new methodâunmapâwhich will dispose of memory-mapped buffers *only*, and added calls to that method where appropriate. At the end of the day, I only found one case where we specifically wanted an "unmap" behavior instead of the other broader disposal behavior. (That case being the one what was causing corruption of replicated blocks in the first place.) I also found a new memory management bug in `BlockManager` introduced by the encryption support. In the original codebase, it disposes of a buffer unsafely. I think part of the problem is the documentation of the `ChunkedByteBuffer.toByteBuffer` method uses the word "copy" in describing what that method does. I expanded and made that method's documentation more precise to clarify that sometimes that method *does not* return a copy of the data. In those cases, it is not safe to dispose the returned buffer. I found that there were no uses of `ByteBufferInputStream` where automatic buffer disposal was called for. Therefore, I dropped that support from that class to guard against unsafe usage. If someone _really_ wants to actually use automatic buffer disposal in `ByteBufferInputStream` they canâcarefullyâre-add that support. I think that that's generally unsafe. And, like I said, nothing in the codebase was using it anyway except where it was used incorrectly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @felixcheung Can you take another look and merge if LGTY? I think we've addressed all of the open reviewer requests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r103602156 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- Ok. I'll see what else is calling that method to validate that a fix there won't break something else, and I'll add a unit test to validate that calling `StorageUtils.dispose` on a direct byte buffer that isn't memory mapped doesn't actually dispose it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r103509816 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- Personally, I don't think we need to include the checkpointer in the implementation sketch. I think it's more an implementation detail than an essential part of the algorithm. I think it's enough to simply document that GraphX's implementation of pregel includes checkpointing to avoid unbounded RDD lineages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r103509038 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- Hi guys. Sorry for the delay on the update. I started down the path that I proposed and it resulted in too many awkward changes in method signatures downstream. I don't think this is a viable step forward. As another option, we could dispose the buffer if and only if it has a non-null `fd` field. Since that field is private, we would have to call it by reflection. I'd also include a unit test to validate that the field exists as expected to guard against internal changes in future versions of Java. On a broader level, I wonder if callers of `ChunkedByteBuffer.dispose` method understand that it will dispose of non-memory-mapped direct buffers? The documentation of that method suggests it's only supposed to dispose of memory-mapped files in the strict sense (those actually memory mapped against a file descriptor by the OS). If other methods are accidentally calling this method on non-memory-mapped direct buffers, that suggests to me we need to push the fix to that method (or actually the StorageUtils.dispose() method). What do you think of that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102293681 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -317,6 +317,9 @@ private[spark] class BlockManager( /** * Put the block locally, using the given storage level. + * + * '''Important!''' Callers must not mutate or release the data buffer underlying `bytes`. Doing --- End diff -- I've explicitly documented the fact that callers must not mutate the data buffers underlying `bytes`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102293219 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -843,7 +852,15 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP && +bytes.chunks.exists(buffer => !buffer.isDirect)) { --- End diff -- I've refined this check for copying `bytes` to skip copying when the underlying buffers are already direct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102292537 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -122,27 +125,39 @@ object Pregel extends Logging { require(maxIterations > 0, s"Maximum number of iterations must be greater than 0," + s" but got ${maxIterations}") -var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val checkpointInterval = graph.vertices.sparkContext.getConf + .getInt("spark.graphx.pregel.checkpointInterval", 10) --- End diff -- I would also suggest incorporating this change into the Spark 2.2 release notes under a section for GraphX, but I don't see where these notes are maintained. The release notes for 2.1 are published at http://spark.apache.org/releases/spark-release-2-1-0.html, but I can't find them in the repo. Anybody know how these are generated or how to contribute to them? Is there another repo for this documentation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102290972 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- > I guess it would help reviewers to understand if you could explain those changes in the impl sketch is required to use checkpoint Sorry @felixcheung, I don't understand what you mean here. Can you elaborate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102272981 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- So do a copy if and only if `memoryMode == MemoryMode.OFF_HEAP` and `bytes` is not direct and `bytes` is not a memory mapped file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102271763 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- There does not appear to be a robust way to check if `bytesToReplicate` is a mmapped file or not. Perhaps `doGetLocalBytes` should return a tuple `(ChunkedByteBuffer, boolean)` where the second element of the tuple is `true` if and only if the buffer is a mmapped file. Thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @dding3, thank you for your continued patience and dedication to this PR, despite the continued change requests. We are getting closer to a merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102057156 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- Also, we should document the default value and that checkpointing can be disabled by setting this config property to -1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102056438 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- I think @viirya is suggesting we remove all references to checkpointing from this implementation sketch. @viirya, correct me if I'm wrong. I suggest reverting these changes to `graphx-programming-guide.md`. Just add a brief note that GraphX periodically checkpoints the graph and message lineages, and this checkpoint interval can be configured with the `spark.graphx.pregel.checkpointInterval` Spark configuration property. @viirya @felixcheung Should we document this config property in the Spark configuration document as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102053462 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -154,7 +169,9 @@ object Pregel extends Logging { // count the iteration i += 1 } -messages.unpersist(blocking = false) +messageCheckpointer.unpersistDataSet() --- End diff -- Sorry, I don't understand this change. Why do we replace ```scala messages.unpersist(blocking = false) ``` with ```scala messageCheckpointer.unpersistDataSet() ``` Especially because this adds a new public method to `PeriodicCheckpointer` that no other code has needed before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101819321 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala --- @@ -87,10 +87,10 @@ private[mllib] class PeriodicGraphCheckpointer[VD, ED]( override protected def persist(data: Graph[VD, ED]): Unit = { if (data.vertices.getStorageLevel == StorageLevel.NONE) { - data.vertices.persist() --- End diff -- We need to use `cache` because `persist` does not honor the default storage level requested when constructing the graph. Only `cache` does that. It's confusing, but true. To verify this for yourself, change these values to `persist` and run the `PeriodicGraphCheckpointerSuite` tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101818789 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala --- @@ -362,12 +362,14 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]) extends Seriali def pregel[A: ClassTag]( initialMsg: A, maxIterations: Int = Int.MaxValue, + checkpointInterval: Int = 25, --- End diff -- Good point in both cases. I'm wondering if the periodic pregel checkpointing operation should be controlled by a config value instead. Suppose, for example, we create a config key `spark.graphx.pregel.checkpointInterval`. If it's set to the default value (0, to retain existing functionality), checkpointing is disabled. If it's set to a positive integer, the checkpointing is performed with that many iterations. Other values are invalid. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101809872 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- Allocating a direct byte buffer creates a `java.nio.DirectByteBuffer`, which is in turn a subclass of `java.nio.MappedByteBuffer`. So calling `dispose()` will dispose direct buffers, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101809099 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- I'm not sure what you're suggesting I do here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101675669 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- So maybe use `putBlockStatus.storageLevel` instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101675576 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- Is it safe to store a ref to `bytes` if the memory is stored off-heap? If the caller changes the values in that memory or frees it, the buffer we put in the memory store will be affected. We don't want that kind of side-effect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 LGTM. @felixcheung are we good to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @dding3 I submitted a PR against your `cp2_pregel` branch. If you merge that PR into your branch, it will be reflected in this PR. This is my PR: https://github.com/dding3/spark/pull/1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 > I think @mallman is saying he would merge changes to @dding3 branch Yes, or I could do them in a follow up PR. Or @dding3 could do them without my PR. I'm not hung up on getting credit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101602604 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- NM about checking `memoryMode.useOffHeap`. I got that confused with `StorageLevel`. There's actually only two values of `MemoryMode`: `MemoryMode.OFF_HEAP` and `MemoryMode.ON_HEAP`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101602014 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- (Actually, I think we need to check `memoryMode.useOffHeap` here.) Assume `memoryMode.useOffHeap` is true. We have two cases to consider: 1. `bytes` is on-heap. In this case, we need to copy it into a new direct buffer, and that's what we're doing here. 2. `bytes` is off-heap. In this case, we assume that the caller upstream is managing the memory underlying `bytes`, and `bytes.copy(Platform.allocateDirectBuffer)` becomes a defensive copy. If the caller is not managing this memory, I would call that a bug in the caller's behavior. In either case, I believe we should be calling `bytes.copy(Platform.allocateDirectBuffer)` when `memoryMode.useOffHeap` is true. BTW, in my experience tracing this code in the debugger, `bytes` has always been an on-heap buffer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101592331 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- I think the name of the `ChunkedByteBuffer.dispose()` method is confusing. It actually only attempts to dispose a so-called "memory mapped" buffer. On-heap buffers are not memory mapped, therefore this is a no-op for them. On the other hand, when the storage level uses off-heap memory in this context, `bytesToReplicate` is a reference to the actual off-heap memory buffer. Disposing of this buffer will erase it from the local memory store. Obviously, this is not the desired behavior. So we add the guard for off-heap memory buffers here. As far as I can tell, there is no storage level for which `bytesToReplicate.dispose()` would actually do anything. However, technically if `bytesToReplicate` where memory-mapped but not direct, this would dispose of that memory. Would we even want that behavior? Overall, this `finally` clause is attempting to destroy the data we get from `doGetLocalBytes()`. This does not seem to be safe or correct, because we do not want to destroy the data stored locally. Therefore, we should consider getting rid of this finally clause entirely. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16942: [SPARK-19611][SQL] Introduce configurable table schema i...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16942 Weird. I think I've seen that behavior once before. But I think the only time I force push on a PR is to rebase. Maybe that's the only kind of force push allowed for Github PRs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16942: [SPARK-19611][SQL] Introduce configurable table schema i...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16942 Force pushing your branch shouldn't close the PR. You didn't close it manually? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 @viirya I've added a commit to address some of your feedback. I will have another commit to address the others, but I'm not sure when I'll have it in. Hopefully by the end of next week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 Our connected components computation completed successfully, with performance as expected. I've created a PR against @dding3's PR branch to incorporate a couple simple things. Then I think we're good to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @dding3 These latest changes look great. I'll run our big connected components job today and report back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @viirya @dding3 I'm going to rerun our big connected components computation with the changes I've suggested to validate that it still performs and completes as expected. Given the time required to complete this calculation (10+ hours), I might not get to it until next week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100641170 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala --- @@ -87,10 +88,7 @@ private[mllib] class PeriodicGraphCheckpointer[VD, ED]( override protected def persist(data: Graph[VD, ED]): Unit = { if (data.vertices.getStorageLevel == StorageLevel.NONE) { - data.vertices.persist() -} -if (data.edges.getStorageLevel == StorageLevel.NONE) { - data.edges.persist() + data.persist() --- End diff -- I enhanced the persistence tests in `PeriodicGraphCheckpointSuite` to check that the storage level requested in the graph construction is the storage level seen after persistence. Both this version and the original version of this method failed that unit test. The graph's vertex and edge rdds are somewhat peculiar in that `.cache()` and `.persist()` do not do the same thing, unlike other RDDs. And while `.cache()` honors the default storage level specified at graph construction time, `.persist()` always caches with the `MEMORY_ONLY` storage level. At any rate, getting the `PeriodicGraphCheckpointer` to honor the default storage level specified at graph construction time requires changing these method calls from `persist()` to `cache()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100640256 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) --- End diff -- Actually, for the sake of simplicity and consistency, I'm going to suggest we keep the checkpointer update calls but remove all `.cache()` calls. The `update` calls persist the underlying data, making the calls to `.cache()` unnecessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100638292 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) var activeMessages = messages.count() + // Loop var prevG: Graph[VD, ED] = null var i = 0 while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog).cache() --- End diff -- Ok. I agree with your observation here, @viirya. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100638130 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala --- @@ -76,7 +77,7 @@ import org.apache.spark.storage.StorageLevel * * TODO: Move this out of MLlib? --- End diff -- This comment should be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100632148 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointerSuite.scala --- @@ -23,7 +23,7 @@ import org.apache.spark.{SparkContext, SparkFunSuite} import org.apache.spark.mllib.util.MLlibTestSparkContext --- End diff -- Is there a reason this file isn't moved into the core codebase? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100631975 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointerSuite.scala --- @@ -21,6 +21,7 @@ import org.apache.hadoop.fs.Path --- End diff -- Is there a reason this test suite isn't moved into the GraphX codebase? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100612840 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) var activeMessages = messages.count() + // Loop var prevG: Graph[VD, ED] = null var i = 0 while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog).cache() --- End diff -- Actually, this may be more subtle than I thought. I'm going to think through this again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100609529 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) var activeMessages = messages.count() + // Loop var prevG: Graph[VD, ED] = null var i = 0 while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog).cache() --- End diff -- I agree. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100608839 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) --- End diff -- I agree. What do you think, @dding3? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 BTW @budde, given that this represents a regression in behavior from previous versions of Spark, I think it is too generous of you to label the Jira issue as an "improvement" instead of a "bug". I would support you changing the type to "bug" if you want to. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 >> Like you said, users can still create a hive table with mixed-case-schema parquet/orc files, by hive or other systems like presto. This table is readable for hive, and for Spark prior to 2.1, because of the runtime schema inference But this is not intentional, and Spark should not support it as the data file schema and table schema mismatch. > > I will continue to argue strongly against reducing the number of usecases Spark SQL supports out of the box. While offering a migration command can offer a helpful optimization I don't think it is acceptable as the only option for the reasons I've detailed here. > > Simply put, I think relying on the presence of Spark-specific key/value pairs in the table properties in order for Spark SQL to function properly and assuming that Spark (or Spark users) can easily alter those properties to add the table schema is too brittle for large-scale production use. I would have to agree with @budde in this case. In versions of Spark prior to 2.1, an effort was made to reconcile metastore and file format case mismatching using the method `ParquetFileFormat.mergeMetastoreParquetSchema`. The code docs for that method state that here: https://github.com/apache/spark/blob/1b02f8820ddaf3f2a0e7acc9a7f27afc20683cca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L711-L719. I don't see anything here that suggests this was a "hack" or was intended to be removed in a later version. It seems we've simply broken compatibility with a certain class of Hive tables in Spark 2.1. Schema inference is very expensive, and doing it at query time on large tables was painful in versions prior to Spark 2.1 because all metadata files were read. But it seems some people were using it nonetheless and found it useful. At least in Spark 2.1, only the files for partitions read in a query will be read for schema inference. That would significantly enhance the schema inference performance at query time for partitioned tables. Incidentally, what happens when a program outside of Spark (such as Hive) updates the Hive metastore schema of a table with the embedded Spark SQL schema? Does Spark detect that change and update the embedded schema? Does it have to redo the schema inference across all files in the table? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16775 @viirya I believe this PR meshes with the refactoring and application to pregel GraphX algorithms in #15125. Basically, it moves the periodic checkpointing code from mllib into core and uses it in GraphX to checkpoint long lineages. This is essential to scale GraphX to huge graphs, as described in my comment in the PR, and solves a very real problem for us. Can you take a look at that PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16785#discussion_r100364260 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -314,19 +322,29 @@ abstract class UnaryNode extends LogicalPlan { * expressions with the corresponding alias */ protected def getAliasedConstraints(projectList: Seq[NamedExpression]): Set[Expression] = { -var allConstraints = child.constraints.asInstanceOf[Set[Expression]] -projectList.foreach { - case a @ Alias(e, _) => -// For every alias in `projectList`, replace the reference in constraints by its attribute. -allConstraints ++= allConstraints.map(_ transform { - case expr: Expression if expr.semanticEquals(e) => -a.toAttribute -}) -allConstraints += EqualNullSafe(e, a.toAttribute) - case _ => // Don't change. -} - -allConstraints -- child.constraints +val relativeReferences = AttributeSet(projectList.collect { + case a: Alias => a +}.flatMap(_.references)) +val parAllConstraints = child.constraints.asInstanceOf[Set[Expression]].filter { constraint => + constraint.references.intersect(relativeReferences).nonEmpty +}.par +parAllConstraints.tasksupport = UnaryNode.taskSupport --- End diff -- Why are we using a custom task support instead of the default (which uses the global fork-join executor)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r100360523 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField} +import org.apache.spark.sql.types.StructField + +/** + * A Scala extractor that extracts the child expression and struct field from a [[GetStructField]]. + * This is in contrast to the [[GetStructField]] case class extractor which returns the field + * ordinal instead of the field itself. + */ +private[planning] object GetStructField2 { --- End diff -- Let's go with `GetStructFieldObject`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r100229358 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField} +import org.apache.spark.sql.types.StructField + +/** + * A Scala extractor that extracts the child expression and struct field from a [[GetStructField]]. + * This is in contrast to the [[GetStructField]] case class extractor which returns the field + * ordinal instead of the field itself. + */ +private[planning] object GetStructField2 { --- End diff -- How about `GetStructFieldObject`? Or `GetStructFieldRef`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r100229300 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala --- @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +/** + * A Scala extractor that builds a [[StructField]] from a Catalyst complex type + * extractor. This is like the opposite of [[ExtractValue#apply]]. + */ +object SelectedField { + def unapply(expr: Expression): Option[StructField] = { +// If this expression is an alias, work on its child instead +val unaliased = expr match { + case Alias(child, _) => child + case expr => expr +} +selectField(unaliased, None) + } + + /** + * Converts some chain of complex type extractors into a [[StructField]]. + * + * @param expr the top-level complex type extractor + * @param fieldOpt the subfield of [[expr]], where relevent + */ + private def selectField(expr: Expression, fieldOpt: Option[StructField]): Option[StructField] = +expr match { + case AttributeReference(name, _, nullable, _) => +fieldOpt.map(field => StructField(name, StructType(Array(field)), nullable)) + case GetArrayItem(GetStructField2(child, field @ StructField(name, + ArrayType(_, arrayNullable), fieldNullable, _)), _) => +val childField = fieldOpt.map(field => StructField(name, ArrayType( + StructType(Array(field)), arrayNullable), fieldNullable)).getOrElse(field) +selectField(child, Some(childField)) + case GetArrayStructFields(child, --- End diff -- I've spent some time this week developing a few different solutions to this problem, however none of them are very easy to understand or verify. I'm going to spend some more time working on a simpler solution before posting something back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 The proposal to restore schema inference with finer grained control on when it is performed sounds reasonable to me. The case I'm most interested in is turning off schema inference entirely, because we do not use parquet files with upper-case characters in their column names. BTW, what behavior do we expect if a parquet file has two columns whose lower-cased names are identical? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r99174674 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala --- @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +/** + * A Scala extractor that builds a [[StructField]] from a Catalyst complex type + * extractor. This is like the opposite of [[ExtractValue#apply]]. + */ +object SelectedField { + def unapply(expr: Expression): Option[StructField] = { +// If this expression is an alias, work on its child instead +val unaliased = expr match { + case Alias(child, _) => child + case expr => expr +} +selectField(unaliased, None) + } + + /** + * Converts some chain of complex type extractors into a [[StructField]]. + * + * @param expr the top-level complex type extractor + * @param fieldOpt the subfield of [[expr]], where relevent + */ + private def selectField(expr: Expression, fieldOpt: Option[StructField]): Option[StructField] = +expr match { + case AttributeReference(name, _, nullable, _) => +fieldOpt.map(field => StructField(name, StructType(Array(field)), nullable)) + case GetArrayItem(GetStructField2(child, field @ StructField(name, + ArrayType(_, arrayNullable), fieldNullable, _)), _) => +val childField = fieldOpt.map(field => StructField(name, ArrayType( + StructType(Array(field)), arrayNullable), fieldNullable)).getOrElse(field) +selectField(child, Some(childField)) + case GetArrayStructFields(child, --- End diff -- I believe I have a fix for this, but I probably won't be able to post a new commit until early next weekâI'm working on a proposal for the Spark Summit RFP. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r98920657 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField} +import org.apache.spark.sql.types.StructField + +/** + * A Scala extractor that extracts the child expression and struct field from a [[GetStructField]]. + * This is in contrast to the [[GetStructField]] case class extractor which returns the field + * ordinal instead of the field itself. + */ +private[planning] object GetStructField2 { --- End diff -- Agreed. I think the best name in this context is `GetStructField`, but that's already taken. I'll keep thinking about a good alternative. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r98819150 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField} +import org.apache.spark.sql.types.StructField + +/** + * A Scala extractor that extracts the child expression and struct field from a [[GetStructField]]. + * This is in contrast to the [[GetStructField]] case class extractor which returns the field + * ordinal instead of the field itself. + */ +private[planning] object GetStructField2 { --- End diff -- What do you mean by combining it with the existing case class extractor? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16751: [SPARK-19409][BUILD] Bump parquet version to 1.8.2
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16751 FYI, there are at least two workarounds in the Spark codebase which can potentially be removed as a consequence of this upgrade. For example: https://github.com/apache/spark/blob/5de1737b02710e36f6804d2ae243d1aeb30a0b32/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L549-L558 and https://github.com/apache/spark/blob/ca6391637212814b7c0bd14c434a6737da17b258/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L175-L178 These come immediately to mind. There may be others. I think this PR would have been a good opportunity to remove these workarounds, but it's been closed and merged so that's water under the bridge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16281 FYI, we've been using 1.9.0 patched with a fix for https://issues.apache.org/jira/browse/PARQUET-783 without problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 LGTM. @srowen, can you recommend an mllib committer to review these changes? I'm not familiar with that team. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 Hi @dding3. Thanks for working on this! I was able to rebase and apply your patch to our build of Spark 2.1 to successfully compute the connected components of a graph with 5.2 billion vertices, 3.7 billion edges and 2.4 billion connected components. It took 832 iterations, but without your patch it stalled around iteration number 330. Can you please rebase your patch? I will take up a review of this PR when you've had a chance to rebase. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 > Does this take over #14957? If so, we might need Closes #14957 in the PR description for the merge script to close that one or let the author know this takes over that. I don't know. @xuanyuanking, how do you feel about this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16499 Josh, can you take a look at this when you have a chance? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16499 @rxin, can you recommend someone I reach out to for help reviewing this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 cc @rxin @ericl @cloud-fan @marmbrus I would love to get your feedback on this if you have the time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/16578 [SPARK-4502][SQL] Parquet nested column pruning (Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502) ## What changes were proposed in this pull request? One of the hallmarks of a column-oriented data storage format is the ability to read data from a subset of columns, efficiently skipping reads from other columns. Spark has long had support for pruning unneeded top-level schema fields from the scan of a parquet file. For example, consider a table, `contacts`, backed by parquet with the following Spark SQL schema: ``` root |-- name: struct ||-- first: string ||-- last: string |-- address: string ``` Parquet stores this table's data in three physical columns: `name.first`, `name.last` and `address`. To answer the query ```SQL select address from contacts ``` Spark will read only from the `address` column of parquet data. However, to answer the query ```SQL select name.first from contacts ``` Spark will read `name.first` and `name.last` from parquet. This PR modifies Spark SQL to support a finer-grain of schema pruning. With this patch, Spark reads only the `name.first` column to answer the previous query. ### Implementation There are three main components of this patch. First, there is a `ParquetSchemaPruning` optimizer rule for gathering the required schema fields of a `PhysicalOperation` over a parquet file, constructing a new schema based on those required fields and rewriting the plan in terms of that pruned schema. The pruned schema fields are pushed down to the parquet requested read schema. `ParquetSchemaPruning` uses a new `ProjectionOverSchema` extractor for rewriting a catalyst expression in terms of a pruned schema. Second, the `ParquetRowConverter` has been patched to ensure the ordinals of the parquet columns read are correct for the pruned schema. `ParquetReadSupport` has been patched to address a compatibility mismatch between Spark's built in vectorized reader and the parquet-mr library's reader. Third, we introduce two new catalyst query transformations, `AggregateFieldExtractionPushdown` and `JoinFieldExtractionPushdown`, to support schema pruning in aggregation and join query plans. These rules extract field references in aggregations and joins respectively, push down aliases to those references and replace them with references to the pushed down aliases. They use a new `SelectedField` extractor that transforms a catalyst complex type extractor (the "selected field") into a corresponding `StructField`. ### Performance The performance difference in executing queries with this patch compared to master is related to the depth of the table schema and the query itself. At VideoAmp, one of our biggest tables stores OpenRTB bid requests we receive from our exchange partners. Our bid request table's schema closely follows the OpenRTB bid request object schema. Additionally, when we bid we save our response along with the request in the same table. We store these two objects as two top-level fields in our table. Therefore, all bid request and response data are contained within nested fields. For the purposes of measuring the performance impact of this patch, we ran some queries on our bid request table with the un-patched and patched master. We measured query execution time and the amount of data read from the underlying parquet files. I'll focus on a couple of benchmarks. (All benchmarks were run on an AWS EC2 cluster with four c3.8xl workers.) The first query I'll highlight is ```SQL select count(request.device.ip) from event.bid_request where ds=20161128 and h=0 ``` (Hopefully it's obvious what this query means.) On the un-patched master, this query ran in 2.7 minutes and read 34.3 GB of data. On the patched master, this query ran in 4 seconds and read 987.3 MB of data. We also ran a reporting-oriented query benchmark. I won't reproduce the query here, but it reads a larger subset of the bid request fields and joins against another table with a deeply nested schema. In addition to a join, we perform several aggregations in this query. On the un-patched master, this query ran in 3.4 minutes and read 34.6 GB of data. On the patched master, this query ran in 59 seconds and read 2.6 GB of data. ### Limitation Among the complex Spark SQL data types, this patch supports parquet column pruning of nested sequences of struct fields only. ## How was this patch tested? Care has been taken to ensure correctness and prevent regressions. This patch introduces over two dozen new unit tests and has been running on a production Spark 1.5 clust
[GitHub] spark pull request #16514: [SPARK-19128] [SQL] Refresh Cache after Set Locat...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16514#discussion_r95490619 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -555,6 +557,61 @@ class HiveDDLSuite } } + test("Alter Table Set Location") { +withTable("tab1", "tab2") { + val catalog = spark.sessionState.catalog + sql("CREATE TABLE tab1 using parquet AS SELECT 1 as a") + sql("CREATE TABLE tab2 using parquet AS SELECT 2 as a") + checkAnswer(spark.table("tab1"), Seq(Row(1))) + checkAnswer(spark.table("tab2"), Seq(Row(2))) + val metadataTab1 = catalog.getTableMetadata(TableIdentifier("tab1")) + val locTab1 = metadataTab1.storage.locationUri + val metadataTab2 = catalog.getTableMetadata(TableIdentifier("tab2")) + val locTab2 = metadataTab2.storage.locationUri + assert(locTab1.isDefined) + assert(locTab2.isDefined) + try { +sql(s"ALTER TABLE tab2 SET LOCATION '${locTab1.get}'") +checkAnswer(spark.table("tab2"), Seq(Row(1))) + } finally { +val root = new Path(locTab2.get) +val fs = root.getFileSystem(spark.sparkContext.hadoopConfiguration) +fs.delete(root, true) --- End diff -- Why aren't we deleting `locTab1`, too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16514: [SPARK-19128] [SQL] Refresh Cache after Set Locat...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16514#discussion_r95489960 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -119,7 +119,30 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log qualifiedTableName.database, qualifiedTableName.name) if (DDLUtils.isDatasourceTable(table)) { - val dataSourceTable = cachedDataSourceTables(qualifiedTableName) + val dataSourceTable = +cachedDataSourceTables(qualifiedTableName) match { + case l @ LogicalRelation(relation: HadoopFsRelation, _, _) => +// Ignore the scheme difference when comparing the paths +val isSamePath = + table.storage.locationUri.isDefined && relation.location.rootPaths.size == 1 && +table.storage.locationUri.get == relation.location.rootPaths.head.toUri.getPath +// If we have the same paths, same schema, and same partition spec, +// we will use the cached relation. +val useCached = + isSamePath && + l.schema == table.schema && + relation.bucketSpec == table.bucketSpec && + relation.partitionSchema == table.partitionSchema +if (useCached) { + l +} else { + // If the cached relation is not updated, we invalidate it right away. + cachedDataSourceTables.invalidate(qualifiedTableName) + // Reload it from the external catalog + cachedDataSourceTables(qualifiedTableName) +} + case o => o +} --- End diff -- Understood. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16514: [SPARK-19128] [SQL] Refresh Cache after Set Location
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16514 > A good suggestion. Will do the code changes tomorrow. Thanks! I look forward to seeing this. Thanks for taking this on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95206030 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -392,7 +392,9 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) - sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +if (partition.nonEmpty) { + sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +} --- End diff -- Why is it safe to restrict this call to the case where `partition.nonEmpty`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r95066452 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -375,7 +375,8 @@ class BlockManagerReplicationSuite extends SparkFunSuite // Put the block into one of the stores val blockId = new TestBlockId( "block-with-" + storageLevel.description.replace(" ", "-").toLowerCase) - stores(0).putSingle(blockId, new Array[Byte](blockSize), storageLevel) + val testValue = Array.fill[Byte](blockSize)(1) --- End diff -- Using an array of 1s instead of an array of 0s is my silly, paranoid, OCD way of adding a little extra entropy to the test. I think the chance that this change in test value will actually affect the outcome of this test is about 0%. I will revert to the original test value on request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r95066296 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -387,12 +388,23 @@ class BlockManagerReplicationSuite extends SparkFunSuite testStore => blockLocations.contains(testStore.blockManagerId.executorId) }.foreach { testStore => val testStoreName = testStore.blockManagerId.executorId -assert( - testStore.getLocalValues(blockId).isDefined, s"$blockId was not found in $testStoreName") -testStore.releaseLock(blockId) --- End diff -- N.B. We no longer need the `releaseLock` call because we exhaust the iterator returned by `getLocalValues`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/16499 [SPARK-17204][CORE] Fix replicated off heap storage (Jira: https://issues.apache.org/jira/browse/SPARK-17204) ## What changes were proposed in this pull request? There are a couple of bugs in the `BlockManager` with respect to support for replicated off-heap storage. First, the locally-stored off-heap byte buffer is disposed of when it is replicated. It should not be. Second, the replica byte buffers are stored as heap byte buffers instead of direct byte buffers even when the storage level memory mode is off-heap. This PR addresses both of these problems. ## How was this patch tested? `BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. It now fails if either of the bugs in this PR exist. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-17204-replicated_off_heap_storage Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16499.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16499 commit e49aeca23ff463fbd9a9cc4db99078c466bfbd56 Author: Michael Allman Date: 2017-01-02T01:28:12Z Fix a couple of bugs in replicated off-heap storage commit 40b6b97ca9013544702433dc5dc388c054daf41a Author: Michael Allman Date: 2017-01-07T21:17:33Z Shore-up BlockManagerReplicationSuite to identify a couple of bugs with off-heap storage replication --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` gro...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15480 Hi @lw-lin. Just FYI we use this patch at VideoAmp and would love to see it merged in. I notice this PR has gone a little cold. I'm sorry I can't offer much concrete help, but I wanted to check with you to see if you'll be able to pick this up again soon. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16281 I'd love to see frequent, conservative patch releases. From my experience, parquet bugs cause significant trouble for downstream consumers. For example, we encountered a data corruption bug writing parquet format V2 that made a good deal of that data unreadable. (I can't recall the issue number, but it was reported and fixed.) As another example, we recently upgraded some of our Spark clusters to use parquet-mr 1.9.0, because PARQUET-363 introduced a bug into one of our Spark 2.x patches. When we switched to 1.9, we found https://issues.apache.org/jira/browse/PARQUET-783, which breaks things in a different way. We needed a fix, so we forked 1.9 internally. FWIW, we haven't found any other issues using parquet-mr 1.9.0 with Spark 2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16274: [SPARK-18853][SQL] Project (UnaryNode) is way too aggres...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16274 Outside of some comment grooming, LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16274: [SPARK-18853][SQL] Project (UnaryNode) is way too...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16274#discussion_r92448788 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala --- @@ -78,10 +78,10 @@ case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataT ("containsNull" -> containsNull) /** - * The default size of a value of the ArrayType is 100 * the default size of the element type. - * (We assume that there are 100 elements). + * The default size of a value of the ArrayType is 1 * the default size of the element type. --- End diff -- Suggest: ```The default size of a value of the ArrayType is the default size of the element type.``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16122 @wangyum Thank you for this important bug fix! :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16122 I believe that may be the case, unfortunately. At least, I have no immediate ideas otherwise. > On Dec 7, 2016, at 5:25 PM, Eric Liang wrote: > > I see. In that case I think manual testing may be sufficient. > > On Wed, Dec 7, 2016, 5:00 PM Michael Allman > wrote: > > > I think that's exactly what I tried and got the `NoSuchMethodException`. > > > > On Dec 7, 2016, at 3:35 PM, Eric Liang wrote: > > > > I did some digging into HiveClientImpl and Hive.java, and I think it would > > be pretty safe to call Hive.get() within the scope of a > > client.withHiveState > > { }. That call sets the thread-local hive to that used by the client before > > executing the passed function block. > > > > â > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <https://github.com/apache/spark/pull/16122#issuecomment-265608215>, or > > mute > > the thread > > < > > https://github.com/notifications/unsubscribe-auth/AAy4nWy8PPhc-6MWv2mGUoPXws_WxLJ0ks5rF0LagaJpZM4LC1rJ > > > > > . > > > > â > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <https://github.com/apache/spark/pull/16122#issuecomment-265622317>, or mute > > the thread > > <https://github.com/notifications/unsubscribe-auth/AAA6SlhtuKG0rqwcBxdNu4_PSYjG1NkXks5rF1a_gaJpZM4LC1rJ> > > . > > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub <https://github.com/apache/spark/pull/16122#issuecomment-265626214>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAy4nXLmyAVGe6Bv02BxA-eP0-KvTXSxks5rF1xtgaJpZM4LC1rJ>. > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16122 I think that's exactly what I tried and got the `NoSuchMethodException`. On Dec 7, 2016, at 3:35 PM, Eric Liang wrote: I did some digging into HiveClientImpl and Hive.java, and I think it would be pretty safe to call Hive.get() within the scope of a client.withHiveState { }. That call sets the thread-local hive to that used by the client before executing the passed function block. â You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/apache/spark/pull/16122#issuecomment-265608215>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAy4nWy8PPhc-6MWv2mGUoPXws_WxLJ0ks5rF0LagaJpZM4LC1rJ> . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16122 I'm not sure configuration-level rollback will guarantee an absence of interactions with other tests. For one thing, I think we need to create and clean up an independent metastore directory for Derby. Otherwise we may end up with https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69669/testReport/junit/org.apache.spark.sql.hive/HiveExternalCatalogSuite/_It_is_not_a_test_/. I guess I'd like a stronger assurance that we're doing this the right way. The only call to `Hive.get` in the Hive codebase I can find is in a private method in `HiveClientImpl`. Spark seems to be going to great lengths to discourage calling that method directly. I've also thought this might be better suited to testing in the thriftserver codebase, where there seems to be a stronger sense of client/server separation in testing. I haven't gone too far in investigating that direction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16122 It does, yes. My concern around that test is that its behavior doesn't seem to be independent of other tests. For example, the value of ```hive.getConf.getBoolean(tryDirectSqlConfVar.varname, tryDirectSqlConfVar.defaultBoolVal)``` seemed to vary depending on what it was set to in the other test in that suite. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16122 There may be a way to do it, but the classloader tricks being used in the hive client implementation are beyond my comprehension. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16122 I tried that but got a `NoSuchMethodException` in the call to `getMSC`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16122 I haven't been able to get a proper unit test environment running where the embedded metastore conf is different from the client conf. I did validate that Spark without this patch failed to execute a query on a table with an integer type partition column filtering on that column where the metastore has direct sql access disabled, whereas Spark with this patch works and behaves as expected. @wangyum I don't believe your test uses Hive in a way that's compatible with Spark. Can you please remove it? @ericl Any ideas on how to unit test the case where the client and metastore have different configurations? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16122 @wangyum I'm going to see if I can help with the unit testing on this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15998 I suspect this is a spurious, unrelated test failure. Can we get a rebuild, please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15998 @gatorsmile I've applied your patch and reverted the change I made in the previous commit to workaround that defect. The failed test now passes for me. Let's see what Jenkins says. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15998 > #15998 (comment) found a bug. If this PR will not be merged to Spark 2.1 branch, I think we need to submit a separate PR for resolving the bug. I would like to get this patch into Spark 2.1 as it's a scalability issue for partitioned tables. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org