[GitHub] spark issue #17009: [SPARK-19674][SQL]Ignore driver accumulator updates don'...

2017-03-24 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/17009
  
I created a Spark 2.1 backport at #17418.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17418: [SPARK-19674][SQL] Ignore driver accumulator upda...

2017-03-24 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/17418

[SPARK-19674][SQL] Ignore driver accumulator updates don't belong to …

[SPARK-19674][SQL] Ignore driver accumulator updates don't belong to the 
execution when merging all accumulator updates

N.B. This is a backport to branch-2.1 of #17009.

## What changes were proposed in this pull request?
In SQLListener.getExecutionMetrics, driver accumulator updates don't belong 
to the execution should be ignored when merging all accumulator updates to 
prevent NoSuchElementException.

## How was this patch tested?
Updated unit test.

Author: Carson Wang 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public spark-19674-backport_2.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17418.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17418






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17009: [SPARK-19674][SQL]Ignore driver accumulator updates don'...

2017-03-24 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/17009
  
It looks like this will fix a bug we're experiencing in Spark 2.1. Given 
that this PR is a bug fix, any chance we can get a backport into `branch-2.1`? 
I can work on it myself if @carsonwang is unable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-03-24 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
Rebased to latest master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-03-24 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
I'm still working actively on this PR (as I have time), but I wanted to 
share that I will be away and unavailable from tonight, March 24th until 
Tuesday, April 11th. If you post a comment to this PR during that timeframe 
I'll respond as soon as I can after I return.

Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17390: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-03-24 Thread mallman
Github user mallman closed the pull request at:

https://github.com/apache/spark/pull/17390


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage

2017-03-22 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16499
  
Backport PR is #17390 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17390: [SPARK-17204][CORE] Fix replicated off heap storage

2017-03-22 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/17390
  
This is a backport of #16499 to branch-2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17390: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-03-22 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/17390

[SPARK-17204][CORE] Fix replicated off heap storage

(Jira: https://issues.apache.org/jira/browse/SPARK-17204)

There are a couple of bugs in the `BlockManager` with respect to support 
for replicated off-heap storage. First, the locally-stored off-heap byte buffer 
is disposed of when it is replicated. It should not be. Second, the replica 
byte buffers are stored as heap byte buffers instead of direct byte buffers 
even when the storage level memory mode is off-heap. This PR addresses both of 
these problems.

`BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. 
It now fails if either of the bugs in this PR exist.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public 
spark-17204-replicated_off_heap_storage-2.0_backport

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17390.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17390






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-03-22 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
@felixcheung We haven't heard from @jkbradley or @ankurdave in a week. 
Should we give them more time or can we merge to master?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage

2017-03-21 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16499
  
> @mallman can you send a new PR for 2.0? thanks!

Will do. Do I need to open a new JIRA ticket for that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-03-20 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r107028767
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1048,7 +1065,7 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+bytesToReplicate.unmap()
--- End diff --

The best I think we can expect from such a flag is a hint. The constructor 
of a `ChunkedByteBuffer` will not always know if the underlying byte buffers 
are memory mapped or not. For example, see 
https://github.com/apache/spark/blob/bec6b16c1900fe93def89cc5eb51cbef498196cb/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L326.

In this case, `data.nioByteBuffer()` might or might not be memory-mapped.

I still think the current patch set is the best overall amongst the other 
options we've considered. I can add a unit test for `StorageUtils.unmap` to 
ensure it works as expected (only disposing memory-mapped buffers). I can also 
add an `if` clause around the call to `bytesToReplicate.unmap()` to ensure this 
is only called when the replication storage level is off-heap. This will ensure 
the reflective call on the `fd` field only occurs for off-heap replication. 
Given that off-heap replication is currently broken, I doubt anyone will notice 
a performance degradation... Besides that, I suspect that network and disk IO 
performance will dominate the reflective method call performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-03-16 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
@viirya A month has gone by since my last update. I've added much more 
comprehensive coverage to the `SelectedFieldSuite`, however I haven't yet fixed 
the `SelectedField` extractor to pass all of the tests. All of the failures are 
related to handling path expressions including `GetArrayStructFields` 
extractors. There are many complicated cases, and they are proving quite a 
challenge to resolve comprehensively.

I hope to spend some more time on this by the end of next week. I would 
love to push an update by then. After next week I will be away for two weeks.

Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-03-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r106552361
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1048,7 +1065,7 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+bytesToReplicate.unmap()
--- End diff --

@cloud-fan I explored the approach of making the `MemoryStore` return a 
`ChunkedByteBuffer` that cannot be disposed, however I don't think there's a 
clean way to safely support that behavior. In essence, if the memory manager 
marks a buffer as indisposable when it returns it to the block manager, then 
that buffer cannot be evicted later. Adding additional code to handle this 
other behavior correctly was looking rather messy, and I abandoned the effort.

At this point, I think that explicitly separating `unmap` and `dispose` 
methods is still the best way to resolve this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-03-07 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r104770778
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1048,7 +1065,7 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+bytesToReplicate.unmap()
--- End diff --

I would to explore this further to ensure this would really work well, but 
I like your idea with one caveat. I think we should avoid using the `disposed` 
var for this purpose. Using it this way would introduce ambiguity in its 
meaning when `disposed` is true. In addition to its current meaning, it could 
also mean "this buffer is not disposed but we don't want to dispose it".

Instead, I suggest keeping the usage of `disposed` as-is and adding an 
additional var, e.g. `indisposable`, which makes the `dispose` method itself a 
no-op.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage

2017-03-07 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16499
  
I looked into simply cleaning up the `StorageUtils.dispose` method to only 
dispose memory-mapped buffers. However, I did find legitimate uses of that 
method to dispose of direct/non-memory-mapped buffers. So I kept the behavior 
of that method as-is. Instead, I added a new method—unmap—which will 
dispose of memory-mapped buffers *only*, and added calls to that method where 
appropriate. At the end of the day, I only found one case where we specifically 
wanted an "unmap" behavior instead of the other broader disposal behavior. 
(That case being the one what was causing corruption of replicated blocks in 
the first place.)

I also found a new memory management bug in `BlockManager` introduced by 
the encryption support. In the original codebase, it disposes of a buffer 
unsafely. I think part of the problem is the documentation of the 
`ChunkedByteBuffer.toByteBuffer` method uses the word "copy" in describing what 
that method does. I expanded and made that method's documentation more precise 
to clarify that sometimes that method *does not* return a copy of the data. In 
those cases, it is not safe to dispose the returned buffer.

I found that there were no uses of `ByteBufferInputStream` where automatic 
buffer disposal was called for. Therefore, I dropped that support from that 
class to guard against unsafe usage. If someone _really_ wants to actually use 
automatic buffer disposal in `ByteBufferInputStream` they 
can—carefully—re-add that support. I think that that's generally unsafe. 
And, like I said, nothing in the codebase was using it anyway except where it 
was used incorrectly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-03-07 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
@felixcheung Can you take another look and merge if LGTY? I think we've 
addressed all of the open reviewer requests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-28 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r103602156
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

Ok. I'll see what else is calling that method to validate that a fix there 
won't break something else, and I'll add a unit test to validate that calling 
`StorageUtils.dispose` on a direct byte buffer that isn't memory mapped doesn't 
actually dispose it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-28 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r103509816
  
--- Diff: docs/graphx-programming-guide.md ---
@@ -708,7 +708,9 @@ messages remaining.
 > messaging function.  These constraints allow additional optimization 
within GraphX.
 
 The following is the type signature of the [Pregel 
operator][GraphOps.pregel] as well as a *sketch*
-of its implementation (note calls to graph.cache have been removed):
+of its implementation (note: to avoid stackOverflowError due to long 
lineage chains, graph and 
--- End diff --

Personally, I don't think we need to include the checkpointer in the 
implementation sketch. I think it's more an implementation detail than an 
essential part of the algorithm. I think it's enough to simply document that 
GraphX's implementation of pregel includes checkpointing to avoid unbounded RDD 
lineages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-28 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r103509038
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

Hi guys. Sorry for the delay on the update. I started down the path that I 
proposed and it resulted in too many awkward changes in method signatures 
downstream. I don't think this is a viable step forward.

As another option, we could dispose the buffer if and only if it has a 
non-null `fd` field. Since that field is private, we would have to call it by 
reflection. I'd also include a unit test to validate that the field exists as 
expected to guard against internal changes in future versions of Java.

On a broader level, I wonder if callers of `ChunkedByteBuffer.dispose` 
method understand that it will dispose of non-memory-mapped direct buffers? The 
documentation of that method suggests it's only supposed to dispose of 
memory-mapped files in the strict sense (those actually memory mapped against a 
file descriptor by the OS). If other methods are accidentally calling this 
method on non-memory-mapped direct buffers, that suggests to me we need to push 
the fix to that method (or actually the StorageUtils.dispose() method). What do 
you think of that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102293681
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -317,6 +317,9 @@ private[spark] class BlockManager(
 
   /**
* Put the block locally, using the given storage level.
+   *
+   * '''Important!''' Callers must not mutate or release the data buffer 
underlying `bytes`. Doing
--- End diff --

I've explicitly documented the fact that callers must not mutate the data 
buffers underlying `bytes`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102293219
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -843,7 +852,15 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP &&
+bytes.chunks.exists(buffer => !buffer.isDirect)) {
--- End diff --

I've refined this check for copying `bytes` to skip copying when the 
underlying buffers are already direct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102292537
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -122,27 +125,39 @@ object Pregel extends Logging {
 require(maxIterations > 0, s"Maximum number of iterations must be 
greater than 0," +
   s" but got ${maxIterations}")
 
-var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val checkpointInterval = graph.vertices.sparkContext.getConf
+  .getInt("spark.graphx.pregel.checkpointInterval", 10)
--- End diff --

I would also suggest incorporating this change into the Spark 2.2 release 
notes under a section for GraphX, but I don't see where these notes are 
maintained. The release notes for 2.1 are published at 
http://spark.apache.org/releases/spark-release-2-1-0.html, but I can't find 
them in the repo. Anybody know how these are generated or how to contribute to 
them? Is there another repo for this documentation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102290972
  
--- Diff: docs/graphx-programming-guide.md ---
@@ -708,7 +708,9 @@ messages remaining.
 > messaging function.  These constraints allow additional optimization 
within GraphX.
 
 The following is the type signature of the [Pregel 
operator][GraphOps.pregel] as well as a *sketch*
-of its implementation (note calls to graph.cache have been removed):
+of its implementation (note: to avoid stackOverflowError due to long 
lineage chains, graph and 
--- End diff --

> I guess it would help reviewers to understand if you could explain those 
changes in the impl sketch is required to use checkpoint

Sorry @felixcheung, I don't understand what you mean here. Can you 
elaborate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102272981
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

So do a copy if and only if `memoryMode == MemoryMode.OFF_HEAP` and `bytes` 
is not direct and `bytes` is not a memory mapped file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102271763
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

There does not appear to be a robust way to check if `bytesToReplicate` is 
a mmapped file or not. Perhaps `doGetLocalBytes` should return a tuple 
`(ChunkedByteBuffer, boolean)` where the second element of the tuple is `true` 
if and only if the buffer is a mmapped file. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-20 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
@dding3, thank you for your continued patience and dedication to this PR, 
despite the continued change requests. We are getting closer to a merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-20 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102057156
  
--- Diff: docs/graphx-programming-guide.md ---
@@ -708,7 +708,9 @@ messages remaining.
 > messaging function.  These constraints allow additional optimization 
within GraphX.
 
 The following is the type signature of the [Pregel 
operator][GraphOps.pregel] as well as a *sketch*
-of its implementation (note calls to graph.cache have been removed):
+of its implementation (note: to avoid stackOverflowError due to long 
lineage chains, graph and 
--- End diff --

Also, we should document the default value and that checkpointing can be 
disabled by setting this config property to -1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-20 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102056438
  
--- Diff: docs/graphx-programming-guide.md ---
@@ -708,7 +708,9 @@ messages remaining.
 > messaging function.  These constraints allow additional optimization 
within GraphX.
 
 The following is the type signature of the [Pregel 
operator][GraphOps.pregel] as well as a *sketch*
-of its implementation (note calls to graph.cache have been removed):
+of its implementation (note: to avoid stackOverflowError due to long 
lineage chains, graph and 
--- End diff --

I think @viirya is suggesting we remove all references to checkpointing 
from this implementation sketch. @viirya, correct me if I'm wrong. I suggest 
reverting these changes to `graphx-programming-guide.md`. Just add a brief note 
that GraphX periodically checkpoints the graph and message lineages, and this 
checkpoint interval can be configured with the 
`spark.graphx.pregel.checkpointInterval` Spark configuration property.

@viirya @felixcheung Should we document this config property in the Spark 
configuration document as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-20 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102053462
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -154,7 +169,9 @@ object Pregel extends Logging {
   // count the iteration
   i += 1
 }
-messages.unpersist(blocking = false)
+messageCheckpointer.unpersistDataSet()
--- End diff --

Sorry, I don't understand this change. Why do we replace

```scala
messages.unpersist(blocking = false)
```

with

```scala
messageCheckpointer.unpersistDataSet()
```

Especially because this adds a new public method to `PeriodicCheckpointer` 
that no other code has needed before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r101819321
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
 ---
@@ -87,10 +87,10 @@ private[mllib] class PeriodicGraphCheckpointer[VD, ED](
 
   override protected def persist(data: Graph[VD, ED]): Unit = {
 if (data.vertices.getStorageLevel == StorageLevel.NONE) {
-  data.vertices.persist()
--- End diff --

We need to use `cache` because `persist` does not honor the default storage 
level requested when constructing the graph. Only `cache` does that. It's 
confusing, but true. To verify this for yourself, change these values to 
`persist` and run the `PeriodicGraphCheckpointerSuite` tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r101818789
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala ---
@@ -362,12 +362,14 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: 
Graph[VD, ED]) extends Seriali
   def pregel[A: ClassTag](
   initialMsg: A,
   maxIterations: Int = Int.MaxValue,
+  checkpointInterval: Int = 25,
--- End diff --

Good point in both cases. I'm wondering if the periodic pregel 
checkpointing operation should be controlled by a config value instead. 
Suppose, for example, we create a config key 
`spark.graphx.pregel.checkpointInterval`. If it's set to the default value (0, 
to retain existing functionality), checkpointing is disabled. If it's set to a 
positive integer, the checkpointing is performed with that many iterations. 
Other values are invalid.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101809872
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

Allocating a direct byte buffer creates a `java.nio.DirectByteBuffer`, 
which is in turn a subclass of `java.nio.MappedByteBuffer`. So calling 
`dispose()` will dispose direct buffers, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101809099
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

I'm not sure what you're suggesting I do here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101675669
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

So maybe use `putBlockStatus.storageLevel` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101675576
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

Is it safe to store a ref to `bytes` if the memory is stored off-heap? If 
the caller changes the values in that memory or frees it, the buffer we put in 
the memory store will be affected. We don't want that kind of side-effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
LGTM. @felixcheung are we good to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
@dding3 I submitted a PR against your `cp2_pregel` branch. If you merge 
that PR into your branch, it will be reflected in this PR. This is my PR: 
https://github.com/dding3/spark/pull/1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
> I think @mallman is saying he would merge changes to @dding3 branch

Yes, or I could do them in a follow up PR. Or @dding3 could do them without 
my PR. I'm not hung up on getting credit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101602604
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

NM about checking `memoryMode.useOffHeap`. I got that confused with 
`StorageLevel`. There's actually only two values of `MemoryMode`: 
`MemoryMode.OFF_HEAP` and `MemoryMode.ON_HEAP`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101602014
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

(Actually, I think we need to check `memoryMode.useOffHeap` here.)

Assume `memoryMode.useOffHeap` is true. We have two cases to consider:

1. `bytes` is on-heap. In this case, we need to copy it into a new direct 
buffer, and that's what we're doing here.
2. `bytes` is off-heap. In this case, we assume that the caller upstream is 
managing the memory underlying `bytes`, and 
`bytes.copy(Platform.allocateDirectBuffer)` becomes a defensive copy. If the 
caller is not managing this memory, I would call that a bug in the caller's 
behavior.

In either case, I believe we should be calling 
`bytes.copy(Platform.allocateDirectBuffer)` when `memoryMode.useOffHeap` is 
true.

BTW, in my experience tracing this code in the debugger, `bytes` has always 
been an on-heap buffer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101592331
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

I think the name of the `ChunkedByteBuffer.dispose()` method is confusing. 
It actually only attempts to dispose a so-called "memory mapped" buffer. 
On-heap buffers are not memory mapped, therefore this is a no-op for them.

On the other hand, when the storage level uses off-heap memory in this 
context, `bytesToReplicate` is a reference to the actual off-heap memory 
buffer. Disposing of this buffer will erase it from the local memory store. 
Obviously, this is not the desired behavior. So we add the guard for off-heap 
memory buffers here.

As far as I can tell, there is no storage level for which 
`bytesToReplicate.dispose()` would actually do anything. However, technically 
if `bytesToReplicate` where memory-mapped but not direct, this would dispose of 
that memory. Would we even want that behavior? Overall, this `finally` clause 
is attempting to destroy the data we get from `doGetLocalBytes()`. This does 
not seem to be safe or correct, because we do not want to destroy the data 
stored locally. Therefore, we should consider getting rid of this finally 
clause entirely. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16942: [SPARK-19611][SQL] Introduce configurable table schema i...

2017-02-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16942
  
Weird. I think I've seen that behavior once before. But I think the only 
time I force push on a PR is to rebase. Maybe that's the only kind of force 
push allowed for Github PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16942: [SPARK-19611][SQL] Introduce configurable table schema i...

2017-02-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16942
  
Force pushing your branch shouldn't close the PR. You didn't close it 
manually?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-14 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
@viirya I've added a commit to address some of your feedback. I will have 
another commit to address the others, but I'm not sure when I'll have it in. 
Hopefully by the end of next week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-14 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
Our connected components computation completed successfully, with 
performance as expected. I've created a PR against @dding3's PR branch to 
incorporate a couple simple things. Then I think we're good to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-13 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
@dding3 These latest changes look great. I'll run our big connected 
components job today and report back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-10 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
@viirya @dding3 I'm going to rerun our big connected components computation 
with the changes I've suggested to validate that it still performs and 
completes as expected. Given the time required to complete this calculation 
(10+ hours), I might not get to it until next week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100641170
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
 ---
@@ -87,10 +88,7 @@ private[mllib] class PeriodicGraphCheckpointer[VD, ED](
 
   override protected def persist(data: Graph[VD, ED]): Unit = {
 if (data.vertices.getStorageLevel == StorageLevel.NONE) {
-  data.vertices.persist()
-}
-if (data.edges.getStorageLevel == StorageLevel.NONE) {
-  data.edges.persist()
+  data.persist()
--- End diff --

I enhanced the persistence tests in `PeriodicGraphCheckpointSuite` to check 
that the storage level requested in the graph construction is the storage level 
seen after persistence. Both this version and the original version of this 
method failed that unit test.

The graph's vertex and edge rdds are somewhat peculiar in that `.cache()` 
and `.persist()` do not do the same thing, unlike other RDDs. And while 
`.cache()` honors the default storage level specified at graph construction 
time, `.persist()` always caches with the `MEMORY_ONLY` storage level.

At any rate, getting the `PeriodicGraphCheckpointer` to honor the default 
storage level specified at graph construction time requires changing these 
method calls from `persist()` to `cache()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100640256
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -123,16 +127,25 @@ object Pregel extends Logging {
   s" but got ${maxIterations}")
 
 var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED](
+  checkpointInterval, graph.vertices.sparkContext)
+graphCheckpointer.update(g)
+
 // compute the messages
-var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
+var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, 
mergeMsg).cache()
+val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+  checkpointInterval, graph.vertices.sparkContext)
+messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
--- End diff --

Actually, for the sake of simplicity and consistency, I'm going to suggest 
we keep the checkpointer update calls but remove all `.cache()` calls. The 
`update` calls persist the underlying data, making the calls to `.cache()` 
unnecessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100638292
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -123,16 +127,25 @@ object Pregel extends Logging {
   s" but got ${maxIterations}")
 
 var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED](
+  checkpointInterval, graph.vertices.sparkContext)
+graphCheckpointer.update(g)
+
 // compute the messages
-var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
+var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, 
mergeMsg).cache()
+val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+  checkpointInterval, graph.vertices.sparkContext)
+messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
 var activeMessages = messages.count()
+
 // Loop
 var prevG: Graph[VD, ED] = null
 var i = 0
 while (activeMessages > 0 && i < maxIterations) {
   // Receive the messages and update the vertices.
   prevG = g
   g = g.joinVertices(messages)(vprog).cache()
--- End diff --

Ok. I agree with your observation here, @viirya.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100638130
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
 ---
@@ -76,7 +77,7 @@ import org.apache.spark.storage.StorageLevel
  *
  * TODO: Move this out of MLlib?
--- End diff --

This comment should be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100632148
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointerSuite.scala
 ---
@@ -23,7 +23,7 @@ import org.apache.spark.{SparkContext, SparkFunSuite}
 import org.apache.spark.mllib.util.MLlibTestSparkContext
--- End diff --

Is there a reason this file isn't moved into the core codebase?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100631975
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointerSuite.scala
 ---
@@ -21,6 +21,7 @@ import org.apache.hadoop.fs.Path
 
--- End diff --

Is there a reason this test suite isn't moved into the GraphX codebase?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100612840
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -123,16 +127,25 @@ object Pregel extends Logging {
   s" but got ${maxIterations}")
 
 var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED](
+  checkpointInterval, graph.vertices.sparkContext)
+graphCheckpointer.update(g)
+
 // compute the messages
-var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
+var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, 
mergeMsg).cache()
+val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+  checkpointInterval, graph.vertices.sparkContext)
+messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
 var activeMessages = messages.count()
+
 // Loop
 var prevG: Graph[VD, ED] = null
 var i = 0
 while (activeMessages > 0 && i < maxIterations) {
   // Receive the messages and update the vertices.
   prevG = g
   g = g.joinVertices(messages)(vprog).cache()
--- End diff --

Actually, this may be more subtle than I thought. I'm going to think 
through this again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100609529
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -123,16 +127,25 @@ object Pregel extends Logging {
   s" but got ${maxIterations}")
 
 var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED](
+  checkpointInterval, graph.vertices.sparkContext)
+graphCheckpointer.update(g)
+
 // compute the messages
-var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
+var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, 
mergeMsg).cache()
+val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+  checkpointInterval, graph.vertices.sparkContext)
+messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
 var activeMessages = messages.count()
+
 // Loop
 var prevG: Graph[VD, ED] = null
 var i = 0
 while (activeMessages > 0 && i < maxIterations) {
   // Receive the messages and update the vertices.
   prevG = g
   g = g.joinVertices(messages)(vprog).cache()
--- End diff --

I agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100608839
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -123,16 +127,25 @@ object Pregel extends Logging {
   s" but got ${maxIterations}")
 
 var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED](
+  checkpointInterval, graph.vertices.sparkContext)
+graphCheckpointer.update(g)
+
 // compute the messages
-var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
+var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, 
mergeMsg).cache()
+val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+  checkpointInterval, graph.vertices.sparkContext)
+messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
--- End diff --

I agree. What do you think, @dding3?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16797
  
BTW @budde, given that this represents a regression in behavior from 
previous versions of Spark, I think it is too generous of you to label the Jira 
issue as an "improvement" instead of a "bug". I would support you changing the 
type to "bug" if you want to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16797
  
>> Like you said, users can still create a hive table with 
mixed-case-schema parquet/orc files, by hive or other systems like presto. This 
table is readable for hive, and for Spark prior to 2.1, because of the runtime 
schema inference But this is not intentional, and Spark should not support it 
as the data file schema and table schema mismatch.
>
> I will continue to argue strongly against reducing the number of usecases 
Spark SQL supports out of the box. While offering a migration command can offer 
a helpful optimization I don't think it is acceptable as the only option for 
the reasons I've detailed here.
>
> Simply put, I think relying on the presence of Spark-specific key/value 
pairs in the table properties in order for Spark SQL to function properly and 
assuming that Spark (or Spark users) can easily alter those properties to add 
the table schema is too brittle for large-scale production use.

I would have to agree with @budde in this case. In versions of Spark prior 
to 2.1, an effort was made to reconcile metastore and file format case 
mismatching using the method `ParquetFileFormat.mergeMetastoreParquetSchema`. 
The code docs for that method state that here: 
https://github.com/apache/spark/blob/1b02f8820ddaf3f2a0e7acc9a7f27afc20683cca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L711-L719.
 I don't see anything here that suggests this was a "hack" or was intended to 
be removed in a later version. It seems we've simply broken compatibility with 
a certain class of Hive tables in Spark 2.1.

Schema inference is very expensive, and doing it at query time on large 
tables was painful in versions prior to Spark 2.1 because all metadata files 
were read. But it seems some people were using it nonetheless and found it 
useful. At least in Spark 2.1, only the files for partitions read in a query 
will be read for schema inference. That would significantly enhance the schema 
inference performance at query time for partitioned tables.

Incidentally, what happens when a program outside of Spark (such as Hive) 
updates the Hive metastore schema of a table with the embedded Spark SQL 
schema? Does Spark detect that change and update the embedded schema? Does it 
have to redo the schema inference across all files in the table?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16775
  
@viirya I believe this PR meshes with the refactoring and application to 
pregel GraphX algorithms in #15125. Basically, it moves the periodic 
checkpointing code from mllib into core and uses it in GraphX to checkpoint 
long lineages. This is essential to scale GraphX to huge graphs, as described 
in my comment in the PR, and solves a very real problem for us. Can you take a 
look at that PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...

2017-02-09 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16785#discussion_r100364260
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -314,19 +322,29 @@ abstract class UnaryNode extends LogicalPlan {
* expressions with the corresponding alias
*/
   protected def getAliasedConstraints(projectList: Seq[NamedExpression]): 
Set[Expression] = {
-var allConstraints = child.constraints.asInstanceOf[Set[Expression]]
-projectList.foreach {
-  case a @ Alias(e, _) =>
-// For every alias in `projectList`, replace the reference in 
constraints by its attribute.
-allConstraints ++= allConstraints.map(_ transform {
-  case expr: Expression if expr.semanticEquals(e) =>
-a.toAttribute
-})
-allConstraints += EqualNullSafe(e, a.toAttribute)
-  case _ => // Don't change.
-}
-
-allConstraints -- child.constraints
+val relativeReferences = AttributeSet(projectList.collect {
+  case a: Alias => a
+}.flatMap(_.references))
+val parAllConstraints = 
child.constraints.asInstanceOf[Set[Expression]].filter { constraint =>
+  constraint.references.intersect(relativeReferences).nonEmpty
+}.par
+parAllConstraints.tasksupport = UnaryNode.taskSupport
--- End diff --

Why are we using a custom task support instead of the default (which uses 
the global fork-join executor)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-09 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r100360523
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala
 ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
GetStructField}
+import org.apache.spark.sql.types.StructField
+
+/**
+ * A Scala extractor that extracts the child expression and struct field 
from a [[GetStructField]].
+ * This is in contrast to the [[GetStructField]] case class extractor 
which returns the field
+ * ordinal instead of the field itself.
+ */
+private[planning] object GetStructField2 {
--- End diff --

Let's go with `GetStructFieldObject`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-08 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r100229358
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala
 ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
GetStructField}
+import org.apache.spark.sql.types.StructField
+
+/**
+ * A Scala extractor that extracts the child expression and struct field 
from a [[GetStructField]].
+ * This is in contrast to the [[GetStructField]] case class extractor 
which returns the field
+ * ordinal instead of the field itself.
+ */
+private[planning] object GetStructField2 {
--- End diff --

How about `GetStructFieldObject`? Or `GetStructFieldRef`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-08 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r100229300
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala
 ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+/**
+ * A Scala extractor that builds a [[StructField]] from a Catalyst complex 
type
+ * extractor. This is like the opposite of [[ExtractValue#apply]].
+ */
+object SelectedField {
+  def unapply(expr: Expression): Option[StructField] = {
+// If this expression is an alias, work on its child instead
+val unaliased = expr match {
+  case Alias(child, _) => child
+  case expr => expr
+}
+selectField(unaliased, None)
+  }
+
+  /**
+   * Converts some chain of complex type extractors into a [[StructField]].
+   *
+   * @param expr the top-level complex type extractor
+   * @param fieldOpt the subfield of [[expr]], where relevent
+   */
+  private def selectField(expr: Expression, fieldOpt: 
Option[StructField]): Option[StructField] =
+expr match {
+  case AttributeReference(name, _, nullable, _) =>
+fieldOpt.map(field => StructField(name, StructType(Array(field)), 
nullable))
+  case GetArrayItem(GetStructField2(child, field @ StructField(name,
+  ArrayType(_, arrayNullable), fieldNullable, _)), _) =>
+val childField = fieldOpt.map(field => StructField(name, ArrayType(
+  StructType(Array(field)), arrayNullable), 
fieldNullable)).getOrElse(field)
+selectField(child, Some(childField))
+  case GetArrayStructFields(child,
--- End diff --

I've spent some time this week developing a few different solutions to this 
problem, however none of them are very easy to understand or verify. I'm going 
to spend some more time working on a simpler solution before posting something 
back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16797
  
The proposal to restore schema inference with finer grained control on when 
it is performed sounds reasonable to me. The case I'm most interested in is 
turning off schema inference entirely, because we do not use parquet files with 
upper-case characters in their column names.

BTW, what behavior do we expect if a parquet file has two columns whose 
lower-cased names are identical?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-02 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r99174674
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala
 ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+/**
+ * A Scala extractor that builds a [[StructField]] from a Catalyst complex 
type
+ * extractor. This is like the opposite of [[ExtractValue#apply]].
+ */
+object SelectedField {
+  def unapply(expr: Expression): Option[StructField] = {
+// If this expression is an alias, work on its child instead
+val unaliased = expr match {
+  case Alias(child, _) => child
+  case expr => expr
+}
+selectField(unaliased, None)
+  }
+
+  /**
+   * Converts some chain of complex type extractors into a [[StructField]].
+   *
+   * @param expr the top-level complex type extractor
+   * @param fieldOpt the subfield of [[expr]], where relevent
+   */
+  private def selectField(expr: Expression, fieldOpt: 
Option[StructField]): Option[StructField] =
+expr match {
+  case AttributeReference(name, _, nullable, _) =>
+fieldOpt.map(field => StructField(name, StructType(Array(field)), 
nullable))
+  case GetArrayItem(GetStructField2(child, field @ StructField(name,
+  ArrayType(_, arrayNullable), fieldNullable, _)), _) =>
+val childField = fieldOpt.map(field => StructField(name, ArrayType(
+  StructType(Array(field)), arrayNullable), 
fieldNullable)).getOrElse(field)
+selectField(child, Some(childField))
+  case GetArrayStructFields(child,
--- End diff --

I believe I have a fix for this, but I probably won't be able to post a new 
commit until early next week—I'm working on a proposal for the Spark Summit 
RFP.

Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-01 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r98920657
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala
 ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
GetStructField}
+import org.apache.spark.sql.types.StructField
+
+/**
+ * A Scala extractor that extracts the child expression and struct field 
from a [[GetStructField]].
+ * This is in contrast to the [[GetStructField]] case class extractor 
which returns the field
+ * ordinal instead of the field itself.
+ */
+private[planning] object GetStructField2 {
--- End diff --

Agreed. I think the best name in this context is `GetStructField`, but 
that's already taken. I'll keep thinking about a good alternative.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-01-31 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r98819150
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala
 ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
GetStructField}
+import org.apache.spark.sql.types.StructField
+
+/**
+ * A Scala extractor that extracts the child expression and struct field 
from a [[GetStructField]].
+ * This is in contrast to the [[GetStructField]] case class extractor 
which returns the field
+ * ordinal instead of the field itself.
+ */
+private[planning] object GetStructField2 {
--- End diff --

What do you mean by combining it with the existing case class extractor?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16751: [SPARK-19409][BUILD] Bump parquet version to 1.8.2

2017-01-31 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16751
  
FYI, there are at least two workarounds in the Spark codebase which can 
potentially be removed as a consequence of this upgrade. For example:


https://github.com/apache/spark/blob/5de1737b02710e36f6804d2ae243d1aeb30a0b32/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L549-L558

and


https://github.com/apache/spark/blob/ca6391637212814b7c0bd14c434a6737da17b258/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L175-L178

These come immediately to mind. There may be others.

I think this PR would have been a good opportunity to remove these 
workarounds, but it's been closed and merged so that's water under the bridge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2017-01-31 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16281
  
FYI, we've been using 1.9.0 patched with a fix for 
https://issues.apache.org/jira/browse/PARQUET-783 without problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-01-18 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
LGTM. @srowen, can you recommend an mllib committer to review these 
changes? I'm not familiar with that team.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-01-17 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
Hi @dding3. Thanks for working on this! I was able to rebase and apply your 
patch to our build of Spark 2.1 to successfully compute the connected 
components of a graph with 5.2 billion vertices, 3.7 billion edges and 2.4 
billion connected components. It took 832 iterations, but without your patch it 
stalled around iteration number 330.

Can you please rebase your patch? I will take up a review of this PR when 
you've had a chance to rebase.

Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-01-16 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
> Does this take over #14957? If so, we might need Closes #14957 in the PR 
description for the merge script to close that one or let the author know this 
takes over that.

I don't know. @xuanyuanking, how do you feel about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage

2017-01-13 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16499
  
Josh, can you take a look at this when you have a chance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage

2017-01-13 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16499
  
@rxin, can you recommend someone I reach out to for help reviewing this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-01-13 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
cc @rxin @ericl @cloud-fan 

@marmbrus I would love to get your feedback on this if you have the time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-01-13 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/16578

[SPARK-4502][SQL] Parquet nested column pruning

(Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502)

## What changes were proposed in this pull request?

One of the hallmarks of a column-oriented data storage format is the 
ability to read data from a subset of columns, efficiently skipping reads from 
other columns. Spark has long had support for pruning unneeded top-level schema 
fields from the scan of a parquet file. For example, consider a table, 
`contacts`, backed by parquet with the following Spark SQL schema:

```
root
 |-- name: struct
 ||-- first: string
 ||-- last: string
 |-- address: string
```

Parquet stores this table's data in three physical columns: `name.first`, 
`name.last` and `address`. To answer the query

```SQL
select address from contacts
```

Spark will read only from the `address` column of parquet data. However, to 
answer the query

```SQL
select name.first from contacts
```

Spark will read `name.first` and `name.last` from parquet.

This PR modifies Spark SQL to support a finer-grain of schema pruning. With 
this patch, Spark reads only the `name.first` column to answer the previous 
query.

### Implementation

There are three main components of this patch. First, there is a 
`ParquetSchemaPruning` optimizer rule for gathering the required schema fields 
of a `PhysicalOperation` over a parquet file, constructing a new schema based 
on those required fields and rewriting the plan in terms of that pruned schema. 
The pruned schema fields are pushed down to the parquet requested read schema. 
`ParquetSchemaPruning` uses a new `ProjectionOverSchema` extractor for 
rewriting a catalyst expression in terms of a pruned schema.

Second, the `ParquetRowConverter` has been patched to ensure the ordinals 
of the parquet columns read are correct for the pruned schema. 
`ParquetReadSupport` has been patched to address a compatibility mismatch 
between Spark's built in vectorized reader and the parquet-mr library's reader.

Third, we introduce two new catalyst query transformations, 
`AggregateFieldExtractionPushdown` and `JoinFieldExtractionPushdown`, to 
support schema pruning in aggregation and join query plans. These rules extract 
field references in aggregations and joins respectively, push down aliases to 
those references and replace them with references to the pushed down aliases. 
They use a new `SelectedField` extractor that transforms a catalyst complex 
type extractor (the "selected field") into a corresponding `StructField`.

### Performance

The performance difference in executing queries with this patch compared to 
master is related to the depth of the table schema and the query itself. At 
VideoAmp, one of our biggest tables stores OpenRTB bid requests we receive from 
our exchange partners. Our bid request table's schema closely follows the 
OpenRTB bid request object schema. Additionally, when we bid we save our 
response along with the request in the same table. We store these two objects 
as two top-level fields in our table. Therefore, all bid request and response 
data are contained within nested fields.

For the purposes of measuring the performance impact of this patch, we ran 
some queries on our bid request table with the un-patched and patched master. 
We measured query execution time and the amount of data read from the 
underlying parquet files. I'll focus on a couple of benchmarks. (All benchmarks 
were run on an AWS EC2 cluster with four c3.8xl workers.) The first query I'll 
highlight is

```SQL
select count(request.device.ip) from event.bid_request where ds=20161128 
and h=0
```

(Hopefully it's obvious what this query means.) On the un-patched master, 
this query ran in 2.7 minutes and read 34.3 GB of data. On the patched master, 
this query ran in 4 seconds and read 987.3 MB of data.

We also ran a reporting-oriented query benchmark. I won't reproduce the 
query here, but it reads a larger subset of the bid request fields and joins 
against another table with a deeply nested schema. In addition to a join, we 
perform several aggregations in this query. On the un-patched master, this 
query ran in 3.4 minutes and read 34.6 GB of data. On the patched master, this 
query ran in 59 seconds and read 2.6 GB of data.

### Limitation

Among the complex Spark SQL data types, this patch supports parquet column 
pruning of nested sequences of struct fields only.

## How was this patch tested?

Care has been taken to ensure correctness and prevent regressions. This 
patch introduces over two dozen new unit tests and has been running on a 
production Spark 1.5 clust

[GitHub] spark pull request #16514: [SPARK-19128] [SQL] Refresh Cache after Set Locat...

2017-01-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16514#discussion_r95490619
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -555,6 +557,61 @@ class HiveDDLSuite
 }
   }
 
+  test("Alter Table Set Location") {
+withTable("tab1", "tab2") {
+  val catalog = spark.sessionState.catalog
+  sql("CREATE TABLE tab1 using parquet AS SELECT 1 as a")
+  sql("CREATE TABLE tab2 using parquet AS SELECT 2 as a")
+  checkAnswer(spark.table("tab1"), Seq(Row(1)))
+  checkAnswer(spark.table("tab2"), Seq(Row(2)))
+  val metadataTab1 = catalog.getTableMetadata(TableIdentifier("tab1"))
+  val locTab1 = metadataTab1.storage.locationUri
+  val metadataTab2 = catalog.getTableMetadata(TableIdentifier("tab2"))
+  val locTab2 = metadataTab2.storage.locationUri
+  assert(locTab1.isDefined)
+  assert(locTab2.isDefined)
+  try {
+sql(s"ALTER TABLE tab2 SET LOCATION '${locTab1.get}'")
+checkAnswer(spark.table("tab2"), Seq(Row(1)))
+  } finally {
+val root = new Path(locTab2.get)
+val fs = root.getFileSystem(spark.sparkContext.hadoopConfiguration)
+fs.delete(root, true)
--- End diff --

Why aren't we deleting `locTab1`, too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16514: [SPARK-19128] [SQL] Refresh Cache after Set Locat...

2017-01-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16514#discussion_r95489960
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -119,7 +119,30 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   qualifiedTableName.database, qualifiedTableName.name)
 
 if (DDLUtils.isDatasourceTable(table)) {
-  val dataSourceTable = cachedDataSourceTables(qualifiedTableName)
+  val dataSourceTable =
+cachedDataSourceTables(qualifiedTableName) match {
+  case l @ LogicalRelation(relation: HadoopFsRelation, _, _) =>
+// Ignore the scheme difference when comparing the paths
+val isSamePath =
+  table.storage.locationUri.isDefined && 
relation.location.rootPaths.size == 1 &&
+table.storage.locationUri.get == 
relation.location.rootPaths.head.toUri.getPath
+// If we have the same paths, same schema, and same partition 
spec,
+// we will use the cached relation.
+val useCached =
+  isSamePath &&
+  l.schema == table.schema &&
+  relation.bucketSpec == table.bucketSpec &&
+  relation.partitionSchema == table.partitionSchema
+if (useCached) {
+  l
+} else {
+  // If the cached relation is not updated, we invalidate it 
right away.
+  cachedDataSourceTables.invalidate(qualifiedTableName)
+  // Reload it from the external catalog
+  cachedDataSourceTables(qualifiedTableName)
+}
+  case o => o
+}
--- End diff --

Understood.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16514: [SPARK-19128] [SQL] Refresh Cache after Set Location

2017-01-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16514
  
> A good suggestion. Will do the code changes tomorrow. Thanks!

I look forward to seeing this. Thanks for taking this on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-09 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95206030
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -392,7 +392,9 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
-
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+if (partition.nonEmpty) {
+  
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+}
--- End diff --

Why is it safe to restrict this call to the case where `partition.nonEmpty`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-01-07 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r95066452
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala 
---
@@ -375,7 +375,8 @@ class BlockManagerReplicationSuite extends SparkFunSuite
   // Put the block into one of the stores
   val blockId = new TestBlockId(
 "block-with-" + storageLevel.description.replace(" ", 
"-").toLowerCase)
-  stores(0).putSingle(blockId, new Array[Byte](blockSize), 
storageLevel)
+  val testValue = Array.fill[Byte](blockSize)(1)
--- End diff --

Using an array of 1s instead of an array of 0s is my silly, paranoid, OCD 
way of adding a little extra entropy to the test. I think the chance that this 
change in test value will actually affect the outcome of this test is about 0%. 
I will revert to the original test value on request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-01-07 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r95066296
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala 
---
@@ -387,12 +388,23 @@ class BlockManagerReplicationSuite extends 
SparkFunSuite
 testStore => 
blockLocations.contains(testStore.blockManagerId.executorId)
   }.foreach { testStore =>
 val testStoreName = testStore.blockManagerId.executorId
-assert(
-  testStore.getLocalValues(blockId).isDefined, s"$blockId was not 
found in $testStoreName")
-testStore.releaseLock(blockId)
--- End diff --

N.B. We no longer need the `releaseLock` call because we exhaust the 
iterator returned by `getLocalValues`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-01-07 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/16499

[SPARK-17204][CORE] Fix replicated off heap storage

(Jira: https://issues.apache.org/jira/browse/SPARK-17204)

## What changes were proposed in this pull request?

There are a couple of bugs in the `BlockManager` with respect to support 
for replicated off-heap storage. First, the locally-stored off-heap byte buffer 
is disposed of when it is replicated. It should not be. Second, the replica 
byte buffers are stored as heap byte buffers instead of direct byte buffers 
even when the storage level memory mode is off-heap. This PR addresses both of 
these problems.

## How was this patch tested?

`BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. 
It now fails if either of the bugs in this PR exist.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public 
spark-17204-replicated_off_heap_storage

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16499.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16499


commit e49aeca23ff463fbd9a9cc4db99078c466bfbd56
Author: Michael Allman 
Date:   2017-01-02T01:28:12Z

Fix a couple of bugs in replicated off-heap storage

commit 40b6b97ca9013544702433dc5dc388c054daf41a
Author: Michael Allman 
Date:   2017-01-07T21:17:33Z

Shore-up BlockManagerReplicationSuite to identify a couple of bugs with
off-heap storage replication




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` gro...

2017-01-04 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15480
  
Hi @lw-lin. Just FYI we use this patch at VideoAmp and would love to see it 
merged in. I notice this PR has gone a little cold. I'm sorry I can't offer 
much concrete help, but I wanted to check with you to see if you'll be able to 
pick this up again soon. Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16281
  
I'd love to see frequent, conservative patch releases. From my experience, 
parquet bugs cause significant trouble for downstream consumers. For example, 
we encountered a data corruption bug writing parquet format V2 that made a good 
deal of that data unreadable. (I can't recall the issue number, but it was 
reported and fixed.)

As another example, we recently upgraded some of our Spark clusters to use 
parquet-mr 1.9.0, because PARQUET-363 introduced a bug into one of our Spark 
2.x patches. When we switched to 1.9, we found 
https://issues.apache.org/jira/browse/PARQUET-783, which breaks things in a 
different way. We needed a fix, so we forked 1.9 internally.

FWIW, we haven't found any other issues using parquet-mr 1.9.0 with Spark 
2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16274: [SPARK-18853][SQL] Project (UnaryNode) is way too aggres...

2016-12-14 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16274
  
Outside of some comment grooming, LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16274: [SPARK-18853][SQL] Project (UnaryNode) is way too...

2016-12-14 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16274#discussion_r92448788
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala ---
@@ -78,10 +78,10 @@ case class ArrayType(elementType: DataType, 
containsNull: Boolean) extends DataT
   ("containsNull" -> containsNull)
 
   /**
-   * The default size of a value of the ArrayType is 100 * the default 
size of the element type.
-   * (We assume that there are 100 elements).
+   * The default size of a value of the ArrayType is 1 * the default size 
of the element type.
--- End diff --

Suggest:

```The default size of a value of the ArrayType is the default size of the 
element type.```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...

2016-12-13 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16122
  
@wangyum Thank you for this important bug fix! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...

2016-12-07 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16122
  
I believe that may be the case, unfortunately. At least, I have no 
immediate ideas otherwise.

> On Dec 7, 2016, at 5:25 PM, Eric Liang  wrote:
> 
> I see. In that case I think manual testing may be sufficient.
> 
> On Wed, Dec 7, 2016, 5:00 PM Michael Allman 
> wrote:
> 
> > I think that's exactly what I tried and got the `NoSuchMethodException`.
> >
> > On Dec 7, 2016, at 3:35 PM, Eric Liang  wrote:
> >
> > I did some digging into HiveClientImpl and Hive.java, and I think it 
would
> > be pretty safe to call Hive.get() within the scope of a
> > client.withHiveState
> > { }. That call sets the thread-local hive to that used by the client 
before
> > executing the passed function block.
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <https://github.com/apache/spark/pull/16122#issuecomment-265608215>, or
> > mute
> > the thread
> > <
> > 
https://github.com/notifications/unsubscribe-auth/AAy4nWy8PPhc-6MWv2mGUoPXws_WxLJ0ks5rF0LagaJpZM4LC1rJ
> > >
> > .
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <https://github.com/apache/spark/pull/16122#issuecomment-265622317>, or 
mute
> > the thread
> > 
<https://github.com/notifications/unsubscribe-auth/AAA6SlhtuKG0rqwcBxdNu4_PSYjG1NkXks5rF1a_gaJpZM4LC1rJ>
> > .
> >
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub 
<https://github.com/apache/spark/pull/16122#issuecomment-265626214>, or mute 
the thread 
<https://github.com/notifications/unsubscribe-auth/AAy4nXLmyAVGe6Bv02BxA-eP0-KvTXSxks5rF1xtgaJpZM4LC1rJ>.
> 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...

2016-12-07 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16122
  
I think that's exactly what I tried and got the `NoSuchMethodException`.

On Dec 7, 2016, at 3:35 PM, Eric Liang  wrote:

I did some digging into HiveClientImpl and Hive.java, and I think it would
be pretty safe to call Hive.get() within the scope of a client.withHiveState
{ }. That call sets the thread-local hive to that used by the client before
executing the passed function block.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/apache/spark/pull/16122#issuecomment-265608215>, or mute
the thread

<https://github.com/notifications/unsubscribe-auth/AAy4nWy8PPhc-6MWv2mGUoPXws_WxLJ0ks5rF0LagaJpZM4LC1rJ>
.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...

2016-12-07 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16122
  
I'm not sure configuration-level rollback will guarantee an absence of 
interactions with other tests. For one thing, I think we need to create and 
clean up an independent metastore directory for Derby. Otherwise we may end up 
with 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69669/testReport/junit/org.apache.spark.sql.hive/HiveExternalCatalogSuite/_It_is_not_a_test_/.

I guess I'd like a stronger assurance that we're doing this the right way. 
The only call to `Hive.get` in the Hive codebase I can find is in a private 
method in `HiveClientImpl`. Spark seems to be going to great lengths to 
discourage calling that method directly.

I've also thought this might be better suited to testing in the 
thriftserver codebase, where there seems to be a stronger sense of 
client/server separation in testing. I haven't gone too far in investigating 
that direction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...

2016-12-07 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16122
  
It does, yes. My concern around that test is that its behavior doesn't seem 
to be independent of other tests. For example, the value of

```hive.getConf.getBoolean(tryDirectSqlConfVar.varname, 
tryDirectSqlConfVar.defaultBoolVal)```

seemed to vary depending on what it was set to in the other test in that 
suite.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...

2016-12-07 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16122
  
There may be a way to do it, but the classloader tricks being used in the 
hive client implementation are beyond my comprehension.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...

2016-12-07 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16122
  
I tried that but got a `NoSuchMethodException` in the call to `getMSC`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...

2016-12-06 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16122
  
I haven't been able to get a proper unit test environment running where the 
embedded metastore conf is different from the client conf. I did validate that 
Spark without this patch failed to execute a query on a table with an integer 
type partition column filtering on that column where the metastore has direct 
sql access disabled, whereas Spark with this patch works and behaves as 
expected.

@wangyum I don't believe your test uses Hive in a way that's compatible 
with Spark. Can you please remove it?

@ericl Any ideas on how to unit test the case where the client and 
metastore have different configurations?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16122: [SPARK-18681][SQL] Fix filtering to compatible with part...

2016-12-06 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16122
  
@wangyum I'm going to see if I can help with the unit testing on this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...

2016-12-05 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15998
  
I suspect this is a spurious, unrelated test failure. Can we get a rebuild, 
please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...

2016-12-05 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15998
  
@gatorsmile I've applied your patch and reverted the change I made in the 
previous commit to workaround that defect. The failed test now passes for me. 
Let's see what Jenkins says.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15998: [SPARK-18572][SQL] Add a method `listPartitionNames` to ...

2016-12-05 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15998
  
> #15998 (comment) found a bug. If this PR will not be merged to Spark 2.1 
branch, I think we need to submit a separate PR for resolving the bug.

I would like to get this patch into Spark 2.1 as it's a scalability issue 
for partitioned tables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   >