[GitHub] spark pull request #22995: [SPARK-25998] [CORE] Change TorrentBroadcast to h...

2018-11-27 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/22995#discussion_r236775302
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -92,8 +95,15 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: 
T, id: Long)
   /** The checksum for all the blocks. */
   private var checksums: Array[Int] = _
 
-  override protected def getValue() = {
-_value
+  override protected def getValue() = synchronized {
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22995: [SPARK-25998] [CORE] Change TorrentBroadcast to h...

2018-11-27 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/22995#discussion_r236724749
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -92,8 +95,15 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: 
T, id: Long)
   /** The checksum for all the blocks. */
   private var checksums: Array[Int] = _
 
-  override protected def getValue() = {
-_value
+  override protected def getValue() = synchronized {
--- End diff --

Do you mean switching `TorrentBroadcast.synchronized` to 
`broadcastCache.synchronized` inside `readBroadcastBlock`, or changing 
`this.synchronized` to `broadcastCache.synchronized` inside `getValue()` (and 
getting rid of the lock in `readBroadcastBlock`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22995: [SPARK-25998] [CORE] Change TorrentBroadcast to h...

2018-11-27 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/22995#discussion_r23671
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -92,8 +95,15 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: 
T, id: Long)
   /** The checksum for all the blocks. */
   private var checksums: Array[Int] = _
 
-  override protected def getValue() = {
-_value
+  override protected def getValue() = synchronized {
--- End diff --

I think we can remove the `TorrentBroadcast.synchronized` in 
`readBroadcastBlock`, since we're already synchronizing in its only caller? 
Though I'm not sure why it was necessary in the first place, as 
`readBroadcastBlock` should only have been called once before this PR.

Regardless, I agree that the perf hit should be ok. Let me know if you want 
any of this changed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22995: [SPARK-25998] [CORE] Change TorrentBroadcast to hold wea...

2018-11-26 Thread bkrieger
Github user bkrieger commented on the issue:

https://github.com/apache/spark/pull/22995
  
@srowen @mridulm for some reason it looks like tests aren't being 
triggered, can one of you trigger?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22995: [SPARK-25998] [CORE] Change TorrentBroadcast to h...

2018-11-26 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/22995#discussion_r236391043
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -93,7 +96,14 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: 
T, id: Long)
   private var checksums: Array[Int] = _
 
   override protected def getValue() = {
-_value
+val memoized: T = if (_value == null) null.asInstanceOf[T] else 
_value.get
--- End diff --

Good catch. I'll make it synchronized, so it only loads one at a time.

Re: WeakReference, sure, I can change it to SoftReference. That'll be 
closer to the original behavior, and should still give the improvement we want.

When I try with `.asInstanceOf[T]` it fails to compile with:
```
[error] 
/Users/bkrieger/Documents/git/spark/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala:98:
 type mismatch;
[error]  found   : Null(null)
[error]  required: T
[error] val memoized: T = if (_value == null) null else _value.get
[error]   ^
[info] Null(null) <: T?
[info] false
[error] one error found
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22995: [SPARK-25998] [CORE] Change TorrentBroadcast to h...

2018-11-09 Thread bkrieger
GitHub user bkrieger opened a pull request:

https://github.com/apache/spark/pull/22995

[SPARK-25998] [CORE] Change TorrentBroadcast to hold weak reference of 
broadcast object

## What changes were proposed in this pull request?

This PR changes the broadcast object in TorrentBroadcast from a strong 
reference to a weak reference. This allows it to be garbage collected even if 
the Dataset is held in memory. This is ok, because the broadcast object can 
always be re-read.

## How was this patch tested?

Tested in Spark shell by taking a heap dump, full repro steps listed in 
https://issues.apache.org/jira/browse/SPARK-25998.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bkrieger/spark bk/torrent-broadcast-weak

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22995.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22995


commit a2683b62985fc9c7d15fb92f3bb170a4b5225058
Author: Brandon Krieger 
Date:   2018-11-08T23:04:06Z

use weak reference for torrent broadcast

commit 99fbeecf43a289648a56d178fa55e188ce75bdb7
Author: Brandon Krieger 
Date:   2018-11-09T21:04:51Z

fix compile

commit 5e0a179c168a70b0166abe4bb51a1d26a2f1d666
Author: Brandon Krieger 
Date:   2018-11-09T21:33:22Z

fix

commit 1908b5b8dfa6c0b55db3bd9a90e21ca713e5bf25
Author: Brandon Krieger 
Date:   2018-11-09T21:48:44Z

no npe

commit 24183e5b8b63e0b4e117856ab4de7eb1b0ea6c9a
Author: Brandon Krieger 
Date:   2018-11-09T21:52:21Z

no option

commit f212da322242386ce3b71e9961a964e60b587287
Author: Brandon Krieger 
Date:   2018-11-09T22:08:23Z

typo




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21508: [SPARK-24488] [SQL] Fix issue when generator is aliased ...

2018-07-20 Thread bkrieger
Github user bkrieger commented on the issue:

https://github.com/apache/spark/pull/21508
  
@gatorsmile @hvanhovell any chance you can take a look at this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21508: [SPARK-24488] [SQL] Fix issue when generator is aliased ...

2018-06-21 Thread bkrieger
Github user bkrieger commented on the issue:

https://github.com/apache/spark/pull/21508
  
@gatorsmile @hvanhovell can you take another look at this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21508: [SPARK-24488] [SQL] Fix issue when generator is aliased ...

2018-06-18 Thread bkrieger
Github user bkrieger commented on the issue:

https://github.com/apache/spark/pull/21508
  
@gatorsmile @hvanhovell Gentle ping. Let me know if there's someone else 
who would be better to review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21508: [SPARK-24488] [SQL] Fix issue when generator is aliased ...

2018-06-15 Thread bkrieger
Github user bkrieger commented on the issue:

https://github.com/apache/spark/pull/21508
  
@gatorsmile @hvanhovell can you take a last look at this? I think it's good 
to merge.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21508: [SPARK-24488] [SQL] Fix issue when generator is aliased ...

2018-06-13 Thread bkrieger
Github user bkrieger commented on the issue:

https://github.com/apache/spark/pull/21508
  
@gatorsmile @hvanhovell is this good to merge?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21508: [SPARK-24488] [SQL] Fix issue when generator is aliased ...

2018-06-11 Thread bkrieger
Github user bkrieger commented on the issue:

https://github.com/apache/spark/pull/21508
  
The test failure looks like a flake to me?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21508: [SPARK-24488] [SQL] Fix issue when generator is a...

2018-06-11 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21508#discussion_r194539275
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1568,11 +1568,13 @@ class Analyzer(
   expr.find(_.isInstanceOf[Generator]).isDefined
 }
 
-private def hasNestedGenerator(expr: NamedExpression): Boolean = expr 
match {
-  case UnresolvedAlias(_: Generator, _) => false
-  case Alias(_: Generator, _) => false
-  case MultiAlias(_: Generator, _) => false
-  case other => hasGenerator(other)
+private def hasNestedGenerator(expr: NamedExpression): Boolean = {
+  CleanupAliases.trimNonTopLevelAliases(expr) match {
+case UnresolvedAlias(_: Generator, _) => false
--- End diff --

`hasNestedGenerator` already handled `UnresolvedAlias`. I'll change 
`CleanupAliases` back to only handling resolved aliases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21508: [SPARK-24488] [SQL] Fix issue when generator is a...

2018-06-11 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21508#discussion_r194539357
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1568,11 +1568,13 @@ class Analyzer(
   expr.find(_.isInstanceOf[Generator]).isDefined
 }
 
-private def hasNestedGenerator(expr: NamedExpression): Boolean = expr 
match {
-  case UnresolvedAlias(_: Generator, _) => false
-  case Alias(_: Generator, _) => false
-  case MultiAlias(_: Generator, _) => false
-  case other => hasGenerator(other)
+private def hasNestedGenerator(expr: NamedExpression): Boolean = {
+  CleanupAliases.trimNonTopLevelAliases(expr) match {
+case UnresolvedAlias(_: Generator, _) => false
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21508: [SPARK-24488] [SQL] Fix issue when generator is a...

2018-06-11 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21508#discussion_r194467198
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1568,11 +1568,32 @@ class Analyzer(
   expr.find(_.isInstanceOf[Generator]).isDefined
 }
 
-private def hasNestedGenerator(expr: NamedExpression): Boolean = expr 
match {
-  case UnresolvedAlias(_: Generator, _) => false
-  case Alias(_: Generator, _) => false
-  case MultiAlias(_: Generator, _) => false
-  case other => hasGenerator(other)
+private def hasNestedGenerator(expr: NamedExpression): Boolean = {
+  trimNonTopLevelAliases(expr) match {
+case UnresolvedAlias(_: Generator, _) => false
+case Alias(_: Generator, _) => false
+case MultiAlias(_: Generator, _) => false
+case other => hasGenerator(other)
+  }
+}
+
+def trimNonTopLevelAliases(e: Expression): Expression = e match {
+  case a: UnresolvedAlias =>
--- End diff --

I don't think it'll hurt to handle it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21508: [SPARK-24488] [SQL] Fix issue when generator is a...

2018-06-11 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21508#discussion_r194436381
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1568,11 +1568,32 @@ class Analyzer(
   expr.find(_.isInstanceOf[Generator]).isDefined
 }
 
-private def hasNestedGenerator(expr: NamedExpression): Boolean = expr 
match {
-  case UnresolvedAlias(_: Generator, _) => false
-  case Alias(_: Generator, _) => false
-  case MultiAlias(_: Generator, _) => false
-  case other => hasGenerator(other)
+private def hasNestedGenerator(expr: NamedExpression): Boolean = {
+  trimNonTopLevelAliases(expr) match {
+case UnresolvedAlias(_: Generator, _) => false
+case Alias(_: Generator, _) => false
+case MultiAlias(_: Generator, _) => false
+case other => hasGenerator(other)
+  }
+}
+
+def trimNonTopLevelAliases(e: Expression): Expression = e match {
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21508: [SPARK-24488] [SQL] Fix issue when generator is a...

2018-06-10 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21508#discussion_r194274619
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1568,11 +1568,32 @@ class Analyzer(
   expr.find(_.isInstanceOf[Generator]).isDefined
 }
 
-private def hasNestedGenerator(expr: NamedExpression): Boolean = expr 
match {
-  case UnresolvedAlias(_: Generator, _) => false
-  case Alias(_: Generator, _) => false
-  case MultiAlias(_: Generator, _) => false
-  case other => hasGenerator(other)
+private def hasNestedGenerator(expr: NamedExpression): Boolean = {
+  trimNonTopLevelAliases(expr) match {
+case UnresolvedAlias(_: Generator, _) => false
+case Alias(_: Generator, _) => false
+case MultiAlias(_: Generator, _) => false
+case other => hasGenerator(other)
+  }
+}
+
+def trimNonTopLevelAliases(e: Expression): Expression = e match {
+  case a: UnresolvedAlias =>
--- End diff --

In my use case, no. But I wasn't sure if another use case would care. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21508: [SPARK-24488] [SQL] Fix issue when generator is a...

2018-06-10 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21508#discussion_r194274604
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1568,11 +1568,32 @@ class Analyzer(
   expr.find(_.isInstanceOf[Generator]).isDefined
 }
 
-private def hasNestedGenerator(expr: NamedExpression): Boolean = expr 
match {
-  case UnresolvedAlias(_: Generator, _) => false
-  case Alias(_: Generator, _) => false
-  case MultiAlias(_: Generator, _) => false
-  case other => hasGenerator(other)
+private def hasNestedGenerator(expr: NamedExpression): Boolean = {
+  trimNonTopLevelAliases(expr) match {
+case UnresolvedAlias(_: Generator, _) => false
+case Alias(_: Generator, _) => false
+case MultiAlias(_: Generator, _) => false
+case other => hasGenerator(other)
+  }
+}
+
+def trimNonTopLevelAliases(e: Expression): Expression = e match {
--- End diff --

Sure- I didn't want to break any existing functionality, but I can do that 
instead. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21508: [SPARK-24488] [SQL] Fix issue when generator is a...

2018-06-08 Thread bkrieger
Github user bkrieger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21508#discussion_r194082850
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1568,11 +1568,13 @@ class Analyzer(
   expr.find(_.isInstanceOf[Generator]).isDefined
 }
 
-private def hasNestedGenerator(expr: NamedExpression): Boolean = expr 
match {
-  case UnresolvedAlias(_: Generator, _) => false
-  case Alias(_: Generator, _) => false
-  case MultiAlias(_: Generator, _) => false
-  case other => hasGenerator(other)
+private def hasNestedGenerator(expr: NamedExpression): Boolean = {
+  CleanupAliases.trimNonTopLevelAliases(expr) match {
--- End diff --

Updated to handle the `MultiAlias` and `UnresolvedAlias`, and updated the 
unit test to test all 3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21508: [SPARK-24488] [SQL] Fix issue when generator is a...

2018-06-07 Thread bkrieger
GitHub user bkrieger opened a pull request:

https://github.com/apache/spark/pull/21508

[SPARK-24488] [SQL] Fix issue when generator is aliased multiple times

## What changes were proposed in this pull request?

Currently, the Analyzer throws an exception if your try to nest a 
generator. However, it special cases generators "nested" in an alias, and 
allows that. If you try to alias a generator twice, it is not caught by the 
special case, so an exception is thrown.

This PR trims the unnecessary, non-top-level aliases, so that the generator 
is allowed.


## How was this patch tested?

new tests in AnalysisSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bkrieger/spark bk/SPARK-24488

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21508.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21508


commit 44ae34d0387f763936cddeceae64ee98b7bb279f
Author: Brandon Krieger 
Date:   2018-06-07T20:09:09Z

SPARK-24488




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org