[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263
  
> Visualizing a workflow is nice, but Spark's Pipelines are typically 
pretty straightforward and linear. I could imagine producing a nicer 
visualization than what you get from reading the Spark UI, although of course 
we already have some degree of history and data there.

Another good thing that might have to be considered is, that we can 
interact this with other SQL events. For instance, where the input `Dataset` is 
originated. For instance, with current Apache Spark, I can visualises SQL 
operations as below:

![screen shot 2018-12-10 at 9 41 36 
am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png)

I think we can combine those existing lineages together to easily 
understand where the data comes and goes.

(BTW, I hope it doesn't sound like I'm pushing this case for my one 
specific case - I think this hook-like feature can be useful in many way. I 
currently have one explicit example to show so I'm referring my case.)

> These are just the hooks, right? someone would have to implement 
something to use these events. I see the value in the API to some degree, but 
with no concrete implementation, does it add anything for Spark users out of 
the box?

Yes, right. It does not add anything to Spark users out of the box. It 
needs a custom implementation for a query listener. For instance,

with the custom listener below:

```scala
class CustomSparkListener extends SparkListener
  def onOtherEvents(e: SparkListenerEvent) = e match {
case e: MLEvent => // do something
case _ => // pass
  }
```

There are two (existing) ways to use this.

```scala
spark.sparkContext.addSparkListener(new CustomSparkListener)
```

```bash
spark-submit
  --conf spark.extraListeners=CustomSparkListener\
...
```

It's also similar with other existing implementation in SQL side (catalog 
events described above in PR description).

One actual example that I had with SQL query listener was that I had to 
close one connection every time after SQL execution.

> Is that what someone would likely do? or would someone likely have to run 
Atlas to use this? If that's a good example of the use case, and Atlas is 
really about lineage and governance, is that the thrust of this change, to help 
with something to do with model lineage and reproducibility?

There are two reasons.

1. I think someone in general would likely utilise this feature like other 
event listeners. At least, I can see some interests going on outside.

- SQL Listener
  - 
https://stackoverflow.com/questions/46409339/spark-listener-to-an-sql-query
  - 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-Custom-Query-Execution-listener-via-conf-properties-td30979.html

- Streaming Query Listener
  - https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/
  -  
http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Watermark-td25413.html#a25416

2. Someone would likely run this via Atlas so that someone could do 
something about lineage and governance, yes, as you said. Yes, I'm trying to 
show integrated lineages in Apache Spark but this is a missing hole. There had 
to be this change for this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-09 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23263
  
My first impression is that it's a big change, which is reason for caution 
here.

Visualizing a workflow is nice, but Spark's Pipelines are typically pretty 
straightforward and linear. I could imagine producing a nicer visualization 
than what you get from reading the Spark UI, although of course we already have 
some degree of history and data there.

These are just the hooks, right? someone would have to implement something 
to use these events. I see the value in the API to some degree, but with no 
concrete implementation, does it add anything for Spark users out of the box?

It seems like the history this generates would belong in the history 
server, although that already has a pretty particular purpose, storing granular 
history of events in Spark. Is that what someone would likely do? or would 
someone likely have to run Atlas to use this? If that's a good example of the 
use case, and Atlas is really about lineage and governance, is that the thrust 
of this change, to help with something to do with model lineage and 
reproducibility?

It's good that the API changes little, though it does change a bit.

I think I mostly have questions right now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99873/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23263
  
**[Test build #99873 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99873/testReport)**
 for PR 23263 at commit 
[`ba9db6e`](https://github.com/apache/spark/commit/ba9db6eb2c98a1c84f982a093ff982a030b9eab7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99872/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23263
  
**[Test build #99872 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99872/testReport)**
 for PR 23263 at commit 
[`4f2fda2`](https://github.com/apache/spark/commit/4f2fda2272ee7e7d62df139c2f994ef7a122bf7c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23263
  
**[Test build #99873 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99873/testReport)**
 for PR 23263 at commit 
[`ba9db6e`](https://github.com/apache/spark/commit/ba9db6eb2c98a1c84f982a093ff982a030b9eab7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5889/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23263
  
**[Test build #99872 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99872/testReport)**
 for PR 23263 at commit 
[`4f2fda2`](https://github.com/apache/spark/commit/4f2fda2272ee7e7d62df139c2f994ef7a122bf7c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263
  
The tests pass in my local. I'll fix them shortly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99868/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23263
  
**[Test build #99868 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99868/testReport)**
 for PR 23263 at commit 
[`a9112f3`](https://github.com/apache/spark/commit/a9112f33ff8fbfb66bad76bff6898abdef5b6881).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class TransformStart(transformer: Transformer, input: Dataset[_]) 
extends MLEvent`
  * `case class TransformEnd(transformer: Transformer, output: Dataset[_]) 
extends MLEvent`
  * `case class FitStart[M <: Model[M]](estimator: Estimator[M], dataset: 
Dataset[_]) extends MLEvent`
  * `case class FitEnd[M <: Model[M]](estimator: Estimator[M], model: M) 
extends MLEvent`
  * `case class LoadInstanceStart[T](reader: MLReader[T], path: String) 
extends MLEvent`
  * `case class LoadInstanceEnd[T](reader: MLReader[T], instance: T) 
extends MLEvent`
  * `case class SaveInstanceStart(writer: MLWriter, path: String) extends 
MLEvent`
  * `case class SaveInstanceEnd(writer: MLWriter, path: String) extends 
MLEvent`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5885/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23263
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263
  
cc @srowen, @cloud-fan (since it mimics SQL's event listener), @jkbradley, 
@mengxr and @yanboliang. Mind if I ask to take a look please? WDYT about this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23263
  
**[Test build #99868 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99868/testReport)**
 for PR 23263 at commit 
[`a9112f3`](https://github.com/apache/spark/commit/a9112f33ff8fbfb66bad76bff6898abdef5b6881).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org