Re: [SPARK-34806] Observable Metrics on Spark Datasets

2021-03-20 Thread Enrico Minack
The PR can be found here: https://github.com/apache/spark/pull/31905 Am 19.03.21 um 10:55 schrieb Enrico Minack: I'll sketch out a PR so we can talk code and move the discussion there. Am 18.03.21 um 14:55 schrieb Wenchen Fan: I think a listener-based API makes sense for streaming (since

Re: Observable Metrics on Spark Datasets

2021-03-19 Thread Enrico Minack
I'll sketch out a PR so we can talk code and move the discussion there. Am 18.03.21 um 14:55 schrieb Wenchen Fan: I think a listener-based API makes sense for streaming (since you need to keep watching the result), but may not be very reasonable for batch queries (you only get the result

Re: Observable Metrics on Spark Datasets

2021-03-18 Thread Wenchen Fan
I think a listener-based API makes sense for streaming (since you need to keep watching the result), but may not be very reasonable for batch queries (you only get the result once). The idea of Observation looks good, but we should define what happens if `observation.get` is called before the

Re: Observable Metrics on Spark Datasets

2021-03-16 Thread Jungtaek Lim
Please follow up the discussion in the origin PR. https://github.com/apache/spark/pull/26127 Dataset.observe() relies on the query listener for the batch query which is an "unstable" API - that's why we decided to not add an example for the batch query. For streaming query, it relies on the

Re: Observable Metrics on Spark Datasets

2021-03-16 Thread Enrico Minack
I am focusing on batch mode, not streaming mode. I would argue that Dataset.observe() is equally useful for large batch processing. If you need some motivating use cases, please let me know. Anyhow, the documentation of observe states it works for both, batch and streaming. And in batch mode,

Re: Observable Metrics on Spark Datasets

2021-03-15 Thread Jungtaek Lim
If I remember correctly, the major audience of the "observe" API is Structured Streaming, micro-batch mode. From the example, the abstraction in 2 isn't something working with Structured Streaming. It could be still done with callback, but it remains the question how much complexity is hidden from

Observable Metrics on Spark Datasets

2021-03-15 Thread Enrico Minack
Hi Spark-Devs, the observable metrics that have been added to the Dataset API in 3.0.0 are a great improvement over the Accumulator APIs that seem to have much weaker guarantees. I have two questions regarding follow-up contributions: *1. Add observe to Python ***DataFrame** As I can see