Re: Observable Metrics on Spark Datasets

2021-03-16 Thread Jungtaek Lim
Please follow up the discussion in the origin PR. https://github.com/apache/spark/pull/26127 Dataset.observe() relies on the query listener for the batch query which is an "unstable" API - that's why we decided to not add an example for the batch query. For streaming query, it relies on the

Re: Observable Metrics on Spark Datasets

2021-03-16 Thread Enrico Minack
I am focusing on batch mode, not streaming mode. I would argue that Dataset.observe() is equally useful for large batch processing. If you need some motivating use cases, please let me know. Anyhow, the documentation of observe states it works for both, batch and streaming. And in batch mode,

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Takeshi Yamamuro
+1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think. one question I have; what's an initial goal of the proposal? Is that to port all the pandas interfaces that Koalas has already implemented? Or, the basic set of them? On Tue, Mar 16, 2021 at 1:44

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Wenchen Fan
+1, it's great to have Pandas support in Spark out of the box. On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro wrote: > +1; the pandas interfaces are pretty popular and supporting them in > pyspark looks promising, I think. > one question I have; what's an initial goal of the proposal? > Is

Determine global watermark via StreamingQueryProgress eventTime watermark String

2021-03-16 Thread dwichman
Hi Spark Developers, Is it possible to reliably determine the current global watermark that is being used in a streaming query via StreamingQueryProgress.onQueryProgress eventTime watermark String?

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Andrew Melo
Hi, Integrating Koalas with pyspark might help enable a richer integration between the two. Something that would be useful with a tighter integration is support for custom column array types. Currently, Spark takes dataframes, converts them to arrow buffers then transmits them over the socket to

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Bryan Cutler
+1 the proposal sounds good to me. Having a familiar API built-in will really help new users get into using Spark that might only have Pandas experience. It sounds like maintenance costs should be manageable, once the hurdle with setting up tests is done. Just out of curiosity, does Koalas pretty

Re: Determine global watermark via StreamingQueryProgress eventTime watermark String

2021-03-16 Thread Jungtaek Lim
There was a similar question (but another approach) and I've explained the current status a bit. https://lists.apache.org/thread.html/r89a61a10df71ccac132ce5d50b8fe405635753db7fa2aeb79f82fb77%40%3Cuser.spark.apache.org%3E I guess this would also answer your question as well. At least for now,

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Hyukjin Kwon
Thank you guys for all your feedback. I will start working on SPIP with Koalas team. I would expect the SPIP can be sent late this week or early next week. I inlined and answered the questions unanswered as below: Is the community developing the pandas API layer for Spark interested in being