Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2019-05-20 Thread Arun Mahadevan
Heres the proposal for supporting it in "append" mode - https://github.com/apache/spark/pull/23576. You could see if it addresses your requirement and post your feedback in the PR. For "update" mode its going to be much harder to support this without first adding support for "retractions",

Re: Spark 2.4.2

2019-04-19 Thread Arun Mahadevan
+1 to upgrade Jackson. It has come up multiple times due to CVEs and the back port has worked out but it may be good to include if its not going to delay the release. On Thu, 18 Apr 2019 at 19:53, Wenchen Fan wrote: > I've cut RC1. If people think we must upgrade Jackson in 2.4, I can cut > RC2

Re: Closing a SparkSession stops the SparkContext

2019-04-02 Thread Arun Mahadevan
I am not sure how would it cause a leak though. When a spark session or the underlying context is stopped it should clean up everything. The getOrCreate is supposed to return the active thread local or the global session. May be if you keep creating new sessions after explicitly clearing the

Re: Request review for long-standing PRs

2019-02-26 Thread Arun Mahadevan
Yes, I agree thats its a valid concern and leads to individual contributors giving up on new ideas or major improvements. On Tue, 26 Feb 2019 at 15:24, Jungtaek Lim wrote: > Adding one more, it implicitly leads individual contributors to give up > with challenging major things and just focus on

Re: [SS] Allowing stream Sink metadata as part of checkpoint?

2019-02-25 Thread Arun Mahadevan
Unless its some sink metadata to be maintained by the framework (e.g sink state that needs to be passed back to the sink etc), would it make sense to keep it under the checkpoint dir ? Maybe I am missing the motivation of the proposed approach but I guess the sink mostly needs to store the last

Re: Welcome Jose Torres as a Spark committer

2019-01-29 Thread Arun Mahadevan
Congrats Jose! Well deserved. On Tue, 29 Jan 2019 at 11:15, Jules Damji wrote: > Congrats Jose! > > Sent from my iPhone > Pardon the dumb thumb typos :) > > On Jan 29, 2019, at 11:07 AM, shane knapp wrote: > > congrats, and welcome! > > On Tue, Jan 29, 2019 at 11:07 AM Dean Wampler > wrote: >

Re: Support SqlStreaming in spark

2018-12-21 Thread Arun Mahadevan
There has been efforts to come up with a unified syntax for streaming (see [1] [2]), but I guess there will be differences based on the streaming features supported by a system. Agree it needs a detailed design and it can be as close to the Spark batch SQL syntax as possible. Also I am not sure

Re: Implementation for exactly-once streaming sink

2018-12-05 Thread Arun Mahadevan
I guess thats roughly it. As of now theres no in-built support to co-ordinate the commits across the executors in an atomic way. So you need to commit the batch (global commit) at the driver. And when the batch is replayed and if any of the intermediate operations are not idempotent or can cause

Re: DataSourceV2 sync tomorrow

2018-11-13 Thread Arun Mahadevan
IMO, the currentOffset should not be optional. For continuous mode I assume this offset gets periodically check pointed (so mandatory) ? For the micro batch mode the currentOffset would be the start offset for a micro-batch. And if the micro-batch could be executed without knowing the 'latest'

Re: DataSourceV2 hangouts sync

2018-10-31 Thread Arun Mahadevan
Thanks for bringing up the custom metrics API in the list, its something that needs to be addressed. A couple more items worth considering, 1. Possibility to unify the batch, micro-batch and continuous sources. (similar to SPARK-25000) Right now now there is significant code duplication even

Re: queryable state & streaming

2018-10-24 Thread Arun Mahadevan
I don't think separate API or RPCs etc might be necessary for queryable state if the state can be exposed as just another datasource. Then the sql queries can be issued against it just like executing sql queries against any other data source. For now I think the "memory" sink could be used as a

Re: welcome a new batch of committers

2018-10-03 Thread Arun Mahadevan
Congratulations everyone! On Wed, 3 Oct 2018 at 09:16, Dilip Biswal wrote: > Congratulations to Shane Knapp, Dongjoon Hyun, Kazuaki Ishizaki, Xingbo > Jiang, Yinan Li and Takeshi Yamamuro !! > > Regards, > Dilip Biswal > Tel: 408-463-4980 > dbis...@us.ibm.com > > > > - Original message

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Arun Mahadevan
tasks after they have prepared and bring up the tasks during the commit >>>> phase. >>>> >>>> I guess we already got into too much details here, but if it is based >>>> on client transaction Spark must assign "commit" tasks to the ex

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Arun Mahadevan
is also not a trivial one to get it correctly with current execution >>>> model: Spark doesn't require writer tasks to run at the same time but to >>>> achieve 2PC they should run until end of transaction (closing client before >>>> transaction ends normally means

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Arun Mahadevan
In some cases the implementations may be ok with eventual consistency (and does not care if the output is written out atomically) XA can be one option for datasources that supports it and requires atomicity but I am not sure how would one implement it with the current API. May be we need to

Re: Branch 2.4 is cut

2018-09-10 Thread Arun Mahadevan
Ryan's proposal makes a lot of sense. Its better not to release half-baked changes in 2.4 which not only breaks a lot of the APIs released in 2.3, but also expected to change further due redesigns before 3.0 so don't see much value releasing it in 2.4. On Sun, 9 Sep 2018 at 22:42, Wenchen Fan

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Arun Mahadevan
t/d/1DEOW3WQcPUq0YFgazkZx6Ei6 >>>> EOdj_3pXEsyq4LGpyNs/edit?usp=sharing >>>> >>>> Please note that I just copied the content to the google docs, so >>>> someone could point out lack of details. I would like to start with >>>> explanation of the concept,

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Arun Mahadevan
Can you share this in a google doc to make the discussions easier.? Thanks for coming up with ideas to improve upon the current restrictions with the SS state store. If I understood correctly, the plan is to introduce a logical partitioning scheme for state storage (based on keys)

Re: Sorting on a streaming dataframe

2018-04-24 Thread Arun Mahadevan
I guess sorting would make sense only when you have the complete data set. In streaming you don’t know what record is coming next so doesn’t make sense to sort it (except in the aggregated complete output mode where the entire result table is emitted each time and the results can be sorted).