re: should we dump a warning if we drop batches due to window move?

2018-08-03 Thread Peter Liu
Hello there, I have a quick question for the following case: situation: a spark consumer is able to process 5 batches in 10 sec (where the batch interval is zero by default - correct me if this is wrong). the window size is 10 sec (zero overlapping sliding). there are some fluctuations in the

Re: Spark model serving

2018-08-03 Thread Saikat Kanjilal
@holdenK et al ping on next steps. Sent from my iPhone On Jul 12, 2018, at 3:47 PM, Saikat Kanjilal mailto:sxk1...@hotmail.com>> wrote: Thanks maximiliano so much for responding, I didn't want this discussion to disappear in the wilderness of dev emails :), here's what I would like to see

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Joseph Torres
I'd agree it might make sense to bundle this into an API. We'd have to think about whether it's a common enough use case to justify the API complexity. It might be worth exploring decoupling state and partitions, but I wouldn't want to start making decisions based on it without a clearer design

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Arun Mahadevan
coalesce might work. Say "spark.sql.shuffle.partitions" = 200, and then " input.readStream.map.filter.groupByKey(..).coalesce(2)" would still create 200 instances for state but execute just 2 tasks. However I think further groupByKey operations downstream would need similar coalesce. And

SPIP: Executor Plugin (SPARK-24918)

2018-08-03 Thread Imran Rashid
I'd like to propose adding a plugin api for Executors, primarily for instrumentation and debugging ( https://issues.apache.org/jira/browse/SPARK-24918). The changes are small, but as its adding a new api, it might be spip-worthy. I mentioned it as well in a recent email I sent about memory

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Joseph Torres
A coalesced RDD will definitely maintain any within-partition invariants that the original RDD maintained. It pretty much just runs its input partitions sequentially. There'd still be some Dataframe API work needed to get the coalesce operation where you want it to be, but this is much simpler

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Jungtaek Lim
I’m afraid I don’t know about the details on coalesce(), but some finding resource for coalesce, it looks like helping reducing actual partitions. For streaming aggregation, state for all partitions (by default, 200) must be initialized and committed even it is being unchanged. Otherwise error

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Joseph Torres
Scheduling multiple partitions in the same task is basically what coalesce() does. Is there a reason that doesn't work here? On Fri, Aug 3, 2018 at 5:55 AM, Jungtaek Lim wrote: > Here's a link for Google docs (anyone can comment): >

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Jungtaek Lim
Here's a link for Google docs (anyone can comment): https://docs.google.com/document/d/1DEOW3WQcPUq0YFgazkZx6Ei6EOdj_3pXEsyq4LGpyNs/edit?usp=sharing Please note that I just copied the content to the google docs, so someone could point out lack of details. I would like to start with explanation of

Spark sql syntax checker

2018-08-03 Thread Alessandro Liparoti
Hi everyone, I am building a framework on top of Spark in which users specify sql queries and we analyze them in order to extract some metadata. Moreover, sql queries can be composed, meaning that if a user writes a query X to build a dataset, another user can use X in his own query to refer to

Spark kafka streaming failure recovery scenario

2018-08-03 Thread sujith71955
Hi All, I have a scenario where my streaming application is running and in between the application has restarted, So basically i want to start from most recent data and go back to old data. Is there any way in spark we can do this or are we planning for providing such kind of flexibility in

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Arun Mahadevan
Can you share this in a google doc to make the discussions easier.? Thanks for coming up with ideas to improve upon the current restrictions with the SS state store. If I understood correctly, the plan is to introduce a logical partitioning scheme for state storage (based on keys)

[Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Jungtaek Lim
Hi Spark devs, I have a new feature to propose and hear opinions on community. Not sure it is such a big change to worth to step on SPIP, so posting to dev mailing list instead. > Feature Reconfigurable number of partitions on state operators in Structured Streaming > Rationalization