Checkpointing in Spark - Cleaning files and Support across app attempts

2016-10-11 Thread dhruve ashar
While checkpointing RDDs as a part of an application that doesn't use spark-streaming, I observed that the checkpointed files are not being cleaned up even after the application completes successfully. Is it because we assume that checkpointing would be primarily used for spark-streaming applicati

Re: Spark Improvement Proposals

2016-10-11 Thread Ryan Blue
I don't think we will have trouble with whatever rule that is adopted for accepting proposals. Considering committers' votes binding (if that is what we choose) is an established practice as long as it isn't for specific votes, like a release vote. From the Apache docs: "Who is permitted to vote is

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-11 Thread Shivaram Venkataraman
Thanks Fred - that is very helpful. > Delivering low latency, high throughput, and stability simultaneously: Right > now, our own tests indicate you can get at most two of these characteristics > out of Spark Streaming at the same time. I know of two parties that have > abandoned Spark Streaming b

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-11 Thread Reynold Xin
On Tue, Oct 11, 2016 at 10:55 AM, Michael Armbrust wrote: > *Complex event processing and state management:* Several groups I've >> talked to want to run a large number (tens or hundreds of thousands now, >> millions in the near future) of state machines over low-rate partitions of >> a high-rate

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-11 Thread Michael Armbrust
This is super helpful, thanks for writing it up! > *Delivering low latency, high throughput, and stability simultaneously:* Right > now, our own tests indicate you can get at most two of these > characteristics out of Spark Streaming at the same time. I know of two > parties that have abandoned S

StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-11 Thread Fred Reiss
On Thu, Oct 6, 2016 at 12:37 PM, Michael Armbrust > wrote: > > [snip!] > Relatedly, I'm curious to hear more about the types of questions you are > getting. I think the dev list is a good place to discuss applications and > if/how structured streaming can handle them. > Details are difficult to s

Spark Streaming deletes checkpointed RDD then tries to load it after restart

2016-10-11 Thread Cosmin Ciobanu
This is a follow up for this unanswered October 2015 issue: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-failed-recovery-from-checkpoint-td14832.html The issue is that the Spark driver checkpoints an RDD, deletes it, the job restarts, and *the new driver tries to load

Re: Looking for a Spark-Python expert

2016-10-11 Thread Hyukjin Kwon
Just as one of those who subscribed to dev/user mailing list, I would like to avoid to recieve flooding emails about job recruiting. In my personal opinion, I think that might mean virtually allowing that this list is being used as the mean for some profits in an organisation. On 7 Oct 2016 5:05