[Structured Streaming] OOM on ConsoleSink with large inputs

2017-08-11 Thread Gerard Maas
Devs, While investigating another issue, I came across this OOM error when using the Console Sink with any source that can be larger than the available driver memory. In my case, I was using the File source and I had a 14G file in the monitored dir. I traced back the issue to a `df.collect` in

Re: [SS] watermark, eventTime and "StreamExecution: Streaming query made progress"

2017-08-11 Thread Michael Armbrust
The point here is to tell you what watermark value was used when executing this batch. You don't know the new watermark until the batch is over and we don't want to do two passes over the data. In general the semantics of the watermark are designed to be conservative (i.e. just because data is

[build system] jenkins back up and building

2017-08-11 Thread shane knapp
there was some network work being done last night (~945pm PDT) at our colo, and it had the unintended consequence of kicking a lot of services off the network. jenkins was affected, and the connection to github was lost. i just kicked the jenkins master and things are happily building again.

Any comitter interested in Speaking a Solutions.Hamburg

2017-08-11 Thread Christofer Dutz
Hi all, I am looking for someone to speak as part of a 3 day Apache track at Solutions.Hamburg (https://solutions.hamburg/) next month (06.-08.09.2017). We were planning on having a dev oriented "Spark Structured Streaming" talk. Unfortunately the original volunteer seems to be unable to

Re: Use Apache ORC in Apache Spark 2.3

2017-08-11 Thread Sean Owen
-private@ list for future replies. This is not a PMC conversation. On Fri, Aug 11, 2017 at 3:17 AM Andrew Ash wrote: > @Reynold no I don't use the HiveCatalog -- I'm using a custom > implementation of ExternalCatalog instead. > > On Thu, Aug 10, 2017 at 3:34 PM, Dong Joon

[SS] watermark, eventTime and "StreamExecution: Streaming query made progress"

2017-08-11 Thread Jacek Laskowski
Hi, I'm curious why watermark is updated the next streaming batch after it's been observed [1]? The report (from ProgressReporter/StreamExecution) does not look right to me as avg/max/min are already calculated according to the watermark [2] My recommendation would be to do the update [2] in the