I am not pyspark person .. But from the errors I could figure out that your Spark application is having memory issues . Are you collecting the results to the driver at any point of time or have configured less memory for the nodes ? and If you are using Dataframes then there is issue raised in Jira <java.net.SocketTimeoutException: Accept timed out>
Hope this helps Thanks, Divya On 16 December 2016 at 16:53, Russell Jurney <russell.jur...@gmail.com> wrote: > I have created a PySpark Streaming application that uses Spark ML to > classify flight delays into three categories: on-time, slightly late, very > late. After an hour or so something times out and the whole thing crashes. > > The code and error are on a gist here: https://gist.github.com/rjurney/ > 17d471bc98fd1ec925c37d141017640d > > While I am interested in why I am getting an exception, I am more > interested in understanding what the correct deployment model is... because > long running processes will have new and varied errors and exceptions. > Right now with what I've built, Spark is a highly dependable distributed > system but in streaming mode the entire thing is dependent on one Python > PID going down. This can't be how apps are deployed in the wild because it > will never be very reliable, right? But I don't see anything about this in > the docs, so I am confused. > > Note that I use this to run the app, maybe that is the problem? > > ssc.start() > ssc.awaitTermination() > > > What is the actual deployment model for Spark Streaming? All I know to do > right now is to restart the PID. I'm new to Spark, and the docs don't > really explain this (that I can see). > > Thanks! > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io >