Hi,

I'm implementing a Spark Streaming + ML application. The data is coming in a
Kafka topic as json format. The Spark Kafka connector reads the data from
the Kafka topic as DStream. After several preprocessing steps, the input
DStream is transformed to a feature DStream which is fed into Spark ML
pipeline model. The code sample explains how the feature DStream interacts
with the pipeline model.

prediction_stream = feature_stream.transform(lambda rdd: predict_rdd(rdd,
pipeline_model)

def predict_rdd(rdd, pipeline_model):
    if(rdd is not None) and (not rdd.isEmpty()):
        try:
            df = rdd.toDF()
            predictions = pipeline_model.transform(df)
            return predictions.rdd
        except Exception as e:
            print("Unable to make predictions")
            return None
     else:
          return None

Here comes the problem. If the pipeline_model.transform(df) is failed due to
some data issues in some rows of df, the try...except block won't be able to
catch the exception since the exception is thrown in executors. As a result,
the exception is bubbled up to Spark and the streaming application is
terminated.

I want the exception to be caught in some way that the streaming application
won't be terminated and keep processing incoming data. Is it possible?

I know one solution could be doing more thorough data validation in
preprocessing step. However some sort of error handling should be put in
place for the transform method of pipeline model just in case any unexpected
things happen.


Thanks,



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to