Handling exceptions this way means handling errors on the driver side,
which may or may not be what you want. You can also write functions
with exception handling inside, which could make more sense in some
cases (like, to ignore bad records or count them or something).

If you want to handle errors at every step on the driver side, you
have to force RDDs to materialize to see if they "work". You can do
that with .count() or .take(1).length > 0. But to avoid recomputing
the RDD then, it needs to be cached. So there is a big non-trivial
overhead to approaching it this way.

If you go this way, consider materializing only a few key RDDs in your
flow, not every one.

The most natural thing is indeed to handle exceptions where the action occurs.


On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos <michal.klo...@gmail.com> wrote:
> Hi Spark Community,
>
> We would like to define exception handling behavior on RDD instantiation /
> build. Since the RDD is lazily evaluated, it seems like we are forced to put
> all exception handling in the first action call?
>
> This is an example of something that would be nice:
>
> def myRDD = {
> Try {
> val rdd = sc.textFile(...)
> } match {
> Failure(e) => Handle ...
> }
> }
>
> myRDD.reduceByKey(...) //don't need to worry about that exception here
>
> The reason being that we want to try to avoid having to copy paste exception
> handling boilerplate on every first action. We would love to define this
> once somewhere for the RDD build code and just re-use.
>
> Is there a best practice for this? Are we missing something here?
>
> thanks,
> Michal

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to