Handling occasional bad data ...

Manoj Samel Wed, 22 Jan 2014 20:06:27 -0800

Hi,

How does spark handles following case?


Thousands of CSV files (each with 50MB size) comes from external system.
One RDD is defined on all of these. RDD defines some of the CSV fields as
BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal
format after some time (error shows max retries 4).

1) It is very likely that massive dataset will have occasional bad rows. It
is not possible to fix this data set or do a pre-processing on it to
eliminate bad data. How does spark handles it? Is it possible to say ignore
first N bad rows etc. ?

2) What was the max 4 retries in error message? Any way to control it?

Thanks,

Handling occasional bad data ...

Reply via email to