Hi, How does spark handles following case?
Thousands of CSV files (each with 50MB size) comes from external system. One RDD is defined on all of these. RDD defines some of the CSV fields as BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal format after some time (error shows max retries 4). 1) It is very likely that massive dataset will have occasional bad rows. It is not possible to fix this data set or do a pre-processing on it to eliminate bad data. How does spark handles it? Is it possible to say ignore first N bad rows etc. ? 2) What was the max 4 retries in error message? Any way to control it? Thanks,
