Thanks Prashant
On Thu, Jan 23, 2014 at 5:00 AM, Prashant Sharma <[email protected]>wrote: > spark.task.maxFailures > http://spark.incubator.apache.org/docs/latest/configuration.html > > > On Thu, Jan 23, 2014 at 10:18 AM, Andrew Ash <[email protected]> wrote: > >> Why can't you preprocess to filter out the bad rows? I often do this on >> CSV files by testing if the raw line is "parseable" before splitting on "," >> or similar. Just validate the line before attempting to apply BigDecimal >> or anything like that. >> >> Cheers, >> Andrew >> >> >> On Wed, Jan 22, 2014 at 9:04 PM, Manoj Samel <[email protected]>wrote: >> >>> Hi, >>> >>> How does spark handles following case? >>> >>> Thousands of CSV files (each with 50MB size) comes from external system. >>> One RDD is defined on all of these. RDD defines some of the CSV fields as >>> BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal >>> format after some time (error shows max retries 4). >>> >>> 1) It is very likely that massive dataset will have occasional bad rows. >>> It is not possible to fix this data set or do a pre-processing on it to >>> eliminate bad data. How does spark handles it? Is it possible to say ignore >>> first N bad rows etc. ? >>> >>> 2) What was the max 4 retries in error message? Any way to control it? >>> >>> Thanks, >>> >>> >>> >> > > > -- > Prashant >
