Re: Handling occasional bad data ...

Prashant Sharma Thu, 23 Jan 2014 05:01:55 -0800

 spark.task.maxFailures
 http://spark.incubator.apache.org/docs/latest/configuration.html



On Thu, Jan 23, 2014 at 10:18 AM, Andrew Ash <[email protected]> wrote:

> Why can't you preprocess to filter out the bad rows?  I often do this on
> CSV files by testing if the raw line is "parseable" before splitting on ","
> or similar.  Just validate the line before attempting to apply BigDecimal
> or anything like that.
>
> Cheers,
> Andrew
>
>
> On Wed, Jan 22, 2014 at 9:04 PM, Manoj Samel <[email protected]>wrote:
>
>> Hi,
>>
>> How does spark handles following case?
>>
>> Thousands of CSV files (each with 50MB size) comes from external system.
>> One RDD is defined on all of these. RDD defines some of the CSV fields as
>> BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal
>> format after some time (error shows max retries 4).
>>
>> 1) It is very likely that massive dataset will have occasional bad rows.
>> It is not possible to fix this data set or do a pre-processing on it to
>> eliminate bad data. How does spark handles it? Is it possible to say ignore
>> first N bad rows etc. ?
>>
>> 2) What was the max 4 retries in error message? Any way to control it?
>>
>> Thanks,
>>
>>
>>
>


-- 
Prashant

Re: Handling occasional bad data ...

Reply via email to