Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Hyukjin Kwon
Other options are maybe : - "spark.sql.files.ignoreCorruptFiles" option - DataFrameReader.csv(csvDataset: Dataset[String]) with custom inputformat (this is available from Spark 2.2.0). For example, val rdd = spark.sparkContext.newAPIHadoopFile("/tmp/abcd", classOf[org.apache.hadoop.mapreduce.

Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Jörn Franke
Hi, The Spark CSV parser has different parsing modes: * permissive (default) tries to read everything and missing tokens are interpreted as null and extra tokens are ignored * dropmalformed drops lines which have more or less tokens * failfast - runtimexception if there is a malformed line Obviou

[Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Nathan Case
Accidentally sent this to the dev mailing list, meant to send it here. I have a spark java application that in the past has used the hadoopFile interface to specify a custom TextInputFormat to be used when reading files. This custom class would gracefully handle exceptions like EOF exceptions cau