In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files contain Japanese characters.Also they can have ^M character (u000D) so I need to parse them as multiline.
First I used following code to read CSV files: implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) { def readTeradataCSV(schema: StructType, s3Path: String) : DataFrame = { dataFrameReader.option("delimiter", "\u0001") .option("header", "false") .option("inferSchema", "false") .option("multiLine","true") .option("encoding", "UTF-8") .option("charset", "UTF-8") .schema(schema) .csv(s3Path) } } But when I read DF using this method all the Japanese characters are garbled. After doing some tests I found out that If I read the same S3 file using *"spark.sparkContext.textFile(path)"* Japanese characters encoded properly. So I tried this way : implicit class SparkSessionImplicits (spark : SparkSession) { def readTeradataCSV(schema: StructType, s3Path: String) = { import spark.sqlContext.implicits._ spark.read.option("delimiter", "\u0001") .option("header", "false") .option("inferSchema", "false") .option("multiLine","true") .schema(schema) .csv(spark.sparkContext.textFile(s3Path).map(str => str.replaceAll("\u000D"," ")).toDS()) } } Now the encoding issue is fixed.However multilines doesn't work properly and lines are broken near ^M character , even though I tried to replace ^M using *str.replaceAll("\u000D"," ")* Any tips on how to read Japanese characters using first method, or handle multi-lines using the second method ?