In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files
contain Japanese characters.Also they can have ^M character (u000D) so I
need to parse them as multiline.

First I used following code to read CSV files:

implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) {
     def readTeradataCSV(schema: StructType, s3Path: String) : DataFrame = {

        dataFrameReader.option("delimiter", "\u0001")
          .option("header", "false")
          .option("inferSchema", "false")
          .option("multiLine","true")
          .option("encoding", "UTF-8")
          .option("charset", "UTF-8")
          .schema(schema)
          .csv(s3Path)
     }
  }

But when I read DF using this method all the Japanese characters are garbled.

After doing some tests I found out that If I read the same S3 file
using *"spark.sparkContext.textFile(path)"* Japanese characters
encoded properly.

So I tried this way :

implicit class SparkSessionImplicits (spark : SparkSession) {
    def readTeradataCSV(schema: StructType, s3Path: String) = {
      import spark.sqlContext.implicits._
      spark.read.option("delimiter", "\u0001")
        .option("header", "false")
        .option("inferSchema", "false")
        .option("multiLine","true")
        .schema(schema)
        .csv(spark.sparkContext.textFile(s3Path).map(str =>
str.replaceAll("\u000D"," ")).toDS())
    }
  }

Now the encoding issue is fixed.However multilines doesn't work
properly and lines are broken near ^M character , even though I tried
to replace ^M using *str.replaceAll("\u000D"," ")*

Any tips on how to read Japanese characters using first method, or
handle multi-lines using the second method ?

Reply via email to