Based on your description, this isn't a problem in Spark. It means your JDBC connector isn't interpreting bytes from the database according to the encoding in which they were written. It could be Latin1, sure.
But if "new String(ResultSet.getBytes())" works, it's only because your platform's default JVM encoding is Latin1 too. Really you need to specify the encoding directly in that constructor, or else this will not in general work on other platforms, no. That's not the solution though; ideally you find the setting that lets the JDBC connector read the data as intended. On Tue, Sep 13, 2016 at 8:02 PM, Mark Bittmann <mbittm...@gmail.com> wrote: > Hello Spark community, > > I'm reading from a MySQL database into a Spark dataframe using the JDBC > connector functionality, and I'm experiencing some character encoding > issues. The default encoding for MySQL strings is latin1, but the mysql JDBC > connector implementation of "ResultSet.getString()" will return an mangled > unicode encoding of the data for certain characters such as the "all rights > reserved" char. Instead, you can use "new String(ResultSet.getBytes())" > which will return the correctly encoded string. I've confirmed this behavior > with the mysql connector classes (i.e., without using the Spark wrapper). > > I can see here that the Spark JDBC connector uses getString(), though there > is a note to move to getBytes() for performance reasons: > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L389 > > For some special chars, I can reverse the behavior with a UDF that applies > new String(badString.getBytes("Cp1252") , "UTF-8"), however for some foreign > characters the underlying byte array is irreversibly changed and the data is > corrupted. > > I can submit an issue/PR to fix it going forward if "new > String(ResultSet.getBytes())" is the correct approach. > > Meanwhile, can anyone offer any recommendations on how to correct this > behavior prior to it getting to a dataframe? I've tried every permutation of > the settings in the JDBC connection url (characterSetResults, > characterEncoding). > > I'm on Spark 1.6. > > Thanks! --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org