Hi,

I have a very simple script that just reads file from HDFS and immediately
saves it back:

from pyspark import SparkContext
if __name__ == '__main__':
    sc = SparkContext('spark://master:7077', 'UnicodeTest')
    data = sc.textFile('hdfs://master/path/to/file.txt')
    data.saveAsTextFile('hdfs://master/path/to/copy')

If contents of a file are ascii-compatible, it works fine. But if there are
unicode characters in the file, I'm getting the *UnicodeEncodeError*:

  File "/usr/local/spark/python/pyspark/worker.py", line 82, in main
    for obj in func(split_index, iterator):
  File "/usr/local/spark/python/pyspark/rdd.py", line 555, in <genexpr>
    *return (str(x).encode("utf-8") for x in iterator)*
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in
position 56: ordinal not in range(128)

As far as I understand, PySpark works with *unicode* objects internally,
and to save it into a file it tries to encode such an object into UTF-8.
But why does it try to encode to 'ascii' first? How can I fix it to process
UTF characters?

Reply via email to