This is a bug. The str() is there because I want to convert objects to strings like Java's toString(), but I should have used unicode() instead. I'll submit a patch to fix this (I think it should be as simple as replacing str() with unicode()).
On Thu, Nov 28, 2013 at 12:14 AM, Andrei <[email protected]> wrote: > Hi, > > I have a very simple script that just reads file from HDFS and immediately > saves it back: > > from pyspark import SparkContext > if __name__ == '__main__': > sc = SparkContext('spark://master:7077', 'UnicodeTest') > data = sc.textFile('hdfs://master/path/to/file.txt') > data.saveAsTextFile('hdfs://master/path/to/copy') > > If contents of a file are ascii-compatible, it works fine. But if there > are unicode characters in the file, I'm getting the *UnicodeEncodeError*: > > File "/usr/local/spark/python/pyspark/worker.py", line 82, in main > for obj in func(split_index, iterator): > File "/usr/local/spark/python/pyspark/rdd.py", line 555, in <genexpr> > *return (str(x).encode("utf-8") for x in iterator)* > UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in > position 56: ordinal not in range(128) > > As far as I understand, PySpark works with *unicode* objects internally, > and to save it into a file it tries to encode such an object into UTF-8. > But why does it try to encode to 'ascii' first? How can I fix it to process > UTF characters? >
