Hi, I am getting this error (copied from the stderr of the worker that reports exceptions) while processing text files encoded in UTF8:
14/02/11 22:26:15 ERROR executor.Executor: Uncaught exception in thread Thread[stdin writer for python,5,main] org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "(base)/spark-0.9.0/python/pyspark/worker.py", line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 117, in dump_stream for obj in iterator: File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 171, in _batched for item in iterator: File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 276, in load_stream yield self.loads(stream) File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 271, in loads return stream.read(length).decode('utf8') File “(base)/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte I am using PySpark. “SPARK_MEM” is set to 30g. The system is deployed in standalone mode over 17 computers. Scala is version 2.10.3. Python is version 2.7.3. I tried this code with previous release (spark 0.8.1, with scala 2.9.3). And it executes successfully. But it fails with the latest release (0.9.0). Thus I am not sure if this is a bug introduced by the latest version. Any help would be appreciated. Thanks! Julaiti