This is probably a side-effect of a bug introduced when I added custom serialization support to PySpark ( https://spark-project.atlassian.net/browse/SPARK-1043). The fix for this bug (https://github.com/apache/incubator-spark/pull/523) wasn't included in Spark 0.9, but it will be in 0.9.1; it's just a single commit, so you can cherry-pick it on top of 0.9 if you don't want to wait for the next bugfix release.
On Wed, Feb 12, 2014 at 12:24 AM, Julaiti Alafate <arapat.m...@gmail.com>wrote: > Hi, > > I am getting this error (copied from the stderr of the worker that > reports exceptions) while processing text files encoded in UTF8: > > 14/02/11 22:26:15 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "(base)/spark-0.9.0/python/pyspark/worker.py", line 77, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 182, in > dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 117, in > dump_stream > for obj in iterator: > File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 171, in > _batched > for item in iterator: > File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 276, in > load_stream > yield self.loads(stream) > File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 271, in > loads > return stream.read(length).decode('utf8') > File "(base)/lib/python2.7/encodings/utf_8.py", line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: > invalid start byte > > I am using PySpark. "SPARK_MEM" is set to 30g. The system is deployed in > standalone mode over 17 computers. Scala is version 2.10.3. Python is > version 2.7.3. > > I tried this code with previous release (spark 0.8.1, with scala 2.9.3). > And it executes successfully. But it fails with the latest release (0.9.0). > Thus I am not sure if this is a bug introduced by the latest version. > > Any help would be appreciated. Thanks! > > Julaiti > >