I am loading a simple text file using pyspark. Repartitioning it seems to produce garbage data.
I got this results using spark 2.1 prebuilt for hadoop 2.7 using pyspark shell. >>> sc.textFile("outc").collect() [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] >>> sc.textFile("outc", use_unicode=False).collect() ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] Repartitioning seems to produce garbarge and also only only 2 records here >>> sc.textFile("outc", use_unicode=False).repartition(10).collect() ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] >>> sc.textFile("outc", use_unicode=False).repartition(10).count() 2 Without setting use_unicode=False we can't even repartition at all >>> sc.textFile("outc").repartition(19).collect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py", line 529, in load_stream yield self.loads(stream) File "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py", line 524, in loads return s.decode("utf-8") if self.use_unicode else s File "/home/snuderl/scrappstore/virtualenv/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte Input file contents: a b c d e f g h i j k l -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-2-1-0-weird-behavior-with-repartition-tp28350.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org