pyspark EOFError after calling map

Pete Werner Wed, 13 Apr 2016 16:53:34 -0700

Hi

I am new to spark & pyspark.


I am reading a small csv file (~40k rows) into a dataframe.

from pyspark.sql import functions as F
df =
sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('/tmp/sm.csv')
df = df.withColumn('verified', F.when(df['verified'] == 'Y',
1).otherwise(0))
df2 = df.map(lambda x: Row(label=float(x[0]),
features=Vectors.dense(x[1:]))).toDF()

I get some weird error that does not occur every single time, but does
happen pretty regularly

>>> df2.show(1)
+--------------------+---------+
|            features|    label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|0.0|
+--------------------+---------+
only showing top 1 row

>>> df2.count()
41999

>>> df2.show(1)
+--------------------+---------+
|            features|    label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|0.0|
+--------------------+---------+
only showing top 1 row

>>> df2.count()
41999

>>> df2.show(1)
Traceback (most recent call last):
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in
manager
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in
worker
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, in
main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line
545, in read_int
    raise EOFError
EOFError
+--------------------+---------+
|            features|    label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row

Once that EOFError has been raised, I will not see it again until I do
something that requires interacting with the spark server

When I call df2.count() it shows that [Stage xxx] prompt which is what I
mean by it going to the spark server.

Anything that triggers that seems to eventually end up giving the EOFError
again when I do something with df2.

It does not seem to happen with df (vs. df2) so seems like it must be
something happening with the df.map() line.

-- 

Pete Werner
Data Scientist
Freelancer.com

Level 20
680 George Street
Sydney NSW 2000

e: pwer...@freelancer.com
p:  +61 2 8599 2700
w: http://www.freelancer.com

pyspark EOFError after calling map

Reply via email to