S3A Creating Task Per Byte (pyspark / 1.6.1)

Aaron Jackson Thu, 12 May 2016 10:36:18 -0700

I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's
in s3.  I've done this previously with spark 1.5 with no issue.  Attempting
to load and count a single file as follows:


dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv')
dataFrame.count()

But when it attempts to load, it creates 279K tasks.  When I look at the
tasks, the # of tasks is identical to the # of bytes in the file.  Has
anyone seen anything like this or have any ideas why it's getting that
granular?

S3A Creating Task Per Byte (pyspark / 1.6.1)

Reply via email to