I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's in s3. I've done this previously with spark 1.5 with no issue. Attempting to load and count a single file as follows:
dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv') dataFrame.count() But when it attempts to load, it creates 279K tasks. When I look at the tasks, the # of tasks is identical to the # of bytes in the file. Has anyone seen anything like this or have any ideas why it's getting that granular?