Spark Streaming from S3

Michele Freschi Wed, 02 Dec 2015 16:43:49 -0800

Hi all,

I have an app streaming from s3 (textFileStream) and recently I've observed
increasing delay and long time to list files:


INFO dstream.FileInputDStream: Finding new files took 394160 ms
...
INFO scheduler.JobScheduler: Total delay: 404.796 s for time 1449100200000
ms (execution: 10.154 s)

At this time I have about 13K files under the key prefix that I'm monitoring
- hadoop takes about 6 minutes to list all the files while aws cli takes
only seconds. 
My understanding is that this is a current limitation of hadoop but I wanted
to confirm it in case it's a misconfiguration on my part.

Some alternatives I'm considering:
1. copy old files to a different key prefix
2. use one of the available SQS receivers
(https://github.com/imapi/spark-sqs-receiver ?)
3. implement the s3 listing outside of spark and use socketTextStream, but I
couldn't find if it's reliable or not
4. create a custom s3 receiver using aws sdk (even if doesn't look like it's
possible to use them from pyspark)

Has anyone experienced the same issue and found a better way to solve it ?

Thanks,
Michele

smime.p7s
Description: S/MIME cryptographic signature

Spark Streaming from S3

Reply via email to