Hi all, I have an app streaming from s3 (textFileStream) and recently I've observed increasing delay and long time to list files:
INFO dstream.FileInputDStream: Finding new files took 394160 ms ... INFO scheduler.JobScheduler: Total delay: 404.796 s for time 1449100200000 ms (execution: 10.154 s) At this time I have about 13K files under the key prefix that I'm monitoring - hadoop takes about 6 minutes to list all the files while aws cli takes only seconds. My understanding is that this is a current limitation of hadoop but I wanted to confirm it in case it's a misconfiguration on my part. Some alternatives I'm considering: 1. copy old files to a different key prefix 2. use one of the available SQS receivers (https://github.com/imapi/spark-sqs-receiver ?) 3. implement the s3 listing outside of spark and use socketTextStream, but I couldn't find if it's reliable or not 4. create a custom s3 receiver using aws sdk (even if doesn't look like it's possible to use them from pyspark) Has anyone experienced the same issue and found a better way to solve it ? Thanks, Michele
smime.p7s
Description: S/MIME cryptographic signature