Hi Steve, I¹m on hadoop 2.7.1 using the s3n
From: Steve Loughran <ste...@hortonworks.com> Date: Thursday, December 3, 2015 at 4:12 AM Cc: SPARK-USERS <user@spark.apache.org> Subject: Re: Spark Streaming from S3 > On 3 Dec 2015, at 00:42, Michele Freschi <mfres...@palantir.com> wrote: > > Hi all, > > I have an app streaming from s3 (textFileStream) and recently I've observed > increasing delay and long time to list files: > > INFO dstream.FileInputDStream: Finding new files took 394160 ms > ... > INFO scheduler.JobScheduler: Total delay: 404.796 s for time 1449100200000 ms > (execution: 10.154 s) > > At this time I have about 13K files under the key prefix that I'm monitoring - > hadoop takes about 6 minutes to list all the files while aws cli takes only > seconds. > My understanding is that this is a current limitation of hadoop but I wanted > to confirm it in case it's a misconfiguration on my part. not a known issue. Usual questions: which Hadoop version and are you using s3n or s3a connectors. The latter does use the AWS sdk, but it's only been stable enough to use in Hadoop 2.7 > > Some alternatives I'm considering: > 1. copy old files to a different key prefix > 2. use one of the available SQS receivers > (https://github.com/imapi/spark-sqs-receiver > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_imapi_spark-2 > Dsqs-2Dreceiver&d=CwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=YaCZ7 > nUd7TxXQA5k9sR42nen4K6AtCtNo0sEWlPw-9Y&m=N7cTMu7V05lQx-vlxpAWGgZP6jyut95v0PsO5 > hanXSw&s=q0awXD6YCk7xE1zbKXuKbqaQvuCf6_AE4g5C7g8Hq8Q&e=> ?) > 3. implement the s3 listing outside of spark and use socketTextStream, but I > couldn't find if it's reliable or not > 4. create a custom s3 receiver using aws sdk (even if doesn't look like it's > possible to use them from pyspark) > > Has anyone experienced the same issue and found a better way to solve it ? > > Thanks, > Michele >
smime.p7s
Description: S/MIME cryptographic signature