Re: Spark Streaming from S3
On 3 Dec 2015, at 19:31, Michele Freschi mailto:mfres...@palantir.com>> wrote: Hi Steve, I’m on hadoop 2.7.1 using the s3n switch to s3a. It's got better performance on big files (including a better forward seek that doesn't close connections; a faster close() on reads, + uses Amazon's own libraries). if you still have issues, file under https://issues.apache.org/jira/browse/HADOOP-11694 . FWIW there aren't any explicit tests in the hadoop codebase for working at that scale; there is one testing directory deletion that can be configured to scale up, but it's not doing the same actions as you —and it doesn't usually get run for more than a 100+ blobs, just because it object store test runs slow down builds, can be a bit unreliable, and for cost & credential security, aren't run in the ASF jenkins builds. As s3a is the only one being worked on (we're too scared of breaking s3n), its the one to try —and complain about if it underperforms -Steve From: Steve Loughran mailto:ste...@hortonworks.com>> Date: Thursday, December 3, 2015 at 4:12 AM Cc: SPARK-USERS mailto:user@spark.apache.org>> Subject: Re: Spark Streaming from S3 On 3 Dec 2015, at 00:42, Michele Freschi mailto:mfres...@palantir.com>> wrote: Hi all, I have an app streaming from s3 (textFileStream) and recently I've observed increasing delay and long time to list files: INFO dstream.FileInputDStream: Finding new files took 394160 ms ... INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020 ms (execution: 10.154 s) At this time I have about 13K files under the key prefix that I'm monitoring - hadoop takes about 6 minutes to list all the files while aws cli takes only seconds. My understanding is that this is a current limitation of hadoop but I wanted to confirm it in case it's a misconfiguration on my part. not a known issue. Usual questions: which Hadoop version and are you using s3n or s3a connectors. The latter does use the AWS sdk, but it's only been stable enough to use in Hadoop 2.7 Some alternatives I'm considering: 1. copy old files to a different key prefix 2. use one of the available SQS receivers (https://github.com/imapi/spark-sqs-receiver<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_imapi_spark-2Dsqs-2Dreceiver&d=CwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=YaCZ7nUd7TxXQA5k9sR42nen4K6AtCtNo0sEWlPw-9Y&m=N7cTMu7V05lQx-vlxpAWGgZP6jyut95v0PsO5hanXSw&s=q0awXD6YCk7xE1zbKXuKbqaQvuCf6_AE4g5C7g8Hq8Q&e=> ?) 3. implement the s3 listing outside of spark and use socketTextStream, but I couldn't find if it's reliable or not 4. create a custom s3 receiver using aws sdk (even if doesn't look like it's possible to use them from pyspark) Has anyone experienced the same issue and found a better way to solve it ? Thanks, Michele
Re: Spark Streaming from S3
Hi Steve, I¹m on hadoop 2.7.1 using the s3n From: Steve Loughran Date: Thursday, December 3, 2015 at 4:12 AM Cc: SPARK-USERS Subject: Re: Spark Streaming from S3 > On 3 Dec 2015, at 00:42, Michele Freschi wrote: > > Hi all, > > I have an app streaming from s3 (textFileStream) and recently I've observed > increasing delay and long time to list files: > > INFO dstream.FileInputDStream: Finding new files took 394160 ms > ... > INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020 ms > (execution: 10.154 s) > > At this time I have about 13K files under the key prefix that I'm monitoring - > hadoop takes about 6 minutes to list all the files while aws cli takes only > seconds. > My understanding is that this is a current limitation of hadoop but I wanted > to confirm it in case it's a misconfiguration on my part. not a known issue. Usual questions: which Hadoop version and are you using s3n or s3a connectors. The latter does use the AWS sdk, but it's only been stable enough to use in Hadoop 2.7 > > Some alternatives I'm considering: > 1. copy old files to a different key prefix > 2. use one of the available SQS receivers > (https://github.com/imapi/spark-sqs-receiver > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_imapi_spark-2 > Dsqs-2Dreceiver&d=CwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=YaCZ7 > nUd7TxXQA5k9sR42nen4K6AtCtNo0sEWlPw-9Y&m=N7cTMu7V05lQx-vlxpAWGgZP6jyut95v0PsO5 > hanXSw&s=q0awXD6YCk7xE1zbKXuKbqaQvuCf6_AE4g5C7g8Hq8Q&e=> ?) > 3. implement the s3 listing outside of spark and use socketTextStream, but I > couldn't find if it's reliable or not > 4. create a custom s3 receiver using aws sdk (even if doesn't look like it's > possible to use them from pyspark) > > Has anyone experienced the same issue and found a better way to solve it ? > > Thanks, > Michele > smime.p7s Description: S/MIME cryptographic signature
Re: Spark Streaming from S3
On 3 Dec 2015, at 00:42, Michele Freschi mailto:mfres...@palantir.com>> wrote: Hi all, I have an app streaming from s3 (textFileStream) and recently I've observed increasing delay and long time to list files: INFO dstream.FileInputDStream: Finding new files took 394160 ms ... INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020 ms (execution: 10.154 s) At this time I have about 13K files under the key prefix that I'm monitoring - hadoop takes about 6 minutes to list all the files while aws cli takes only seconds. My understanding is that this is a current limitation of hadoop but I wanted to confirm it in case it's a misconfiguration on my part. not a known issue. Usual questions: which Hadoop version and are you using s3n or s3a connectors. The latter does use the AWS sdk, but it's only been stable enough to use in Hadoop 2.7 Some alternatives I'm considering: 1. copy old files to a different key prefix 2. use one of the available SQS receivers (https://github.com/imapi/spark-sqs-receiver ?) 3. implement the s3 listing outside of spark and use socketTextStream, but I couldn't find if it's reliable or not 4. create a custom s3 receiver using aws sdk (even if doesn't look like it's possible to use them from pyspark) Has anyone experienced the same issue and found a better way to solve it ? Thanks, Michele