subject:"Re\: Spark Streaming from S3"

Re: Spark Streaming from S3

2015-12-04 Thread Steve Loughran


On 3 Dec 2015, at 19:31, Michele Freschi 
mailto:mfres...@palantir.com>> wrote:

Hi Steve,

I’m on hadoop 2.7.1 using the s3n

switch to s3a. It's got better performance on big files (including a better 
forward seek that doesn't close connections; a faster close() on reads, + uses 
Amazon's own libraries).

if you still have issues, file under 
https://issues.apache.org/jira/browse/HADOOP-11694 .

FWIW there aren't any explicit tests in the hadoop codebase for working at that 
scale; there is one testing directory deletion that can be configured to scale 
up, but it's not doing the same actions as you —and it doesn't usually get run 
for more than a 100+ blobs, just because it object store test runs slow down 
builds, can be a bit unreliable, and for cost & credential security, aren't run 
in the ASF jenkins builds.

As s3a is the only one being worked on (we're too scared of breaking s3n), its 
the one to try —and complain about if it underperforms

-Steve


From: Steve Loughran mailto:ste...@hortonworks.com>>
Date: Thursday, December 3, 2015 at 4:12 AM
Cc: SPARK-USERS mailto:user@spark.apache.org>>
Subject: Re: Spark Streaming from S3


On 3 Dec 2015, at 00:42, Michele Freschi 
mailto:mfres...@palantir.com>> wrote:

Hi all,

I have an app streaming from s3 (textFileStream) and recently I've observed 
increasing delay and long time to list files:

INFO dstream.FileInputDStream: Finding new files took 394160 ms
...
INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020 ms 
(execution: 10.154 s)

At this time I have about 13K files under the key prefix that I'm monitoring - 
hadoop takes about 6 minutes to list all the files while aws cli takes only 
seconds.
My understanding is that this is a current limitation of hadoop but I wanted to 
confirm it in case it's a misconfiguration on my part.

not a known issue.

Usual questions: which Hadoop version and are you using s3n or s3a connectors. 
The latter does use the AWS sdk, but it's only been stable enough to use in 
Hadoop 2.7


Some alternatives I'm considering:
1. copy old files to a different key prefix
2. use one of the available SQS receivers 
(https://github.com/imapi/spark-sqs-receiver<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_imapi_spark-2Dsqs-2Dreceiver&d=CwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=YaCZ7nUd7TxXQA5k9sR42nen4K6AtCtNo0sEWlPw-9Y&m=N7cTMu7V05lQx-vlxpAWGgZP6jyut95v0PsO5hanXSw&s=q0awXD6YCk7xE1zbKXuKbqaQvuCf6_AE4g5C7g8Hq8Q&e=>
 ?)
3. implement the s3 listing outside of spark and use socketTextStream, but I 
couldn't find if it's reliable or not
4. create a custom s3 receiver using aws sdk (even if doesn't look like it's 
possible to use them from pyspark)

Has anyone experienced the same issue and found a better way to solve it ?

Thanks,
Michele

Re: Spark Streaming from S3

2015-12-03 Thread Michele Freschi

Hi Steve,

I¹m on hadoop 2.7.1 using the s3n

From:  Steve Loughran 
Date:  Thursday, December 3, 2015 at 4:12 AM
Cc:  SPARK-USERS 
Subject:  Re: Spark Streaming from S3


> On 3 Dec 2015, at 00:42, Michele Freschi  wrote:
> 
> Hi all,
> 
> I have an app streaming from s3 (textFileStream) and recently I've observed
> increasing delay and long time to list files:
> 
> INFO dstream.FileInputDStream: Finding new files took 394160 ms
> ...
> INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020 ms
> (execution: 10.154 s)
> 
> At this time I have about 13K files under the key prefix that I'm monitoring -
> hadoop takes about 6 minutes to list all the files while aws cli takes only
> seconds. 
> My understanding is that this is a current limitation of hadoop but I wanted
> to confirm it in case it's a misconfiguration on my part.

not a known issue.

Usual questions: which Hadoop version and are you using s3n or s3a
connectors. The latter does use the AWS sdk, but it's only been stable
enough to use in Hadoop 2.7

> 
> Some alternatives I'm considering:
> 1. copy old files to a different key prefix
> 2. use one of the available SQS receivers
> (https://github.com/imapi/spark-sqs-receiver
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_imapi_spark-2
> Dsqs-2Dreceiver&d=CwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=YaCZ7
> nUd7TxXQA5k9sR42nen4K6AtCtNo0sEWlPw-9Y&m=N7cTMu7V05lQx-vlxpAWGgZP6jyut95v0PsO5
> hanXSw&s=q0awXD6YCk7xE1zbKXuKbqaQvuCf6_AE4g5C7g8Hq8Q&e=>  ?)
> 3. implement the s3 listing outside of spark and use socketTextStream, but I
> couldn't find if it's reliable or not
> 4. create a custom s3 receiver using aws sdk (even if doesn't look like it's
> possible to use them from pyspark)
> 
> Has anyone experienced the same issue and found a better way to solve it ?
> 
> Thanks,
> Michele
> 





smime.p7s
Description: S/MIME cryptographic signature

Re: Spark Streaming from S3

2015-12-03 Thread Steve Loughran


On 3 Dec 2015, at 00:42, Michele Freschi 
mailto:mfres...@palantir.com>> wrote:

Hi all,

I have an app streaming from s3 (textFileStream) and recently I've observed 
increasing delay and long time to list files:

INFO dstream.FileInputDStream: Finding new files took 394160 ms
...
INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020 ms 
(execution: 10.154 s)

At this time I have about 13K files under the key prefix that I'm monitoring - 
hadoop takes about 6 minutes to list all the files while aws cli takes only 
seconds.
My understanding is that this is a current limitation of hadoop but I wanted to 
confirm it in case it's a misconfiguration on my part.

not a known issue.

Usual questions: which Hadoop version and are you using s3n or s3a connectors. 
The latter does use the AWS sdk, but it's only been stable enough to use in 
Hadoop 2.7


Some alternatives I'm considering:
1. copy old files to a different key prefix
2. use one of the available SQS receivers 
(https://github.com/imapi/spark-sqs-receiver ?)
3. implement the s3 listing outside of spark and use socketTextStream, but I 
couldn't find if it's reliable or not
4. create a custom s3 receiver using aws sdk (even if doesn't look like it's 
possible to use them from pyspark)

Has anyone experienced the same issue and found a better way to solve it ?

Thanks,
Michele

Re: Spark Streaming from S3

Re: Spark Streaming from S3

Re: Spark Streaming from S3

3 matches

Site Navigation

Mail list logo

Footer information