Re: Monitoring S3 Bucket with Spark Streaming

Benjamin Kim Sat, 09 Apr 2016 09:55:29 -0700

This is awesome! I have someplace to start from.

Thanks,
Ben



> On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com wrote:
> 
> Someone please correct me if I am wrong as I am still rather green to spark, 
> however it appears that through the S3 notification mechanism described 
> below, you can publish events to SQS and use SQS as a streaming source into 
> spark. The project at https://github.com/imapi/spark-sqs-receiver 
> <https://github.com/imapi/spark-sqs-receiver> appears to provide libraries 
> for doing this.
> 
> Hope this helps.
> 
> Sent from my iPhone
> 
> On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
>> Nezih,
>> 
>> This looks like a good alternative to having the Spark Streaming job check 
>> for new files on its own. Do you know if there is a way to have the Spark 
>> Streaming job get notified with the new file information and act upon it? 
>> This can reduce the overhead and cost of polling S3. Plus, I can use this to 
>> notify and kick off Lambda to process new data files and make them ready for 
>> Spark Streaming to consume. This will also use notifications to trigger. I 
>> just need to have all incoming folders configured for notifications for 
>> Lambda and all outgoing folders for Spark Streaming. This sounds like a 
>> better setup than we have now.
>> 
>> Thanks,
>> Ben
>> 
>>> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com 
>>> <mailto:nyigitb...@netflix.com>> wrote:
>>> 
>>> While it is doable in Spark, S3 also supports notifications: 
>>> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html 
>>> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html>
>>> 
>>> 
>>> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com 
>>> <mailto:nlaucha...@gmail.com>> wrote:
>>> Hi Benjamin,
>>> 
>>> I have done it . The critical configuration items are the ones below :
>>> 
>>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
>>> AccessKeyId)
>>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
>>> AWSSecretAccessKey)
>>> 
>>>       val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder 
>>> <s3://example_bucket/folder>")
>>> 
>>> This code will probe for new S3 files created in your every batch interval.
>>> 
>>> Thanks,
>>> Natu
>>> 
>>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
>>> pulled any new files to process? If so, can you provide basic Scala coding 
>>> help on this?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> <mailto:user-h...@spark.apache.org>
>>> 
>>> 
>>

Re: Monitoring S3 Bucket with Spark Streaming

Reply via email to