This is awesome! I have someplace to start from. Thanks, Ben
> On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com wrote: > > Someone please correct me if I am wrong as I am still rather green to spark, > however it appears that through the S3 notification mechanism described > below, you can publish events to SQS and use SQS as a streaming source into > spark. The project at https://github.com/imapi/spark-sqs-receiver > <https://github.com/imapi/spark-sqs-receiver> appears to provide libraries > for doing this. > > Hope this helps. > > Sent from my iPhone > > On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > >> Nezih, >> >> This looks like a good alternative to having the Spark Streaming job check >> for new files on its own. Do you know if there is a way to have the Spark >> Streaming job get notified with the new file information and act upon it? >> This can reduce the overhead and cost of polling S3. Plus, I can use this to >> notify and kick off Lambda to process new data files and make them ready for >> Spark Streaming to consume. This will also use notifications to trigger. I >> just need to have all incoming folders configured for notifications for >> Lambda and all outgoing folders for Spark Streaming. This sounds like a >> better setup than we have now. >> >> Thanks, >> Ben >> >>> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com >>> <mailto:nyigitb...@netflix.com>> wrote: >>> >>> While it is doable in Spark, S3 also supports notifications: >>> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html >>> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html> >>> >>> >>> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com >>> <mailto:nlaucha...@gmail.com>> wrote: >>> Hi Benjamin, >>> >>> I have done it . The critical configuration items are the ones below : >>> >>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", >>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem") >>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", >>> AccessKeyId) >>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", >>> AWSSecretAccessKey) >>> >>> val inputS3Stream = ssc.textFileStream("s3://example_bucket/folder >>> <s3://example_bucket/folder>") >>> >>> This code will probe for new S3 files created in your every batch interval. >>> >>> Thanks, >>> Natu >>> >>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> Has anyone monitored an S3 bucket or directory using Spark Streaming and >>> pulled any new files to process? If so, can you provide basic Scala coding >>> help on this? >>> >>> Thanks, >>> Ben >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> <mailto:user-unsubscr...@spark.apache.org> >>> For additional commands, e-mail: user-h...@spark.apache.org >>> <mailto:user-h...@spark.apache.org> >>> >>> >>