I'm writing a Spark Streaming application where the input data is put into
an S3 bucket in small batches (using Database Migration Service - DMS). The
Spark application is the only consumer. I'm considering two possible
architectures:

Have Spark Streaming watch an S3 prefix and pick up new objects as they
come in
Stream data from S3 to a Kinesis stream (through a Lambda function
triggered as new S3 objects are created by DMS) and use the stream as input
for the Spark application.
While the second solution will work, the first solution is simpler. But are
there any pitfalls? Looking at this guide, I'm concerned about two specific
points:

> *The more files under a directory, the longer it will take to scan for
changes — even if no files have been modified.*

We will be keeping the S3 data indefinitely. So the number of objects under
the prefix being monitored is going to increase very quickly.

> *“Full” Filesystems such as HDFS tend to set the modification time on
their files as soon as the output stream is created. When a file is opened,
even before data has been completely written, it may be included in the
DStream - after which updates to the file within the same window will be
ignored. That is: changes may be missed, and data omitted from the stream.*

I'm not sure if this applies to S3, since to my understanding objects are
created atomically and cannot be updated afterwards as is the case with
ordinary files (unless deleted and recreated, which I don't believe DMS
does)

Thanks for any help!

Reply via email to