This is very helpful Boris.
I will need to re-architect a piece of my code to work with this service
but see it as more maintainable/stable long term.
I will be developing it out over the course of a few weeks so will let you
know how it goes.


On Tue, Mar 16, 2021, 2:05 AM Boris Litvak <boris.lit...@skf.com> wrote:

> P.S.: 3. If fast updates are required, one way would be capturing S3
> events & putting the paths/modifications dates/etc of the paths into
> DynamoDB/your DB of choice.
>
>
>
> *From:* Boris Litvak
> *Sent:* Tuesday, 16 March 2021 9:03
> *To:* Ben Kaylor <kaylor...@gmail.com>; Alchemist <
> alchemistsrivast...@gmail.com>
> *Cc:* User <user@spark.apache.org>
> *Subject:* RE: How to make bucket listing faster while using S3 with
> wholeTextFile
>
>
>
> Ben, I’d explore these approaches:
>
>    1. To address your problem, I’d setup an inventory for the S3 bucket:
>    
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html.
>    Then you can list the files from the inventory. Have not tried this myself.
>    Note that the inventory update is done once per day, at most, and it’s
>    eventually consistent.
>    2. If possible, would try & make bigger files. One can’t do many
>    things, such as streaming from scratch, when you have millions of files.
>
>
>
> Please tell us if it helps & how it goes.
>
>
>
> Boris
>
>
>
> *From:* Ben Kaylor <kaylor...@gmail.com>
> *Sent:* Monday, 15 March 2021 21:10
> *To:* Alchemist <alchemistsrivast...@gmail.com>
> *Cc:* User <user@spark.apache.org>
> *Subject:* Re: How to make bucket listing faster while using S3 with
> wholeTextFile
>
>
>
> Not sure on answer on this, but am solving similar issues. So looking for
> additional feedback on how to do this.
>
>
>
> My thoughts if unable to do via spark and S3 boto commands,  then have
> apps self report those changes. Where instead of having just mappers
> discovering the keys, you have services self reporting that a new key has
> been created or modified to a metadata service for incremental and more
> realtime updates.
>
>
>
> Would like to hear more ideas on this, thanks
>
> David
>
>
>
>
>
>
>
> On Mon, Mar 15, 2021, 11:31 AM Alchemist <alchemistsrivast...@gmail.com>
> wrote:
>
> *How to optimize s3 list S3 file using wholeTextFile()*: We are using
> wholeTextFile to read data from S3.  As per my understanding wholeTextFile
> first list files of given path.  Since we are using S3 as input source,
> then listing files in a bucket is single-threaded, the S3 API for listing
> the keys in a bucket only returns keys by chunks of 1000 per call.   Since
> we have at millions of files, we are making thousands API calls.  This
> listing make our processing very slow. How can we make listing of S3 faster?
>
>
>
> Thanks,
>
>
>
> Rachana
>
>

Reply via email to