Stephan, Thanks for taking care of this. We'll give it a try once 1.4 drops.
On Sat, Oct 14, 2017 at 1:25 PM, Stephan Ewen <se...@apache.org> wrote: > Some updates on this: > > Aside from reworking how the S3 directory handling is done, we also looked > into supporting S3 different than we currently do. Currently support goes > strictly through Hadoop's S3 file systems, which we need to change, because > we want it to be possible to use Flink without Hadoop dependencies. > > In the next release, we will have S3 file systems without Hadoop > dependency: > > - One implementation wraps and shades a newer version of s3a. For > compatibility with current behavior. > > - The second is interesting for this directory problem: It uses Pesto's > S3 support which is a bit different from Hadoop' s3n and s3a. It does not > create empty directly marker files, hence it is not trying to make S3 look > as much like a file system as s3a and s3n are, but that is actually of > advantage for checkpointing. With that implementation, the here mentioned > issue should not exist. > > Caveat: The new file systems and their aggressive shading needs to be > testet at scale still, but we are happy to take any feedback on this. > > Merged as of https://github.com/apache/flink/commit/ > 991af3652479f85f732cbbade46bed7df1c5d819 > > You can use them by simply dropping the respective JARs from "/opt" into > "/lib" and using the file system scheme "s3://". > The configuration is as in Hadoop/Presto, but you can drop the config keys > into the Flink configuration - they will be forwarded to the Hadoop > configuration. > > Hope that this makes the S3 use a lot easier and more fun... > > > On Wed, Sep 20, 2017 at 2:49 PM, Stefan Richter < > s.rich...@data-artisans.com> wrote: > >> Hi, >> >> We recently removed some cleanup code, because it involved checking some >> store meta data to check when we can delete a directory. For certain stores >> (like S3), requesting this meta data whenever we delete a file was so >> expensive that it could bring down the job because removing state could not >> be processed fast enough. We have a temporary fix in place now, so that >> jobs at large scale can still run reliably on stores like S3. Currently, >> this comes at the cost of not cleaning up directories but we are clearly >> planning to introduce a different mechanism for directory cleanup in the >> future that is not as fine grained as doing meta data queries per file >> delete. In the meantime, unfortunately the best way is to cleanup empty >> directories with some external tool. >> >> Best, >> Stefan >> >> Am 20.09.2017 um 01:23 schrieb Hao Sun <ha...@zendesk.com>: >> >> Thanks Elias! Seems like there is no better answer than "do not care >> about them now", or delete with a background job. >> >> On Tue, Sep 19, 2017 at 4:11 PM Elias Levy <fearsome.lucid...@gmail.com> >> wrote: >> >>> There are a couple of related JIRAs: >>> >>> https://issues.apache.org/jira/browse/FLINK-7587 >>> https://issues.apache.org/jira/browse/FLINK-7266 >>> >>> >>> On Tue, Sep 19, 2017 at 12:20 PM, Hao Sun <ha...@zendesk.com> wrote: >>> >>>> Hi, I am using RocksDB and S3 as storage backend for my checkpoints. >>>> Can flink delete these empty directories automatically? Or I need a >>>> background job to do the deletion? >>>> >>>> I know this has been discussed before, but I could not get a concrete >>>> answer for it yet. Thanks >>>> >>>> <image.png> >>>> >>> >>> >> >