Hello, I've noticed instances where ListS3 and ListGCSBucket never list certain objects until after stopping the processor, clearing its state, and restarting it. They're usually large files in buckets that have frequent writes.
Based on some testing I think S3 and GCS are setting an object's last modified timestamp to the time when the object started uploading rather than when it completes. So any smaller objects that start and finish their uploads during the time a larger object is still uploading will have a newer last modified timestamp than the larger object. If the List processor triggers after the smaller object finishes uploading but before the larger object finishes, it will see the small object and emit a flow file for it, and the processor will set its state to the timestamp of the smaller but newer object. Once the larger object finishes uploading, it now has a timestamp older than the smaller object, so this larger object will be ignored and never listed during subsequent executions of the List processor. The ListAzureBlobStorage processor allows a listing strategy to track entities, but the ListS3 and ListGCS processors do not, so they seem to rely only on the last modified timestamps. I tried setting up a second ListS3 processor with a different run schedule and file aging settings, and while it helps, some objects are still getting missed. Has anyone else run into this? Is there a feasible workaround? Thank you, Paul