Remote File change detection in S3 when spark queries are running and parquet files in S3 changes

Raghvendra Yadav Wed, 22 May 2024 12:37:15 -0700

Hello,
         We are hoping someone can help us understand the spark behavior
for scenarios listed below.


Q. *Will spark running queries fail when S3 parquet object changes
underneath with S3A remote file change detection enabled?  Is it 100%? *
        Our understanding is that S3A has a feature for remote file change
detection using ETag, implemented in the S3AInputStream class.
This feature caches the ETag per S3AInputStream Instance and uses it to
detect file changes even if the stream is reopened. When running a Spark
query that uses FSDataInputStream, will it reliably detect changes in the
file on S3?

*Q2. Does spark work on a single instance of S3AInputStream for a parquet
file or can open multiple S3AInputStream for some queries? *



Thanks
Raghav

Remote File change detection in S3 when spark queries are running and parquet files in S3 changes

Reply via email to