[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable
[ https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843753#comment-16843753 ] Gabor Bota commented on HADOOP-14468: - Resolved this with HADOOP-15999 : fix for OOB operations. > S3Guard: make short-circuit getFileStatus() configurable > > > Key: HADOOP-14468 > URL: https://issues.apache.org/jira/browse/HADOOP-14468 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 3.0.0-beta1 >Reporter: Aaron Fabbri >Assignee: Gabor Bota >Priority: Minor > > Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a > result from the MetadataStore (e.g. dynamodb) first. > I would like to add a new parameter > {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps > the current behavior. When false, S3AFileSystem will check both S3 and the > MetadataStore. > I'm not sure yet if we want to have this behavior the same for all callers of > getFileStatus(), or if we only want to check both S3 and MetadataStore for > some internal callers such as open(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable
[ https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841077#comment-16841077 ] Gabor Bota commented on HADOOP-14468: - This is fixed with HADOOP-15999. I will resolve this issue soon. > S3Guard: make short-circuit getFileStatus() configurable > > > Key: HADOOP-14468 > URL: https://issues.apache.org/jira/browse/HADOOP-14468 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 3.0.0-beta1 >Reporter: Aaron Fabbri >Assignee: Gabor Bota >Priority: Minor > > Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a > result from the MetadataStore (e.g. dynamodb) first. > I would like to add a new parameter > {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps > the current behavior. When false, S3AFileSystem will check both S3 and the > MetadataStore. > I'm not sure yet if we want to have this behavior the same for all callers of > getFileStatus(), or if we only want to check both S3 and MetadataStore for > some internal callers such as open(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable
[ https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353020#comment-16353020 ] Steve Loughran commented on HADOOP-14468: - One thing related to this is whether we should have a TTL on tombstone markers. Even in non-auth mode, when we reconcile the listings, files recorded as deleted are omitted. If someone can create that file via another client, will it ever be seen in listings? > S3Guard: make short-circuit getFileStatus() configurable > > > Key: HADOOP-14468 > URL: https://issues.apache.org/jira/browse/HADOOP-14468 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 3.0.0-beta1 >Reporter: Aaron Fabbri >Assignee: Aaron Fabbri >Priority: Minor > > Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a > result from the MetadataStore (e.g. dynamodb) first. > I would like to add a new parameter > {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps > the current behavior. When false, S3AFileSystem will check both S3 and the > MetadataStore. > I'm not sure yet if we want to have this behavior the same for all callers of > getFileStatus(), or if we only want to check both S3 and MetadataStore for > some internal callers such as open(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable
[ https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243832#comment-16243832 ] Steve Loughran commented on HADOOP-14468: - FWIW not doing the unshort-circuited check will save $0.004 $0.01.3 c/open() call in the case the file is missing; $0.004 if the file is actually there > S3Guard: make short-circuit getFileStatus() configurable > > > Key: HADOOP-14468 > URL: https://issues.apache.org/jira/browse/HADOOP-14468 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 3.0.0-beta1 >Reporter: Aaron Fabbri >Assignee: Aaron Fabbri >Priority: Minor > > Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a > result from the MetadataStore (e.g. dynamodb) first. > I would like to add a new parameter > {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps > the current behavior. When false, S3AFileSystem will check both S3 and the > MetadataStore. > I'm not sure yet if we want to have this behavior the same for all callers of > getFileStatus(), or if we only want to check both S3 and MetadataStore for > some internal callers such as open(). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable
[ https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16077499#comment-16077499 ] Aaron Fabbri commented on HADOOP-14468: --- {quote} That said, looking at all the places we call getFileStatus, it'd be a useful little sanity check all round. {quote} Yeah. It would be interesting to collect statistics on long-running clusters on how often inconsistency happens. Sounds like we're ok with the behavior of failing after open(). Your example of deleted file or inconsistency causing similar behavior is a good point. I'll leave this as minor priority for now and focus on HADOOP-14467 first. > S3Guard: make short-circuit getFileStatus() configurable > > > Key: HADOOP-14468 > URL: https://issues.apache.org/jira/browse/HADOOP-14468 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Aaron Fabbri >Assignee: Aaron Fabbri >Priority: Minor > > Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a > result from the MetadataStore (e.g. dynamodb) first. > I would like to add a new parameter > {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps > the current behavior. When false, S3AFileSystem will check both S3 and the > MetadataStore. > I'm not sure yet if we want to have this behavior the same for all callers of > getFileStatus(), or if we only want to check both S3 and MetadataStore for > some internal callers such as open(). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable
[ https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076845#comment-16076845 ] Steve Loughran commented on HADOOP-14468: - That said, looking at all the places we call getFileStatus, it'd be a useful little sanity check all round. > S3Guard: make short-circuit getFileStatus() configurable > > > Key: HADOOP-14468 > URL: https://issues.apache.org/jira/browse/HADOOP-14468 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Aaron Fabbri >Assignee: Aaron Fabbri >Priority: Minor > > Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a > result from the MetadataStore (e.g. dynamodb) first. > I would like to add a new parameter > {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps > the current behavior. When false, S3AFileSystem will check both S3 and the > MetadataStore. > I'm not sure yet if we want to have this behavior the same for all callers of > getFileStatus(), or if we only want to check both S3 and MetadataStore for > some internal callers such as open(). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable
[ https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076817#comment-16076817 ] Steve Loughran commented on HADOOP-14468: - The general code path for file IO is one of two things sequential from the start (gzip unzip, CSV, text, avro) {code} instream= fs.open(path) instream.read() {code} or: seek and read, explicit or in a readFully() call. This is the codepath in: .snappy files, examining columnar stored data in ORC, Parquet, ... {code} instream= fs.open(path) instream.seek(somewhere) instream.read(bytes) instream.seek(somewhere-else) ... {code} Either way, there's usually a read() call very shortly after the open, which is when any missing file will surface, so we don't need to overreact —just make sure that the error message which surfaces ona 404 on the first open of a file is propagated up in a way which is meaningful and consistent with what people normally expect. HADOOP-14467 looks at that. Doing it as a fallback for troubleshooting/monitoring is something to consider though. > S3Guard: make short-circuit getFileStatus() configurable > > > Key: HADOOP-14468 > URL: https://issues.apache.org/jira/browse/HADOOP-14468 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Aaron Fabbri >Assignee: Aaron Fabbri >Priority: Minor > > Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a > result from the MetadataStore (e.g. dynamodb) first. > I would like to add a new parameter > {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps > the current behavior. When false, S3AFileSystem will check both S3 and the > MetadataStore. > I'm not sure yet if we want to have this behavior the same for all callers of > getFileStatus(), or if we only want to check both S3 and MetadataStore for > some internal callers such as open(). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable
[ https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035211#comment-16035211 ] Aaron Fabbri commented on HADOOP-14468: --- I created this JIRA to follow up on [your comment|https://issues.apache.org/jira/browse/HADOOP-13345?focusedCommentId=16019741=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16019741] and the discussion about failing fast when file is not visible in S3 in the read path. I'm not 100% convinced we want this but it could be useful for: 1. Failing fast on open() instead of when we later read the stream. 2. A "safe mode" or fallback that can be enabled. When this is set to false, we could collect stats on any time MetadataStore differs from S3 which would be interesting. I.e. "s3 / metastore length differs" or "visible in metastore but not s3" In general we do not support a mixed mode where some clients use S3Guard and others do not: It is not safe. However, if there is a well-known path where only an external process (e.g. ETL) is dropping files for ingest, it may be nice to be able to support that more narrow case. I think the existing behavior with list checking S3 + MetadataStore is sufficient without this change though. > S3Guard: make short-circuit getFileStatus() configurable > > > Key: HADOOP-14468 > URL: https://issues.apache.org/jira/browse/HADOOP-14468 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Aaron Fabbri >Assignee: Aaron Fabbri > > Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a > result from the MetadataStore (e.g. dynamodb) first. > I would like to add a new parameter > {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps > the current behavior. When false, S3AFileSystem will check both S3 and the > MetadataStore. > I'm not sure yet if we want to have this behavior the same for all callers of > getFileStatus(), or if we only want to check both S3 and MetadataStore for > some internal callers such as open(). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable
[ https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035064#comment-16035064 ] Steve Loughran commented on HADOOP-14468: - What's the reason for this? To pick up changes to files which aren't going to s3guard even when auth=true? > S3Guard: make short-circuit getFileStatus() configurable > > > Key: HADOOP-14468 > URL: https://issues.apache.org/jira/browse/HADOOP-14468 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Aaron Fabbri >Assignee: Aaron Fabbri > > Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a > result from the MetadataStore (e.g. dynamodb) first. > I would like to add a new parameter > {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps > the current behavior. When false, S3AFileSystem will check both S3 and the > MetadataStore. > I'm not sure yet if we want to have this behavior the same for all callers of > getFileStatus(), or if we only want to check both S3 and MetadataStore for > some internal callers such as open(). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org