[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable

2019-05-20 Thread Gabor Bota (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843753#comment-16843753
 ] 

Gabor Bota commented on HADOOP-14468:
-

Resolved this with HADOOP-15999 : fix for OOB operations.

> S3Guard: make short-circuit getFileStatus() configurable
> 
>
> Key: HADOOP-14468
> URL: https://issues.apache.org/jira/browse/HADOOP-14468
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.0.0-beta1
>Reporter: Aaron Fabbri
>Assignee: Gabor Bota
>Priority: Minor
>
> Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a 
> result from the MetadataStore (e.g. dynamodb) first.
> I would like to add a new parameter 
> {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps 
> the current behavior.  When false, S3AFileSystem will check both S3 and the 
> MetadataStore.
> I'm not sure yet if we want to have this behavior the same for all callers of 
> getFileStatus(), or if we only want to check both S3 and MetadataStore for 
> some internal callers such as open().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable

2019-05-16 Thread Gabor Bota (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841077#comment-16841077
 ] 

Gabor Bota commented on HADOOP-14468:
-

This is fixed with HADOOP-15999. I will resolve this issue soon.

> S3Guard: make short-circuit getFileStatus() configurable
> 
>
> Key: HADOOP-14468
> URL: https://issues.apache.org/jira/browse/HADOOP-14468
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.0.0-beta1
>Reporter: Aaron Fabbri
>Assignee: Gabor Bota
>Priority: Minor
>
> Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a 
> result from the MetadataStore (e.g. dynamodb) first.
> I would like to add a new parameter 
> {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps 
> the current behavior.  When false, S3AFileSystem will check both S3 and the 
> MetadataStore.
> I'm not sure yet if we want to have this behavior the same for all callers of 
> getFileStatus(), or if we only want to check both S3 and MetadataStore for 
> some internal callers such as open().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable

2018-02-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353020#comment-16353020
 ] 

Steve Loughran commented on HADOOP-14468:
-

One thing related to this is whether we should have a TTL on tombstone markers.

Even in non-auth mode, when we reconcile the listings, files recorded as 
deleted are omitted. If someone can create that file via another client, will 
it ever be seen in listings?

> S3Guard: make short-circuit getFileStatus() configurable
> 
>
> Key: HADOOP-14468
> URL: https://issues.apache.org/jira/browse/HADOOP-14468
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.0.0-beta1
>Reporter: Aaron Fabbri
>Assignee: Aaron Fabbri
>Priority: Minor
>
> Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a 
> result from the MetadataStore (e.g. dynamodb) first.
> I would like to add a new parameter 
> {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps 
> the current behavior.  When false, S3AFileSystem will check both S3 and the 
> MetadataStore.
> I'm not sure yet if we want to have this behavior the same for all callers of 
> getFileStatus(), or if we only want to check both S3 and MetadataStore for 
> some internal callers such as open().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable

2017-11-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243832#comment-16243832
 ] 

Steve Loughran commented on HADOOP-14468:
-

FWIW not doing the unshort-circuited check will save $0.004 $0.01.3 c/open() 
call in the case the file is missing; $0.004 if the file is actually there

> S3Guard: make short-circuit getFileStatus() configurable
> 
>
> Key: HADOOP-14468
> URL: https://issues.apache.org/jira/browse/HADOOP-14468
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.0.0-beta1
>Reporter: Aaron Fabbri
>Assignee: Aaron Fabbri
>Priority: Minor
>
> Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a 
> result from the MetadataStore (e.g. dynamodb) first.
> I would like to add a new parameter 
> {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps 
> the current behavior.  When false, S3AFileSystem will check both S3 and the 
> MetadataStore.
> I'm not sure yet if we want to have this behavior the same for all callers of 
> getFileStatus(), or if we only want to check both S3 and MetadataStore for 
> some internal callers such as open().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable

2017-07-06 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16077499#comment-16077499
 ] 

Aaron Fabbri commented on HADOOP-14468:
---

{quote}
That said, looking at all the places we call getFileStatus, it'd be a useful 
little sanity check all round.
{quote}
Yeah. It would be interesting to collect statistics on long-running clusters on 
how often inconsistency happens.

Sounds like we're ok with the behavior of failing after open().  Your example 
of deleted file or inconsistency causing similar behavior is a good point.  
I'll leave this as minor priority for now and focus on HADOOP-14467 first.

> S3Guard: make short-circuit getFileStatus() configurable
> 
>
> Key: HADOOP-14468
> URL: https://issues.apache.org/jira/browse/HADOOP-14468
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Aaron Fabbri
>Assignee: Aaron Fabbri
>Priority: Minor
>
> Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a 
> result from the MetadataStore (e.g. dynamodb) first.
> I would like to add a new parameter 
> {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps 
> the current behavior.  When false, S3AFileSystem will check both S3 and the 
> MetadataStore.
> I'm not sure yet if we want to have this behavior the same for all callers of 
> getFileStatus(), or if we only want to check both S3 and MetadataStore for 
> some internal callers such as open().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable

2017-07-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076845#comment-16076845
 ] 

Steve Loughran commented on HADOOP-14468:
-

That said, looking at all the places we call getFileStatus, it'd be a useful 
little sanity check all round. 

> S3Guard: make short-circuit getFileStatus() configurable
> 
>
> Key: HADOOP-14468
> URL: https://issues.apache.org/jira/browse/HADOOP-14468
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Aaron Fabbri
>Assignee: Aaron Fabbri
>Priority: Minor
>
> Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a 
> result from the MetadataStore (e.g. dynamodb) first.
> I would like to add a new parameter 
> {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps 
> the current behavior.  When false, S3AFileSystem will check both S3 and the 
> MetadataStore.
> I'm not sure yet if we want to have this behavior the same for all callers of 
> getFileStatus(), or if we only want to check both S3 and MetadataStore for 
> some internal callers such as open().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable

2017-07-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076817#comment-16076817
 ] 

Steve Loughran commented on HADOOP-14468:
-

The general code path for file IO is one of two things

sequential from the start (gzip unzip, CSV, text, avro)
{code}
instream= fs.open(path)
instream.read()
{code}

or: seek and read, explicit or in a readFully() call. This is the codepath in: 
.snappy files, examining columnar stored data in ORC, Parquet, ...
{code}
instream= fs.open(path)
instream.seek(somewhere)
instream.read(bytes)
instream.seek(somewhere-else)
...
{code}

Either way, there's usually a read() call very shortly after the open, which is 
when any missing file will surface, so we don't need to overreact —just make 
sure that the error message which surfaces ona 404 on the first open of a file 
is propagated up in a way which is meaningful and consistent with what people 
normally expect. HADOOP-14467 looks at that.

Doing it as a fallback for troubleshooting/monitoring is something  to consider 
though.

> S3Guard: make short-circuit getFileStatus() configurable
> 
>
> Key: HADOOP-14468
> URL: https://issues.apache.org/jira/browse/HADOOP-14468
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Aaron Fabbri
>Assignee: Aaron Fabbri
>Priority: Minor
>
> Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a 
> result from the MetadataStore (e.g. dynamodb) first.
> I would like to add a new parameter 
> {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps 
> the current behavior.  When false, S3AFileSystem will check both S3 and the 
> MetadataStore.
> I'm not sure yet if we want to have this behavior the same for all callers of 
> getFileStatus(), or if we only want to check both S3 and MetadataStore for 
> some internal callers such as open().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable

2017-06-02 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035211#comment-16035211
 ] 

Aaron Fabbri commented on HADOOP-14468:
---

I created this JIRA to follow up on [your 
comment|https://issues.apache.org/jira/browse/HADOOP-13345?focusedCommentId=16019741=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16019741]
  and the discussion about failing fast when file is not visible in S3 in the 
read path.

I'm not 100% convinced we want this but it could be useful for:

1. Failing fast on open() instead of when we later read the stream.
2. A "safe mode" or fallback that can be enabled.  When this is set to false, 
we could collect stats on any time MetadataStore differs from S3 which would be 
interesting.  I.e. "s3 / metastore length differs" or "visible in metastore but 
not s3"

In general we do not support a mixed mode where some clients use S3Guard and 
others do not: It is not safe.  However, if there is a well-known path where 
only an external process (e.g. ETL) is dropping files for ingest, it may be 
nice to be able to support that more narrow case.  I think the existing 
behavior with list checking S3 + MetadataStore is sufficient without this 
change though.

> S3Guard: make short-circuit getFileStatus() configurable
> 
>
> Key: HADOOP-14468
> URL: https://issues.apache.org/jira/browse/HADOOP-14468
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Aaron Fabbri
>Assignee: Aaron Fabbri
>
> Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a 
> result from the MetadataStore (e.g. dynamodb) first.
> I would like to add a new parameter 
> {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps 
> the current behavior.  When false, S3AFileSystem will check both S3 and the 
> MetadataStore.
> I'm not sure yet if we want to have this behavior the same for all callers of 
> getFileStatus(), or if we only want to check both S3 and MetadataStore for 
> some internal callers such as open().



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14468) S3Guard: make short-circuit getFileStatus() configurable

2017-06-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035064#comment-16035064
 ] 

Steve Loughran commented on HADOOP-14468:
-

What's the reason for this? To pick up changes to files which aren't going to 
s3guard even when auth=true?

> S3Guard: make short-circuit getFileStatus() configurable
> 
>
> Key: HADOOP-14468
> URL: https://issues.apache.org/jira/browse/HADOOP-14468
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Aaron Fabbri
>Assignee: Aaron Fabbri
>
> Currently, when S3Guard is enabled, getFileStatus() will skip S3 if it gets a 
> result from the MetadataStore (e.g. dynamodb) first.
> I would like to add a new parameter 
> {{fs.s3a.metadatastore.getfilestatus.authoritative}} which, when true, keeps 
> the current behavior.  When false, S3AFileSystem will check both S3 and the 
> MetadataStore.
> I'm not sure yet if we want to have this behavior the same for all callers of 
> getFileStatus(), or if we only want to check both S3 and MetadataStore for 
> some internal callers such as open().



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org