[jira] [Commented] (HADOOP-16830) Add public IOStatistics API; S3A to support

2020-07-22 Thread Aaron Fabbri (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163131#comment-17163131
 ] 

Aaron Fabbri commented on HADOOP-16830:
---

Been following along. I should be able to finish a review by the end of the 
week. Wonder if we could get [~mackrorysd] to skim over the S3A stats changes?

> Add public IOStatistics API; S3A to support
> ---
>
> Key: HADOOP-16830
> URL: https://issues.apache.org/jira/browse/HADOOP-16830
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/s3
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Applications like to collect the statistics which specific operations take, 
> by collecting exactly those operations done during the execution of FS API 
> calls by their individual worker threads, and returning these to their job 
> driver
> * S3A has a statistics API for some streams, but it's a non-standard one; 
> Impala &c can't use it
> * FileSystem storage statistics are public, but as they aren't cross-thread, 
> they don't aggregate properly
> Proposed
> # A new IOStatistics interface to serve up statistics
> # S3A to implement
> # other stores to follow
> # Pass-through from the usual wrapper classes (FS data input/output streams)
> It's hard to think about how best to offer an API for operation context 
> stats, and how to actually implement.
> ThreadLocal isn't enough because the helper threads need to update on the 
> thread local value of the instigator
> My Initial PoC doesn't address that issue, but it shows what I'm thinking of



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16830) Add public IOStatistics API; S3A to support

2020-07-21 Thread Luca Canali (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161970#comment-17161970
 ] 

Luca Canali commented on HADOOP-16830:
--

[~ste...@apache.org] I have compiled and also briefly the PR with Spark reading 
from S3A, and the first exploration I did looks quite good to me. As mentioned 
previously, one of my goals with this is to add time-based metrics to IO 
Statistics, as in this [proof-of-concept implementation of some read time 
metrics for 
S3A|https://github.com/LucaCanali/hadoop/commit/4ed077061e5826711307941dd397250e2afc47a2].
I was wondering if it could make sense to include in this patch already a list 
of Statistics names for time-based IO instrumentation, so to guide the naming 
convention and future implementation efforts?

> Add public IOStatistics API; S3A to support
> ---
>
> Key: HADOOP-16830
> URL: https://issues.apache.org/jira/browse/HADOOP-16830
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/s3
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Applications like to collect the statistics which specific operations take, 
> by collecting exactly those operations done during the execution of FS API 
> calls by their individual worker threads, and returning these to their job 
> driver
> * S3A has a statistics API for some streams, but it's a non-standard one; 
> Impala &c can't use it
> * FileSystem storage statistics are public, but as they aren't cross-thread, 
> they don't aggregate properly
> Proposed
> # A new IOStatistics interface to serve up statistics
> # S3A to implement
> # other stores to follow
> # Pass-through from the usual wrapper classes (FS data input/output streams)
> It's hard to think about how best to offer an API for operation context 
> stats, and how to actually implement.
> ThreadLocal isn't enough because the helper threads need to update on the 
> thread local value of the instigator
> My Initial PoC doesn't address that issue, but it shows what I'm thinking of



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16830) Add public IOStatistics API; S3A to support

2020-07-07 Thread Luca Canali (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153056#comment-17153056
 ] 

Luca Canali commented on HADOOP-16830:
--

Thanks, that looks quite useful and promising. I'll test it and hopefully 
provide some more meaningful feedback (although it will take another couple of 
weeks for me to do that).

> Add public IOStatistics API; S3A to support
> ---
>
> Key: HADOOP-16830
> URL: https://issues.apache.org/jira/browse/HADOOP-16830
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/s3
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Applications like to collect the statistics which specific operations take, 
> by collecting exactly those operations done during the execution of FS API 
> calls by their individual worker threads, and returning these to their job 
> driver
> * S3A has a statistics API for some streams, but it's a non-standard one; 
> Impala &c can't use it
> * FileSystem storage statistics are public, but as they aren't cross-thread, 
> they don't aggregate properly
> Proposed
> # A new IOStatistics interface to serve up statistics
> # S3A to implement
> # other stores to follow
> # Pass-through from the usual wrapper classes (FS data input/output streams)
> It's hard to think about how best to offer an API for operation context 
> stats, and how to actually implement.
> ThreadLocal isn't enough because the helper threads need to update on the 
> thread local value of the instigator
> My Initial PoC doesn't address that issue, but it shows what I'm thinking of



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16830) Add public IOStatistics API; S3A to support

2020-07-03 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151061#comment-17151061
 ] 

Steve Loughran commented on HADOOP-16830:
-

[~lucacanali] can you look at the latest PR? 
https://github.com/apache/hadoop/pull/2069

I can use it to collect/aggregate stats across workers, marshall as JSON and 
save in the _SUCCESS file.

The big limitation is that without thread local stats contexts, we don't get as 
much information as we can about performance, especially 
reading/seeking/network throttling &c. Somehow we are going to need to do that. 
But not yet. At least here we can start, especially if the ORC/Parquet readers 
collect their stats from all the streams they read, *and* something collects 
those.

I promise I will collect stats on IO work performed across multiple threads on 
behalf of a caller, if people commit to writing the wiring up to retrieve and 
aggregate that 

> Add public IOStatistics API; S3A to support
> ---
>
> Key: HADOOP-16830
> URL: https://issues.apache.org/jira/browse/HADOOP-16830
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/s3
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Applications like to collect the statistics which specific operations take, 
> by collecting exactly those operations done during the execution of FS API 
> calls by their individual worker threads, and returning these to their job 
> driver
> * S3A has a statistics API for some streams, but it's a non-standard one; 
> Impala &c can't use it
> * FileSystem storage statistics are public, but as they aren't cross-thread, 
> they don't aggregate properly
> Proposed
> # A new IOStatistics interface to serve up statistics
> # S3A to implement
> # other stores to follow
> # Pass-through from the usual wrapper classes (FS data input/output streams)
> It's hard to think about how best to offer an API for operation context 
> stats, and how to actually implement.
> ThreadLocal isn't enough because the helper threads need to update on the 
> thread local value of the instigator
> My Initial PoC doesn't address that issue, but it shows what I'm thinking of



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16830) Add public IOStatistics API; S3A to support

2020-06-11 Thread Luca Canali (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133160#comment-17133160
 ] 

Luca Canali commented on HADOOP-16830:
--

[~ste...@apache.org] I think that a simple interface like you propose would be 
quite good.
There may be a case for supporting histograms too (instrumentation for I/O 
latency histograms), although it's not a priority for me at this stage.

> Add public IOStatistics API; S3A to support
> ---
>
> Key: HADOOP-16830
> URL: https://issues.apache.org/jira/browse/HADOOP-16830
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/s3
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Applications like to collect the statistics which specific operations take, 
> by collecting exactly those operations done during the execution of FS API 
> calls by their individual worker threads, and returning these to their job 
> driver
> * S3A has a statistics API for some streams, but it's a non-standard one; 
> Impala &c can't use it
> * FileSystem storage statistics are public, but as they aren't cross-thread, 
> they don't aggregate properly
> Proposed
> # A new IOStatistics interface to serve up statistics
> # S3A to implement
> # other stores to follow
> # Pass-through from the usual wrapper classes (FS data input/output streams)
> It's hard to think about how best to offer an API for operation context 
> stats, and how to actually implement.
> ThreadLocal isn't enough because the helper threads need to update on the 
> thread local value of the instigator
> My Initial PoC doesn't address that issue, but it shows what I'm thinking of



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16830) Add public IOStatistics API; S3A to support

2020-06-10 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130954#comment-17130954
 ] 

Steve Loughran commented on HADOOP-16830:
-

[~lucacanali] I've done an iteration on a variant designed to support and 
aggregate different types

having written the new extensible design, I've decided I don't like it. It is 
too complex as I'm trying to support arbitrary arity tuples of any kind of 
statistic.it makes iterating/parsing this stuff way too complext

here's a better idea: we only support a limited set; 

* counter: long
* min; long
* max: long
* mean (double, long)
* gauge; long

# all but gauge have simple aggregation, for gauge i'll add stuff up too, on 
the assumption that they will be positive values (e.g 'number of active reads')
# and every set will have its own iterator.

what do people think? I can do an iteration fairly quickly

> Add public IOStatistics API; S3A to support
> ---
>
> Key: HADOOP-16830
> URL: https://issues.apache.org/jira/browse/HADOOP-16830
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/s3
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Applications like to collect the statistics which specific operations take, 
> by collecting exactly those operations done during the execution of FS API 
> calls by their individual worker threads, and returning these to their job 
> driver
> * S3A has a statistics API for some streams, but it's a non-standard one; 
> Impala &c can't use it
> * FileSystem storage statistics are public, but as they aren't cross-thread, 
> they don't aggregate properly
> Proposed
> # A new IOStatistics interface to serve up statistics
> # S3A to implement
> # other stores to follow
> # Pass-through from the usual wrapper classes (FS data input/output streams)
> It's hard to think about how best to offer an API for operation context 
> stats, and how to actually implement.
> ThreadLocal isn't enough because the helper threads need to update on the 
> thread local value of the instigator
> My Initial PoC doesn't address that issue, but it shows what I'm thinking of



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16830) Add public IOStatistics API; S3A to support

2020-06-01 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121005#comment-17121005
 ] 

Steve Loughran commented on HADOOP-16830:
-

Interesting thought. We don't actually care what metrics get collected -what I 
do want is to have its extensible and unique to each object instance being 
used, rather than shared across all instances or to a single thread.

A key goal is to allow applications to aggregate statistics; for counters this 
is a simple addition. And I've been avoiding quantiles/ metrics which can go 
down as well as up because they don't really aggregate.

Performance metrics are interesting though: they can be aggregated...an 
aggregate mean can be recalcuated if the size of each set of values is known; a 
max value is simply the largest, isn't it.

Maybe the trick to do here is for each value to be more than just an integer 
but a type (counter, mean-perfomance, min-perf, max-perf) and at least two 
values (needed for the mean recalculation). Initially we'd just have that 
integer enum and either enough fields to cover all eventualities or an array. 
We need the results to be marshallable (protobuf, json, serializable) and 
stable enough for apps to use.

yes, this will need changes to the initial design -but it's better to have 
something extensible now rather than realise later we missed an opportunty. 
(that points to an array of values), doesn't it, maybe a unit too for something 
like (type, unit, long values[]), plus public helper methods to combine two 
fields of a specific type. Would this work? Or are we overengineering it?

The S3A committees do actually aggregate file systems StorageStatisics, but as 
that is per-fs, and spark has many workers sharing it, it's not that useful. 
this API is intended to be something spark/tez/impala can adopt and aggregate 
for meaningful reporting. What would suit best here?

> Add public IOStatistics API; S3A to support
> ---
>
> Key: HADOOP-16830
> URL: https://issues.apache.org/jira/browse/HADOOP-16830
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/s3
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Applications like to collect the statistics which specific operations take, 
> by collecting exactly those operations done during the execution of FS API 
> calls by their individual worker threads, and returning these to their job 
> driver
> * S3A has a statistics API for some streams, but it's a non-standard one; 
> Impala &c can't use it
> * FileSystem storage statistics are public, but as they aren't cross-thread, 
> they don't aggregate properly
> Proposed
> # A new IOStatistics interface to serve up statistics
> # S3A to implement
> # other stores to follow
> # Pass-through from the usual wrapper classes (FS data input/output streams)
> It's hard to think about how best to offer an API for operation context 
> stats, and how to actually implement.
> ThreadLocal isn't enough because the helper threads need to update on the 
> thread local value of the instigator
> My Initial PoC doesn't address that issue, but it shows what I'm thinking of



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16830) Add public IOStatistics API; S3A to support

2020-05-22 Thread Luca Canali (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114069#comment-17114069
 ] 

Luca Canali commented on HADOOP-16830:
--

We find that IO time metrics can be quite useful for debugging, and I wanted to 
check if that could make sense in the context of this JIRA.

As an example, for Apache Spark we have tested with hooking up I/O timing 
metrics for S3A into Spark's monitoring system (and also for HDFS and other 
Hadoop compatible filesystems).
>From the end-user point of view the result is I/O time instrumenation in a 
>dashboard together with other Spark's metrics (such as CPU time and run time), 
>[example|https://www.slideshare.net/databricks/performance-troubleshooting-using-apache-spark-metrics/41]

The tested implementation relied on Spark 3.0's new plugin infrastructure 
[SPARK-29397|https://issues.apache.org/jira/browse/SPARK-29397] that allows to 
integrate external metrics into Spark instrumentation.  
Example code of [Spark's plugins to capture Hadoop IO 
metrics|https://github.com/cerndb/SparkPlugins/tree/master/src/main/scala/ch/cern/experimental]
Proof of concept [implementation of some read time metrics for 
S3A|https://github.com/LucaCanali/hadoop/commit/4ed077061e5826711307941dd397250e2afc47a2]


> Add public IOStatistics API; S3A to support
> ---
>
> Key: HADOOP-16830
> URL: https://issues.apache.org/jira/browse/HADOOP-16830
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/s3
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Applications like to collect the statistics which specific operations take, 
> by collecting exactly those operations done during the execution of FS API 
> calls by their individual worker threads, and returning these to their job 
> driver
> * S3A has a statistics API for some streams, but it's a non-standard one; 
> Impala &c can't use it
> * FileSystem storage statistics are public, but as they aren't cross-thread, 
> they don't aggregate properly
> Proposed
> # A new IOStatistics interface to serve up statistics
> # S3A to implement
> # other stores to follow
> # Pass-through from the usual wrapper classes (FS data input/output streams)
> It's hard to think about how best to offer an API for operation context 
> stats, and how to actually implement.
> ThreadLocal isn't enough because the helper threads need to update on the 
> thread local value of the instigator
> My Initial PoC doesn't address that issue, but it shows what I'm thinking of



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org