[jira] [Commented] (HADOOP-12949) Add HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260508#comment-16260508 ] Steve Loughran commented on HADOOP-12949: - Revisiting this * yes, it would be good. * let's not worry about UA headers initially; a later iteration. * more important: linking across jobs on long lived processes, e.g Spark, Hive LLAP. We want those tools to create a context, it to propagate over with their queries, and the store clients to pick that up. Making a subclass of the S3A phase IV work, targeting Hadoop 3.1. Patches welcome! > Add HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Madhawa Gunasekara >Assignee: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12949) Add HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385826#comment-15385826 ] Steve Loughran commented on HADOOP-12949: - moved back to a dependency of S3A phase III from phase II; I'm no expecting this for Hadoop 2.8. Colin, regarding S3 and UA headers, yes, Amazon can use the UA headers when dealing with problems. But that's for support issues, not performance (except in the more general "why I am being throttled" case) > Add HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Reporter: Madhawa Gunasekara >Assignee: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12949) Add HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340829#comment-15340829 ] Colin Patrick McCabe commented on HADOOP-12949: --- Yeah, we certainly could use the UA header for this. That assumes that Amazon's s3 implementation will start looking for this (which maybe they will?). In the short term, the big win will be just connecting up the job being run with the operations being done at the s3a level. > Add HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Reporter: Madhawa Gunasekara >Assignee: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12949) Add HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338785#comment-15338785 ] Steve Loughran commented on HADOOP-12949: - + we'll want to have the htrace context ID go all the way down to s3 by way of the HADOOP-13122 UA header. That lets your storage infra provider know which queries are causing problems, and, if this goes via a proxy capable of reading the HTTP Requests, lets them sample and correlate with network load > Add HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Reporter: Madhawa Gunasekara >Assignee: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12949) Add HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327116#comment-15327116 ] Steve Loughran commented on HADOOP-12949: - Marking as a dependency of s3a phase III > Add HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Reporter: Madhawa Gunasekara >Assignee: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12949) Add HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206107#comment-15206107 ] Steve Loughran commented on HADOOP-12949: - There's actually some metrics collection in openstack swift; look under {{org.apache.hadoop.fs.swift.util.DurationStats}} ; they log primarily to stdout, list min, max, (moving) arithmetic mean, stddev,, by HTTP verb. # It's pretty low cost to do this; even when hbase sampling is inactive, the stats for an FS can be collected. # The stats showed that rackspace UK throttles delete requests; the more files in a directory I was cleaning up on teardown, the longer it took —only now exponentially, rather than linearly. # I didn't hook the code up to the normal hadoop metrics; it's something I'd as an option now, because it does become something you need to monitor now we are shifting to longer-lived applications. # I'd add more on causes of operations, specifically: open(), seek(), duration of close(), delete() —things where the fact that object stores are generally O(files*data) means they don't work as expected ... finding that mismatch of expectations matters More and more object stores are coming in. While s3 is the main one, it'd be good to have the core stuff store neutral. The classes from hadoop-openstack can be moved if that helps; the per-verb stuff is useful at the deep levels, while htrace monitoring can track cost of specific actions. > Add HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12949) Add HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204750#comment-15204750 ] Colin Patrick McCabe commented on HADOOP-12949: --- Hi [~madhawa], great idea! I think the first thing to do is to read a bit about how to set up HTrace. See http://blog.cloudera.com/blog/2015/12/new-in-cloudera-labs-apache-htrace-incubating/ If you can get a working setup for HTrace-on-HDFS, it will help for adding tracing to other projects such as the s3a connector. > Add HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.3.4#6332)