[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29409&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29409 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 20/Sep/16 17:31 Start Date: 20/Sep/16 17:31 Worklog Time Spent: 10m Work Description: Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/1028 FreeBSD build *successful*! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/841/ for details. Issue Time Tracking --- Worklog Id: (was: 29409) Time Spent: 2h 50m (was: 2h 40m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29408&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29408 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 20/Sep/16 17:21 Start Date: 20/Sep/16 17:21 Worklog Time Spent: 10m Work Description: Github user jpeach closed the pull request at: https://github.com/apache/trafficserver/pull/1028 Issue Time Tracking --- Worklog Id: (was: 29408) Time Spent: 2h 40m (was: 2.5h) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29402&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29402 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 20/Sep/16 15:24 Start Date: 20/Sep/16 15:24 Worklog Time Spent: 10m Work Description: Github user jpeach commented on the issue: https://github.com/apache/trafficserver/pull/1028 @gtenev This looks good. Can you please squash the branch? Issue Time Tracking --- Worklog Id: (was: 29402) Time Spent: 2.5h (was: 2h 20m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29390&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29390 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 20/Sep/16 13:36 Start Date: 20/Sep/16 13:36 Worklog Time Spent: 10m Work Description: Github user zwoop commented on the issue: https://github.com/apache/trafficserver/pull/1028 @jpeach we ok to land this now? Issue Time Tracking --- Worklog Id: (was: 29390) Time Spent: 2h 20m (was: 2h 10m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29353&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29353 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 19/Sep/16 22:39 Start Date: 19/Sep/16 22:39 Worklog Time Spent: 10m Work Description: Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/1028 Linux build *successful*! See https://ci.trafficserver.apache.org/job/Github-Linux/731/ for details. Issue Time Tracking --- Worklog Id: (was: 29353) Time Spent: 2h 10m (was: 2h) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29352&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29352 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 19/Sep/16 22:38 Start Date: 19/Sep/16 22:38 Worklog Time Spent: 10m Work Description: Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/1028 FreeBSD build *successful*! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/835/ for details. Issue Time Tracking --- Worklog Id: (was: 29352) Time Spent: 2h (was: 1h 50m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29351&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29351 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 19/Sep/16 22:29 Start Date: 19/Sep/16 22:29 Worklog Time Spent: 10m Work Description: Github user gtenev commented on the issue: https://github.com/apache/trafficserver/pull/1028 @jpeach, renamed "offline" flag to "online", added some reasoning about why the flag was necessary in the last commit description. Issue Time Tracking --- Worklog Id: (was: 29351) Time Spent: 1h 50m (was: 1h 40m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29345&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29345 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 19/Sep/16 20:59 Start Date: 19/Sep/16 20:59 Worklog Time Spent: 10m Work Description: Github user gtenev commented on a diff in the pull request: https://github.com/apache/trafficserver/pull/1028#discussion_r79487888 --- Diff: iocore/cache/Cache.cc --- @@ -2000,6 +2000,12 @@ CacheProcessor::mark_storage_offline(CacheDisk *d ///< Target disk uint64_t total_dir_delete = 0; uint64_t used_dir_delete= 0; + /* Don't mark it again, it will invalidate the stats! */ + if (d->offline) { +return this->has_online_storage(); + } + d->offline = true; --- End diff -- @jpeach, great! sure, I can rename the flag to "online" :) Issue Time Tracking --- Worklog Id: (was: 29345) Time Spent: 1h 40m (was: 1.5h) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29344&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29344 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 19/Sep/16 20:52 Start Date: 19/Sep/16 20:52 Worklog Time Spent: 10m Work Description: Github user jpeach commented on a diff in the pull request: https://github.com/apache/trafficserver/pull/1028#discussion_r79486346 --- Diff: iocore/cache/Cache.cc --- @@ -2000,6 +2000,12 @@ CacheProcessor::mark_storage_offline(CacheDisk *d ///< Target disk uint64_t total_dir_delete = 0; uint64_t used_dir_delete= 0; + /* Don't mark it again, it will invalidate the stats! */ + if (d->offline) { +return this->has_online_storage(); + } + d->offline = true; --- End diff -- @gtenev and I discussed this. The problem is that in the common case, the disk is already bad when ``mark_storage_offline`` is called, so we can't depend on the good->bad state transition to know when to update the accounting. @gtenev This looks fine to me, but I'd suggest calling the flag ``online`` so that we avoid the double negatives. Issue Time Tracking --- Worklog Id: (was: 29344) Time Spent: 1.5h (was: 1h 20m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29336&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29336 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 19/Sep/16 20:09 Start Date: 19/Sep/16 20:09 Worklog Time Spent: 10m Work Description: Github user jpeach commented on the issue: https://github.com/apache/trafficserver/pull/1028 @gtenev If i'm reading your patch correctly, it adds the ``offline`` flag such that disks are marked bad *and* offline. That doesn't sound like what you intended from the description above. Issue Time Tracking --- Worklog Id: (was: 29336) Time Spent: 1h 20m (was: 1h 10m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29325&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29325 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 19/Sep/16 18:51 Start Date: 19/Sep/16 18:51 Worklog Time Spent: 10m Work Description: Github user gtenev commented on the issue: https://github.com/apache/trafficserver/pull/1028 @jpeach, appreciate your feedback! It felt that "disk being offline" (might be an operator's decision) and "disk being bad" (number of IO errors reached a threshold) are better kept separate in general. IMHO using `CacheDisk::num_errors` to mark the disk offline could be error prone and here is an example. Let us say ``proxy.config.cache.max_disk_errors=5`` and a disk keeps failing causing ``handle_disk_failure()`` to be called and at some point ``CacheDisk::num_errors`` becomes ``5`` which causes ``mark_storage_offline()`` to be called. At this point since ``CacheDisk::num_errors=5`` then ``true==DISK_BAD(d)``. It seems that if I did ``if(!DISK_BAD(d)) {...}`` (as suggested above) it would not execute the code in ``mark_storage_offline()`` at all, for instance ``proxy.process.cache.bytes_total_stat`` would not get updated as it should. This is one of my first adventures in the "cache"component so I hope I am not missing something, please let me know what you think and will gladly look/test/change as necessary. Issue Time Tracking --- Worklog Id: (was: 29325) Time Spent: 1h 10m (was: 1h) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29318&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29318 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 19/Sep/16 15:59 Start Date: 19/Sep/16 15:59 Worklog Time Spent: 10m Work Description: Github user jpeach commented on a diff in the pull request: https://github.com/apache/trafficserver/pull/1028#discussion_r79424573 --- Diff: iocore/cache/Cache.cc --- @@ -2000,6 +2000,12 @@ CacheProcessor::mark_storage_offline(CacheDisk *d ///< Target disk uint64_t total_dir_delete = 0; uint64_t used_dir_delete= 0; + /* Don't mark it again, it will invalidate the stats! */ + if (d->offline) { +return this->has_online_storage(); + } + d->offline = true; --- End diff -- Why do yo introduce a new flag rather than making the code conditional on the ``DISK_BAD`` check? e.g. ```C if (!DISK_BAD(d)) { SET_DISK_BAD(d); // Do all the other stuff ... } ``` Issue Time Tracking --- Worklog Id: (was: 29318) Time Spent: 1h (was: 50m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29222&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29222 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 15/Sep/16 22:27 Start Date: 15/Sep/16 22:27 Worklog Time Spent: 10m Work Description: Github user gtenev commented on a diff in the pull request: https://github.com/apache/trafficserver/pull/1028#discussion_r79075679 --- Diff: iocore/cache/P_CacheDisk.h --- @@ -97,6 +97,7 @@ struct CacheDisk : public Continuation { int num_errors; int cleared; bool read_only_p; + bool offline; /* flag marking cache disk offline (because of too many failures or by the operator). */ --- End diff -- This is another review tests (per jpeach's request). "Start a review" Issue Time Tracking --- Worklog Id: (was: 29222) Time Spent: 50m (was: 40m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29221&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29221 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 15/Sep/16 22:26 Start Date: 15/Sep/16 22:26 Worklog Time Spent: 10m Work Description: Github user gtenev commented on a diff in the pull request: https://github.com/apache/trafficserver/pull/1028#discussion_r79075616 --- Diff: iocore/cache/P_CacheDisk.h --- @@ -97,6 +97,7 @@ struct CacheDisk : public Continuation { int num_errors; int cleared; bool read_only_p; + bool offline; /* flag marking cache disk offline (because of too many failures or by the operator). */ --- End diff -- This is another review tests (per jpeach's request). "Add single comment" Issue Time Tracking --- Worklog Id: (was: 29221) Time Spent: 40m (was: 0.5h) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.1.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29204&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29204 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 15/Sep/16 21:04 Start Date: 15/Sep/16 21:04 Worklog Time Spent: 10m Work Description: Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/1028 Linux build *successful*! See https://ci.trafficserver.apache.org/job/Github-Linux/713/ for details. Issue Time Tracking --- Worklog Id: (was: 29204) Time Spent: 0.5h (was: 20m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29203&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29203 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 15/Sep/16 21:03 Start Date: 15/Sep/16 21:03 Worklog Time Spent: 10m Work Description: Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/1028 FreeBSD build *successful*! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/817/ for details. Issue Time Tracking --- Worklog Id: (was: 29203) Time Spent: 20m (was: 10m) > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics
[ https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29201&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29201 ] ASF GitHub Bot logged work on TS-4870: -- Author: ASF GitHub Bot Created on: 15/Sep/16 20:50 Start Date: 15/Sep/16 20:50 Worklog Time Spent: 10m Work Description: GitHub user gtenev opened a pull request: https://github.com/apache/trafficserver/pull/1028 TS-4870 Avoid marking storage offline multiple times Currently storage can be marked offline multiple times which breaks related metrics. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gtenev/trafficserver TS-4870 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/trafficserver/pull/1028.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1028 commit b1389f36936bbfcee6ee645e9954eeae92d4e7ed Author: Gancho Tenev Date: 2016-09-15T13:44:44Z TS-4870 Avoid marking storage offline multiple times Currently storage can be marked offline multiple times which breaks related metrics. Issue Time Tracking --- Worklog Id: (was: 29201) Time Spent: 10m Remaining Estimate: 0h > Storage can be marked offline multiple times which breaks related metrics > - > > Key: TS-4870 > URL: https://issues.apache.org/jira/browse/TS-4870 > Project: Traffic Server > Issue Type: Bug > Components: Cache, Metrics >Reporter: Gancho Tenev >Assignee: Gancho Tenev > Fix For: 7.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Let us say traffic server is running with 2 disks > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]' > Disk /dev/sdb: 134 MB, 134217728 bytes > Disk /dev/sdc: 134 MB, 134217728 bytes > {code} > Let us see what happens when we mark the same disk 3 times in a raw > ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}. > {code} > # Initial cache size (when using both disks). > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 268025856 > # Take 1st disk offline. Cache size changes as expected. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 134012928 > # Take same disk offline again. Not good! > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total 0 > # Take same disk offline again. Negative value. > $ sudo ./bin/traffic_ctl storage offline /dev/sdb > $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total > proxy.node.cache.bytes_total -134012928 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)