TL;DR: The SRE Observability team asks that no new metrics be deployed to Graphite, as the service will be configured to a read-only state, disabling new metric ingestion. The planned date for Graphite Read Only has been extended to April 30th, 2025. Details available in T228380 [1].
Please disable or migrate all existing Graphite metrics to Prometheus [2] and retire the corresponding panels and dashboards as applicable before the noted date. Technical Sunsetting of Graphite for Prometheus Reminder The SRE Observability team has been operating Prometheus [1] in production for several years, offering several operational benefits over Graphite. After a long period of observation and usage, the team has determined that migrating MW off Graphite ensures we stay ahead with a supported, scalable metrics platform for more effective dimensional metrics analysis and storage. Notice and Action Required The SRE Observability team is extending Graphite read-only to April 30th, 2025, and beginning the formal deprecation of Graphite in production [1]. We ask all teams and maintainers to check this dashboard [3] and related task T350592 [4] and claim metrics and dashboards in associated tasks or components under their care. First, disable/remove any unused metrics and dashboards, then follow the process outlined in the task to migrate all “in-use” metrics before April 30th. After this date, Graphite will be read-only, and no new data will be ingested. Graphite will continue to be available for another year to provide historical data in read-only “mode” while new history is recorded in Prometheus. Please see the tracking task T228380 [2] or roadmap [5] for additional details. Why We’re Migrating from Graphite to Prometheus We have been utilizing Prometheus in production for several years as it offers several benefits over Graphite <https://prometheus.io/docs/introduction/comparison/>. Migrating MW off Graphite ensures we stay ahead with a supported, scalable metrics platform for more effective, multidimensional metrics analysis and storage. Prometheus provides more robust data labeling, storage, and query capabilities. This initiative is fundamental in unifying our metrics, enhancing monitoring, improving MW observability, and reducing tool fragmentation. We’re moving from Graphite to Prometheus because of critical limitations in our current setup. Here’s also what you need to know: - Graphite is Dropping Data: Our existing Graphite hosts have recently been saturated by too much metrics traffic (UDP). The 1G network interfaces on these hosts are overloaded, causing packets (and, therefore, data) to be lost. We don’t know precisely how much data is dropped, but it’s enough to be noticeable. Instead of investing in fixing this old system, we’re focusing on migrating Prometheus, which is more reliable. The hardware that powers the current system will also reach its end-of-life and be retired by Q4 of the end of FY 2025/2026 (June 2026). - Prometheus Works Differently: Different internal methods for data processing, sampling, and calculation between Graphite and Prometheus mean that numbers on both sides won't necessarily align or match 100%; this is expected. More information available - More Accurate Metrics: Prometheus handles timing metrics and counters differently from Graphite. You may see higher counts for certain metrics, such as timing metrics. - Compare Patterns, Not Values: If you’re comparing numbers between Graphite and Prometheus, focus on the pattern and trends rather than exact values. Differences in how the two systems process data mean that exact numbers won’t always match. However, the overall trend should be the same. Frequently Asked Questions - Will you be migrating historical data from Graphite to Prometheus? We do not plan to migrate data from graphite to Prometheus, and while it is technically feasible, we don't have enough documented requests. Instead, we will run both systems in parallel for a while (1yr) to allow new historical data to cross over before read-only. - What if I need data longer than 1 year? We can also provide graphite files for projects interested in longer retention and work on possible backfilling alternatives for specific cases. Details are in T349521 <https://phabricator.wikimedia.org/T349521>. - What if I needed access to graphite for longer? We can provide a subset of the data in a VM, with a graphite service (in read-only) available for a discretionary period longer than the year after the hardware is sunset with limited support as we are sunsetting the technology. This workaround will be offered until the current graphite and os deployments are supported. Related Links: [1] Tech debt: sunsetting of Graphite https://phabricator.wikimedia.org/T228380 [2] Wikitech:Prometheus https://wikitech.wikimedia.org/wiki/Prometheus [3] List of dashboards w/Graphite queries https://grafana.wikimedia.org/d/K6DEOo5Ik/grafana-graphite-datasource-utilization?orgId=1 [4] EPIC: Migrate in-use metrics and dashboards to statslib https://phabricator.wikimedia.org/T350592 [5] Graphite Deprecation Roadmap https://wikitech.wikimedia.org/wiki/Graphite/Deprecation_Roadmap Thank you for reading! Be safe and happy. Best, Leo
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/