TL;DR:

The SRE Observability team asks that no new metrics be deployed to
Graphite, as the service will be configured to a read-only state, disabling
new metric ingestion. The planned date for Graphite Read Only has been
extended to April 30th, 2025. Details available in T228380 [1].

Please disable or migrate all existing Graphite metrics to Prometheus [2]
and retire the corresponding panels and dashboards as applicable before the
noted date.

Technical Sunsetting of Graphite for Prometheus Reminder

The SRE Observability team has been operating Prometheus [1] in production
for several years, offering several operational benefits over Graphite.
After a long period of observation and usage, the team has determined that
migrating MW off Graphite ensures we stay ahead with a supported, scalable
metrics platform for more effective dimensional metrics analysis and
storage.

Notice and Action Required

The SRE Observability team is extending Graphite read-only to April 30th,
2025, and beginning the formal deprecation of Graphite in production [1].

We ask all teams and maintainers to check this dashboard [3] and related
task T350592 [4] and claim metrics and dashboards in associated tasks or
components under their care. First, disable/remove any unused metrics and
dashboards, then follow the process outlined in the task to migrate all
“in-use” metrics before April 30th. After this date, Graphite will be
read-only, and no new data will be ingested.

Graphite will continue to be available for another year to provide
historical data in read-only “mode” while new history is recorded in
Prometheus. Please see the tracking task T228380 [2] or roadmap [5] for
additional details.

Why We’re Migrating from Graphite to Prometheus

We have been utilizing Prometheus in production for several years as it offers
several benefits over Graphite
<https://prometheus.io/docs/introduction/comparison/>. Migrating MW off
Graphite ensures we stay ahead with a supported, scalable metrics platform
for more effective, multidimensional metrics analysis and storage.


Prometheus provides more robust data labeling, storage, and query
capabilities. This initiative is fundamental in unifying our metrics,
enhancing monitoring, improving MW observability, and reducing tool
fragmentation. We’re moving from Graphite to Prometheus because of critical
limitations in our current setup.

Here’s also what you need to know:


   -

   Graphite is Dropping Data: Our existing Graphite hosts have recently
   been saturated by too much metrics traffic (UDP). The 1G network interfaces
   on these hosts are overloaded, causing packets (and, therefore, data) to be
   lost. We don’t know precisely how much data is dropped, but it’s enough to
   be noticeable. Instead of investing in fixing this old system, we’re
   focusing on migrating Prometheus, which is more reliable. The hardware
   that powers the current system will also reach its end-of-life and be
   retired by Q4 of the end of FY 2025/2026 (June 2026).
   -

   Prometheus Works Differently: Different internal methods for data
   processing, sampling, and calculation between Graphite and Prometheus mean
   that numbers on both sides won't necessarily align or match 100%; this is
   expected. More information available
   -

   More Accurate Metrics: Prometheus handles timing metrics and counters
   differently from Graphite. You may see higher counts for certain metrics,
   such as timing metrics.
   -

   Compare Patterns, Not Values: If you’re comparing numbers between
   Graphite and Prometheus, focus on the pattern and trends rather than exact
   values. Differences in how the two systems process data mean that exact
   numbers won’t always match. However, the overall trend should be the same.


Frequently Asked Questions

   -

   Will you be migrating historical data from Graphite to Prometheus?

We do not plan to migrate data from graphite to Prometheus, and while it is
technically feasible, we don't have enough documented requests. Instead, we
will run both systems in parallel for a while (1yr) to allow new historical
data to cross over before read-only.


   -

   What if I need data longer than 1 year?

We can also provide graphite files for projects interested in longer
retention and work on possible backfilling alternatives for specific
cases. Details
are in T349521 <https://phabricator.wikimedia.org/T349521>.


   -

   What if I needed access to graphite for longer?

We can provide a subset of the data in a VM, with a graphite service (in
read-only) available for a discretionary period longer than the year after
the hardware is sunset with limited support as we are sunsetting the
technology. This workaround will be offered until the current graphite and
os deployments are supported.

Related Links:

[1] Tech debt: sunsetting of Graphite
https://phabricator.wikimedia.org/T228380

[2] Wikitech:Prometheus https://wikitech.wikimedia.org/wiki/Prometheus

[3] List of dashboards w/Graphite queries
https://grafana.wikimedia.org/d/K6DEOo5Ik/grafana-graphite-datasource-utilization?orgId=1

[4] EPIC: Migrate in-use metrics and dashboards to statslib
https://phabricator.wikimedia.org/T350592

[5] Graphite Deprecation Roadmap
https://wikitech.wikimedia.org/wiki/Graphite/Deprecation_Roadmap

Thank you for reading! Be safe and happy.

Best,

Leo
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to