GitHub user davift created a discussion: [Watching Logs] How-To Stopped Drowning in Log Avalanche
I guess most of us are familiar with running: `tail -f /var/log/cloudstack/management/management-server.log` and being immediately blasted with an unbearable amount of log messages: <img width="1116" height="690" alt="image00" src="https://github.com/user-attachments/assets/06462299-9c07-4173-9ec9-e9253f5105a9" /> Enabling debug logging is often essential for troubleshooting and identifying clues that lead to a solution. However, it also increases the volume of logs significantly, making it even harder to spot the information that actually matters. Even worse, you may discover that a particular issue has been occurring for days, weeks, or even months, continuously flooding the logs without anyone noticing. Wouldn't it be useful to visualize the occurrence of known (classified) events over time, correlate them with infrastructure events, and receive alerts when unknown patterns or abnormal spikes appear? <img width="2309" height="1103" alt="image" src="https://github.com/user-attachments/assets/76baac0d-9221-4484-8b37-e9151e4462c8" /> To help with this, I built a tool that uses AI to classify log entries of **any kind**. I called it [LogWatcher](https://github.com/davift/LogWatcher). ### What does it have to do with CloudStack? I trained LogWatcher with millions of CloudStack log lines and spent time reviewing and correcting the classifications to improve accuracy, because AI is just a statistical guessing machine. The resulting knowledge bases for **ACS Management** and **ACS KVM Agent** are available [here](https://github.com/davift/LogWatcher/tree/main/ACS). ### What does this mean? Anyone can load the pre-trained knowledge bases and immediately start classifying CloudStack logs. The tool can run in **offline mode** using the existing knowledge base, or continue learning as it encounters new patterns. The generated metrics can be scraped by Prometheus and visualized in Grafana, making it easy to create dashboards and alerts. This provides visibility into trends, helps correlate issues with infrastructure events, and can reveal silent problems long before users report them. ### Request for Help I would love to collaborate with CloudStack operators to expand the knowledge base and cover a wider range of issues that I haven't been able to reproduce and train LogWatcher on. For those curious about performance, LogWatcher can process 10 million log lines in roughly 10 minutes and typically evaluates between 10,000 and 20,000 log lines per second, with a pre-trained knowledge base (no AI invoked for classification), while running as a single-threaded application. I also run it in a centralized setup, where logs from multiple hosts are collected and analyzed through a single pane of glass. If you are interested in contributing log samples, testing the knowledge base, or sharing feedback, I would be happy to collaborate. GitHub link: https://github.com/apache/cloudstack/discussions/13374 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
