GitHub user davift created a discussion: [Watching Logs] How-To Stopped 
Drowning in Log Avalanche

I guess most of us are familiar with running:

`tail -f /var/log/cloudstack/management/management-server.log`

and being immediately blasted with an unbearable amount of log messages:

<img width="1116" height="690" alt="image00" 
src="https://github.com/user-attachments/assets/06462299-9c07-4173-9ec9-e9253f5105a9";
 />

Enabling debug logging is often essential for troubleshooting and identifying 
clues that lead to a solution. However, it also increases the volume of logs 
significantly, making it even harder to spot the information that actually 
matters.

Even worse, you may discover that a particular issue has been occurring for 
days, weeks, or even months, continuously flooding the logs without anyone 
noticing.

Wouldn't it be useful to visualize the occurrence of known (classified) events 
over time, correlate them with infrastructure events, and receive alerts when 
unknown patterns or abnormal spikes appear?

<img width="2309" height="1103" alt="image" 
src="https://github.com/user-attachments/assets/76baac0d-9221-4484-8b37-e9151e4462c8";
 />

To help with this, I built a tool that uses AI to classify log entries of **any 
kind**. I called it [LogWatcher](https://github.com/davift/LogWatcher).

### What does it have to do with CloudStack?

I trained LogWatcher with millions of CloudStack log lines and spent time 
reviewing and correcting the classifications to improve accuracy, because AI is 
just a statistical guessing machine. The resulting knowledge bases for **ACS 
Management** and **ACS KVM Agent** are available 
[here](https://github.com/davift/LogWatcher/tree/main/ACS).

### What does this mean?

Anyone can load the pre-trained knowledge bases and immediately start 
classifying CloudStack logs. The tool can run in **offline mode** using the 
existing knowledge base, or continue learning as it encounters new patterns.

The generated metrics can be scraped by Prometheus and visualized in Grafana, 
making it easy to create dashboards and alerts. This provides visibility into 
trends, helps correlate issues with infrastructure events, and can reveal 
silent problems long before users report them.

### Request for Help

I would love to collaborate with CloudStack operators to expand the knowledge 
base and cover a wider range of issues that I haven't been able to reproduce 
and train LogWatcher on.

For those curious about performance, LogWatcher can process 10 million log 
lines in roughly 10 minutes and typically evaluates between 10,000 and 20,000 
log lines per second, with a pre-trained knowledge base (no AI invoked for 
classification), while running as a single-threaded application.

I also run it in a centralized setup, where logs from multiple hosts are 
collected and analyzed through a single pane of glass.

If you are interested in contributing log samples, testing the knowledge base, 
or sharing feedback, I would be happy to collaborate.


GitHub link: https://github.com/apache/cloudstack/discussions/13374

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to