K8s Monitoring and Prometheus

Monitoring in Kubernetes (K8s) and using Prometheus for this purpose involves a series of steps, from metric generation to alert notifications.

Kubernetes Native Monitoring: Detailed Technical Workflow

Metric Generation:
- Kubernetes components like Kubelet and cAdvisor are responsible for collecting metrics.
- Kubelet collects metrics about pod and node performance.
- cAdvisor, integrated into Kubelet, provides container-specific metrics like CPU, memory usage, and network I/O.

Metrics Server:
- This is a scalable, efficient short-term storage for cluster metrics.
- It collects metrics from the Kubelet’s /metrics/resource endpoint.
- Metrics Server stores these metrics in-memory and does not write them to disk, making it lightweight but not suitable for long-term metric storage.

API Exposure:
- Metrics are exposed via the Kubernetes Metrics API.
- This API is often used by Horizontal Pod Autoscalers and Kubernetes Dashboard.

Prometheus Integration: In-depth Process

Prometheus Exporters

Exporters are a set of tools and libraries that allow you to collect and expose metrics from various systems and services in a Prometheus-compatible format.

These exporters act as intermediaries between Prometheus and the systems you want to monitor, providing valuable data for analysis and visualization.

By using Prometheus Exporters, you can easily monitor and gain insights into the performance, health, and status of your applications, databases, network devices, and more.

Each exporter exposes a /metrics HTTP endpoint, where metrics are presented in a format understandable by Prometheus.
Example: node-exporter exposes node-level metrics, kube-state-metrics exposes Kubernetes object metrics (like deployment, node status).

Prometheus Server Scraping

Prometheus uses a pull model for metric collection.
It’s configured with a list of HTTP endpoint URLs of the exporters.
Prometheus periodically sends HTTP GET requests to these endpoints to fetch new metrics.
The scrape interval is configurable and crucial for balancing between data timeliness and system load.

Data Storage:

Prometheus stores time-series data in a local disk in a custom, highly efficient format.
Each time-series is identified by a metric name and key-value pairs (labels).
It employs a compression algorithm to optimize storage space and query efficiency.

Alerting and Notifications: Mechanism and Protocols

AlertManager Configuration:
- Alert rules are defined in Prometheus configuration files using PromQL.
- When an alert condition is met, Prometheus sends that alert to the AlertManager.

AlertManager:
- It handles alerts, including grouping, deduplication, and routing.
- It has a configuration file that specifies how to group alerts and where to send them.

Notification Services Integration:
- Slack: AlertManager sends a webhook to Slack. The webhook body is a JSON payload that contains the alert’s details.
- PagerDuty: AlertManager uses the PagerDuty v2 Events API. It sends a detailed event, which includes event type, dedup key, payload with severity, summary, source, and custom details.

Scalability Considerations

Stateless Design:
- Prometheus instances are stateless with respect to real-time data, meaning you can horizontally scale by adding more Prometheus instances.
- Each instance is independent, simplifying the scaling process.

Service Discovery:
- Prometheus supports dynamic service discovery. As your cluster grows, Prometheus automatically starts scraping metrics from new instances based on predefined rules.

Federation:
- For very large setups, Prometheus supports federation, allowing a Prometheus server to scrape selected data from another Prometheus server.

Sharding and Partitioning:
- Data can be sharded across multiple Prometheus instances based on labels, distributing the load.

High Availability:
- Running multiple replicas of Prometheus and using AlertManager’s high availability setup ensures no single point of failure in the monitoring pipeline.

PQL Examples

PromQL (Prometheus Query Language) is a powerful tool for writing queries and setting up alerts in Prometheus. Below are examples of some common alerting rules written in PromQL. These rules are typically defined in Prometheus’ configuration files and are used to trigger alerts based on specific conditions:

1. High CPU Usage

- alert: HighCpuUsage
  expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "High CPU usage detected on {{ $labels.instance }}"
    description: "CPU usage is above 80% for more than 10 minutes."

This alert triggers if the CPU usage is above 80% for more than 10 minutes.

2. Memory Usage Alert

- alert: HighMemoryUsage
  expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High memory usage on {{ $labels.instance }}"
    description: "Memory usage is above 80% for more than 5 minutes."

This alert is for when memory usage exceeds 80% for over 5 minutes.

3. Disk Space Alert

- alert: DiskSpaceRunningLow
  expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes) * 100 < 20
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Low disk space on {{ $labels.device }} at {{ $labels.instance }}"
    description: "Less than 20% disk space available for 15 minutes."

This alert fires if the disk space available is less than 20% for 15 minutes.

4. Node Down Alert

- alert: NodeDown
  expr: up{job="node"} == 0
  for: 3m
  labels:
    severity: critical
  annotations:
    summary: "Node {{ $labels.instance }} is down"
    description: "Node has been down for more than 3 minutes."

This triggers an alert if a node has been down for more than 3 minutes.

5. High Network Traffic

- alert: HighNetworkTraffic
  expr: sum by(instance) (irate(node_network_receive_bytes_total[5m])) > 10000000
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High network traffic on {{ $labels.instance }}"
    description: "Network traffic is above 10MB/s for more than 10 minutes."

This alert indicates high network traffic, triggering if it exceeds 10MB/s for more than 10 minutes.

6. Pod CrashLoopBackOff

- alert: PodCrashLooping
  expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Pod {{ $labels.pod }} is in CrashLoopBackOff state"
    description: "Pod has been in CrashLoopBackOff state for more than 5 minutes."

Alerts when a pod is in a CrashLoopBackOff state for more than 5 minutes.

Important Notes

These rules should be tailored according to the specific needs and thresholds relevant to your environment.
The for clause in each rule specifies how long the condition must be true before the alert is fired.
The labels and annotations are used to add additional information to the alert, which can be very helpful for the receiving end to understand the context of the alert.

How to trigger alerts in Prometheus

1. Data Scraping and Evaluation

Data Collection: Prometheus scrapes metrics from configured targets at regular intervals. These intervals are defined in the Prometheus configuration and are typically in the order of seconds or minutes.
Data Storage: The scraped metrics are stored in Prometheus’ time-series database.

2. Rule Evaluation

Regular Evaluation: Alert rules defined in PromQL are evaluated by the Prometheus server at a regular interval, which is usually the same as the scrape interval but can be configured differently.
Stateful Tracking: When Prometheus evaluates an alert rule, it doesn’t just look at the current value of the queried metrics. It also considers the historical data over the period specified in the alert rule (using the for clause).
Condition Checking: During each evaluation, Prometheus checks if the condition defined in the alert rule is true. For instance, if an alert rule is set to trigger when CPU usage is over 80% for more than 10 minutes, Prometheus will check, at each evaluation, whether this condition has been continuously true for the past 10 minutes.

3. Alert State Transition

Pending State: If the condition for an alert becomes true, the alert first moves into a “Pending” state. This state indicates that the condition has been met but not for long enough to trigger the alert (according to the for duration).
Firing State: If the condition continues to hold true for the duration specified by the for clause, the alert then transitions to the “Firing” state. This is the state where actions like notifications are triggered.

4. AlertManager Integration

Receiving Alerts: Once an alert reaches the “Firing” state, Prometheus sends this information to AlertManager.
Alert Processing: AlertManager then handles the alert according to its configuration—grouping, deduplicating, and routing the alert to the appropriate receiver (like email, Slack, PagerDuty, etc.).

5. Real-Time Aspect

While Prometheus operates in near real-time, there is inherently a small delay due to the scrape interval and the evaluation interval. For instance, if Prometheus scrapes metrics every 15 seconds, and an alert’s condition is met exactly after a scrape, the alert will only be detected during the next scrape or rule evaluation.
The for clause adds additional delay (intentionally) to avoid flapping alerts (alerts that quickly switch between firing and not firing states).

Alert Life-cycle

Inactive:
- Conditions not met.
- No action taken.
Pending:
- Conditions met but not for the duration specified in the for clause.
- No notifications sent yet.
Firing:
- Conditions met continuously for the duration specified.
- Alert sent to AlertManager and notifications are triggered.
Resolved:
- Conditions for the alert are no longer met.
- Alert is automatically marked as resolved and resolution notifications may be sent.

Summary

In summary, Prometheus uses PromQL to define alert conditions and evaluates these conditions at regular intervals against its time-series database. Alerts transition through states (“Pending” to “Firing“) based on the duration of the condition being met.

The process, while quite efficient, is not instantaneous, with minor delays due to scrape and evaluation intervals. This system ensures that alerts are based on consistent and sustained metric conditions, avoiding false positives due to momentary spikes or anomalies.