Configure Alerts and Availability for Kubernetes

Introduction

The alert definition feature allows you to set alerts on a metric using a PromQL query. It is currently defined in the K8s ConfigMap named opsramp-alert-user-config, which you can find in the OpsRamp agent installed namespace.

The Kubernetes 2.0 agent computes the alert metric value based on the PromQL expression and computes the alert state by comparing the thresholds in the alert definition.

The OpsRamp Agent sends the alert on the K8s resource based on labels generated in resultant metric streams after running promQL query. If no resource matches with labels then alert will be sent on the cluster.

Alert Definition Template

Sample template to define a single alert:

  - resourceType: k8s_resource_type
    rules:
      - name: alert_definition_name
        interval: alert_polling_time
        expr: promql_expression
        isAvailability: true
        warnOperator: operator_macro
        warnThreshold: str_threshold_value
        criticalOperator: operator_macro
        criticalThreshold: str_threshold_value
        alertSub: alert_subject
        alertBody: alert_description

Explanation of template fields:

resourceType: Specify the type of K8s resource (e.g., k8s_pod). Following are the possible values for resourceType:
- k8s_pod
- k8s_node
- k8s_namespace
- k8s_service
- k8s_pv
- k8s_pvc
- k8s_deployment
- k8s_replicaset
- k8s_daemonset
- k8s_statefulset
rules: A set of rules for the alert definition.
name: A unique name for the alert.
interval: Polling interval at which alert definition should run. The interval should given in time duration format (e.g., 1m, 5m, 1h).
expr: A valid PromQL query expression for calculating the metric.
isAvailability: Boolean indicating if the alert should consider resource availability.
warnOperator & criticalOperator: Operators to compare and compute alert states. OpsRamp supports below operators for comparison:
- GREATER_THAN_EQUAL
- GREATER_THAN
- EQUAL
- NOT_EQUAL
- LESS_THAN_EQUAL
- LESS_THAN
warnThreshold & criticalThreshold: Values for warning and critical thresholds.
alertSub & alertBody: Content displayed for alerts, which can use macros for dynamic values. Below are macros that can be used while defining alert subject/body:
- ${severity}
- ${metric.name}
- ${component.name}
- ${metric.value}
- ${threshold}
- ${resource.name}
- ${resource.uniqueid}

Configure Alert

OpsRamp provides basic alert definitions for resources like pods and nodes by default. Users can configure the alert definitions by using the K8s ConfigMap below in the namespace where the agent is installed.

K8s ConfigMap Name: opsramp-alert-user-config

Access the ConfigMap:
To view the ConfigMap, use the following command:

kubectl get configmap opsramp-alert-definitions -n <agent-installed-namespace> -oyaml

To edit the ConfigMap, use the following command:

kubectl edit configmap opsramp-alert-definitions -n <agent-installed-namespace>

Example ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: "opsramp-alert-user-config-tmp"
  namespace: {{ include "common.names.namespace" . | quote }}
  labels:
    app: "opsramp-alert-user-config"
data:
  alert-definitions.yaml: |
    alertDefinitions:
      - resourceType: k8s_pod
        rules:
          - name: k8s_pod_phase
            interval: 2m
            expr: (k8s_pod_phase == bool 2) OR (k8s_pod_phase == bool 3)
            isAvailability: true
            criticalOperator: EQUAL
            criticalThreshold: '0'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'
          - name: k8s_pod_cpu_usage_percent
            interval: 2m
            expr: ((k8s_pod_cpu_limit_utilization_ratio*100 ) OR (k8s_pod_cpu_limit_utilization))
            isAvailability: true
            warnOperator: GREATER_THAN_EQUAL
            warnThreshold: '90'
            criticalOperator: GREATER_THAN_EQUAL
            criticalThreshold: '95'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'
          - name: k8s_pod_memory_usage_percent
            interval: 2m
            expr: ((k8s_pod_memory_limit_utilization_ratio*100 ) OR (k8s_pod_memory_limit_utilization))
            isAvailability: true
            warnOperator: GREATER_THAN_EQUAL
            warnThreshold: '90'
            criticalOperator: GREATER_THAN_EQUAL
            criticalThreshold: '95'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'

          - name: k8s_pod_memory_usage_mb
            interval: 2m
            expr: (k8s_pod_memory_usage_bytes/1000000)
            warnOperator: GREATER_THAN_EQUAL
            warnThreshold: '200'
            criticalOperator: GREATER_THAN_EQUAL
            criticalThreshold: '300'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'

      - resourceType: k8s_node
        rules:
          - name: k8s_node_condition
            interval: 2m
            expr: ((k8s_node_condition_ready == bool 1) * (k8s_node_condition_disk_pressure == bool 0) * (k8s_node_condition_memory_pressure == bool 0) * (k8s_node_condition_network_unavailable == bool 0) * (k8s_node_condition_pid_pressure == bool 0))
            isAvailability: true
            criticalOperator: EQUAL
            criticalThreshold: '0'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'

      - resourceType: k8s_namespace
        rules:
          - name: k8s_namespace_memory
            interval: 2m
            expr: >-
              (sum by (k8s_cluster_name, k8s_namespace_name)
              ((k8s_pod_memory_usage_bytes/1000000)))
            isAvailability: true
            warnOperator: GREATER_THAN_EQUAL
            warnThreshold: '800'
            criticalOperator: GREATER_THAN_EQUAL
            criticalThreshold: '1000'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'

      - resourceType: k8s_deployment
        rules:
          - name: k8s_deployment_status
            interval: 2m
            expr: (k8s_deployment_available/k8s_deployment_desired)
            isAvailability: true
            warnOperator: LESS_THAN_EQUAL
            warnThreshold: '0.9'
            criticalOperator: LESS_THAN_EQUAL
            criticalThreshold: '0.8'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'

      - resourceType: k8s_replicaset
        rules:
          - name: k8s_replicaset_status
            interval: 2m
            expr: (k8s_replicaset_available/k8s_replicaset_desired)
            isAvailability: true
            warnOperator: LESS_THAN_EQUAL
            warnThreshold: '0.9'
            criticalOperator: LESS_THAN_EQUAL
            criticalThreshold: '0.8'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'

      - resourceType: k8s_daemonset
        rules:
          - name: k8s_daemonset_status
            interval: 2m
            expr: >-
              (k8s_daemonset_current_scheduled_nodes/k8s_daemonset_desired_scheduled_nodes)
            isAvailability: true
            warnOperator: LESS_THAN_EQUAL
            warnThreshold: '0.9'
            criticalOperator: LESS_THAN_EQUAL
            criticalThreshold: '0.8'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'
      - resourceType: k8s_statefulset
        rules:
          - name: k8s_statefulset_status
            interval: 2m
            expr: (k8s_statefulset_current_pods/k8s_statefulset_desired_pods)
            isAvailability: true
            warnOperator: LESS_THAN_EQUAL
            warnThreshold: '0.9'
            criticalOperator: LESS_THAN_EQUAL
            criticalThreshold: '0.8'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'

You can Add or Remove new alerts using standard PromQL expressions.

Configure Availability

To configure availability of a resource, define alert definition and make the key isAvailability as true.

Same alert definition rule will be used to compute the availability of a resource. For example, if we want to define Pod availability based on pod memory usage, we can give alert definition as below with isAvailability as true.

- resourceType: k8s_resource_type
    rules:
      name: k8s_pod_memory_usage_percent
            interval: 2m
            expr: ((k8s_pod_memory_limit_utilization_ratio*100 ) OR (k8s_pod_memory_limit_utilization))
            isAvailability: true
            warnOperator: GREATER_THAN_EQUAL
            warnThreshold: '90'
            criticalOperator: GREATER_THAN_EQUAL
            criticalThreshold: '95'
            alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
            alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'

Here if k8s_pod_memory_usage_percent is either warning or critical, availability will be considered as down otherwise up.

View Alerts in OpsRamp Portal

Users can view logs in the OpsRamp portal from Command Center > Alerts.

Default Alerts Screen: View your alerts through the Alerts UI.
Alerts Details: Click on any alert entry for detailed information.