How to Send Alerts With Grafana
How to Send Alerts With Grafana Grafana is one of the most powerful open-source platforms for monitoring and observability, widely adopted by DevOps teams, SREs, and infrastructure engineers around the world. While its intuitive dashboards provide real-time visualizations of metrics, logs, and traces, its true power lies in its alerting capabilities. Sending alerts with Grafana enables teams to pr
How to Send Alerts With Grafana
Grafana is one of the most powerful open-source platforms for monitoring and observability, widely adopted by DevOps teams, SREs, and infrastructure engineers around the world. While its intuitive dashboards provide real-time visualizations of metrics, logs, and traces, its true power lies in its alerting capabilities. Sending alerts with Grafana enables teams to proactively respond to anomalies, performance degradation, system failures, and security incidents before they impact end users or business operations.
Whether youre monitoring a cloud-native Kubernetes cluster, a legacy on-premise database, or a microservices architecture, Grafanas alerting system integrates seamlessly with your data sourcessuch as Prometheus, InfluxDB, Loki, and moreto trigger notifications via email, Slack, PagerDuty, Microsoft Teams, Webhooks, and other channels. This tutorial provides a comprehensive, step-by-step guide to configuring, optimizing, and scaling alerting in Grafana, ensuring you never miss a critical event again.
By the end of this guide, youll understand how to define meaningful alert rules, avoid alert fatigue, integrate with notification platforms, and implement enterprise-grade alerting strategies that reduce mean time to detection (MTTD) and mean time to resolution (MTTR).
Step-by-Step Guide
Prerequisites
Before configuring alerts in Grafana, ensure you have the following:
- A running Grafana instance (version 8.0 or higher recommended)
- A supported data source connected (e.g., Prometheus, InfluxDB, Loki, MySQL, etc.)
- Administrative or editor permissions in Grafana
- Access to your notification channels (e.g., Slack webhook, SMTP server, PagerDuty API key)
If youre using Grafana Cloud, these components are pre-configured. For self-hosted installations, ensure your data source is properly connected and queried successfully in a dashboard panel.
Step 1: Create or Open a Dashboard
Alerts in Grafana are tied to individual panels within a dashboard. You cannot create an alert without first having a visualization panel that queries data from a supported data source.
To begin, navigate to the Dashboard menu in the left sidebar, then click New ? Add new panel. Alternatively, open an existing dashboard you wish to monitor.
In the panel editor, select your data source from the dropdown (e.g., Prometheus). Write a query that tracks a key metricsuch as HTTP error rates, CPU utilization, memory usage, or request latency. For example:
rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1
This query calculates the 5-minute rate of HTTP 5xx responses and triggers an alert if it exceeds 10% of total requests.
Once your query returns data and the visualization looks correct, click the Alert tab at the bottom of the panel editor.
Step 2: Define Alert Conditions
The Alert tab allows you to define the conditions under which Grafana triggers an alert. There are two primary modes: Query and Expression. For most use cases, Query is preferred because it leverages your data sources native querying capabilities.
Under Alert condition, ensure When is set to Query A (or your selected query). Then choose:
- Operator: >, =,
- Value: The threshold number (e.g., 0.1 for 10%)
- For: The duration the condition must persist before triggering (e.g., 5m)
For example:
- Operator: >
- Value: 0.1
- For: 5m
This means: Trigger an alert if the rate of 5xx errors exceeds 10% for five consecutive minutes. The For clause is criticalit prevents false positives from transient spikes. A 30-second spike in errors is normal during deployments; a sustained 5-minute spike is a real incident.
Optionally, you can enable Evaluate every to control how often Grafana re-evaluates the condition (e.g., every 15s or 1m). This should align with your data sources scrape interval. For Prometheus, 15s1m is typical.
Step 3: Configure Alert Notifications
Alerts are useless if no one receives them. Grafana uses Notification Channels to deliver alerts to external systems.
To set up a notification channel:
- Click the Alerting menu in the left sidebar.
- Select Notification channels.
- Click Add channel.
Choose your notification type:
- Email: Requires SMTP configuration in grafana.ini
- Slack: Requires a Slack webhook URL
- PagerDuty: Requires an integration key
- Microsoft Teams: Requires a webhook URL from Teams channel
- Webhook: Custom HTTP POST endpoint (e.g., for internal ticketing systems)
- Telegram: Bot token and chat ID
- VictorOps, Google Chat, SNS, and more
For Slack:
- Go to your Slack workspace ? Apps ? Search for Incoming Webhooks ? Add to workspace
- Create a new webhook, select a channel, and copy the URL
- Paste it into Grafanas Slack channel configuration
- Test the connection by clicking Send test notification
Once saved, the channel appears in the list. Return to your panels Alert tab and select this channel under Alert notifications. You can select multiple channelsfor example, email for on-call engineers and Slack for the entire DevOps team.
Step 4: Customize Alert Message and Labels
Grafana allows you to personalize the alert message using templating variables. This makes alerts more actionable and context-rich.
In the Alert tab, scroll to Message. Use the following variables:
{{ .Title }}Alert title{{ .State }}Current state (e.g., Alerting, OK){{ .RuleUrl }}Direct link to the alert rule{{ .Values }}Current metric value{{ .Tags }}Labels attached to the metric
Example message:
? HIGH HTTP ERROR RATE DETECTED
Service: {{ .Tags.instance }}
Error Rate: {{ .Values }} (threshold: 0.1)
Duration: 5 minutes
Dashboard: {{ .RuleUrl }}
Check logs: https://loki.example.com/inspect
You can also add custom Labels to the alert rule (e.g., team=backend, severity=critical). These labels help route alerts to the right teams in external systems and are passed through to notification channels.
Step 5: Test the Alert
Before relying on your alert in production, test it. You can do this in two ways:
- Simulate the condition: Temporarily increase the metric value using a test endpoint or script. For example, if youre monitoring HTTP errors, send a few 500 responses using curl or Postman.
- Use the Test Rule button: In the Alert tab, click Test rule. Grafana evaluates the query against the current data and shows whether the condition would trigger.
If the alert triggers, check your notification channel (Slack/email/etc.) to confirm the message was received correctly.
Step 6: Enable Alerting in Grafana Settings
Ensure alerting is enabled in your Grafana configuration. For self-hosted instances, edit the grafana.ini file:
[alerting]
enabled = true
Also, if using email alerts, configure SMTP:
[smtp]
enabled = true
host = smtp.gmail.com:587
user = your-email@gmail.com
password = your-app-password
from_address = alerts@yourcompany.com
Restart Grafana after making changes to the config file.
Step 7: Manage and Organize Alerts
As your alerting setup grows, managing hundreds of rules becomes challenging. Use the following best practices:
- Group alerts by dashboard or service (e.g., API Gateway Alerts, Database Health)
- Use consistent naming: HighLatency-frontend-v1, DiskFull-db-prod
- Tag alerts with metadata: team, environment, severity
- Export alert rules as JSON or YAML using Grafanas API or UI (Alerting ? Export)
- Version-control alert rules in Git alongside your infrastructure-as-code (IaC) files
To export all alert rules:
- Go to Alerting ? Alert rules
- Click Export
- Download the JSON file
You can later import these rules into another Grafana instance using Import in the same menu.
Best Practices
1. Avoid Alert Fatigue with Smart Thresholds
Alert fatigue occurs when teams receive too many low-priority or false-positive alerts, leading to ignored notifications. To prevent this:
- Use relative thresholds instead of static values. For example, alert when CPU usage exceeds 150% of the 7-day average, not when its above 80%.
- Apply anomaly detection using machine learning tools like Prometheuss
predict_linear()or Grafanas built-in anomaly detection (available in Grafana Cloud). - Set For durations to at least 510 minutes for non-critical alerts, and 12 minutes for critical ones.
- Exclude known maintenance windows or scheduled jobs using alert annotations or external scheduling tools.
2. Prioritize Alert Severity
Not all alerts are created equal. Classify alerts into tiers:
- Critical: System down, data loss, security breach (e.g., 99.9%+ error rate, disk full)
- High: Degraded performance, increased latency, high memory usage
- Medium: Resource utilization nearing limits, non-critical service slowdown
- Low: Informational, e.g., deployment completed, backup started
Use labels like severity=critical and route them to different channels. Critical alerts go to on-call engineers via SMS or PagerDuty; low alerts go to a general Slack channel.
3. Use Annotations for Context
Annotations provide additional context to alerts without triggering them. Add annotations for:
- Deployment timestamps
- Incident runbooks or troubleshooting guides
- Links to dashboards or logs
- Root cause hypotheses
In the alert rule editor, under Annotations, add key-value pairs:
- runbook: https://wiki.yourcompany.com/runbooks/http-5xx
- dashboard: https://grafana.yourcompany.com/d/123/api-latency
These appear in the alert notification and help responders take action faster.
4. Integrate with Incident Management Tools
For production environments, integrate Grafana alerts with incident management platforms like PagerDuty, Opsgenie, or VictorOps. These tools provide:
- Escalation policies (if no one responds in 15 minutes, notify the next person)
- On-call scheduling
- Alert deduplication
- Incident timelines and post-mortem templates
To integrate with PagerDuty:
- Log into PagerDuty ? Services ? Add Service ? Integration Type: Grafana
- Copy the integration key
- In Grafana, create a new notification channel ? PagerDuty ? Paste the key
- Test and save
Now, every Grafana alert becomes a PagerDuty incident with full lifecycle tracking.
5. Monitor Alerting Health Itself
Alerts can fail silently. A misconfigured rule, a downed data source, or a broken webhook can render your alerting system useless.
Create a Grafana Alerting Health dashboard with panels that monitor:
- Number of active alert rules
- Number of alerts in Alerting state
- Time since last alert was triggered
- Notification channel status (e.g., webhook HTTP 500 errors)
Use the Prometheus metric grafana_alerting_rules and grafana_alerting_evaluations_total to track alerting health.
6. Automate Alert Rule Deployment
Manually creating alerts in the UI is error-prone and not scalable. Use Grafanas API or configuration files to automate alert rule deployment.
For example, use the Grafana HTTP API to create alert rules programmatically:
POST /api/alerts
Content-Type: application/json
{
"name": "High CPU Usage",
"condition": "A",
"data": [
{
"refId": "A",
"queryType": "random_walk",
"relativeTimeRange": {
"from": 600,
"to": 0
},
"datasourceUid": "Prometheus",
"model": {
"expr": "avg_over_time(node_cpu_seconds_total{mode!=\"idle\"}[5m]) > 0.8",
"legendFormat": "CPU Usage",
"range": true
}
}
],
"message": "CPU usage is above 80% for 5 minutes.",
"for": "5m",
"executionErrorState": "alerting",
"folderId": 1,
"labels": {
"team": "infrastructure",
"severity": "high"
},
"annotations": {
"runbook": "https://wiki.example.com/cpu-alert"
}
}
Integrate this into your CI/CD pipeline using tools like Terraform, Ansible, or Helm charts to ensure alert rules are versioned and deployed alongside application code.
7. Review and Retire Alerts Regularly
Alerts decay over time. Services are decommissioned, thresholds become outdated, and teams change.
Establish a quarterly alert review process:
- Identify alerts that havent triggered in 90+ days
- Check if the underlying metric still matters
- Archive or delete obsolete rules
- Update documentation and runbooks
Use Grafanas alert history to analyze which alerts are most usefuland which are noise.
Tools and Resources
Official Grafana Documentation
The authoritative source for all things Grafana:
Community Alert Rules and Templates
Start with proven alert rules from the community:
- kube-prometheus Alert Rules Excellent for Kubernetes environments
- Grafanas built-in alert rule examples
- Grafana Dashboard Library Many include pre-configured alert rules
Third-Party Integrations
Enhance alerting with these tools:
- PagerDuty Incident response and escalation
- Opsgenie Advanced alert routing and on-call scheduling
- VictorOps DevOps-focused incident management
- Microsoft Teams Native integration via webhooks
- Slack Team communication with threaded alerts
- Webhook Connect to Jira, ServiceNow, or custom ticketing systems
Monitoring Tools to Pair With Grafana
For full observability, combine Grafana with:
- Prometheus Time-series metrics collection
- Loki Log aggregation
- Tempo Distributed tracing
- Node Exporter Server-level metrics
- Blackbox Exporter HTTP/S, TCP, ICMP probes
Alerting Best Practice Checklists
Download or print these to ensure your alerting setup is robust:
- Google SRE: Monitoring Distributed Systems
- Site Reliability Engineering (OReilly)
- Datadog Monitoring Best Practices
Real Examples
Example 1: High HTTP Error Rate Alert
Scenario: Your web application serves 10M requests/day. A sudden spike in 5xx errors indicates a backend service failure.
Query:
rate(http_requests_total{job="web-app", status_code=~"5.."}[5m]) / rate(http_requests_total{job="web-app"}[5m]) > 0.05
This calculates the proportion of 5xx responses out of total requests over 5 minutes. If it exceeds 5%, trigger an alert.
For: 5 minutes
Notification: Slack + PagerDuty
Message:
? CRITICAL: HTTP 5xx Error Rate > 5%
Service: web-app
Environment: production
Current Rate: {{ .Values }}
Threshold: 5%
Duration: 5m
Dashboard: {{ .RuleUrl }}
Runbook: https://wiki.example.com/5xx-troubleshooting
Result: The on-call engineer is notified within 5 minutes, investigates the failing microservice, and rolls back a bad deployment within 12 minutes.
Example 2: Disk Space Exhaustion Alert
Scenario: A database server is running out of disk space. This could cause data loss or downtime.
Query (Prometheus + Node Exporter):
100 * (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) > 90
This calculates the percentage of disk space used on the root filesystem. If it exceeds 90%, alert.
For: 10 minutes (to allow for temporary log rotation or cache cleanup)
Notification: Email + Teams
Message:
?? HIGH DISK USAGE ON {{ .Tags.instance }}
Filesystem: {{ .Tags.mountpoint }}
Used: {{ .Values }}%
Threshold: 90%
Runbook: https://wiki.example.com/disk-space-clear
Commands: df -h | grep /; du -sh /var/log/*; journalctl --vacuum-size=100M
Result: The system administrator clears old logs and increases disk size before the server crashes.
Example 3: Latency Spike in API Gateway
Scenario: Your API gateways p99 latency spikes above 2 seconds, impacting user experience.
Query:
histogram_quantile(0.99, sum(rate(api_request_duration_seconds_bucket{service="gateway"}[5m])) by (le)) > 2
This uses a histogram to calculate the 99th percentile latency. If it exceeds 2 seconds, trigger an alert.
For: 3 minutes
Notification: Slack channel
api-alerts + PagerDuty
Message:
? API LATENCY SPIKE: p99 > 2s
Service: gateway
Current p99: {{ .Values }}s
Threshold: 2s
Duration: 3m
Dashboard: {{ .RuleUrl }}
Check: https://grafana.example.com/d/456/api-latency
Trace: https://tempo.example.com/search?service=gateway
Result: The team identifies a misconfigured caching layer and fixes it before users report issues.
FAQs
Can Grafana send alerts without a data source?
No. Grafana requires a connected data source (Prometheus, InfluxDB, Loki, etc.) to evaluate metrics and trigger alerts. You cannot create alerts based on static values or manual inputs.
How often does Grafana evaluate alert rules?
By default, Grafana evaluates alert rules every 15 seconds. You can change this in the alert rule settings under Evaluate every. Ensure this aligns with your data sources scrape interval (e.g., Prometheus scrapes every 15s60s).
Can I silence alerts during maintenance windows?
Yes. Use Grafanas Alert Silences feature. Go to Alerting ? Silences ? Create Silence. Set a time range and match rules by name, label, or tag. This temporarily disables matching alerts without deleting them.
Do alerts work if Grafana is down?
No. Grafana must be running to evaluate alert conditions and send notifications. For high availability, deploy Grafana in a clustered or replicated setup. Alternatively, use Prometheus Alertmanager for alert routingit can continue to send alerts even if Grafana is temporarily unavailable.
Can I create alerts based on logs?
Yes, if youre using Loki as your log data source. You can create alerts based on log line counts, error patterns, or specific message content using LogQL queries. For example:
count_over_time({job="api"} |= "ERROR" [5m]) > 10
This triggers an alert if more than 10 ERROR lines appear in 5 minutes.
Whats the difference between Grafana alerts and Prometheus Alertmanager?
Grafana alerts are evaluated and triggered within Grafana using its UI and API. Prometheus Alertmanager is a separate component that receives alerts from Prometheus servers and handles routing, silencing, and deduplication. Grafana can send alerts to Alertmanager, but Alertmanager is more robust for large-scale, enterprise alerting. Use Grafana alerts for simplicity and dashboards; use Alertmanager for scale and reliability.
Can I send alerts to mobile apps?
Yes. Use notification channels like PagerDuty, Pushover, or Telegram, which have mobile apps. Configure the channel in Grafana, and alerts will appear as push notifications on your phone.
Is there a limit to the number of alerts I can create?
Grafana doesnt enforce a hard limit, but performance degrades with thousands of alert rules. For large deployments (>500 rules), consider using Prometheus Alertmanager or Grafana Clouds managed alerting, which scales better.
Conclusion
Alerting is not a one-time setupits an ongoing discipline that requires careful design, continuous refinement, and active maintenance. Sending alerts with Grafana gives you the power to detect issues before they become incidents, reduce downtime, and improve system reliability across your entire infrastructure.
By following the step-by-step guide in this tutorial, youve learned how to create meaningful alert rules, integrate with modern notification platforms, avoid alert fatigue, and automate alert management at scale. Youve seen real-world examples of how alerts prevent outages and how best practices turn reactive monitoring into proactive resilience.
Remember: The goal of alerting isnt to notify you of every minor fluctuationits to notify you of the right things, at the right time, with the right context. Invest time in tuning your alerts. Review them quarterly. Automate their deployment. Link them to runbooks. And always ask: If this alert fires, will I know exactly what to do?
With Grafanas powerful alerting system and the strategies outlined here, youre no longer just watching metricsyoure defending your systems. And in todays digital world, thats not just an advantage. Its a necessity.