Introduction: Why this matters
If we run or support SQL Server, we already know the stakes: slow queries frustrate users, poor plans can silently grind performance, and intermittent production issues eat time and credibility.
Microsoft built three complementary, powerful tools that when used together let us monitor, triage, and fix problems quickly: Query Store, Dynamic Management Views (DMVs), and Extended Events. This article gives you a practical, playbook-style guide so you can use those tools like a pro. For official reference and deep technical detail, Microsoft’s docs are essential reading.
Why proactive monitoring matters
Proactive monitoring turns surprises into expected events. Reactive firefighting (chasing a slow report at 3 AM) costs time and leads to shortcuts. A proactive approach gives you baselines, quick root-cause data, and less intrusive ways to capture evidence.
Costs of reactive troubleshooting include user downtime, rushed changes that cause regressions, and lost business metric continuity. Observability for DBAs means capturing query performance history (Query Store), current engine state (DMVs), and detailed event traces when you need them (Extended Events). These three afford different time-resolution and overhead tradeoffs. let learn to use each where it shines.
Overview: Query Store, DMVs, and Extended Events (who does what)
Quick comparison:
- Query Store: Historical query & plan telemetry retained in the database . It best for plan regressions, comparing pre/post-change performance, and forcing plans. Captures aggregated runtime stats for queries over time.
- DMVs: Live server and database state. It best for immediate diagnostics (current waiting tasks, plan cache contents, index usage). DMVs are extremely useful for light, fast checks.
- Extended Events: Low-overhead, fine-grained tracing. It best for capturing sequences, errors, deadlocks, and correlating causality across components. Use XEvents when you need detailed event-level data.
When to pick which tool: use Query Store for history and plan regressions, DMVs for quick health checks, and XEvents for deep root-cause and sequence analysis.
Query Store deep dive
What Query Store captures & how it helps
Query Store stores query texts, their plans, and runtime statistics (execution counts, avg/duration, CPU, logical reads) so we can see plan changes and performance over time without requiring customers to run traces. It’s built into SQL Server and Azure SQL and is indispensable for regression hunting.
Architecture and retention
Query Store stores data in the user database (Query Store internal tables). You can configure capture mode (ALL, AUTO, NONE), max size, and data flush/cleanup policies. Be mindful of storage as Query Store can grow quickly on ad-hoc workloads. Microsoft documents management settings and retention policies you should tune per workload.
Enabling & configuring
Quick can be enabled using the below SQL queries:
ALTER DATABASE YourDB
SET QUERY_STORE = ON;
ALTER DATABASE YourDB
SET QUERY_STORE (OPERATION_MODE = READ_WRITE, MAX_STORAGE_SIZE_MB = 2048);
Use AUTO capture mode for most production workloads; switch to CUSTOM for finer control.
Best practices & gotchas
- Use AUTO mode unless you have high ad-hoc traffic that will overwhelm storage.
- Monitor
sys.database_query_store_optionsandsys.query_store_runtime_statsfor usage and growth. - If Query Store fills, it can stop accepting more data or purge old data. So plan retention and alerts aacordingly.
Using Query Store for troubleshooting
Find plan regressions
Steps:
- Open Query Store UI in SSMS (or query
sys.query_store_query/sys.query_store_plantables). - Identify queries with sudden increases in avg duration or CPU across plan_id changes.
- Inspect plan differences: parameter sniffing, cardinality estimate differences, or different join/scan choices.
Forcing plans
If a newer plan regresses and you verified an older plan is better, you can force the good plan via Query Store to stabilize performance:
EXEC sp_query_store_force_plan @query_id = <id>, @plan_id = <plan_id>;
Use plan forcing judiciously. Always seek a root fix (statistics, indexes, query rewrite) before forcing long-term.
Example flow
- Capture evidence (Query Store) → Reproduce in lower environment → Fix (index/statistics/rewrite) → Validate → Remove forced plan once root cause solved.
DMVs (Dynamic Management Views) explained
DMVs provide snapshots of the engine: waits, sessions, requests, buffer pool, IO stats, plan cache, and more. They’re the quickest way to triage live problems. Microsoft documents the full DMV list and purpose—start with the server- and database-scoped DMVs.
Key DMV categories
- Execution statistics:
sys.dm_exec_query_stats,sys.dm_exec_sql_text,sys.dm_exec_query_plan - Waits & sessions:
sys.dm_os_waiting_tasks,sys.dm_exec_sessions,sys.dm_exec_requests - IO & buffer:
sys.dm_io_virtual_file_stats,sys.dm_os_buffer_descriptors - Index & missing index stats:
sys.dm_db_index_usage_stats,sys.dm_db_missing_index_details - Plan cache:
sys.dm_exec_cached_plans,sys.dm_exec_plan_attributes
Practical DMV queries (recipes)
Top CPU/IO consuming queries (group by query_hash)
SELECT TOP 50
qs.query_hash,
SUM(qs.total_worker_time) AS total_cpu,
SUM(qs.total_logical_reads) AS total_reads,
SUM(qs.total_logical_writes) AS total_writes,
COUNT(*) AS executions,
SUBSTRING(st.text, (qs.statement_start_offset/2)+1,
((CASE WHEN qs.statement_end_offset = -1
THEN LEN(CONVERT(nvarchar(MAX), st.text)) * 2
ELSE qs.statement_end_offset END - qs.statement_start_offset)/2)+1) AS statement
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
GROUP BY qs.query_hash, st.text
ORDER BY total_cpu DESC;
Blocking and waits
SELECT
wt.session_id, wt.wait_type, wt.wait_time_ms, wt.blocking_session_id, r.command, s.host_name
FROM sys.dm_os_waiting_tasks wt
JOIN sys.dm_exec_requests r ON wt.session_id = r.session_id
JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id
ORDER BY wt.wait_time_ms DESC;
Index usage
SELECT OBJECT_NAME(i.object_id) AS TableName, i.name AS IndexName,
s.user_seeks, s.user_scans, s.user_lookups, s.user_updates
FROM sys.indexes i
LEFT JOIN sys.dm_db_index_usage_stats s
ON i.object_id = s.object_id AND i.index_id = s.index_id
WHERE OBJECTPROPERTY(i.object_id,'IsMsShipped') = 0;
Remember: group by query_hash or plan_handle to avoid tuning duplicate statements and to deal with ad-hoc query noise. Community experience shows grouping is essential.
Extended Events (XEvents) fundamentals
Extended Events is the lightweight tracing mechanism that replaced SQL Trace/Profiler; it provides flexible sessions, low overhead, and many events and actions to capture engine internals. Use XEvents for deadlock graphs (if you prefer more than the default), for long-running query capture with context, and for correlating errors to sessions. Microsoft’s docs and quickstart provide modern guidance.
Core concepts
- Session: container for the capture.
- Events: things that happen (e.g.,
sql_statement_completed,deadlock_graph,query_post_execution_showplan). - Targets: where you save event data (ring buffer, event_file).
- Predicates: filters so you capture only what you need (e.g., duration > 1000 ms).
Extended Events for troubleshooting
Capture long-running queries and waits
A simple session to catch statements > 1s:
CREATE EVENT SESSION [LongRunningStatements] ON SERVER
ADD EVENT sqlserver.sql_statement_completed(
ACTION(sqlserver.sql_text, sqlserver.session_id)
WHERE (duration > 1000000)) -- duration in microseconds
ADD TARGET package0.event_file(SET filename=N'LongRunningStatements.xel');
ALTER EVENT SESSION [LongRunningStatements] ON SERVER STATE = START;
Correlating with waits and errors
Add wait and error events (e.g., attention, error_reported, blocked_process_report) to build a causality timeline. Use causality_tracking to connect related events across sessions.
Lightweight continuous monitoring
Use ring buffers or small rolling event_files with predicates to keep overhead very low. Do not enable overly broad sessions on heavy OLTP systems. So filter early.
Putting it all together: a troubleshooting playbook
Step 1 Detect
- Alerts: CPU, long-running queries, sudden plan change counts, increased waits. Query Store alerts (growth) and instance-level counters help detect deviations.
Step 2 Triage (fast)
- Run DMV checks:
sys.dm_os_waiting_tasks,sys.dm_exec_requests, topsys.dm_exec_query_statsgrouped byquery_hash. This gives an immediate idea: blocking, resource saturation, or runaway queries.
Step 3 Diagnose (evidence)
- Query Store: check historical performance and plan changes around the time of the incident.
- Extended Events: start a targeted XEvent session to capture the exact sequence (errors, waits, query text, plans). Save results to an .xel file and analyze in SSMS.
Step 4 Fix & validate
- Apply fixes: update statistics, add/drop/rebuild indexes, rewrite queries, or adjust configuration. Re-run the same queries and verify via Query Store or DMVs that the metrics improved. If necessary, force a plan temporarily while you coordinate a safe permanent fix.
Automation & alerting: from reactive to proactive
Collect baseline metrics: avg duration per top queries (Query Store), CPU/IO by query_hash (DMVs), wait category baselines, and plan change counts. Integrate SQL telemetry with centralized monitoring (Azure Monitor, Grafana/Prometheus, or commercial APMs). Use Query Store and XEvents alerts for growth or unusual plan-change spikes. Microsoft published best practices for Query Store management and monitoring. Please read those when setting thresholds.
Performance tuning workflows and collaboration
Use Query Store artifacts (query_id, plan_id) to create reproducible tickets for developers. Document the plan difference, the metrics (before/after), and the recommended remediation. Avoid forcing plans as a first-line permanent fix; use it as a stabilization step while dev/test pipelines address root causes.
Common pitfalls and how to avoid them
- Query Store storage blowout: don’t enable on a heavily ad-hoc system without sizing caps and monitoring.
- Misreading DMVs: DMV data is cumulative since start – interpret deltas and average rates, not raw counters.
- Over-capturing XEvents: broad sessions can add overhead and large files. Filter early and store to rolling targets.
Reference cheat sheet: quick commands & sessions
Enable Query Store
ALTER DATABASE YourDB SET QUERY_STORE = ON;
ALTER DATABASE YourDB SET QUERY_STORE (OPERATION_MODE = READ_WRITE, CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 30));
Top waits
SELECT wait_type, SUM(wait_time_ms) wait_ms, SUM(signal_wait_time_ms) signal_ms
FROM sys.dm_os_wait_stats
GROUP BY wait_type
ORDER BY wait_ms DESC;
Create a simple XEvent for deadlocks
CREATE EVENT SESSION [CaptureDeadlocks] ON SERVER
ADD EVENT sqlserver.lock_deadlock (ACTION(sqlserver.sql_text, sqlserver.session_id))
ADD TARGET package0.event_file(SET filename=N'CaptureDeadlocks.xel');
ALTER EVENT SESSION [CaptureDeadlocks] ON SERVER STATE = START;
Frequently Asked Questions
Q1: When should I enable Query Store on a production database?
A1: Enable Query Store when you want historical telemetry to diagnose regressions—preferably in AUTO mode with reasonable MAX_STORAGE_SIZE_MB and retention. For very ad-hoc workloads, test in staging first and monitor growth.
Q2: Can I rely only on DMVs for troubleshooting?
A2: DMVs are great for live state, but they don’t retain historical plan changes. Combine DMVs (live triage) with Query Store (history) and XEvents (detailed traces) for full coverage.
Q3: Does Extended Events replace SQL Trace/Profiler?
A3: Yes. Extended Events is the modern, lightweight tracing framework intended to replace SQL Trace and Profiler. Use XEvents for low-overhead, flexible tracing
Q4: How do I avoid Query Store filling up the database?
A4: Configure MAX_STORAGE_SIZE_MB, use AUTO capture, set an appropriate STALE_QUERY_THRESHOLD_DAYS, and monitor sys.database_query_store_options. Purge or limit capture for extremely high-cardinality workloads.
Q5: What’s the best way to capture intermittent slowdowns?
A5: Use Query Store to identify plan-time correlations; if finer timing is needed, start a targeted XEvent session with predicates (duration threshold, specific database or application_name) and save to rolling files for analysis.
Q6: Should I force plans or fix root causes?
A6: Prefer root-cause fixes (statistics, indexing, rewrites). Use plan forcing as a controlled, temporary mitigation while you implement a permanent solution and test thoroughly.
Q7: How do I correlate XEvents with Query Store entries?
A7: Capture actions like sql_text, session_id, and plan_handle in XEvents. Use the plan_handle and SQL text to match entries in Query Store (sys.query_store_plan) and DMVs for triangulation.
Conclusion: mastering the trio for reliable SQL Server operations
Monitoring and troubleshooting SQL Server well is about having the right data at the right time. Use Query Store for historical plan and performance telemetry, DMVs for live-state triage, and Extended Events for detailed, low-overhead tracing. Combine these tools into a repeatable playbook: detect, triage, diagnose, fix, and validate. Start by enabling Query Store with sane limits, keep a library of DMV queries for fast checks, and use targeted XEvent sessions for deep dives. Do that, and you’ll spend less time chasing fires and more time improving performance predictably.
Discover more from Technology with Vivek Johari
Subscribe to get the latest posts sent to your email.



