Monitoring During a Load Test¶
Load test monitoring is detective work in real time. You're watching hundreds of virtual users stress your application, looking for clues about performance limits, bottlenecks, and failure modes. Response times spike at 300 VUs? That's a clue. Database CPU hits 100% at the same moment? That's the culprit.
Running a load test without monitoring is like driving blindfolded. You'll crash, but you won't know why. Monitoring tells you not just "the server failed at 500 VUs" but "the database connection pool exhausted at 500 VUs because only 100 connections were configured."
This guide explains:
- Which metrics to watch during a load test
- What each metric means and why it matters
- How to correlate metrics to identify bottlenecks
- Warning signs that indicate problems
- Real-time degradation detection with AI assistance
Key Metrics Overview¶
Load testing produces dozens of metrics, but these seven are the ones to watch in real time:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Response Time (avg) | Time from request sent to response received | User experience (slow responses = frustrated users) |
| Hits/sec | HTTP requests per second across all VUs | Server throughput: how many requests/sec can it handle? |
| Bandwidth | Data transferred per second (download + upload) | Network capacity: are you bandwidth-limited? |
| Virtual Users | Number of concurrent VUs executing test case | Load level: more VUs = more stress |
| Errors/sec | Failed transactions per second | Application health: errors indicate broken functionality |
| CPU % (server) | Server CPU utilization | Compute capacity: high CPU = compute-bound |
| Memory % (server) | Server memory utilization | Memory capacity: high memory = potential leak or cache issue |
These metrics tell a story: response times increase (the symptom) because CPU hits 100% (the cause). Monitoring reveals the narrative.
Response Time: The Primary Performance Metric¶
Response time is what users experience. Everything else is diagnostic. If response times are fast, users are happy. If response times are slow, users are frustrated, and it doesn't matter that your server CPU is only 30%.
What Response Time Measures¶
Response time = time from sending HTTP request to receiving complete response:
[VU sends request] → [network latency] → [server processes] →
[network latency] → [VU receives response] = Response Time
Components:
- Network latency: Time for packets to travel (typically 10-100ms)
- Server processing: Time for server to generate response (varies: 10ms for cached page, 1000ms for complex database query)
- Network download time: Time to transfer response body (depends on response size and bandwidth)
What "good" response times look like:
| Page Type | Acceptable | Good | Excellent |
|---|---|---|---|
| Static content (images, CSS) | < 500ms | < 200ms | < 100ms |
| Dynamic pages (database queries) | < 2000ms | < 1000ms | < 500ms |
| API calls (simple) | < 500ms | < 200ms | < 100ms |
| API calls (complex) | < 2000ms | < 1000ms | < 500ms |
These are guidelines. Your application's acceptable response times depend on user expectations and business requirements.
Interpreting Response Time Patterns¶
Response time patterns reveal how the server behaves under load. Learn to read them.
Pattern 1: Flat Line (Ideal)¶
What it looks like:
Response Time (ms)
200 |████████████████████████████████
|
0 +--------------------------------
0 100 200 300 400 500 (VUs)
What it means: Server handling load beautifully. Response times stay constant as VUs increase.
Why this happens: Server has capacity to spare (CPU 40%, memory 50%, database well-optimized).
What to do: Keep ramping VUs to find the capacity limit.
Pattern 2: Gradual Increase (Normal)¶
What it looks like:
Response Time (ms)
400 | ██████████
300 | ███████████
200 | ████████████
100 |██████████
+----------------------------------
0 100 200 300 400 500 (VUs)
What it means: Server handling load well, but performance degrades proportionally with load.
Why this happens: Server resource contention increases as VUs increase (more DB connections, more CPU threads, more memory usage).
What to do: Acceptable if degradation is linear and response times stay under acceptable thresholds (e.g., < 2000ms).
Pattern 3: Sharp Spike (Capacity Limit Reached)¶
What it looks like:
Response Time (ms)
8000| ████
2000| ████
500| ████████
100|████████
+----------------------------
0 100 200 300 400 (VUs)
What it means: Server hit a hard limit at around 300 VUs, with response times jumping from 500ms to 8,000ms in one load level.
Why this happens: Resource exhaustion, plain and simple. Database connection pool full, memory exhausted, CPU maxed, thread pool saturated. Something ran out.
What to do: Note the VU count when the spike occurred (capacity limit = 300 VUs). Check server metrics (CPU, memory, database connections) to identify which bottleneck you hit. Check the Errors View for specific error messages, which often reveal exactly what exhausted ("connection pool exhausted" being a common one).
This is valuable data. You found the breaking point.
Pattern 4: Erratic Spikes (Intermittent Issues)¶
What it looks like:
Response Time (ms)
5000| ██ ██ ██
2000| ██ ██ ████
500|████████████████████████████
+-------------------------------
0 100 200 300 400 500 (VUs)
What it means: Intermittent performance issues, with occasional slow requests (outliers).
Why this happens: Garbage collection pauses in the JVM or .NET CLR. Database query timeouts where slow queries occasionally take 10x longer. Network hiccups (packet loss, retransmissions). Background jobs like cron tasks or scheduled processes competing for resources.
What to do: Check whether spikes correlate with time. If they happen every 5 minutes, that's a scheduled job. Review server logs during spike periods, paying attention to GC logs and slow query logs. If spikes are random and infrequent (under 5% of requests), they may be acceptable noise. If they're frequent (over 10%), investigate the root cause: GC tuning, query optimization.
Ask the AI to Interpret Response Time Patterns
If you see unusual response time patterns:
My response times are flat at 100ms until 250 VUs, then jump to 5000ms at 300 VUs.
CPU is at 60% and memory is at 50%. What's the bottleneck?
The AI can:
- Analyze response time patterns to identify capacity limits
- Correlate response times with server metrics (CPU, memory, database) to pinpoint bottlenecks
- Distinguish between normal degradation vs. hard limits vs. intermittent issues
- Recommend immediate actions (stop test, add resources, investigate specific components)
- Suggest long-term fixes (optimize queries, increase connection pools, add caching)
Hits/Sec: Server Throughput¶
Hits/sec measures how many HTTP requests your server processes per second. Raw throughput capacity.
What Hits/Sec Tells You¶
Hits/sec should increase as VUs increase:
| VUs | Expected Hits/Sec (Typical Web App) | Why |
|---|---|---|
| 100 | ~500-1000 | Each VU makes 5-10 requests/min (60 sec think time) |
| 200 | ~1000-2000 | Linear scaling (2x VUs = 2x hits/sec) |
| 500 | ~2500-5000 | Continues scaling |
If hits/sec stops increasing even though VUs keep ramping, the server is maxed out: it can't process more requests even though you're sending them. The VUs are waiting for slow responses, which is also why response times will be spiking.
Example (problem):
| VUs | Hits/Sec | Response Time | What It Means |
|---|---|---|---|
| 100 | 1000 | 100ms | Good |
| 200 | 2000 | 150ms | Good (linear scaling) |
| 300 | 2500 | 500ms | Scaling slows |
| 400 | 2500 | 2000ms | Hits/sec plateaued, server can't handle more |
This tells you the server maxes out at around 2,500 hits/sec, regardless of how many more VUs you throw at it.
Hits/Sec vs. Response Time Correlation¶
The relationship between hits/sec and response time reveals server behavior.
| Hits/Sec | Response Time | What It Means |
|---|---|---|
| Increasing | Flat/Low | Server handling load easily (plenty of capacity) |
| Increasing | Gradually increasing | Server handling load but approaching limits |
| Plateaus | Spiking | Server maxed out, can't process more requests |
| Decreasing | Spiking | Server overloaded, actually processing FEWER requests because it's so slow |
Decreasing hits/sec is the red flag. The server is so overloaded it's actually processing fewer requests than before. It's going backward.
Bandwidth: Network Throughput¶
Bandwidth measures data transferred per second (typically in Mbps or Gbps).
What Bandwidth Tells You¶
Bandwidth should increase as VUs increase (more users = more data transferred):
| VUs | Expected Bandwidth (Image-Heavy Site) | Expected Bandwidth (Text-Heavy Site) |
|---|---|---|
| 100 | ~50 Mbps | ~5 Mbps |
| 500 | ~250 Mbps | ~25 Mbps |
| 1000 | ~500 Mbps | ~50 Mbps |
If bandwidth plateaus (stops increasing even though VUs increase):
- Network bottleneck: server's network interface maxed out (e.g., 1 Gbps NIC at capacity)
- Engine bottleneck: load engines maxed out on bandwidth (e.g., cloud engines at 90 Mbps each)
Example (network bottleneck):
| VUs | Bandwidth | Response Time | What It Means |
|---|---|---|---|
| 100 | 200 Mbps | 100ms | Good |
| 500 | 900 Mbps | 150ms | Approaching 1 Gbps NIC limit |
| 1000 | 1000 Mbps | 5000ms | Network maxed out, server can't send more data |
This tells you the server's 1 Gbps network interface is the bottleneck. Not CPU, not database. The network.
Fix: Upgrade to a 10 Gbps NIC, or add a load balancer with multiple servers.
Engine Bandwidth Monitoring¶
Monitor engine bandwidth in Engines View to ensure engines aren't the bottleneck:
| Engine | Bandwidth | Status | What It Means |
|---|---|---|---|
| Engine 1 | 35 Mbps | OK | Plenty of headroom |
| Engine 2 | 89 Mbps | ⚠️ Warning | Near capacity (cloud engines max ~90 Mbps) |
If engine bandwidth exceeds 80 Mbps: Add more engines to distribute the bandwidth load.
See: Cloud Load Testing for engine bandwidth expectations.
Virtual Users: Load Level¶
VU count shows the current load level. More VUs means more concurrent users.
VU Ramp Monitoring¶
VUs should increase according to load profile:
- Stepped profile: VUs increase in discrete steps (e.g., 100 → 150 → 200 every 5 min)
- Exponential profile: VUs increase by percentage (e.g., 100 → 125 → 156 → 195)
- Constant profile: VUs stay constant (e.g., 100 for entire test)
If VUs don't increase on schedule, one of three things happened. Engines detected overload (CPU > 90%) and self-regulated. Engine capacity was exceeded (you asked for 5,000 VUs but engine max is 3,000). Or the test duration was too short to complete all the ramps. Check the Engines View for warnings or "Overloaded" status.
VUs per Engine Distribution¶
VUs should distribute evenly across engines:
| Engine | VUs | Status | Good/Bad |
|---|---|---|---|
| Engine 1 | 167 | OK | ✅ Balanced |
| Engine 2 | 167 | OK | ✅ Balanced |
| Engine 3 | 166 | OK | ✅ Balanced |
Unbalanced distribution (problem):
| Engine | VUs | Status | Good/Bad |
|---|---|---|---|
| Engine 1 | 450 | Overloaded | ❌ Imbalanced |
| Engine 2 | 25 | OK | ❌ Imbalanced |
| Engine 3 | 25 | OK | ❌ Imbalanced |
This indicates an engine configuration issue or outright engine failure: Engine 1 didn't recognize the other engines and tried to carry the whole load itself.
Errors/Sec: Application Health¶
Errors/sec shows failed transactions: HTTP errors, timeouts, connection failures.
What Error Rate Means¶
| Errors/Sec | Error Rate | What It Means |
|---|---|---|
| 0 | 0% | Perfect, all transactions succeeding |
| < 5 | < 1% | Acceptable, occasional transient errors |
| 5-50 | 1-10% | Concerning, investigate root cause |
| > 50 | > 10% | Critical, application broken under load |
Common error types:
| HTTP Status | Error Type | Likely Cause |
|---|---|---|
| 401 Unauthorized | Authentication failure | Session expired, auth tokens invalid |
| 403 Forbidden | Permission denied | CSRF token missing, session security check failed |
| 404 Not Found | Resource not found | Dynamic URL correlation failed, resource deleted |
| 500 Internal Server Error | Server-side error | Application bug, database error, exception |
| 502 Bad Gateway | Proxy/load balancer error | Backend server down |
| 503 Service Unavailable | Server overloaded | Connection pool exhausted, server shutdown |
| 504 Gateway Timeout | Timeout | Backend server too slow |
| Connection refused | Network error | Server not listening, firewall blocking |
| Read timeout | Response timeout | Server processing took too long |
Error Rate During Load Ramp¶
When errors appear tells you what caused them:
| VU Level | Error Rate | Response Time | Diagnosis |
|---|---|---|---|
| 0-200 VUs | 0% | 100ms | Good |
| 300 VUs | 5% (503 errors) | 500ms | Connection pool exhaustion starting |
| 400 VUs | 25% (503 errors) | 5000ms | Server overloaded |
| 500 VUs | 50% (503 errors + timeouts) | Timeouts | Server critically overloaded |
This tells you the server's capacity limit is around 250 VUs. Beyond that, the connection pool exhausts and errors start piling up.
What to do: Check error details in the Errors View for specific messages. Increase the connection pool on the server (say, from 100 to 500 database connections). Re-run the test to verify the fix.
Ask the AI to Diagnose Error Patterns
If you see errors during load testing:
I'm getting 503 errors starting at 300 VUs. Response times are 5000ms and
server CPU is only 40%. What's wrong?
The AI can:
- Correlate error types with server metrics to identify root cause
- Distinguish between application errors (bugs) vs. capacity errors (overload)
- Explain why specific HTTP status codes appear under load (503 = service unavailable, likely connection pool)
- Recommend configuration changes (increase connection pools, add caching, optimize queries)
- Suggest whether errors are acceptable (< 1%) or critical (> 10%)
Server Metrics: Identifying Bottlenecks¶
Server-side metrics reveal WHY performance degrades. Response times tell you there's a problem. Server metrics tell you what the problem is.
CPU %: Compute Capacity¶
CPU utilization shows how much compute capacity is used:
| CPU % | What It Means | Action |
|---|---|---|
| < 50% | Plenty of capacity | Keep ramping load |
| 50-70% | Moderate usage | Watch for degradation |
| 70-90% | High usage | Approaching limit |
| > 90% | Critically high | CPU bottleneck: optimize code or add CPU |
Correlating CPU with response times:
| CPU % | Response Time | Diagnosis |
|---|---|---|
| 40% | 100ms | CPU not the bottleneck (plenty of capacity) |
| 70% | 200ms | CPU moderately loaded (normal degradation) |
| 95% | 5000ms | CPU is the bottleneck: server can't process requests fast enough |
If CPU hits 100% and response times spike: you're CPU-bound. Optimize application code, add CPU cores, or scale horizontally by adding servers.
Memory %: Memory Capacity¶
Memory utilization shows RAM usage:
| Memory % | What It Means | Action |
|---|---|---|
| < 70% | Healthy | Normal |
| 70-85% | Moderate | Watch for growth |
| 85-95% | High | Potential memory pressure |
| > 95% | Critical | Memory bottleneck or leak |
Memory leak pattern:
| Time | Memory % | Response Time | Diagnosis |
|---|---|---|---|
| 0 min | 30% | 100ms | Good |
| 30 min | 50% | 150ms | Growing (expected) |
| 60 min | 75% | 500ms | Concerning |
| 90 min | 95% | 5000ms | Memory leak: memory keeps growing |
| 120 min | 100% (OOM) | Crash | Server ran out of memory |
If memory keeps growing throughout the test, even at constant VU load, you have a memory leak. The application isn't releasing memory that it should be.
What to do: Profile the application with a memory profiler, identify the leak, fix the code. No shortcut.
Database Metrics¶
Database-specific metrics (if monitoring database server):
| Metric | What to Watch | Red Flag |
|---|---|---|
| DB CPU % | < 80% | > 90% = database compute-bound |
| DB Connections | < max pool size | = max pool size = connection pool exhausted |
| Query time (avg) | < 100ms | > 1000ms = slow queries |
| Lock wait time | < 10ms | > 100ms = database locking/deadlocks |
| Disk I/O % | < 70% | > 90% = disk bottleneck (slow storage) |
Example (database bottleneck):
| Metric | Value | Diagnosis |
|---|---|---|
| Web server CPU | 30% | Plenty of capacity |
| Web server memory | 40% | Plenty of capacity |
| Database CPU | 95% | Bottleneck |
| Database connections | 85 / 100 | Not maxed |
| Query time (avg) | 2000ms | Slow queries |
This tells you the database is the bottleneck, not the web server. Optimize queries, add indexes, or add database CPU capacity. The web server is sitting there waiting for the database to finish.
Correlating Metrics to Find Bottlenecks¶
The power of monitoring is correlation. Any single metric in isolation is ambiguous. Combined, they reveal root causes.
Correlation Pattern 1: CPU Bottleneck¶
| Response Time | Hits/Sec | Server CPU | Database CPU | Diagnosis |
|---|---|---|---|---|
| ⬆️ Spiking | ⬇️ Plateaus | ⬆️ 95% | 40% | Web server CPU bottleneck |
Fix: Optimize application code, add CPU cores, or add web servers.
Correlation Pattern 2: Database Bottleneck¶
| Response Time | Hits/Sec | Server CPU | Database CPU | Diagnosis |
|---|---|---|---|---|
| ⬆️ Spiking | ⬇️ Plateaus | 40% | ⬆️ 95% | Database CPU bottleneck |
Fix: Optimize queries, add indexes, add database CPU capacity, or add read replicas.
Correlation Pattern 3: Network Bottleneck¶
| Response Time | Bandwidth | Server CPU | Server Network | Diagnosis |
|---|---|---|---|---|
| ⬆️ Spiking | ⬆️ Maxed (1 Gbps) | 50% | ⬆️ 100% | Network bandwidth bottleneck |
Fix: Upgrade NIC to 10 Gbps, add CDN for static assets, or optimize response sizes.
Correlation Pattern 4: Connection Pool Exhaustion¶
| Response Time | Errors/Sec | Server CPU | DB Connections | Diagnosis |
|---|---|---|---|---|
| ⬆️ Spiking | ⬆️ 503 errors | 40% | ⬆️ 100/100 (maxed) | Connection pool exhausted |
Fix: Increase database connection pool size (e.g., 100 → 500 connections).
Correlation Pattern 5: Memory Leak¶
| Time | Response Time | Memory % | CPU % | Diagnosis |
|---|---|---|---|---|
| 0-30 min | 100ms | 30% → 50% | 60% | Normal |
| 30-60 min | 200ms | 50% → 75% | 60% | Memory growing (CPU constant) |
| 60-90 min | 1000ms | 75% → 95% | 60% | Memory leak |
| 90 min | Crash (OOM) | 100% | N/A | Out of memory |
Fix: Profile application, find leak, fix code.
Ask the AI to Correlate Metrics
If you're struggling to identify the bottleneck:
Response times are 5000ms at 300 VUs. Server CPU is 40%, memory is 50%, but
database CPU is 95%. What's the bottleneck and how do I fix it?
The AI can:
- Analyze combinations of metrics to pinpoint the exact bottleneck
- Distinguish between application bottlenecks (code) vs. infrastructure bottlenecks (CPU/memory/network)
- Recommend immediate fixes (increase connection pools, optimize queries)
- Suggest long-term architectural improvements (caching, read replicas, CDN)
- Validate your diagnosis before you make expensive infrastructure changes
Real-Time Degradation Detection¶
Detecting performance degradation during the test lets you intervene before wasting hours on a broken test.
Automated Warning Signs¶
Load Tester monitors for these conditions automatically:
| Condition | Warning Level | What It Means |
|---|---|---|
| Engine CPU > 90% | ⚠️ Warning | Engine overloaded, may self-regulate |
| Engine bandwidth > 80 Mbps | ⚠️ Warning | Engine near bandwidth limit |
| Error rate > 10% | 🚨 Critical | Application broken under load |
| Response time > 30 seconds | 🚨 Critical | Server severely overloaded or timing out |
| VUs not ramping | ⚠️ Warning | Engine self-regulation or capacity limit |
When warnings appear, investigate immediately. Don't wait for the test to finish.
Manual Degradation Detection¶
Watch for these patterns during the test:
| Pattern | What to Watch | Action |
|---|---|---|
| Response time doubles | 100ms → 200ms | Note VU count, approaching capacity limit |
| Response time increases 10x | 100ms → 1000ms+ | Stop and investigate, something broke |
| Errors appear | 0% → 5%+ | Check Errors View for error types |
| Hits/sec plateaus | Increasing → flat | Server maxed out, note capacity limit |
| Memory keeps growing | 30% → 50% → 70% → ... | Potential memory leak, watch closely |
Ask the AI for Real-Time Alerts
Configure the AI to monitor your test in real time:
Monitor my load test and alert me if response times increase 5x or if error
rate exceeds 5%. I'm ramping from 100 to 1000 VUs over 60 minutes.
The AI can:
- Watch metrics in real time and alert you to degradation patterns
- Detect capacity limits as they're reached (response times spike at X VUs)
- Identify correlation breakdowns (hits/sec plateaus while VUs keep increasing)
- Recommend stopping the test early if conditions are critical (50% error rate)
- Suggest immediate actions during live tests (add engines, adjust ramp rates)
Next Steps¶
After monitoring your load test:
- Analyze results: See Analyzing Results
- Interactive dashboard: See Embedded Analytics Dashboard
- Identify bottlenecks: See Performance Analysis Workflow
- Export reports: See Legacy Reports for archival/sharing
If you need to optimize:
- Server monitoring: See Server Monitoring
- Cloud engine optimization: See Cloud Load Testing
- Test case troubleshooting: See Debugging Failed Replays