Understanding Metrics¶
Load testing generates hundreds of metrics. Response times, throughput, error rates, server CPU, memory, database connections. All of it matters, but not equally. The metrics that matter most depend on what broke.
This guide explains WHY each metric exists, what trends reveal about your application, and how to interpret metric patterns to identify bottlenecks.
The Core Principle: Metrics Tell a Story¶
Metrics don't exist in isolation. They interact. Response times spike (symptom) because database CPU hit 100% (cause) because queries aren't indexed (root cause).
Your job: find the narrative. What happened? When? Why?
Example story:
| Metric | Trend | What It Means |
|---|---|---|
| Response Time | 100ms → 5000ms at 300 VUs | SYMPTOM: Performance degraded |
| Hits/sec | Increasing → Plateaued at 2500 | EVIDENCE: Server maxed out |
| Database CPU | 40% → 95% at 300 VUs | CAUSE: Database bottleneck |
| Slow Query Log | 10 queries taking > 2 seconds | ROOT CAUSE: Unindexed queries |
This is how you diagnose bottlenecks: metrics reveal the chain of cause and effect.
Metrics Hierarchy: What to Watch First¶
Not all metrics are equally important. Follow this priority:
Tier 1: User Experience (Watch First)¶
These metrics answer: Is the application fast enough for users?
| Metric | Target | Why It Matters |
|---|---|---|
| Response Time (avg) | < 2000ms | Users perceive anything > 2s as "slow" |
| Response Time (95th %ile) | < 3000ms | Worst-case user experience (5% of users wait this long) |
| Error Rate % | < 1% | Users see broken functionality |
If Tier 1 metrics are good, you're done. Users are happy. If not, move to Tier 2.
Tier 2: Throughput (Diagnose Capacity)¶
These metrics answer: Can the server handle the load?
| Metric | What to Watch | Why It Matters |
|---|---|---|
| Hits/sec | Should increase with VUs | Throughput plateau = server maxed out |
| Bandwidth (Mbps) | Should increase with VUs | Bandwidth plateau = network bottleneck |
| Virtual Users | Increasing as planned | VU plateau = engine overloaded |
If throughput plateaus while VUs keep increasing, the server is maxed out. Check Tier 3 to find WHY.
Tier 3: Server Resources (Find Bottleneck)¶
These metrics answer: Which resource is limiting performance?
| Metric | Red Flag | What It Means |
|---|---|---|
| Server CPU % | > 90% | CPU bottleneck (optimize code or add CPU) |
| Server Memory % | > 95% | Memory bottleneck or leak |
| Database CPU % | > 90% | Database bottleneck (optimize queries) |
| Database Connections | = Max pool size | Connection pool exhausted |
| Disk I/O % | > 90% | Disk bottleneck (slow storage) |
| Network Bandwidth | = NIC capacity | Network bottleneck (upgrade NIC) |
Whichever hits 100% first is the bottleneck. Fix that resource before scaling further.
Key Metrics in Detail¶
Response Time: The Primary Performance Metric¶
What it measures: Time from sending HTTP request to receiving complete response (network latency + server processing + data transfer).
Why it matters: This is what users experience. All other metrics are diagnostic tools. Response time is the truth.
What "good" looks like:
| Application Type | Good | Acceptable | Slow |
|---|---|---|---|
| Static content (images, CSS, JS) | < 100ms | < 500ms | > 500ms |
| Dynamic pages (HTML with DB queries) | < 500ms | < 2000ms | > 2000ms |
| API calls (simple) | < 200ms | < 1000ms | > 1000ms |
| Complex operations (reports, batch jobs) | < 2000ms | < 5000ms | > 5000ms |
These are guidelines. Your application's acceptable response times depend on what your users will tolerate.
Interpreting Response Time Trends¶
Pattern 1: Flat Line (Ideal)
Response Time (ms)
200 |████████████████████████████████
|
0 +--------------------------------
0 100 200 300 400 500 (VUs)
What this means: Server has capacity to spare. Response times don't degrade as load increases.
Why this happens: Server resources (CPU, memory, database) are underused (e.g., CPU at 40%).
What to do: Keep ramping VUs to find the capacity limit.
Pattern 2: Gradual Increase (Normal)
Response Time (ms)
800 | ██████████
600 | ████████
400 | ████████
200 |██████
+----------------------------------
0 100 200 300 400 500 (VUs)
What this means: Performance degrades proportionally with load. This is NORMAL for most applications.
Why this happens: Resource contention increases as VUs increase (more database connections, more CPU threads, more memory usage).
What to do: Acceptable if response times stay under business requirements (e.g., < 2000ms). If degradation is too steep, investigate resource bottlenecks.
Pattern 3: Sharp Spike (Capacity Exceeded)
Response Time (ms)
8000| ████
2000| ████
500| ████████
100|████████
+----------------------------
0 100 200 300 400 (VUs)
What this means: Hard limit reached. Response times jumped from 500ms to 8000ms at ~350 VUs.
Why this happens: Resource exhaustion: database connection pool full (most common), memory exhausted, CPU maxed, thread pool saturated.
What to do:
- Note the VU count when spike occurred (capacity limit = 350 VUs)
- Check server metrics (CPU, memory, database connections) to identify which resource maxed out
- Check Errors View for error messages (may reveal "connection pool exhausted" or similar)
Pattern 4: Erratic Spikes (Intermittent Issues)
Response Time (ms)
5000| ██ ██ ██
2000| ██ ██ ████
500|████████████████████████████
+-------------------------------
0 100 200 300 400 500 (VUs)
What this means: Occasional slow requests (outliers). Most requests are fast, but some take 10x longer.
Why this happens:
- Garbage collection pauses (JVM, .NET CLR)
- Database query timeouts (slow queries that occasionally take 10x longer)
- Network hiccups (packet loss, retransmissions)
- Background jobs (cron jobs, scheduled tasks competing for resources)
What to do:
- Check if spikes correlate with time (e.g., every 5 minutes = scheduled job)
- Review server logs during spike periods (GC logs, database slow query logs)
- If spikes are random and infrequent (< 5% of requests): May be acceptable noise
- If spikes are frequent (> 10% of requests): Investigate root cause
Response Time Percentiles: Why They Matter¶
Average response time hides outliers. If 95% of requests are 100ms but 5% are 10 seconds, the average might show 600ms. That number is technically correct and practically useless.
Percentile metrics reveal the full picture:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Response Time (avg) | Mean of all response times | Quick overview, but hides outliers |
| Response Time (50th %ile / median) | Half of requests are faster, half slower | More realistic "typical" user experience |
| Response Time (95th %ile) | 95% of requests are faster | Worst-case experience for most users |
| Response Time (99th %ile) | 99% of requests are faster | Outliers (slowest 1% of requests) |
| Response Time (max) | Slowest request | Extreme outliers (often anomalies) |
Example (why percentiles matter):
| Metric | Value | Interpretation |
|---|---|---|
| Average | 500ms | Looks acceptable |
| 95th percentile | 8000ms | 🚨 5% of users wait 8 seconds! |
| Diagnosis | Database query timeouts | Outliers reveal real problem |
Rule of thumb: 95th percentile should be < 2x average. If 95th percentile is 10x average, you have outliers.
Ask the AI to Interpret Response Time Trends
If you see unusual response time patterns:
The AI can:
- Analyze percentile distributions to identify outlier patterns
- Distinguish between normal variance vs. systemic issues (GC pauses, slow queries)
- Recommend specific investigations (enable GC logging, check database slow query log)
- Explain whether outliers are acceptable (< 5%) or critical (> 10%)
Hits/Sec: Server Throughput¶
What it measures: Number of HTTP transactions completed per second across all virtual users.
Why it matters: This is your server's capacity: how many requests per second can it handle? If hits/sec plateaus while VUs keep increasing, the server is maxed out.
What "good" looks like: Hits/sec should increase linearly as VUs increase.
| VUs | Expected Hits/Sec (Typical Web App) | Why |
|---|---|---|
| 100 | ~500-1000 | Each VU makes 5-10 requests/min (60 sec think time) |
| 200 | ~1000-2000 | Linear scaling (2x VUs = 2x hits/sec) |
| 500 | ~2500-5000 | Continues scaling |
If hits/sec stops increasing even though VUs keep ramping:
- Server is maxed out: can't process more requests even though you're sending them
- VUs are waiting for slow responses (response times spiking)
Interpreting Hits/Sec Trends¶
Good (linear scaling):
Hits/Sec
2000| ██████████
1500| ████████
1000| ████████
500|██████
+----------------------------------
0 100 200 300 400 500 (VUs)
Hits/sec increases with VUs → server can process more requests as load increases.
Problem (throughput plateau):
Hits/Sec
2500| ████████████████████████
2000| ████
1000|████
+----------------------------------
0 100 200 300 400 500 (VUs)
Hits/sec plateaus at ~2500 → server maxed out, can't process more requests even though VUs keep increasing.
What this means: Server's maximum capacity is ~2500 requests/sec (regardless of VU count).
Why this happens: Resource constraint (CPU maxed, database connections exhausted, thread pool saturated).
What to do: Check server metrics (CPU, memory, database) to find which resource is maxed out.
Hits/Sec vs. Response Time Correlation¶
The relationship between hits/sec and response time reveals server behavior:
| Hits/Sec | Response Time | What It Means |
|---|---|---|
| Increasing | Flat/Low | Server handling load easily (plenty of capacity) |
| Increasing | Gradually increasing | Server handling load but approaching limits |
| Plateaus | Spiking | Server maxed out, can't process more requests |
| Decreasing | Spiking | Server overloaded, processing FEWER requests because it's so slow |
Decreasing hits/sec is a red flag. The server is so overloaded it's actually processing fewer requests than before.
Bandwidth: Network Throughput¶
What it measures: Data transferred per second (download + upload) in Mbps or Gbps.
Why it matters: Network bottlenecks are invisible without this metric. If bandwidth plateaus, you've maxed out the network, not the server.
What "good" looks like: Bandwidth should increase as VUs increase (more users = more data transferred).
| VUs | Expected Bandwidth (Image-Heavy Site) | Expected Bandwidth (Text-Heavy API) |
|---|---|---|
| 100 | ~50 Mbps | ~5 Mbps |
| 500 | ~250 Mbps | ~25 Mbps |
| 1000 | ~500 Mbps | ~50 Mbps |
If bandwidth plateaus (stops increasing even though VUs increase):
- Network bottleneck: server's network interface maxed out (e.g., 1 Gbps NIC at capacity)
- Engine bottleneck: load engines maxed out on bandwidth (cloud engines ~90 Mbps each)
Interpreting Bandwidth Trends¶
Network Bottleneck Example:
| VUs | Bandwidth | Response Time | Diagnosis |
|---|---|---|---|
| 100 | 200 Mbps | 100ms | Good |
| 500 | 900 Mbps | 150ms | Approaching 1 Gbps NIC limit |
| 1000 | 1000 Mbps | 5000ms | Network maxed out, server can't send more data |
This tells you: Server's 1 Gbps network interface is the bottleneck (not CPU, not database).
Fix: Upgrade to 10 Gbps NIC or add load balancer with multiple servers.
Virtual Users (VUs): Load Level¶
What it measures: Number of concurrent virtual users executing the test case.
Why it matters: VU count defines load level: more VUs = more concurrent users. This is your X-axis for all performance analysis.
What "good" looks like: VUs should increase according to your load profile:
- Stepped profile: VUs increase in discrete steps (e.g., 100 → 150 → 200 every 5 min)
- Exponential profile: VUs increase by percentage (e.g., 100 → 125 → 156 → 195)
- Constant profile: VUs stay constant (e.g., 100 for entire test)
If VUs don't increase on schedule:
- Engine self-regulation: Engines detected overload (CPU > 90%) and stopped adding VUs
- Engine capacity exceeded: Requested 5000 VUs but engine max is 3000
- Test duration too short: Not enough time to complete all ramps
Errors/Sec: Application Health¶
What it measures: Number of failed transactions per second (HTTP errors, timeouts, connection failures).
Why it matters: Errors indicate broken functionality. Users see 404 pages, 500 errors, or timeouts. Even a 1% error rate means 1 in 100 users fails.
What "good" looks like: Zero errors.
| Error Rate | Severity | What It Means |
|---|---|---|
| 0% | ✅ Perfect | All transactions succeeding |
| < 1% | ⚠️ Acceptable | Occasional transient errors (network hiccups) |
| 1-10% | 🚨 Concerning | Application issues under load (investigate immediately) |
| > 10% | 💀 Critical | Application broken (stop test, fix issues) |
Common Error Types and Their Causes¶
| HTTP Status | Error Type | Likely Cause |
|---|---|---|
| 401 Unauthorized | Authentication failure | Session expired, auth tokens invalid |
| 403 Forbidden | Permission denied | CSRF token missing, session security check failed |
| 404 Not Found | Resource not found | Dynamic URL correlation failed, resource deleted |
| 500 Internal Server Error | Server-side error | Application bug, database error, exception |
| 502 Bad Gateway | Proxy/load balancer error | Backend server down |
| 503 Service Unavailable | Server overloaded | Connection pool exhausted, server shutdown |
| 504 Gateway Timeout | Timeout | Backend server too slow |
| Connection refused | Network error | Server not listening, firewall blocking |
| Read timeout | Response timeout | Server processing took too long |
When Errors Appear Reveals the Cause¶
| VU Level | Error Rate | Response Time | Diagnosis |
|---|---|---|---|
| 0-200 VUs | 0% | 100ms | Good |
| 300 VUs | 5% (503 errors) | 500ms | Connection pool exhaustion starting |
| 400 VUs | 25% (503 errors) | 5000ms | Server overloaded |
| 500 VUs | 50% (503 errors + timeouts) | Timeouts | Server critically overloaded |
This tells you: Server's capacity limit is ~250 VUs. Beyond that, the connection pool exhausts and errors start.
Server Metrics: CPU, Memory, Disk, Network¶
What they measure: Server-side resource utilization (CPU %, Memory %, Disk I/O %, Network %).
Why they matter: Response times tell you WHAT broke. Server metrics tell you WHY. Slow response times with 95% CPU means CPU bottleneck. Slow response times with 30% CPU means the problem is elsewhere: database, disk, network.
CPU %: Compute Capacity¶
What to watch:
| CPU % | What It Means | Action |
|---|---|---|
| < 50% | Plenty of capacity | Keep ramping load |
| 50-70% | Moderate usage | Watch for degradation |
| 70-90% | High usage | Approaching limit |
| > 90% | Critically high | CPU bottleneck: optimize code or add CPU |
Correlating CPU with response times:
| CPU % | Response Time | Diagnosis |
|---|---|---|
| 40% | 100ms | CPU not the bottleneck |
| 70% | 200ms | CPU moderately loaded (normal) |
| 95% | 5000ms | CPU is the bottleneck |
Memory %: Memory Capacity¶
What to watch:
| Memory % | What It Means | Action |
|---|---|---|
| < 70% | Healthy | Normal |
| 70-85% | Moderate | Watch for growth |
| 85-95% | High | Potential memory pressure |
| > 95% | Critical | Memory bottleneck or leak |
Memory leak pattern:
| Time | Memory % | Response Time | Diagnosis |
|---|---|---|---|
| 0 min | 30% | 100ms | Good |
| 60 min | 75% | 500ms | Concerning |
| 120 min | 100% (OOM) | Crash | Memory leak |
Metric Correlation: Finding Bottlenecks¶
The power of metrics is correlation. No single metric tells the whole story. Combining them reveals root causes.
Example 1: CPU Bottleneck¶
| Metric | Value | Interpretation |
|---|---|---|
| Response Time | 5000ms | SYMPTOM: Slow |
| Hits/sec | Plateaued | EVIDENCE: Maxed out |
| Server CPU | 95% | CAUSE: CPU bottleneck |
| Database CPU | 40% | NOT the database |
Diagnosis: Web server CPU bottleneck.
Fix: Optimize application code, add CPU cores, or add web servers.
Example 2: Database Bottleneck¶
| Metric | Value | Interpretation |
|---|---|---|
| Response Time | 5000ms | SYMPTOM: Slow |
| Server CPU | 40% | NOT the web server |
| Database CPU | 95% | CAUSE: Database bottleneck |
| Query time (avg) | 2000ms | EVIDENCE: Slow queries |
Diagnosis: Database CPU bottleneck.
Fix: Optimize queries, add indexes, add database CPU capacity, or add read replicas.
Example 3: Connection Pool Exhaustion¶
| Metric | Value | Interpretation |
|---|---|---|
| Response Time | 5000ms | SYMPTOM: Slow |
| Errors/sec | 50 (503 errors) | SYMPTOM: Service unavailable |
| Server CPU | 40% | NOT compute-bound |
| DB Connections | 100/100 (maxed) | CAUSE: Pool exhausted |
Diagnosis: Database connection pool exhausted.
Fix: Increase connection pool size (e.g., 100 → 500 connections).
Ask the AI to Correlate Metrics for You
If you're struggling to identify the bottleneck:
Response times are 5000ms at 300 VUs. Server CPU is 40%, memory is 50%,
but database CPU is 95%. What's the bottleneck and how do I fix it?
The AI can:
- Analyze combinations of metrics to pinpoint the exact bottleneck
- Distinguish between application bottlenecks (code) vs. infrastructure bottlenecks (CPU/memory/network)
- Recommend immediate fixes (increase connection pools, optimize queries)
- Suggest long-term architectural improvements (caching, read replicas, CDN)
- Validate your diagnosis before you make expensive infrastructure changes
Advanced Metrics for Specific Scenarios¶
Time to First Byte (TTFB)¶
What it measures: Time from sending request to receiving first byte of response (server processing time, excludes data transfer).
When to use: Isolate server processing time from network transfer time.
Example:
| Metric | Value | Diagnosis |
|---|---|---|
| Response Time | 5000ms | Total time |
| TTFB | 4900ms | Server processing 98% of time |
| Diagnosis | Server bottleneck | NOT network (transfer is only 100ms) |
Average Speed (Mbps per Transaction)¶
What it measures: Data transfer rate for individual transactions (response size ÷ download time).
When to use: Identify slow downloads (e.g., large images, PDFs, video).
Example:
| Transaction | Response Size | Response Time | Avg Speed | Diagnosis |
|---|---|---|---|---|
| Homepage | 500 KB | 100ms | 40 Mbps | Good |
| Product Image | 2 MB | 5000ms | 3.2 Mbps | Slow download (network or CDN issue) |
Page Duration vs. Transaction Duration¶
What it measures:
- Transaction Duration: Single HTTP request/response
- Page Duration: All transactions on a page (including think time between transactions)
When to use: Identify whether pages are slow because of single slow transaction or cumulative effect.
Example:
| Metric | Value | Diagnosis |
|---|---|---|
| Page Duration | 10 seconds | SYMPTOM: Slow page |
| Transaction 1 | 100ms | Fast |
| Transaction 2 | 100ms | Fast |
| Transaction 3 | 9800ms | One slow transaction (investigate this) |
Next Steps: Deep Analysis¶
After understanding metrics:
- Embedded Analytics Dashboard - AI-powered interactive metrics exploration (v7.0)
- Performance Analysis Workflow - Step-by-step bottleneck identification process
- Identifying Bottlenecks - Detailed correlation patterns
- Legacy Reports - Static HTML reports for archival and sharing
Or: Return to the overview:
- Load Test Results Overview - Navigate the results interface
Related Topics: