Skip to content

Understanding Metrics

Load testing generates hundreds of metrics. Response times, throughput, error rates, server CPU, memory, database connections. All of it matters, but not equally. The metrics that matter most depend on what broke.

This guide explains WHY each metric exists, what trends reveal about your application, and how to interpret metric patterns to identify bottlenecks.


The Core Principle: Metrics Tell a Story

Metrics don't exist in isolation. They interact. Response times spike (symptom) because database CPU hit 100% (cause) because queries aren't indexed (root cause).

Your job: find the narrative. What happened? When? Why?

Example story:

Metric Trend What It Means
Response Time 100ms → 5000ms at 300 VUs SYMPTOM: Performance degraded
Hits/sec Increasing → Plateaued at 2500 EVIDENCE: Server maxed out
Database CPU 40% → 95% at 300 VUs CAUSE: Database bottleneck
Slow Query Log 10 queries taking > 2 seconds ROOT CAUSE: Unindexed queries

This is how you diagnose bottlenecks: metrics reveal the chain of cause and effect.


Metrics Hierarchy: What to Watch First

Not all metrics are equally important. Follow this priority:

Tier 1: User Experience (Watch First)

These metrics answer: Is the application fast enough for users?

Metric Target Why It Matters
Response Time (avg) < 2000ms Users perceive anything > 2s as "slow"
Response Time (95th %ile) < 3000ms Worst-case user experience (5% of users wait this long)
Error Rate % < 1% Users see broken functionality

If Tier 1 metrics are good, you're done. Users are happy. If not, move to Tier 2.


Tier 2: Throughput (Diagnose Capacity)

These metrics answer: Can the server handle the load?

Metric What to Watch Why It Matters
Hits/sec Should increase with VUs Throughput plateau = server maxed out
Bandwidth (Mbps) Should increase with VUs Bandwidth plateau = network bottleneck
Virtual Users Increasing as planned VU plateau = engine overloaded

If throughput plateaus while VUs keep increasing, the server is maxed out. Check Tier 3 to find WHY.


Tier 3: Server Resources (Find Bottleneck)

These metrics answer: Which resource is limiting performance?

Metric Red Flag What It Means
Server CPU % > 90% CPU bottleneck (optimize code or add CPU)
Server Memory % > 95% Memory bottleneck or leak
Database CPU % > 90% Database bottleneck (optimize queries)
Database Connections = Max pool size Connection pool exhausted
Disk I/O % > 90% Disk bottleneck (slow storage)
Network Bandwidth = NIC capacity Network bottleneck (upgrade NIC)

Whichever hits 100% first is the bottleneck. Fix that resource before scaling further.


Key Metrics in Detail

Response Time: The Primary Performance Metric

What it measures: Time from sending HTTP request to receiving complete response (network latency + server processing + data transfer).

Why it matters: This is what users experience. All other metrics are diagnostic tools. Response time is the truth.

What "good" looks like:

Application Type Good Acceptable Slow
Static content (images, CSS, JS) < 100ms < 500ms > 500ms
Dynamic pages (HTML with DB queries) < 500ms < 2000ms > 2000ms
API calls (simple) < 200ms < 1000ms > 1000ms
Complex operations (reports, batch jobs) < 2000ms < 5000ms > 5000ms

These are guidelines. Your application's acceptable response times depend on what your users will tolerate.


Pattern 1: Flat Line (Ideal)

Response Time (ms)
200 |████████████████████████████████
    |
  0 +--------------------------------
    0  100  200  300  400  500 (VUs)

What this means: Server has capacity to spare. Response times don't degrade as load increases.

Why this happens: Server resources (CPU, memory, database) are underused (e.g., CPU at 40%).

What to do: Keep ramping VUs to find the capacity limit.


Pattern 2: Gradual Increase (Normal)

Response Time (ms)
800 |                      ██████████
600 |              ████████
400 |      ████████
200 |██████
    +----------------------------------
    0  100  200  300  400  500 (VUs)

What this means: Performance degrades proportionally with load. This is NORMAL for most applications.

Why this happens: Resource contention increases as VUs increase (more database connections, more CPU threads, more memory usage).

What to do: Acceptable if response times stay under business requirements (e.g., < 2000ms). If degradation is too steep, investigate resource bottlenecks.


Pattern 3: Sharp Spike (Capacity Exceeded)

Response Time (ms)
8000|                    ████
2000|                ████
 500|        ████████
 100|████████
    +----------------------------
    0  100  200  300  400 (VUs)

What this means: Hard limit reached. Response times jumped from 500ms to 8000ms at ~350 VUs.

Why this happens: Resource exhaustion: database connection pool full (most common), memory exhausted, CPU maxed, thread pool saturated.

What to do:

  1. Note the VU count when spike occurred (capacity limit = 350 VUs)
  2. Check server metrics (CPU, memory, database connections) to identify which resource maxed out
  3. Check Errors View for error messages (may reveal "connection pool exhausted" or similar)

Pattern 4: Erratic Spikes (Intermittent Issues)

Response Time (ms)
5000|    ██       ██          ██
2000|    ██       ██      ████
 500|████████████████████████████
    +-------------------------------
    0  100  200  300  400  500 (VUs)

What this means: Occasional slow requests (outliers). Most requests are fast, but some take 10x longer.

Why this happens:

  • Garbage collection pauses (JVM, .NET CLR)
  • Database query timeouts (slow queries that occasionally take 10x longer)
  • Network hiccups (packet loss, retransmissions)
  • Background jobs (cron jobs, scheduled tasks competing for resources)

What to do:

  1. Check if spikes correlate with time (e.g., every 5 minutes = scheduled job)
  2. Review server logs during spike periods (GC logs, database slow query logs)
  3. If spikes are random and infrequent (< 5% of requests): May be acceptable noise
  4. If spikes are frequent (> 10% of requests): Investigate root cause

Response Time Percentiles: Why They Matter

Average response time hides outliers. If 95% of requests are 100ms but 5% are 10 seconds, the average might show 600ms. That number is technically correct and practically useless.

Percentile metrics reveal the full picture:

Metric What It Measures Why It Matters
Response Time (avg) Mean of all response times Quick overview, but hides outliers
Response Time (50th %ile / median) Half of requests are faster, half slower More realistic "typical" user experience
Response Time (95th %ile) 95% of requests are faster Worst-case experience for most users
Response Time (99th %ile) 99% of requests are faster Outliers (slowest 1% of requests)
Response Time (max) Slowest request Extreme outliers (often anomalies)

Example (why percentiles matter):

Metric Value Interpretation
Average 500ms Looks acceptable
95th percentile 8000ms 🚨 5% of users wait 8 seconds!
Diagnosis Database query timeouts Outliers reveal real problem

Rule of thumb: 95th percentile should be < 2x average. If 95th percentile is 10x average, you have outliers.


Ask the AI to Interpret Response Time Trends

If you see unusual response time patterns:

My average response time is 300ms but 95th percentile is 5000ms.
What's causing these outliers?

The AI can:

  • Analyze percentile distributions to identify outlier patterns
  • Distinguish between normal variance vs. systemic issues (GC pauses, slow queries)
  • Recommend specific investigations (enable GC logging, check database slow query log)
  • Explain whether outliers are acceptable (< 5%) or critical (> 10%)

Hits/Sec: Server Throughput

What it measures: Number of HTTP transactions completed per second across all virtual users.

Why it matters: This is your server's capacity: how many requests per second can it handle? If hits/sec plateaus while VUs keep increasing, the server is maxed out.

What "good" looks like: Hits/sec should increase linearly as VUs increase.

VUs Expected Hits/Sec (Typical Web App) Why
100 ~500-1000 Each VU makes 5-10 requests/min (60 sec think time)
200 ~1000-2000 Linear scaling (2x VUs = 2x hits/sec)
500 ~2500-5000 Continues scaling

If hits/sec stops increasing even though VUs keep ramping:

  • Server is maxed out: can't process more requests even though you're sending them
  • VUs are waiting for slow responses (response times spiking)

Good (linear scaling):

Hits/Sec
2000|                      ██████████
1500|              ████████
1000|      ████████
 500|██████
    +----------------------------------
    0  100  200  300  400  500 (VUs)

Hits/sec increases with VUs → server can process more requests as load increases.


Problem (throughput plateau):

Hits/Sec
2500|        ████████████████████████
2000|    ████
1000|████
    +----------------------------------
    0  100  200  300  400  500 (VUs)

Hits/sec plateaus at ~2500 → server maxed out, can't process more requests even though VUs keep increasing.

What this means: Server's maximum capacity is ~2500 requests/sec (regardless of VU count).

Why this happens: Resource constraint (CPU maxed, database connections exhausted, thread pool saturated).

What to do: Check server metrics (CPU, memory, database) to find which resource is maxed out.


Hits/Sec vs. Response Time Correlation

The relationship between hits/sec and response time reveals server behavior:

Hits/Sec Response Time What It Means
Increasing Flat/Low Server handling load easily (plenty of capacity)
Increasing Gradually increasing Server handling load but approaching limits
Plateaus Spiking Server maxed out, can't process more requests
Decreasing Spiking Server overloaded, processing FEWER requests because it's so slow

Decreasing hits/sec is a red flag. The server is so overloaded it's actually processing fewer requests than before.


Bandwidth: Network Throughput

What it measures: Data transferred per second (download + upload) in Mbps or Gbps.

Why it matters: Network bottlenecks are invisible without this metric. If bandwidth plateaus, you've maxed out the network, not the server.

What "good" looks like: Bandwidth should increase as VUs increase (more users = more data transferred).

VUs Expected Bandwidth (Image-Heavy Site) Expected Bandwidth (Text-Heavy API)
100 ~50 Mbps ~5 Mbps
500 ~250 Mbps ~25 Mbps
1000 ~500 Mbps ~50 Mbps

If bandwidth plateaus (stops increasing even though VUs increase):

  • Network bottleneck: server's network interface maxed out (e.g., 1 Gbps NIC at capacity)
  • Engine bottleneck: load engines maxed out on bandwidth (cloud engines ~90 Mbps each)

Network Bottleneck Example:

VUs Bandwidth Response Time Diagnosis
100 200 Mbps 100ms Good
500 900 Mbps 150ms Approaching 1 Gbps NIC limit
1000 1000 Mbps 5000ms Network maxed out, server can't send more data

This tells you: Server's 1 Gbps network interface is the bottleneck (not CPU, not database).

Fix: Upgrade to 10 Gbps NIC or add load balancer with multiple servers.


Virtual Users (VUs): Load Level

What it measures: Number of concurrent virtual users executing the test case.

Why it matters: VU count defines load level: more VUs = more concurrent users. This is your X-axis for all performance analysis.

What "good" looks like: VUs should increase according to your load profile:

  • Stepped profile: VUs increase in discrete steps (e.g., 100 → 150 → 200 every 5 min)
  • Exponential profile: VUs increase by percentage (e.g., 100 → 125 → 156 → 195)
  • Constant profile: VUs stay constant (e.g., 100 for entire test)

If VUs don't increase on schedule:

  1. Engine self-regulation: Engines detected overload (CPU > 90%) and stopped adding VUs
  2. Engine capacity exceeded: Requested 5000 VUs but engine max is 3000
  3. Test duration too short: Not enough time to complete all ramps

Errors/Sec: Application Health

What it measures: Number of failed transactions per second (HTTP errors, timeouts, connection failures).

Why it matters: Errors indicate broken functionality. Users see 404 pages, 500 errors, or timeouts. Even a 1% error rate means 1 in 100 users fails.

What "good" looks like: Zero errors.

Error Rate Severity What It Means
0% ✅ Perfect All transactions succeeding
< 1% ⚠️ Acceptable Occasional transient errors (network hiccups)
1-10% 🚨 Concerning Application issues under load (investigate immediately)
> 10% 💀 Critical Application broken (stop test, fix issues)

Common Error Types and Their Causes

HTTP Status Error Type Likely Cause
401 Unauthorized Authentication failure Session expired, auth tokens invalid
403 Forbidden Permission denied CSRF token missing, session security check failed
404 Not Found Resource not found Dynamic URL correlation failed, resource deleted
500 Internal Server Error Server-side error Application bug, database error, exception
502 Bad Gateway Proxy/load balancer error Backend server down
503 Service Unavailable Server overloaded Connection pool exhausted, server shutdown
504 Gateway Timeout Timeout Backend server too slow
Connection refused Network error Server not listening, firewall blocking
Read timeout Response timeout Server processing took too long

When Errors Appear Reveals the Cause

VU Level Error Rate Response Time Diagnosis
0-200 VUs 0% 100ms Good
300 VUs 5% (503 errors) 500ms Connection pool exhaustion starting
400 VUs 25% (503 errors) 5000ms Server overloaded
500 VUs 50% (503 errors + timeouts) Timeouts Server critically overloaded

This tells you: Server's capacity limit is ~250 VUs. Beyond that, the connection pool exhausts and errors start.


Server Metrics: CPU, Memory, Disk, Network

What they measure: Server-side resource utilization (CPU %, Memory %, Disk I/O %, Network %).

Why they matter: Response times tell you WHAT broke. Server metrics tell you WHY. Slow response times with 95% CPU means CPU bottleneck. Slow response times with 30% CPU means the problem is elsewhere: database, disk, network.


CPU %: Compute Capacity

What to watch:

CPU % What It Means Action
< 50% Plenty of capacity Keep ramping load
50-70% Moderate usage Watch for degradation
70-90% High usage Approaching limit
> 90% Critically high CPU bottleneck: optimize code or add CPU

Correlating CPU with response times:

CPU % Response Time Diagnosis
40% 100ms CPU not the bottleneck
70% 200ms CPU moderately loaded (normal)
95% 5000ms CPU is the bottleneck

Memory %: Memory Capacity

What to watch:

Memory % What It Means Action
< 70% Healthy Normal
70-85% Moderate Watch for growth
85-95% High Potential memory pressure
> 95% Critical Memory bottleneck or leak

Memory leak pattern:

Time Memory % Response Time Diagnosis
0 min 30% 100ms Good
60 min 75% 500ms Concerning
120 min 100% (OOM) Crash Memory leak

Metric Correlation: Finding Bottlenecks

The power of metrics is correlation. No single metric tells the whole story. Combining them reveals root causes.

Example 1: CPU Bottleneck

Metric Value Interpretation
Response Time 5000ms SYMPTOM: Slow
Hits/sec Plateaued EVIDENCE: Maxed out
Server CPU 95% CAUSE: CPU bottleneck
Database CPU 40% NOT the database

Diagnosis: Web server CPU bottleneck.

Fix: Optimize application code, add CPU cores, or add web servers.


Example 2: Database Bottleneck

Metric Value Interpretation
Response Time 5000ms SYMPTOM: Slow
Server CPU 40% NOT the web server
Database CPU 95% CAUSE: Database bottleneck
Query time (avg) 2000ms EVIDENCE: Slow queries

Diagnosis: Database CPU bottleneck.

Fix: Optimize queries, add indexes, add database CPU capacity, or add read replicas.


Example 3: Connection Pool Exhaustion

Metric Value Interpretation
Response Time 5000ms SYMPTOM: Slow
Errors/sec 50 (503 errors) SYMPTOM: Service unavailable
Server CPU 40% NOT compute-bound
DB Connections 100/100 (maxed) CAUSE: Pool exhausted

Diagnosis: Database connection pool exhausted.

Fix: Increase connection pool size (e.g., 100 → 500 connections).


Ask the AI to Correlate Metrics for You

If you're struggling to identify the bottleneck:

Response times are 5000ms at 300 VUs. Server CPU is 40%, memory is 50%,
but database CPU is 95%. What's the bottleneck and how do I fix it?

The AI can:

  • Analyze combinations of metrics to pinpoint the exact bottleneck
  • Distinguish between application bottlenecks (code) vs. infrastructure bottlenecks (CPU/memory/network)
  • Recommend immediate fixes (increase connection pools, optimize queries)
  • Suggest long-term architectural improvements (caching, read replicas, CDN)
  • Validate your diagnosis before you make expensive infrastructure changes

Advanced Metrics for Specific Scenarios

Time to First Byte (TTFB)

What it measures: Time from sending request to receiving first byte of response (server processing time, excludes data transfer).

When to use: Isolate server processing time from network transfer time.

Example:

Metric Value Diagnosis
Response Time 5000ms Total time
TTFB 4900ms Server processing 98% of time
Diagnosis Server bottleneck NOT network (transfer is only 100ms)

Average Speed (Mbps per Transaction)

What it measures: Data transfer rate for individual transactions (response size ÷ download time).

When to use: Identify slow downloads (e.g., large images, PDFs, video).

Example:

Transaction Response Size Response Time Avg Speed Diagnosis
Homepage 500 KB 100ms 40 Mbps Good
Product Image 2 MB 5000ms 3.2 Mbps Slow download (network or CDN issue)

Page Duration vs. Transaction Duration

What it measures:

  • Transaction Duration: Single HTTP request/response
  • Page Duration: All transactions on a page (including think time between transactions)

When to use: Identify whether pages are slow because of single slow transaction or cumulative effect.

Example:

Metric Value Diagnosis
Page Duration 10 seconds SYMPTOM: Slow page
Transaction 1 100ms Fast
Transaction 2 100ms Fast
Transaction 3 9800ms One slow transaction (investigate this)

Next Steps: Deep Analysis

After understanding metrics:

Or: Return to the overview:


Related Topics: