Understanding Metrics¶

Load testing generates hundreds of metrics. Response times, throughput, error rates, server CPU, memory, database connections. All of it matters, but not equally. The metrics that matter most depend on what broke.

This guide explains WHY each metric exists, what trends reveal about your application, and how to interpret metric patterns to identify bottlenecks.

The Core Principle: Metrics Tell a Story¶

Metrics don't exist in isolation. They interact. Response times spike (symptom) because database CPU hit 100% (cause) because queries aren't indexed (root cause).

Your job: find the narrative. What happened? When? Why?

Example story:

Metric	Trend	What It Means
Response Time	100ms → 5000ms at 300 VUs	SYMPTOM: Performance degraded
Hits/sec	Increasing → Plateaued at 2500	EVIDENCE: Server maxed out
Database CPU	40% → 95% at 300 VUs	CAUSE: Database bottleneck
Slow Query Log	10 queries taking > 2 seconds	ROOT CAUSE: Unindexed queries

This is how you diagnose bottlenecks: metrics reveal the chain of cause and effect.

Metrics Hierarchy: What to Watch First¶

Not all metrics are equally important. Follow this priority:

Tier 1: User Experience (Watch First)¶

These metrics answer: Is the application fast enough for users?

Metric	Target	Why It Matters
Response Time (avg)	< 2000ms	Users perceive anything > 2s as "slow"
Response Time (95th %ile)	< 3000ms	Worst-case user experience (5% of users wait this long)
Error Rate %	< 1%	Users see broken functionality

If Tier 1 metrics are good, you're done. Users are happy. If not, move to Tier 2.

Tier 2: Throughput (Diagnose Capacity)¶

These metrics answer: Can the server handle the load?

Metric	What to Watch	Why It Matters
Hits/sec	Should increase with VUs	Throughput plateau = server maxed out
Bandwidth (Mbps)	Should increase with VUs	Bandwidth plateau = network bottleneck
Virtual Users	Increasing as planned	VU plateau = engine overloaded

If throughput plateaus while VUs keep increasing, the server is maxed out. Check Tier 3 to find WHY.

Tier 3: Server Resources (Find Bottleneck)¶

These metrics answer: Which resource is limiting performance?

Metric	Red Flag	What It Means
Server CPU %	> 90%	CPU bottleneck (optimize code or add CPU)
Server Memory %	> 95%	Memory bottleneck or leak
Database CPU %	> 90%	Database bottleneck (optimize queries)
Database Connections	= Max pool size	Connection pool exhausted
Disk I/O %	> 90%	Disk bottleneck (slow storage)
Network Bandwidth	= NIC capacity	Network bottleneck (upgrade NIC)

Whichever hits 100% first is the bottleneck. Fix that resource before scaling further.

Key Metrics in Detail¶

Response Time: The Primary Performance Metric¶

What it measures: Time from sending HTTP request to receiving complete response (network latency + server processing + data transfer).

Why it matters: This is what users experience. All other metrics are diagnostic tools. Response time is the truth.

What "good" looks like:

Application Type	Good	Acceptable	Slow
Static content (images, CSS, JS)	< 100ms	< 500ms	> 500ms
Dynamic pages (HTML with DB queries)	< 500ms	< 2000ms	> 2000ms
API calls (simple)	< 200ms	< 1000ms	> 1000ms
Complex operations (reports, batch jobs)	< 2000ms	< 5000ms	> 5000ms

These are guidelines. Your application's acceptable response times depend on what your users will tolerate.

Interpreting Response Time Trends¶

Pattern 1: Flat Line (Ideal)

Response Time (ms)
200 |████████████████████████████████
    |
  0 +--------------------------------
    0  100  200  300  400  500 (VUs)

What this means: Server has capacity to spare. Response times don't degrade as load increases.

Why this happens: Server resources (CPU, memory, database) are underused (e.g., CPU at 40%).

What to do: Keep ramping VUs to find the capacity limit.

Pattern 2: Gradual Increase (Normal)

Response Time (ms)
800 |                      ██████████
600 |              ████████
400 |      ████████
200 |██████
    +----------------------------------
    0  100  200  300  400  500 (VUs)

What this means: Performance degrades proportionally with load. This is NORMAL for most applications.

Why this happens: Resource contention increases as VUs increase (more database connections, more CPU threads, more memory usage).

What to do: Acceptable if response times stay under business requirements (e.g., < 2000ms). If degradation is too steep, investigate resource bottlenecks.

Pattern 3: Sharp Spike (Capacity Exceeded)

Response Time (ms)
8000|                    ████
2000|                ████
 500|        ████████
 100|████████
    +----------------------------
    0  100  200  300  400 (VUs)

What this means: Hard limit reached. Response times jumped from 500ms to 8000ms at ~350 VUs.

Why this happens: Resource exhaustion: database connection pool full (most common), memory exhausted, CPU maxed, thread pool saturated.

What to do:

Note the VU count when spike occurred (capacity limit = 350 VUs)
Check server metrics (CPU, memory, database connections) to identify which resource maxed out
Check Errors View for error messages (may reveal "connection pool exhausted" or similar)

Pattern 4: Erratic Spikes (Intermittent Issues)

Response Time (ms)
5000|    ██       ██          ██
2000|    ██       ██      ████
 500|████████████████████████████
    +-------------------------------
    0  100  200  300  400  500 (VUs)

What this means: Occasional slow requests (outliers). Most requests are fast, but some take 10x longer.

Why this happens:

Garbage collection pauses (JVM, .NET CLR)
Database query timeouts (slow queries that occasionally take 10x longer)
Network hiccups (packet loss, retransmissions)
Background jobs (cron jobs, scheduled tasks competing for resources)

What to do:

Check if spikes correlate with time (e.g., every 5 minutes = scheduled job)
Review server logs during spike periods (GC logs, database slow query logs)
If spikes are random and infrequent (< 5% of requests): May be acceptable noise
If spikes are frequent (> 10% of requests): Investigate root cause

Response Time Percentiles: Why They Matter¶

Average response time hides outliers. If 95% of requests are 100ms but 5% are 10 seconds, the average might show 600ms. That number is technically correct and practically useless.

Percentile metrics reveal the full picture:

Metric	What It Measures	Why It Matters
Response Time (avg)	Mean of all response times	Quick overview, but hides outliers
Response Time (50th %ile / median)	Half of requests are faster, half slower	More realistic "typical" user experience
Response Time (95th %ile)	95% of requests are faster	Worst-case experience for most users
Response Time (99th %ile)	99% of requests are faster	Outliers (slowest 1% of requests)
Response Time (max)	Slowest request	Extreme outliers (often anomalies)

Example (why percentiles matter):

Metric	Value	Interpretation
Average	500ms	Looks acceptable
95th percentile	8000ms	🚨 5% of users wait 8 seconds!
Diagnosis	Database query timeouts	Outliers reveal real problem

Rule of thumb: 95th percentile should be < 2x average. If 95th percentile is 10x average, you have outliers.

Ask the AI to Interpret Response Time Trends

If you see unusual response time patterns:

My average response time is 300ms but 95th percentile is 5000ms.
What's causing these outliers?

The AI can:

Analyze percentile distributions to identify outlier patterns
Distinguish between normal variance vs. systemic issues (GC pauses, slow queries)
Recommend specific investigations (enable GC logging, check database slow query log)
Explain whether outliers are acceptable (< 5%) or critical (> 10%)

Hits/Sec: Server Throughput¶

What it measures: Number of HTTP transactions completed per second across all virtual users.

Why it matters: This is your server's capacity: how many requests per second can it handle? If hits/sec plateaus while VUs keep increasing, the server is maxed out.

What "good" looks like: Hits/sec should increase linearly as VUs increase.

VUs	Expected Hits/Sec (Typical Web App)	Why
100	~500-1000	Each VU makes 5-10 requests/min (60 sec think time)
200	~1000-2000	Linear scaling (2x VUs = 2x hits/sec)
500	~2500-5000	Continues scaling

If hits/sec stops increasing even though VUs keep ramping:

Server is maxed out: can't process more requests even though you're sending them
VUs are waiting for slow responses (response times spiking)

Interpreting Hits/Sec Trends¶

Good (linear scaling):

Hits/Sec
2000|                      ██████████
1500|              ████████
1000|      ████████
 500|██████
    +----------------------------------
    0  100  200  300  400  500 (VUs)

Hits/sec increases with VUs → server can process more requests as load increases.

Problem (throughput plateau):

Hits/Sec
2500|        ████████████████████████
2000|    ████
1000|████
    +----------------------------------
    0  100  200  300  400  500 (VUs)

Hits/sec plateaus at ~2500 → server maxed out, can't process more requests even though VUs keep increasing.

What this means: Server's maximum capacity is ~2500 requests/sec (regardless of VU count).

Why this happens: Resource constraint (CPU maxed, database connections exhausted, thread pool saturated).

What to do: Check server metrics (CPU, memory, database) to find which resource is maxed out.

Hits/Sec vs. Response Time Correlation¶

The relationship between hits/sec and response time reveals server behavior:

Hits/Sec	Response Time	What It Means
Increasing	Flat/Low	Server handling load easily (plenty of capacity)
Increasing	Gradually increasing	Server handling load but approaching limits
Plateaus	Spiking	Server maxed out, can't process more requests
Decreasing	Spiking	Server overloaded, processing FEWER requests because it's so slow

Decreasing hits/sec is a red flag. The server is so overloaded it's actually processing fewer requests than before.

Bandwidth: Network Throughput¶

What it measures: Data transferred per second (download + upload) in Mbps or Gbps.

Why it matters: Network bottlenecks are invisible without this metric. If bandwidth plateaus, you've maxed out the network, not the server.

What "good" looks like: Bandwidth should increase as VUs increase (more users = more data transferred).

VUs	Expected Bandwidth (Image-Heavy Site)	Expected Bandwidth (Text-Heavy API)
100	~50 Mbps	~5 Mbps
500	~250 Mbps	~25 Mbps
1000	~500 Mbps	~50 Mbps

If bandwidth plateaus (stops increasing even though VUs increase):

Network bottleneck: server's network interface maxed out (e.g., 1 Gbps NIC at capacity)
Engine bottleneck: load engines maxed out on bandwidth (cloud engines ~90 Mbps each)

Interpreting Bandwidth Trends¶

Network Bottleneck Example:

VUs	Bandwidth	Response Time	Diagnosis
100	200 Mbps	100ms	Good
500	900 Mbps	150ms	Approaching 1 Gbps NIC limit
1000	1000 Mbps	5000ms	Network maxed out, server can't send more data

This tells you: Server's 1 Gbps network interface is the bottleneck (not CPU, not database).

Fix: Upgrade to 10 Gbps NIC or add load balancer with multiple servers.

Virtual Users (VUs): Load Level¶

What it measures: Number of concurrent virtual users executing the test case.

Why it matters: VU count defines load level: more VUs = more concurrent users. This is your X-axis for all performance analysis.

What "good" looks like: VUs should increase according to your load profile:

Stepped profile: VUs increase in discrete steps (e.g., 100 → 150 → 200 every 5 min)
Exponential profile: VUs increase by percentage (e.g., 100 → 125 → 156 → 195)
Constant profile: VUs stay constant (e.g., 100 for entire test)

If VUs don't increase on schedule:

Engine self-regulation: Engines detected overload (CPU > 90%) and stopped adding VUs
Engine capacity exceeded: Requested 5000 VUs but engine max is 3000
Test duration too short: Not enough time to complete all ramps

Errors/Sec: Application Health¶

What it measures: Number of failed transactions per second (HTTP errors, timeouts, connection failures).

Why it matters: Errors indicate broken functionality. Users see 404 pages, 500 errors, or timeouts. Even a 1% error rate means 1 in 100 users fails.

What "good" looks like: Zero errors.

Error Rate	Severity	What It Means
0%	✅ Perfect	All transactions succeeding
< 1%	⚠️ Acceptable	Occasional transient errors (network hiccups)
1-10%	🚨 Concerning	Application issues under load (investigate immediately)
> 10%	💀 Critical	Application broken (stop test, fix issues)

Common Error Types and Their Causes¶

HTTP Status	Error Type	Likely Cause
401 Unauthorized	Authentication failure	Session expired, auth tokens invalid
403 Forbidden	Permission denied	CSRF token missing, session security check failed
404 Not Found	Resource not found	Dynamic URL correlation failed, resource deleted
500 Internal Server Error	Server-side error	Application bug, database error, exception
502 Bad Gateway	Proxy/load balancer error	Backend server down
503 Service Unavailable	Server overloaded	Connection pool exhausted, server shutdown
504 Gateway Timeout	Timeout	Backend server too slow
Connection refused	Network error	Server not listening, firewall blocking
Read timeout	Response timeout	Server processing took too long

When Errors Appear Reveals the Cause¶

VU Level	Error Rate	Response Time	Diagnosis
0-200 VUs	0%	100ms	Good
300 VUs	5% (503 errors)	500ms	Connection pool exhaustion starting
400 VUs	25% (503 errors)	5000ms	Server overloaded
500 VUs	50% (503 errors + timeouts)	Timeouts	Server critically overloaded

This tells you: Server's capacity limit is ~250 VUs. Beyond that, the connection pool exhausts and errors start.

Server Metrics: CPU, Memory, Disk, Network¶

What they measure: Server-side resource utilization (CPU %, Memory %, Disk I/O %, Network %).

Why they matter: Response times tell you WHAT broke. Server metrics tell you WHY. Slow response times with 95% CPU means CPU bottleneck. Slow response times with 30% CPU means the problem is elsewhere: database, disk, network.

CPU %: Compute Capacity¶

What to watch:

CPU %	What It Means	Action
< 50%	Plenty of capacity	Keep ramping load
50-70%	Moderate usage	Watch for degradation
70-90%	High usage	Approaching limit
> 90%	Critically high	CPU bottleneck: optimize code or add CPU

Correlating CPU with response times:

CPU %	Response Time	Diagnosis
40%	100ms	CPU not the bottleneck
70%	200ms	CPU moderately loaded (normal)
95%	5000ms	CPU is the bottleneck

Memory %: Memory Capacity¶

What to watch:

Memory %	What It Means	Action
< 70%	Healthy	Normal
70-85%	Moderate	Watch for growth
85-95%	High	Potential memory pressure
> 95%	Critical	Memory bottleneck or leak

Memory leak pattern:

Time	Memory %	Response Time	Diagnosis
0 min	30%	100ms	Good
60 min	75%	500ms	Concerning
120 min	100% (OOM)	Crash	Memory leak

Metric Correlation: Finding Bottlenecks¶

The power of metrics is correlation. No single metric tells the whole story. Combining them reveals root causes.

Example 1: CPU Bottleneck¶

Metric	Value	Interpretation
Response Time	5000ms	SYMPTOM: Slow
Hits/sec	Plateaued	EVIDENCE: Maxed out
Server CPU	95%	CAUSE: CPU bottleneck
Database CPU	40%	NOT the database

Diagnosis: Web server CPU bottleneck.

Fix: Optimize application code, add CPU cores, or add web servers.

Example 2: Database Bottleneck¶

Metric	Value	Interpretation
Response Time	5000ms	SYMPTOM: Slow
Server CPU	40%	NOT the web server
Database CPU	95%	CAUSE: Database bottleneck
Query time (avg)	2000ms	EVIDENCE: Slow queries

Diagnosis: Database CPU bottleneck.

Fix: Optimize queries, add indexes, add database CPU capacity, or add read replicas.

Example 3: Connection Pool Exhaustion¶

Metric	Value	Interpretation
Response Time	5000ms	SYMPTOM: Slow
Errors/sec	50 (503 errors)	SYMPTOM: Service unavailable
Server CPU	40%	NOT compute-bound
DB Connections	100/100 (maxed)	CAUSE: Pool exhausted

Diagnosis: Database connection pool exhausted.

Fix: Increase connection pool size (e.g., 100 → 500 connections).

Ask the AI to Correlate Metrics for You

If you're struggling to identify the bottleneck:

Response times are 5000ms at 300 VUs. Server CPU is 40%, memory is 50%,
but database CPU is 95%. What's the bottleneck and how do I fix it?

The AI can:

Analyze combinations of metrics to pinpoint the exact bottleneck
Distinguish between application bottlenecks (code) vs. infrastructure bottlenecks (CPU/memory/network)
Recommend immediate fixes (increase connection pools, optimize queries)
Suggest long-term architectural improvements (caching, read replicas, CDN)
Validate your diagnosis before you make expensive infrastructure changes

Advanced Metrics for Specific Scenarios¶

Time to First Byte (TTFB)¶

What it measures: Time from sending request to receiving first byte of response (server processing time, excludes data transfer).

When to use: Isolate server processing time from network transfer time.

Example:

Metric	Value	Diagnosis
Response Time	5000ms	Total time
TTFB	4900ms	Server processing 98% of time
Diagnosis	Server bottleneck	NOT network (transfer is only 100ms)

Average Speed (Mbps per Transaction)¶

What it measures: Data transfer rate for individual transactions (response size ÷ download time).

When to use: Identify slow downloads (e.g., large images, PDFs, video).

Example:

Transaction	Response Size	Response Time	Avg Speed	Diagnosis
Homepage	500 KB	100ms	40 Mbps	Good
Product Image	2 MB	5000ms	3.2 Mbps	Slow download (network or CDN issue)

Page Duration vs. Transaction Duration¶

What it measures:

Transaction Duration: Single HTTP request/response
Page Duration: All transactions on a page (including think time between transactions)

When to use: Identify whether pages are slow because of single slow transaction or cumulative effect.

Example:

Metric	Value	Diagnosis
Page Duration	10 seconds	SYMPTOM: Slow page
Transaction 1	100ms	Fast
Transaction 2	100ms	Fast
Transaction 3	9800ms	One slow transaction (investigate this)

Next Steps: Deep Analysis¶

After understanding metrics:

Embedded Analytics Dashboard - AI-powered interactive metrics exploration (v7.0)
Performance Analysis Workflow - Step-by-step bottleneck identification process
Identifying Bottlenecks - Detailed correlation patterns
Legacy Reports - Static HTML reports for archival and sharing

Or: Return to the overview:

Load Test Results Overview - Navigate the results interface

Related Topics: