Server Performance Checklist¶
Work through this checklist after every load test where you collected server metrics. Each category (CPU, Memory, Disk, Network) has recommended thresholds. When a metric exceeds its threshold, you've found something worth investigating.
How to use this checklist:
- Open your completed load test result in the Embedded Analytics Dashboard (double-click the result in the Navigator)
- Find the server metrics view in the Dashboard, where CPU / Memory / Disk / Network are plotted alongside response-time and throughput data from the test
- Work through each category below, checking your metrics against the thresholds
- When a metric exceeds the threshold, follow the guidance to investigate and resolve
The Dashboard is the right tool for post-test review because it correlates server metrics with the same test's response-time and throughput data, so you can see exactly when each server-side spike happened and what the users were experiencing at that moment. If you want to watch the same metrics in real time during a load test, the Servers View shows current values live.
For detailed metric definitions, see Server Metrics & Counters.
CPU Performance¶
Check these CPU metrics to identify processor bottlenecks:
☐ CPU % (Processor Time)¶
What to check: - Peak CPU % during the load test - Sustained CPU % at steady state
Thresholds:
| CPU % | Status | Action |
|---|---|---|
| < 70% | Good | No action needed |
| 70-85% | Warning | Monitor closely; capacity is adequate but limited headroom |
| 85-95% | Critical | Processor is bottleneck; optimize code or add capacity |
| > 95% | Severe | Processor is saturated; expect significant response time degradation |
What it means: - CPU % should scale proportionally with load (double the VUs → roughly double the CPU %) - If CPU hits 85%+ and response times spike, CPU is the bottleneck - If CPU is < 70% but response times are still slow, look elsewhere (memory, disk, database)
Next steps if high: - Identify inefficient code (profiling tools) - Optimize database queries (check query execution plans) - Add more CPU cores or scale horizontally - Consider caching to reduce computation
☐ Context Switches/sec¶
What to check: - Rate of thread context switches during the test
Thresholds: - Should scale proportionally with load - Greater-than-linear increase suggests inefficient threading or lock contention
What it means: - Context switches occur when the CPU switches between threads - Excessive context switching wastes CPU cycles - Often indicates thread pool size mismatch or lock contention
Next steps if high: - Review thread pool configuration (too many threads → excessive switching) - Check for lock contention in application code - Consider thread affinity or worker thread optimization
☐ Process Queue Length¶
What to check: - Number of threads waiting to be scheduled
Thresholds:
| Queue Length (per processor) | Status | Action |
|---|---|---|
| < 2 | Good | No action needed |
| 2-10 | Acceptable | Monitor; may indicate CPU pressure |
| > 10 | Critical | CPU cannot keep up with demand |
What it means: - Threads waiting for CPU time are queued - Sustained queue length > 10 per processor indicates CPU saturation
Next steps if high: - Same as high CPU %: optimize code, add capacity - Check if background processes are competing for CPU
Memory Performance¶
Check these memory metrics to identify RAM exhaustion or paging issues:
☐ % Memory (Memory Utilization)¶
What to check: - Peak memory % during the test - Memory growth over time (memory leak indicator)
Thresholds:
| Memory % | Status | Action |
|---|---|---|
| < 80% | Good | Adequate memory headroom |
| 80-90% | Warning | Monitor; risk of paging if usage grows |
| > 90% | Critical | Memory pressure; risk of swap/paging |
What it means: - High memory % forces the OS to page memory to disk - Paging causes severe performance degradation because disk is roughly 1000x slower than RAM - Gradual memory growth over time indicates a memory leak
Next steps if high: - Check for memory leaks (heap dumps, profiling) - Increase physical RAM - Optimize memory usage (object pooling, caching strategies) - Review garbage collection settings (Java/.NET apps)
☐ Page Reads/sec and Page Writes/sec¶
What to check: - Rate of page faults (disk reads to resolve memory access) - Rate of page writes (memory flushed to disk)
Thresholds:
| Paging Rate | Status | Action |
|---|---|---|
| < 10/sec | Good | Minimal paging |
| 10-100/sec | Warning | Some paging; monitor for growth |
| > 100/sec | Critical | Excessive paging; severe performance impact |
What it means: - Page reads occur when memory isn't in RAM and must be fetched from disk - Page writes occur when RAM is full and pages must be evicted to disk - Both cause massive slowdowns (millisecond memory access → seconds for disk)
Next steps if high: - Increase physical RAM immediately - Reduce memory consumption (optimize code, reduce cache sizes) - Check for memory leaks
☐ Cache Memory Allocation Ratio¶
What to check: - Percentage of RAM reserved for OS cache - Decreasing ratio indicates memory pressure
What it means: - OS reduces cache allocation when memory is needed elsewhere - Decreasing cache means less efficient file system access
Next steps if decreasing: - Same as high memory %: add RAM or optimize usage
Disk I/O Performance¶
Check these disk metrics to identify storage bottlenecks:
☐ % I/O Time Utilized¶
What to check: - Percentage of time disk was busy with I/O
Thresholds:
| I/O Time % | Status | Action |
|---|---|---|
| < 80% | Good | Disk is keeping up |
| 80-95% | Warning | Disk is under heavy load |
| > 95% | Critical | Disk is saturated; I/O bottleneck |
What it means: - Disk at 95%+ cannot handle more I/O requests - I/O requests will queue, causing delays
Next steps if high: - Move to faster storage (SSD vs. spinning disk) - Reduce disk writes (optimize logging, caching) - Separate logs/temp files to different physical disks
☐ Queue Length¶
What to check: - Average number of I/O requests waiting for disk
Thresholds:
| Queue Length | Status | Action |
|---|---|---|
| < 2 | Good | No queuing |
| 2-5 | Acceptable | Disk under load but managing |
| > 5 | Critical | Requests are queueing; disk bottleneck |
What it means: - Requests waiting in queue are delayed - High queue length → slow disk response times
Next steps if high: - Same as high I/O %: faster storage, reduce I/O
☐ Reads/sec and Writes/sec¶
What to check: - Rate of disk read/write operations - Plateaus indicate disk capacity limit
What it means: - Disk I/O should scale with load - Plateau (rate stops increasing despite more load) indicates disk saturation
Next steps if plateauing: - Upgrade to faster storage (NVMe SSD) - Optimize database queries to reduce disk reads - Increase RAM to cache more data in memory
Network Performance¶
Check these network metrics to identify bandwidth or packet loss issues:
☐ Bytes Received/sec and Bytes Sent/sec¶
What to check: - Network throughput during the test - Scaling behavior as load increases
Thresholds: - Should scale proportionally with load - Less-than-linear increase indicates network capacity limit
What it means: - Bytes/sec measures actual data transfer rate - Plateaus indicate network saturation (hit bandwidth limit)
Next steps if saturated: - Upgrade network interface (1 GbE → 10 GbE) - Check for network congestion (switch, firewall bottlenecks) - Optimize data transfer (compression, reduce payload sizes)
☐ Packets Received Errors and Packets Sent Errors¶
What to check: - Number of packets with errors
Thresholds:
| Error Rate | Status | Action |
|---|---|---|
| 0 | Good | No packet errors |
| > 0 | Critical | Network degradation; investigate immediately |
What it means: - Packet errors indicate serious network problems - Can be caused by bad cables, failing NICs, switch issues
Next steps if > 0: - Check network cables and connections - Replace faulty network hardware - Review switch/router logs for errors
☐ Collisions/sec (Ethernet)¶
What to check: - Rate of packet collisions on Ethernet
Thresholds:
| Collision Rate | Status | Action |
|---|---|---|
| < 5% of Packets Sent/sec | Good | Normal collision rate |
| > 5% of Packets Sent/sec | Critical | Network problem or capacity limit |
What it means: - Collisions occur when two devices transmit simultaneously - Excessive collisions indicate network congestion or misconfiguration
Next steps if high: - Check for network congestion - Upgrade to full-duplex Ethernet (eliminates collisions) - Review network topology for bottlenecks
☐ Connections Established¶
What to check: - Number of active TCP connections - Should scale proportionally with VU count
What it means: - Each virtual user typically requires 1-6 TCP connections (HTTP/1.1 keep-alive) - Plateaus indicate connection limit reached
Next steps if plateauing: - Increase TCP connection limits (OS tuning) - Check application server connection pool settings - Review TIME_WAIT socket exhaustion
☐ Connection Failures¶
What to check: - Number of failed TCP connection attempts
Thresholds:
| Failures | Status | Action |
|---|---|---|
| 0 | Good | No connection failures |
| > 0 | Critical | Investigate cause immediately |
What it means: - Connection failures indicate server refusing connections - Often caused by listen queue exhaustion or firewall rules
Next steps if > 0: - Increase listen queue backlog - Check firewall rules - Review server logs for refused connections
Systematic Diagnosis Process¶
If you're seeing performance problems, work through this process:
1. Response Times Slow?¶
- Yes → Continue to #2
- No → No server bottleneck; check client-side (network latency, think time)
2. CPU > 85%?¶
- Yes → CPU bottleneck. Optimize code, add capacity, or scale horizontally.
- No → Continue to #3
3. Memory > 90% or Paging > 100/sec?¶
- Yes → Memory bottleneck. Add RAM, fix memory leaks, optimize usage.
- No → Continue to #4
4. Disk I/O > 95% or Queue Length > 5?¶
- Yes → Disk bottleneck. Upgrade to SSD, optimize queries, cache more in RAM.
- No → Continue to #5
5. Network errors > 0 or Collisions > 5%?¶
- Yes → Network problem. Fix hardware, check cables, upgrade network.
- No → Continue to #6
6. All server metrics look good, but response times are still slow?¶
- Likely causes:
- Database locks (check for blocked queries)
- External service dependency (API calls, third-party services)
- Misconfigured load balancer (uneven distribution)
- Application-level bottleneck (thread pool exhaustion, lock contention)
- Next steps:
- Review application logs for errors
- Check database query performance
- Trace external service calls
- Profile the application under load
Related Topics¶
- Server Metrics & Counters - Detailed definitions of all metrics
- Server Monitoring Introduction - Why server monitoring matters
- Basic Server Monitoring - How to set up server monitoring
- Server Monitoring Agent - Installing and configuring the agent
Run through this checklist after every load test. The bottleneck is usually in the last place you'd think to look, which is exactly why a systematic approach matters.