Skip to content

Server Performance Checklist

Work through this checklist after every load test where you collected server metrics. Each category (CPU, Memory, Disk, Network) has recommended thresholds. When a metric exceeds its threshold, you've found something worth investigating.

How to use this checklist:

  1. Open your completed load test result in the Embedded Analytics Dashboard (double-click the result in the Navigator)
  2. Find the server metrics view in the Dashboard, where CPU / Memory / Disk / Network are plotted alongside response-time and throughput data from the test
  3. Work through each category below, checking your metrics against the thresholds
  4. When a metric exceeds the threshold, follow the guidance to investigate and resolve

The Dashboard is the right tool for post-test review because it correlates server metrics with the same test's response-time and throughput data, so you can see exactly when each server-side spike happened and what the users were experiencing at that moment. If you want to watch the same metrics in real time during a load test, the Servers View shows current values live.

For detailed metric definitions, see Server Metrics & Counters.


CPU Performance

Check these CPU metrics to identify processor bottlenecks:

CPU % (Processor Time)

What to check: - Peak CPU % during the load test - Sustained CPU % at steady state

Thresholds:

CPU % Status Action
< 70% Good No action needed
70-85% Warning Monitor closely; capacity is adequate but limited headroom
85-95% Critical Processor is bottleneck; optimize code or add capacity
> 95% Severe Processor is saturated; expect significant response time degradation

What it means: - CPU % should scale proportionally with load (double the VUs → roughly double the CPU %) - If CPU hits 85%+ and response times spike, CPU is the bottleneck - If CPU is < 70% but response times are still slow, look elsewhere (memory, disk, database)

Next steps if high: - Identify inefficient code (profiling tools) - Optimize database queries (check query execution plans) - Add more CPU cores or scale horizontally - Consider caching to reduce computation


Context Switches/sec

What to check: - Rate of thread context switches during the test

Thresholds: - Should scale proportionally with load - Greater-than-linear increase suggests inefficient threading or lock contention

What it means: - Context switches occur when the CPU switches between threads - Excessive context switching wastes CPU cycles - Often indicates thread pool size mismatch or lock contention

Next steps if high: - Review thread pool configuration (too many threads → excessive switching) - Check for lock contention in application code - Consider thread affinity or worker thread optimization


Process Queue Length

What to check: - Number of threads waiting to be scheduled

Thresholds:

Queue Length (per processor) Status Action
< 2 Good No action needed
2-10 Acceptable Monitor; may indicate CPU pressure
> 10 Critical CPU cannot keep up with demand

What it means: - Threads waiting for CPU time are queued - Sustained queue length > 10 per processor indicates CPU saturation

Next steps if high: - Same as high CPU %: optimize code, add capacity - Check if background processes are competing for CPU


Memory Performance

Check these memory metrics to identify RAM exhaustion or paging issues:

% Memory (Memory Utilization)

What to check: - Peak memory % during the test - Memory growth over time (memory leak indicator)

Thresholds:

Memory % Status Action
< 80% Good Adequate memory headroom
80-90% Warning Monitor; risk of paging if usage grows
> 90% Critical Memory pressure; risk of swap/paging

What it means: - High memory % forces the OS to page memory to disk - Paging causes severe performance degradation because disk is roughly 1000x slower than RAM - Gradual memory growth over time indicates a memory leak

Next steps if high: - Check for memory leaks (heap dumps, profiling) - Increase physical RAM - Optimize memory usage (object pooling, caching strategies) - Review garbage collection settings (Java/.NET apps)


Page Reads/sec and Page Writes/sec

What to check: - Rate of page faults (disk reads to resolve memory access) - Rate of page writes (memory flushed to disk)

Thresholds:

Paging Rate Status Action
< 10/sec Good Minimal paging
10-100/sec Warning Some paging; monitor for growth
> 100/sec Critical Excessive paging; severe performance impact

What it means: - Page reads occur when memory isn't in RAM and must be fetched from disk - Page writes occur when RAM is full and pages must be evicted to disk - Both cause massive slowdowns (millisecond memory access → seconds for disk)

Next steps if high: - Increase physical RAM immediately - Reduce memory consumption (optimize code, reduce cache sizes) - Check for memory leaks


Cache Memory Allocation Ratio

What to check: - Percentage of RAM reserved for OS cache - Decreasing ratio indicates memory pressure

What it means: - OS reduces cache allocation when memory is needed elsewhere - Decreasing cache means less efficient file system access

Next steps if decreasing: - Same as high memory %: add RAM or optimize usage


Disk I/O Performance

Check these disk metrics to identify storage bottlenecks:

% I/O Time Utilized

What to check: - Percentage of time disk was busy with I/O

Thresholds:

I/O Time % Status Action
< 80% Good Disk is keeping up
80-95% Warning Disk is under heavy load
> 95% Critical Disk is saturated; I/O bottleneck

What it means: - Disk at 95%+ cannot handle more I/O requests - I/O requests will queue, causing delays

Next steps if high: - Move to faster storage (SSD vs. spinning disk) - Reduce disk writes (optimize logging, caching) - Separate logs/temp files to different physical disks


Queue Length

What to check: - Average number of I/O requests waiting for disk

Thresholds:

Queue Length Status Action
< 2 Good No queuing
2-5 Acceptable Disk under load but managing
> 5 Critical Requests are queueing; disk bottleneck

What it means: - Requests waiting in queue are delayed - High queue length → slow disk response times

Next steps if high: - Same as high I/O %: faster storage, reduce I/O


Reads/sec and Writes/sec

What to check: - Rate of disk read/write operations - Plateaus indicate disk capacity limit

What it means: - Disk I/O should scale with load - Plateau (rate stops increasing despite more load) indicates disk saturation

Next steps if plateauing: - Upgrade to faster storage (NVMe SSD) - Optimize database queries to reduce disk reads - Increase RAM to cache more data in memory


Network Performance

Check these network metrics to identify bandwidth or packet loss issues:

Bytes Received/sec and Bytes Sent/sec

What to check: - Network throughput during the test - Scaling behavior as load increases

Thresholds: - Should scale proportionally with load - Less-than-linear increase indicates network capacity limit

What it means: - Bytes/sec measures actual data transfer rate - Plateaus indicate network saturation (hit bandwidth limit)

Next steps if saturated: - Upgrade network interface (1 GbE → 10 GbE) - Check for network congestion (switch, firewall bottlenecks) - Optimize data transfer (compression, reduce payload sizes)


Packets Received Errors and Packets Sent Errors

What to check: - Number of packets with errors

Thresholds:

Error Rate Status Action
0 Good No packet errors
> 0 Critical Network degradation; investigate immediately

What it means: - Packet errors indicate serious network problems - Can be caused by bad cables, failing NICs, switch issues

Next steps if > 0: - Check network cables and connections - Replace faulty network hardware - Review switch/router logs for errors


Collisions/sec (Ethernet)

What to check: - Rate of packet collisions on Ethernet

Thresholds:

Collision Rate Status Action
< 5% of Packets Sent/sec Good Normal collision rate
> 5% of Packets Sent/sec Critical Network problem or capacity limit

What it means: - Collisions occur when two devices transmit simultaneously - Excessive collisions indicate network congestion or misconfiguration

Next steps if high: - Check for network congestion - Upgrade to full-duplex Ethernet (eliminates collisions) - Review network topology for bottlenecks


Connections Established

What to check: - Number of active TCP connections - Should scale proportionally with VU count

What it means: - Each virtual user typically requires 1-6 TCP connections (HTTP/1.1 keep-alive) - Plateaus indicate connection limit reached

Next steps if plateauing: - Increase TCP connection limits (OS tuning) - Check application server connection pool settings - Review TIME_WAIT socket exhaustion


Connection Failures

What to check: - Number of failed TCP connection attempts

Thresholds:

Failures Status Action
0 Good No connection failures
> 0 Critical Investigate cause immediately

What it means: - Connection failures indicate server refusing connections - Often caused by listen queue exhaustion or firewall rules

Next steps if > 0: - Increase listen queue backlog - Check firewall rules - Review server logs for refused connections


Systematic Diagnosis Process

If you're seeing performance problems, work through this process:

1. Response Times Slow?

  • Yes → Continue to #2
  • No → No server bottleneck; check client-side (network latency, think time)

2. CPU > 85%?

  • YesCPU bottleneck. Optimize code, add capacity, or scale horizontally.
  • No → Continue to #3

3. Memory > 90% or Paging > 100/sec?

  • YesMemory bottleneck. Add RAM, fix memory leaks, optimize usage.
  • No → Continue to #4

4. Disk I/O > 95% or Queue Length > 5?

  • YesDisk bottleneck. Upgrade to SSD, optimize queries, cache more in RAM.
  • No → Continue to #5

5. Network errors > 0 or Collisions > 5%?

  • YesNetwork problem. Fix hardware, check cables, upgrade network.
  • No → Continue to #6

6. All server metrics look good, but response times are still slow?

  • Likely causes:
    • Database locks (check for blocked queries)
    • External service dependency (API calls, third-party services)
    • Misconfigured load balancer (uneven distribution)
    • Application-level bottleneck (thread pool exhaustion, lock contention)
  • Next steps:
    • Review application logs for errors
    • Check database query performance
    • Trace external service calls
    • Profile the application under load

Run through this checklist after every load test. The bottleneck is usually in the last place you'd think to look, which is exactly why a systematic approach matters.