Chapter 3. Monitoring and Analyzing System Performance

This chapter briefly introduces tools that can be used to monitor and analyze system and application performance, and points out the situations in which each tool is most useful. The data collected by each tool can reveal bottlenecks or other system problems that contribute to less-than-optimal performance.

3.1. The proc File System

The proc "file system" is a directory that contains a hierarchy of files that represent the current state of the Linux kernel. It allows applications and users to see the kernel's view of the system.

The proc directory also contains information about the hardware of the system, and any currently running processes. Most of these files are read-only, but some files (primarily those in /proc/sys) can be manipulated by users and applications to communicate configuration changes to the kernel.

For further information about viewing and editing files in the proc directory, refer to the Deployment Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

3.2. GNOME and KDE System Monitors

The GNOME and KDE desktop environments both have graphical tools to assist you in monitoring and modifying the behavior of your system.

GNOME System Monitor

The GNOME System Monitor displays basic system information and allows you to monitor system processes, and resource or file system usage. Open it with the gnome-system-monitor command in the Terminal, or click on the Applications menu, and select System Tools > System Monitor.

GNOME System Monitor has four tabs:

System: Displays basic information about the computer's hardware and software.
Processes: Shows active processes, and the relationships between those processes, as well as detailed information about each process. It also lets you filter the processes displayed, and perform certain actions on those processes (start, stop, kill, change priority, etc.).
Resources: Displays the current CPU time usage, memory and swap space usage, and network usage.
File Systems: Lists all mounted file systems alongside some basic information about each, such as the file system type, mount point, and memory usage.

For further information about the GNOME System Monitor, refer to the Help menu in the application, or to the Deployment Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

KDE System Guard

The KDE System Guard allows you to monitor current system load and processes that are running. It also lets you perform actions on processes. Open it with the ksysguard command in the Terminal, or click on the Kickoff Application Launcher and select Applications > System > System Monitor.

There are two tabs to KDE System Guard:

Process Table: Displays a list of all running processes, alphabetically by default. You can also sort processes by a number of other properties, including total CPU usage, physical or shared memory usage, owner, and priority. You can also filter the visible results, search for specific processes, or perform certain actions on a process.
System Load: Displays historical graphs of CPU usage, memory and swap space usage, and network usage. Hover over the graphs for detailed analysis and graph keys.

For further information about the KDE System Guard, refer to the Help menu in the application.

3.3. Built-in Command-line Monitoring Tools

In addition to graphical monitoring tools, Red Hat Enterprise Linux provides several tools that can be used to monitor a system from the command line. The advantage of these tools is that they can be used outside run level 5. This section discusses each tool briefly, and suggests the purposes to which each tool is best suited.

top

The top tool provides a dynamic, real-time view of the processes in a running system. It can display a variety of information, including a system summary and the tasks currently being managed by the Linux kernel. It also has a limited ability to manipulate processes. Both its operation and the information it displays are highly configurable, and any configuration details can be made to persist across restarts.

By default, the processes shown are ordered by the percentage of CPU usage, giving an easy view into the processes that are consuming the most resources.

For detailed information about using top, refer to its man page: man top.

ps

The ps tool takes a snapshot of a select group of active processes. By default this group is limited to processes owned by the current user and associated with the same terminal.

It can provide more detailed information about processes than top, but is not dynamic.

For detailed information about using ps, refer to its man page: man ps.

vmstat

vmstat (Virtual Memory Statistics) outputs instantaneous reports about your system's processes, memory, paging, block I/O, interrupts and CPU activity.

Although it is not dynamic like top, you can specify a sampling interval, which lets you observe system activity in near-real time.

For detailed information about using vmstat, refer to its man page: man vmstat.

sar

sar (System Activity Reporter) collects and reports information about today's system activity so far. The default output covers today's CPU utilization at ten minute intervals from the beginning of the day:

12:00:01 AM CPU %user %nice   %system   %iowait %steal %idle12:10:01 AM all  0.10  0.00  0.15  2.96  0.00 96.7912:20:01 AM all  0.09  0.00  0.13  3.16  0.00 96.6112:30:01 AM all  0.09  0.00  0.14  2.11  0.00 97.66...

This tool is a useful alternative to attempting to create periodic reports on system activity through top or similar tools.

For detailed information about using sar, refer to its man page: man sar.

3.4. Tuned and ktune

Tuned is a daemon that monitors and collects data on the usage of various system components, and uses that information to dynamically tune system settings as required. It can react to changes in CPU and network use, and adjust settings to improve performance in active devices or reduce power consumption in inactive devices.

The accompanying ktune partners with the tuned-adm tool to provide a number of tuning profiles that are pre-configured to enhance performance and reduce power consumption in a number of specific use cases. Edit these profiles or create new profiles to create performance solutions tailored to your environment.

The profiles provided as part of tuned-adm include:

default

The default power-saving profile. This is the most basic power-saving profile. It enables only the disk and CPU plug-ins. Note that this is not the same as turning tuned-adm off, where both tuned and ktune are disabled.

latency-performance

A server profile for typical latency performance tuning. It disables tuned and ktune power-saving mechanisms. The cpuspeed mode changes to performance. The I/O elevator is changed to deadline for each device. For power management quality of service, cpu_dma_latency requirement value 0 is registered.

throughput-performance

A server profile for typical throughput performance tuning. This profile is recommended if the system does not have enterprise-class storage. It is the same as latency-performance, except:

kernel.sched_min_granularity_ns (scheduler minimal preemption granularity) is set to 10 milliseconds,
kernel.sched_wakeup_granularity_ns (scheduler wake-up granularity) is set to 15 milliseconds,
vm.dirty_ratio (virtual machine dirty ratio) is set to 40%, and
transparent huge pages are enabled.

enterprise-storage

This profile is recommended for enterprise-sized server configurations with enterprise-class storage, including battery-backed controller cache protection and management of on-disk cache. It is the same as the throughput-performance profile, with one addition: file systems are re-mounted with barrier=0.

virtual-guest

readahead value is set to 4x, and
non root/boot file systems are re-mounted with barrier=0.

virtual-host

Based on the enterprise-storage profile, virtual-host also decreases the swappiness of virtual memory and enables more aggressive writeback of dirty pages. This profile is available in Red Hat Enterprise Linux 6.3 and later, and is the recommended profile for virtualization hosts, including both KVM and Red Hat Enterprise Virtualization hosts.

Refer to the Red Hat Enterprise Linux 6 Power Management Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/, for further information about tuned and ktune.

3.5. Application Profilers

Profiling is the process of gathering information about a program's behavior as it executes. You profile an application to determine which areas of a program can be optimized to increase the program's overall speed, reduce its memory usage, etc. Application profiling tools help to simplify this process.

There are three supported profiling tools for use with Red Hat Enterprise Linux 6: SystemTap, OProfile and Valgrind. Documenting these profiling tools is outside the scope of this guide; however, this section does provide links to further information and a brief overview of the tasks for which each profiler is suitable.

3.5.1. SystemTap

SystemTap is a tracing and probing tool that lets users monitor and analyze operating system activities (particularly kernel activities) in fine detail. It provides information similar to the output of tools like netstat, top, ps and iostat, but includes additional filtering and analysis options for the information that is collected.

SystemTap provides a deeper, more precise analysis of system activities and application behavior to allow you to pinpoint system and application bottlenecks.

The Function Callgraph plug-in for Eclipse uses SystemTap as a back-end, allowing it to thoroughly monitor the status of a program, including function calls, returns, times, and user-space variables, and display the information visually for easy optimization.

For further information about SystemTap, refer to the SystemTap Beginners Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

3.5.2. OProfile

OProfile (oprofile) is a system-wide performance monitoring tool. It uses the processor's dedicated performance monitoring hardware to retrieve information about the kernel and system executables, such as when memory is referenced, the number of L2 cache requests, and the number of hardware interrupts received. It can also be used to determine processor usage, and which applications and services are used most.

OProfile can also be used with Eclipse via the Eclipse OProfile plug-in. This plug-in allows users to easily determine the most time-consuming areas of their code, and perform all command-line functions of OProfile with rich visualization of the results.

However, users should be aware of several OProfile limitations:

Performance monitoring samples may not be precise - because the processor may execute instructions out of order, a sample may be recorded from a nearby instruction, instead of the instruction that triggered the interrupt.
Because OProfile is system-wide and expects processes to start and stop multiple times, samples from multiple runs are allowed to accumulate. This means you may need to clear sample data from previous runs.
It focuses on identifying problems with CPU-limited processes, and therefore does not identify processes that are sleeping while they wait on locks for other events.

For further information about using OProfile, refer to the Deployment Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/, or to the oprofile documentation on your system, located in /usr/share/doc/oprofile-<version>.

3.5.3. Valgrind

Valgrind provides a number of detection and profiling tools to help improve the performance and correctness of your applications. These tools can detect memory and thread-related errors as well as heap, stack and array overruns, allowing you to easily locate and correct errors in your application code. They can also profile the cache, the heap, and branch-prediction to identify factors that may increase application speed and minimize application memory use.

Valgrind analyzes your application by running it on a synthetic CPU and instrumenting the existing application code as it is executed. It then prints "commentary" clearly identifying each process involved in application execution to a user-specified file descriptor, file, or network socket. The level of instrumentation varies depending on the Valgrind tool in use, and its settings, but it is important to note that executing the instrumented code can take 4-50 times longer than normal execution.

Valgrind can be used on your application as-is, without recompiling. However, because Valgrind uses debugging information to pinpoint issues in your code, if your application and support libraries were not compiled with debugging information enabled, recompiling to include this information is highly recommended.

As of Red Hat Enterprise Linux 6.4, Valgrind integrates with gdb (GNU Project Debugger) to improve debugging efficiency.

More information about Valgrind is available from the Developer Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/, or by using the man valgrind command when the valgrind package is installed. Accompanying documentation can also be found in:

/usr/share/doc/valgrind-<version>/valgrind_manual.pdf
/usr/share/doc/valgrind-<version>/html/index.html

For information about how Valgrind can be used to profile system memory, refer to Section 5.3, "Using Valgrind to Profile Memory Usage".

3.5.4. Perf

The perf tool provides a number of useful performance counters that let the user assess the impact of other commands on their system:

perf stat: This command provides overall statistics for common performance events, including instructions executed and clock cycles consumed. You can use the option flags to gather statistics on events other than the default measurement events. As of Red Hat Enterprise Linux 6.4, it is possible to use perf stat to filter monitoring based on one or more specified control groups (cgroups). For further information, read the man page: man perf-stat.
perf record: This command records performance data into a file which can be later analyzed using perf report. For further details, read the man page: man perf-record.
perf report: This command reads the performance data from a file and analyzes the recorded data. For further details, read the man page: man perf-report.
perf list: This command lists the events available on a particular machine. These events will vary based on the performance monitoring hardware and the software configuration of the system. For further information, read the man page: man perf-list.
perf top: This command performs a similar function to the top tool. It generates and displays a performance counter profile in realtime. For further information, read the man page: man perf-top.

More information about perf is available in the Red Hat Enterprise Linux Developer Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

3.6. Red Hat Enterprise MRG

Red Hat Enterprise MRG's Realtime component includes Tuna, a tool that allows users to both adjust the tunable values their system and view the results of those changes. While it was developed for use with the Realtime component, it can also be used to tune standard Red Hat Enterprise Linux systems.

With Tuna, you can adjust or disable unnecessary system activity, including:

BIOS parameters related to power management, error detection, and system management interrupts;
network settings, such as interrupt coalescing, and the use of TCP;
journaling activity in journaling file systems;
system logging;
whether interrupts and user processes are handled by a specific CPU or range of CPUs;
whether swap space is used; and
how to deal with out-of-memory exceptions.

For more detailed conceptual information about tuning Red Hat Enterprise MRG with the Tuna interface, refer to the "General System Tuning" chapter of the Realtime Tuning Guide. For detailed instructions about using the Tuna interface, refer to the Tuna User Guide. Both guides are available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_MRG/.

Chapter 4. CPU

4.1. CPU Topology

4.1.1. CPU and NUMA Topology
4.1.2. Tuning CPU Performance
4.1.3. numastat
4.1.4. NUMA Affinity Management Daemon (numad)

4.2. CPU Scheduling

4.2.1. Realtime scheduling policies
4.2.2. Normal scheduling policies
4.2.3. Policy selection

4.3. Interrupts and IRQ Tuning

4.4. Enhancements to NUMA in Red Hat Enterprise Linux 6

4.4.1. Bare-metal and Scalability Optimizations
4.4.2. Virtualization Optimizations

The term CPU, which stands for central processing unit, is a misnomer for most systems, since central implies single, whereas most modern systems have more than one processing unit, or core. Physically, CPUs are contained in a package attached to a motherboard in a socket. Each socket on the motherboard has various connections: to other CPU sockets, memory controllers, interrupt controllers, and other peripheral devices. A socket to the operating system is a logical grouping of CPUs and associated resources. This concept is central to most of our discussions on CPU tuning.

Red Hat Enterprise Linux keeps a wealth of statistics about system CPU events; these statistics are useful in planning out a tuning strategy to improve CPU performance. Section 4.1.2, "Tuning CPU Performance" discusses some of the more useful statistics, where to find them, and how to analyze them for performance tuning.

Topology

Older computers had relatively few CPUs per system, which allowed an architecture known as Symmetric Multi-Processor (SMP). This meant that each CPU in the system had similar (or symmetric) access to available memory. In recent years, CPU count-per-socket has grown to the point that trying to give symmetric access to all RAM in the system has become very expensive. Most high CPU count systems these days have an architecture known as Non-Uniform Memory Access (NUMA) instead of SMP.

AMD processors have had this type of architecture for some time with their Hyper Transport (HT) interconnects, while Intel has begun implementing NUMA in their Quick Path Interconnect (QPI) designs. NUMA and SMP are tuned differently, since you need to account for the topology of the system when allocating resources for an application.

Threads

Inside the Linux operating system, the unit of execution is known as a thread. Threads have a register context, a stack, and a segment of executable code which they run on a CPU. It is the job of the operating system (OS) to schedule these threads on the available CPUs.

The OS maximizes CPU utilization by load-balancing the threads across available cores. Since the OS is primarily concerned with keeping CPUs busy, it may not make optimal decisions with respect to application performance. Moving an application thread to a CPU on another socket may worsen performance more than simply waiting for the current CPU to become available, since memory access operations may slow drastically across sockets. For high-performance applications, it is usually better for the designer to determine where threads should be placed. Section 4.2, "CPU Scheduling" discusses how to best allocate CPUs and memory to best execute application threads.

Interrupts

One of the less obvious (but nonetheless important) system events that can impact application performance is the interrupt (also known as IRQs in Linux). These events are handled by the operating system, and are used by peripherals to signal the arrival of data or the completion of an operation, such as a network write or a timer event.

The manner in which the OS or CPU that is executing application code handles an interrupt does not affect the application's functionality. However, it may impact the performance of the application. This chapter also discusses tips on preventing interrupts from adversely impacting application performance.

4.1. CPU Topology

4.1.1. CPU and NUMA Topology

The first computer processors were uniprocessors, meaning that the system had a single CPU. The illusion of executing processes in parallel was done by the operating system rapidly switching the single CPU from one thread of execution (process) to another. In the quest for increasing system performance, designers noted that increasing the clock rate to execute instructions faster only worked up to a point (usually the limitations on creating a stable clock waveform with the current technology). In an effort to get more overall system performance, designers added another CPU to the system, allowing two parallel streams of execution. This trend of adding processors has continued over time.

Most early multiprocessor systems were designed so that each CPU had the same logical path to each memory location (usually a parallel bus). This let each CPU access any memory location in the same amount of time as any other CPU in the system. This type of architecture is known as a Symmetric Multi-Processor (SMP) system. SMP is fine for a small number of CPUs, but once the CPU count gets above a certain point (8 or 16), the number of parallel traces required to allow equal access to memory uses too much of the available board real estate, leaving less room for peripherals.

Two new concepts combined to allow for a higher number of CPUs in a system:

Serial buses
NUMA topologies

A serial bus is a single-wire communication path with a very high clock rate, which transfers data as packetized bursts. Hardware designers began to use serial buses as high-speed interconnects between CPUs, and between CPUs and memory controllers and other peripherals. This means that instead of requiring between 32 and 64 traces on the board from each CPU to the memory subsystem, there was now one trace, substantially reducing the amount of space required on the board.

At the same time, hardware designers were packing more transistors into the same space by reducing die sizes. Instead of putting individual CPUs directly onto the main board, they started packing them into a processor package as multi-core processors. Then, instead of trying to provide equal access to memory from each processor package, designers resorted to a Non-Uniform Memory Access (NUMA) strategy, where each package/socket combination has one or more dedicated memory area for high speed access. Each socket also has an interconnect to other sockets for slower access to the other sockets' memory.

As a simple NUMA example, suppose we have a two-socket motherboard, where each socket has been populated with a quad-core package. This means the total number of CPUs in the system is eight; four in each socket. Each socket also has an attached memory bank with four gigabytes of RAM, for a total system memory of eight gigabytes. For the purposes of this example, CPUs 0-3 are in socket 0, and CPUs 4-7 are in socket 1. Each socket in this example also corresponds to a NUMA node.

It might take three clock cycles for CPU 0 to access memory from bank 0: a cycle to present the address to the memory controller, a cycle to set up access to the memory location, and a cycle to read or write to the location. However, it might take six clock cycles for CPU 4 to access memory from the same location; because it is on a separate socket, it must go through two memory controllers: the local memory controller on socket 1, and then the remote memory controller on socket 0. If memory is contested on that location (that is, if more than one CPU is attempting to access the same location simultaneously), memory controllers need to arbitrate and serialize access to the memory, so memory access will take longer. Adding cache consistency (ensuring that local CPU caches contain the same data for the same memory location) complicates the process further.

The latest high-end processors from both Intel (Xeon) and AMD (Opteron) have NUMA topologies. The AMD processors use an interconnect known as HyperTransport or HT, while Intel uses one named QuickPath Interconnect or QPI. The interconnects differ in how they physically connect to other interconnects, memory, or peripheral devices, but in effect they are a switch that allows transparent access to one connected device from another connected device. In this case, transparent refers to the fact that there is no special programming API required to use the interconnect, not a "no cost" option.

Because system architectures are so diverse, it is impractical to specifically characterize the performance penalty imposed by accessing non-local memory. We can say that each hop across an interconnect imposes at least some relatively constant performance penalty per hop, so referencing a memory location that is two interconnects from the current CPU imposes at least 2N + memory cycle time units to access time, where N is the penalty per hop.

Given this performance penalty, performance-sensitive applications should avoid regularly accessing remote memory in a NUMA topology system. The application should be set up so that it stays on a particular node and allocates memory from that node.

To do this, there are a few things that applications will need to know:

What is the topology of the system?
Where is the application currently executing?
Where is the closest memory bank?

4.1.2. Tuning CPU Performance

Read this section to understand how to tune for better CPU performance, and for an introduction to several tools that aid in the process.

NUMA was originally used to connect a single processor to multiple memory banks. As CPU manufacturers refined their processes and die sizes shrank, multiple CPU cores could be included in one package. These CPU cores were clustered so that each had equal access time to a local memory bank, and cache could be shared between the cores; however, each 'hop' across an interconnect between core, memory, and cache involves a small performance penalty.

The example system in Figure 4.1, "Local and Remote Memory Access in NUMA Topology" contains two NUMA nodes. Each node has four CPUs, a memory bank, and a memory controller. Any CPU on a node has direct access to the memory bank on that node. Following the arrows on Node 1, the steps are as follows:

A CPU (any of 0-3) presents the memory address to the local memory controller.
The memory controller sets up access to the memory address.
The CPU performs read or write operations on that memory address.

The CPU icon used in this image is part of the Nuvola 1.0 (KDE 3.x icon set), and is held under the LGPL-2.1: http://www.gnu.org/licenses/lgpl-2.1.html

Figure 4.1. Local and Remote Memory Access in NUMA Topology

However, if a CPU on one node needs to access code that resides on the memory bank of a different NUMA node, the path it has to take is less direct:

A CPU (any of 0-3) presents the remote memory address to the local memory controller.
1. The CPU's request for that remote memory address is passed to a remote memory controller, local to the node containing that memory address.
The remote memory controller sets up access to the remote memory address.
The CPU performs read or write operations on that remote memory address.

Every action needs to pass through multiple memory controllers, so access can take more than twice as long when attempting to access remote memory addresses. The primary performance concern in a multi-core system is therefore to ensure that information travels as efficiently as possible, via the shortest, or fastest, path.

To configure an application for optimal CPU performance, you need to know:

the topology of the system (how its components are connected),
the core on which the application executes, and
the location of the closest memory bank.

Red Hat Enterprise Linux 6 ships with a number of tools to help you find this information and tune your system according to your findings. The following sections give an overview of useful tools for CPU performance tuning.

4.1.2.1. Setting CPU Affinity with taskset

taskset retrieves and sets the CPU affinity of a running process (by process ID). It can also be used to launch a process with a given CPU affinity, which binds the specified process to a specified CPU or set of CPUs. However, taskset will not guarantee local memory allocation. If you require the additional performance benefits of local memory allocation, we recommend numactl over taskset; see Section 4.1.2.2, "Controlling NUMA Policy with numactl" for further details.

CPU affinity is represented as a bitmask. The lowest-order bit corresponds to the first logical CPU, and the highest-order bit corresponds to the last logical CPU. These masks are typically given in hexadecimal, so that 0x00000001 represents processor 0, and 0x00000003 represents processors 0 and 1.

To set the CPU affinity of a running process, execute the following command, replacing mask with the mask of the processor or processors you want the process bound to, and pid with the process ID of the process whose affinity you wish to change.

# taskset -p mask pid

To launch a process with a given affinity, run the following command, replacing mask with the mask of the processor or processors you want the process bound to, and program with the program, options, and arguments of the program you want to run.

# taskset mask -- program

Instead of specifying the processors as a bitmask, you can also use the -c option to provide a comma-delimited list of separate processors, or a range of processors, like so:

# taskset -c 0,5,7-9 -- myprogram

Further information about taskset is available from the man page: man taskset.

4.1.2.2. Controlling NUMA Policy with numactl

numactl runs processes with a specified scheduling or memory placement policy. The selected policy is set for that process and all of its children. numactl can also set a persistent policy for shared memory segments or files, and set the CPU affinity and memory affinity of a process. It uses the /sys file system to determine system topology.

The /sys file system contains information about how CPUs, memory, and peripheral devices are connected via NUMA interconnects. Specifically, the /sys/devices/system/cpu directory contains information about how a system's CPUs are connected to one another. The /sys/devices/system/node directory contains information about the NUMA nodes in the system, and the relative distances between those nodes.

In a NUMA system, the greater the distance between a processor and a memory bank, the slower the processor's access to that memory bank. Performance-sensitive applications should therefore be configured so that they allocate memory from the closest possible memory bank.

Performance-sensitive applications should also be configured to execute on a set number of cores, particularly in the case of multi-threaded applications. Because first-level caches are usually small, if multiple threads execute on one core, each thread will potentially evict cached data accessed by a previous thread. When the operating system attempts to multitask between these threads, and the threads continue to evict each other's cached data, a large percentage of their execution time is spent on cache line replacement. This issue is referred to as cache thrashing. It is therefore recommended to bind a multi-threaded application to a node rather than a single core, since this allows the threads to share cache lines on multiple levels (first-, second-, and last-level cache) and minimizes the need for cache fill operations. However, binding an application to a single core may be performant if all threads are accessing the same cached data.

numactl allows you to bind an application to a particular core or NUMA node, and to allocate the memory associated with a core or set of cores to that application. Some useful options provided by numactl are:

--show: Display the NUMA policy settings of the current process. This parameter does not require further parameters, and can be used like so: numactl --show.
--hardware: Displays an inventory of the available nodes on the system.
--membind: Only allocate memory from the specified nodes. When this is in use, allocation will fail if memory on these nodes is insufficient. Usage for this parameter is numactl --membind=nodes program, where nodes is the list of nodes you want to allocate memory from, and program is the program whose memory requirements should be allocated from that node. Node numbers can be given as a comma-delimited list, a range, or a combination of the two. Further details are available on the numactl man page: man numactl.
--cpunodebind: Only execute a command (and its child processes) on CPUs belonging to the specified node(s). Usage for this parameter is numactl --cpunodebind=nodes program, where nodes is the list of nodes to whose CPUs the specified program (program) should be bound. Node numbers can be given as a comma-delimited list, a range, or a combination of the two. Further details are available on the numactl man page: man numactl.
--physcpubind: Only execute a command (and its child processes) on the specified CPUs. Usage for this parameter is numactl --physcpubind=cpu program, where cpu is a comma-delimited list of physical CPU numbers as displayed in the processor fields of /proc/cpuinfo, and program is the program that should execute only on those CPUs. CPUs can also be specified relative to the current cpuset. Refer to the numactl man page for further information: man numactl.
--localalloc: Specifies that memory should always be allocated on the current node.
--preferred: Where possible, memory is allocated on the specified node. If memory cannot be allocated on the node specified, fall back to other nodes. This option takes only a single node number, like so: numactl --preferred=node. Refer to the numactl man page for further information: man numactl.

The libnuma library included in the numactl package offers a simple programming interface to the NUMA policy supported by the kernel. It is useful for more fine-grained tuning than the numactl utility. Further information is available on the man page: man numa(3).

4.1.3. numastat

Important

Previously, the numastat tool was a Perl script written by Andi Kleen. It has been significantly rewritten for Red Hat Enterprise Linux 6.4.

While the default command (numastat, with no options or parameters) maintains strict compatibility with the previous version of the tool, note that supplying options or parameters to this command significantly changes both the output content and its format.

numastat displays memory statistics (such as allocation hits and misses) for processes and the operating system on a per-NUMA-node basis. By default, running numastat displays how many pages of memory are occupied by the following event categories for each node.

Optimal CPU performance is indicated by low numa_miss and numa_foreign values.

This updated version of numastat also shows whether process memory is spread across a system or centralized on specific nodes using numactl.

Cross-reference numastat output with per-CPU top output to verify that process threads are running on the same nodes to which memory is allocated.

Default Tracking Categories

numa_hit: The number of attempted allocations to this node that were successful.
numa_miss: The number of attempted allocations to another node that were allocated on this node because of low memory on the intended node. Each numa_miss event has a corresponding numa_foreign event on another node.
numa_foreign: The number of allocations initially intended for this node that were allocated to another node instead. Each numa_foreign event has a corresponding numa_miss event on another node.
interleave_hit: The number of attempted interleave policy allocations to this node that were successful.
local_node: The number of times a process on this node successfully allocated memory on this node.
other_node: The number of times a process on another node allocated memory on this node.

Supplying any of the following options changes the displayed units to megabytes of memory (rounded to two decimal places), and changes other specific numastat behaviors as described below.

-c

Horizontally condenses the displayed table of information. This is useful on systems with a large number of NUMA nodes, but column width and inter-column spacing are somewhat unpredictable. When this option is used, the amount of memory is rounded to the nearest megabyte.

-m

Displays system-wide memory usage information on a per-node basis, similar to the information found in /proc/meminfo.

-n

Displays the same information as the original numastat command (numa_hit, numa_miss, numa_foreign, interleave_hit, local_node, and other_node), with an updated format, using megabytes as the unit of measurement.

-p pattern

Displays per-node memory information for the specified pattern. If the the value for pattern is comprised of digits, numastat assumes that it is a numerical process identifier. Otherwise, numastat searches process command lines for the specified pattern.

Command line arguments entered after the value of the -p option are assumed to be additional patterns for which to filter. Additional patterns expand, rather than narrow, the filter.

-s

Sorts the displayed data in descending order so that the biggest memory consumers (according to the total column) are listed first.

Optionally, you can specify a node, and the table will be sorted according to the node column. When using this option, the node value must follow the -s option immediately, as shown here:

numastat -s2

Do not include white space between the option and its value.

-v

Displays more verbose information. Namely, process information for multiple processes will display detailed information for each process.

-V

Displays numastat version information.

-z

Omits table rows and columns with only zero values from the displayed information. Note that some near-zero values that are rounded to zero for display purposes will not be omitted from the displayed output.

4.1.4. NUMA Affinity Management Daemon (numad)

numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management (and therefore system performance).

Depending on system workload, numad can provide benchmark performance improvements of up to 50%. To achieve these performance gains, numad periodically accesses information from the /proc file system to monitor available system resources on a per-node basis. The daemon then attempts to place significant processes on NUMA nodes that have sufficient aligned memory and CPU resources for optimum NUMA performance. Current thresholds for process management are at least 50% of one CPU and at least 300 MB of memory. numad attempts to maintain a resource utilization level, and rebalances allocations when necessary by moving processes between NUMA nodes.

numad also provides a pre-placement advice service that can be queried by various job management systems to provide assistance with the initial binding of CPU and memory resources for their processes. This pre-placement advice service is available regardless of whether numad is running as a daemon on the system. Refer to the man page for further details about using the -w option for pre-placement advice: man numad.

4.1.4.1. Benefits of numad

numad primarily benefits systems with long-running processes that consume significant amounts of resources, particularly when these processes are contained in a subset of the total system resources.

numad may also benefit applications that consume multiple NUMA nodes' worth of resources. However, the benefits that numad provides decrease as the percentage of consumed resources on a system increases.

numad is unlikely to improve performance when processes run for only a few minutes, or do not consume many resources. Systems with continuous unpredictable memory access patterns, such as large in-memory databases, are also unlikely to benefit from numad use.

4.1.4.2. Modes of operation

Note

Kernel memory accounting statistics can contradict each other after large amounts of merging. As such, numad can be confused when the KSM daemon merges large amounts of memory. The KSM daemon will be more NUMA-aware in future releases. However, currently, if your system has a large amount of free memory, you may achieve higher performance by turning off and disabling the KSM daemon.

numad can be used in two ways:

as a service
as an executable

4.1.4.2.1. Using numad as a service

While the numad service runs, it will attempt to dynamically tune the system based on its workload.

To start the service, run:

# service numad start

To make the service persist across reboots, run:

# chkconfig numad on

4.1.4.2.2. Using numad as an executable

To use numad as an executable, just run:

# numad

numad will run until it is stopped. While it runs, its activities are logged in /var/log/numad.log.

To restrict numad management to a specific process, start it with the following options.

# numad -S 0 -p pid

-p pid: Adds the specified pid to an explicit inclusion list. The process specified will not be managed until it meets the numad process significance threshold.
-S mode: The -S parameter specifies the type of process scanning. Setting it to 0 as shown limits numad management to explicitly included processes.

To stop numad, run:

# numad -i 0

Stopping numad does not remove the changes it has made to improve NUMA affinity. If system use changes significantly, running numad again will adjust affinity to improve performance under the new conditions.

For further information about available numad options, refer to the numad man page: man numad.

4.2. CPU Scheduling

The scheduler is responsible for keeping the CPUs in the system busy. The Linux scheduler implements a number of scheduling policies, which determine when and for how long a thread runs on a particular CPU core.

Scheduling policies are divided into two major categories:

Realtime policies
- SCHED_FIFO
- SCHED_RR
Normal policies
- SCHED_OTHER
- SCHED_BATCH
- SCHED_IDLE

4.2.1. Realtime scheduling policies

Realtime threads are scheduled first, and normal threads are scheduled after all realtime threads have been scheduled.

The realtime policies are used for time-critical tasks that must complete without interruptions.

SCHED_FIFO: This policy is also referred to as static priority scheduling, because it defines a fixed priority (between 1 and 99) for each thread. The scheduler scans a list of SCHED_FIFO threads in priority order and schedules the highest priority thread that is ready to run. This thread runs until it blocks, exits, or is preempted by a higher priority thread that is ready to run.
Even the lowest priority realtime thread will be scheduled ahead of any thread with a non-realtime policy; if only one realtime thread exists, the SCHED_FIFO priority value does not matter.
SCHED_RR: A round-robin variant of the SCHED_FIFO policy. SCHED_RR threads are also given a fixed priority between 1 and 99. However, threads with the same priority are scheduled round-robin style within a certain quantum, or time slice. The sched_rr_get_interval(2) system call returns the value of the time slice, but the duration of the time slice cannot be set by a user. This policy is useful if you need multiple thread to run at the same priority.

For more detailed information about the defined semantics of the realtime scheduling policies, refer to the IEEE 1003.1 POSIX standard under System Interfaces - Realtime, which is available from http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html.

Best practice in defining thread priority is to start low and increase priority only when a legitimate latency is identified. Realtime threads are not time-sliced like normal threads; SCHED_FIFO threads run until they block, exit, or are pre-empted by a thread with a higher priority. Setting a priority of 99 is therefore not recommended, as this places your process at the same priority level as migration and watchdog threads. If these threads are blocked because your thread goes into a computational loop, they will not be able to run. Uniprocessor systems will eventually lock up in this situation.

In the Linux kernel, the SCHED_FIFO policy includes a bandwidth cap mechanism. This protects realtime application programmers from realtime tasks that might monopolize the CPU. This mechanism can be adjusted through the following /proc file system parameters:

/proc/sys/kernel/sched_rt_period_us: Defines the time period to be considered one hundred percent of CPU bandwidth, in microseconds ('us' being the closest equivalent to '�s' in plain text). The default value is 1000000�s, or 1 second.
/proc/sys/kernel/sched_rt_runtime_us: Defines the time period to be devoted to running realtime threads, in microseconds ('us' being the closest equivalent to '�s' in plain text). The default value is 950000�s, or 0.95 seconds.

4.2.2. Normal scheduling policies

There are three normal scheduling policies: SCHED_OTHER, SCHED_BATCH and SCHED_IDLE. However, the SCHED_BATCH and SCHED_IDLE policies are intended for very low priority jobs, and as such are of limited interest in a performance tuning guide.

SCHED_OTHER, or SCHED_NORMAL: The default scheduling policy. This policy uses the Completely Fair Scheduler (CFS) to provide fair access periods for all threads using this policy. CFS establishes a dynamic priority list partly based on the niceness value of each process thread. (Refer to the Deployment Guide for more details about this parameter and the /proc file system.) This gives users some indirect level of control over process priority, but the dynamic priority list can only be directly changed by the CFS.

4.2.3. Policy selection

Selecting the correct scheduler policy for an application's threads is not always a straightforward task. In general, realtime policies should be used for time critical or important tasks that need to be scheduled quickly and do not run for extended periods of time. Normal policies will generally yield better data throughput results than realtime policies because they let the scheduler run threads more efficiently (that is, they do not need to reschedule for pre-emption as often).

If you are managing large numbers of threads and are concerned mainly with data throughput (network packets per second, writes to disk, etc.) then use SCHED_OTHER and let the system manage CPU utilization for you.

If you are concerned with event response time (latency) then use SCHED_FIFO. If you have a small number of threads, consider isolating a CPU socket and moving your threads onto that socket's cores so that there are no other threads competing for time on the cores.

4.3. Interrupts and IRQ Tuning

An interrupt request (IRQ) is a request for service, sent at the hardware level. Interrupts can be sent by either a dedicated hardware line, or across a hardware bus as an information packet (a Message Signaled Interrupt, or MSI).

When interrupts are enabled, receipt of an IRQ prompts a switch to interrupt context. Kernel interrupt dispatch code retrieves the IRQ number and its associated list of registered Interrupt Service Routines (ISRs), and calls each ISR in turn. The ISR acknowledges the interrupt and ignores redundant interrupts from the same IRQ, then queues a deferred handler to finish processing the interrupt and stop the ISR from ignoring future interrupts.

The /proc/interrupts file lists the number of interrupts per CPU per I/O device. It displays the IRQ number, the number of that interrupt handled by each CPU core, the interrupt type, and a comma-delimited list of drivers that are registered to receive that interrupt. (Refer to the proc(5) man page for further details: man 5 proc)

IRQs have an associated "affinity" property, smp_affinity, which defines the CPU cores that are allowed to execute the ISR for that IRQ. This property can be used to improve application performance by assigning both interrupt affinity and the application's thread affinity to one or more specific CPU cores. This allows cache line sharing between the specified interrupt and application threads.

The interrupt affinity value for a particular IRQ number is stored in the associated /proc/irq/IRQ_NUMBER/smp_affinity file, which can be viewed and modified by the root user. The value stored in this file is a hexadecimal bit-mask representing all CPU cores in the system.

As an example, to set the interrupt affinity for the Ethernet driver on a server with four CPU cores, first determine the IRQ number associated with the Ethernet driver:

# grep eth0 /proc/interrupts32:   0 140  45   850264  PCI-MSI-edge  eth0

Use the IRQ number to locate the appropriate smp_affinity file:

# cat /proc/irq/32/smp_affinity f

The default value for smp_affinity is f, meaning that the IRQ can be serviced on any of the CPUs in the system. Setting this value to 1, as follows, means that only CPU 0 can service this interrupt:

# echo 1 >/proc/irq/32/smp_affinity# cat /proc/irq/32/smp_affinity1

Commas can be used to delimit smp_affinity values for discrete 32-bit groups. This is required on systems with more than 32 cores. For example, the following example shows that IRQ 40 is serviced on all cores of a 64-core system:

# cat /proc/irq/40/smp_affinityffffffff,ffffffff

To service IRQ 40 on only the upper 32-cores of a 64-core system, you would do the following:

# echo 0xffffffff,00000000 > /proc/irq/40/smp_affinity# cat /proc/irq/40/smp_affinityffffffff,00000000

Note

On systems that support interrupt steering, modifying the smp_affinity of an IRQ sets up the hardware so that the decision to service an interrupt with a particular CPU is made at the hardware level, with no intervention from the kernel.

4.4. Enhancements to NUMA in Red Hat Enterprise Linux 6

Red Hat Enterprise Linux 6 includes a number of enhancements to capitalize on the full potential of today's highly scalable hardware. This section gives a high-level overview of the most important NUMA-related performance enhancements provided by Red Hat Enterprise Linux 6.

4.4.1. Bare-metal and Scalability Optimizations

4.4.1.1. Enhancements in topology-awareness

The following enhancements allow Red Hat Enterprise Linux to detect low-level hardware and architecture details, improving its ability to automatically optimize processing on your system.

enhanced topology detection: This allows the operating system to detect low-level hardware details (such as logical CPUs, hyper threads, cores, sockets, NUMA nodes and access times between nodes) at boot time, and optimize processing on your system.
completely fair scheduler: This new scheduling mode ensures that runtime is shared evenly between eligible processes. Combining this with topology detection allows processes to be scheduled onto CPUs within the same socket to avoid the need for expensive remote memory access, and ensure that cache content is preserved wherever possible.
malloc: malloc is now optimized to ensure that the regions of memory that are allocated to a process are as physically close as possible to the core on which the process is executing. This increases memory access speeds.
skbuff I/O buffer allocation: Similarly to malloc, this is now optimized to use memory that is physically close to the CPU handling I/O operations such as device interrupts.
device interrupt affinity: Information recorded by device drivers about which CPU handles which interrupts can be used to restrict interrupt handling to CPUs within the same physical socket, preserving cache affinity and limiting high-volume cross-socket communication.

4.4.1.2. Enhancements in Multi-processor Synchronization

Coordinating tasks between multiple processors requires frequent, time-consuming operations to ensure that processes executing in parallel do not compromise data integrity. Red Hat Enterprise Linux includes the following enhancements to improve performance in this area:

Read-Copy-Update (RCU) locks: Typically, 90% of locks are acquired for read-only purposes. RCU locking removes the need to obtain an exclusive-access lock when the data being accessed is not being modified. This locking mode is now used in page cache memory allocation: locking is now used only for allocation or deallocation operations.
per-CPU and per-socket algorithms: Many algorithms have been updated to perform lock coordination among cooperating CPUs on the same socket to allow for more fine-grained locking. Numerous global spinlocks have been replaced with per-socket locking methods, and updated memory allocator zones and related memory page lists allow memory allocation logic to traverse a more efficient subset of the memory mapping data structures when performing allocation or deallocation operations.

4.4.2. Virtualization Optimizations

Because KVM utilizes kernel functionality, KVM-based virtualized guests immediately benefit from all bare-metal optimizations. Red Hat Enterprise Linux also includes a number of enhancements to allow virtualized guests to approach the performance level of a bare-metal system. These enhancements focus on the I/O path in storage and network access, allowing even intensive workloads such as database and file-serving to make use of virtualized deployment. NUMA-specific enhancements that improve the performance of virtualized systems include:

CPU pinning: Virtual guests can be bound to run on a specific socket in order to optimize local cache use and remove the need for expensive inter-socket communications and remote memory access.
transparent hugepages (THP): With THP enabled, the system automatically performs NUMA-aware memory allocation requests for large contiguous amounts of memory, reducing both lock contention and the number of translation lookaside buffer (TLB) memory management operations required and yielding a performance increase of up to 20% in virtual guests.
kernel-based I/O implementation: The virtual guest I/O subsystem is now implemented in the kernel, greatly reducing the expense of inter-node communication and memory access by avoiding a significant amount of context switching, and synchronization and communication overhead.

Performance Tuning Guide

Chapter 3. Monitoring and Analyzing System Performance

3.1. The proc File System

3.2. GNOME and KDE System Monitors

3.3. Built-in Command-line Monitoring Tools

3.4. Tuned and ktune

3.5. Application Profilers

3.5.1. SystemTap

3.5.2. OProfile

3.5.3. Valgrind

3.5.4. Perf

3.6. Red Hat Enterprise MRG

Chapter 4. CPU

Topology

Threads

Interrupts

4.1. CPU Topology

4.1.1. CPU and NUMA Topology

4.1.2. Tuning CPU Performance

4.1.2.1. Setting CPU Affinity with taskset

4.1.2.2. Controlling NUMA Policy with numactl

4.1.3. numastat

4.1.4. NUMA Affinity Management Daemon (numad)

4.1.4.1. Benefits of numad

4.1.4.2. Modes of operation

4.1.4.2.1. Using numad as a service

4.1.4.2.2. Using numad as an executable

4.2. CPU Scheduling

4.2.1. Realtime scheduling policies

4.2.2. Normal scheduling policies

4.2.3. Policy selection

4.3. Interrupts and IRQ Tuning

4.4. Enhancements to NUMA in Red Hat Enterprise Linux 6

4.4.1. Bare-metal and Scalability Optimizations

4.4.1.1. Enhancements in topology-awareness

4.4.1.2. Enhancements in Multi-processor Synchronization

4.4.2. Virtualization Optimizations