Chapter 5. Memory

Read this chapter for an overview of the memory management features available in Red Hat Enterprise Linux, and how to use these management features to optimize memory utilization in your system.

5.1. Huge Translation Lookaside Buffer (HugeTLB)

Physical memory addresses are translated to virtual memory addresses as part of memory management. The mapped relationship of physical to virtual addresses is stored in a data structure known as the page table. Since reading the page table for every address mapping would be time consuming and resource-expensive, there is a cache for recently-used addresses. This cache is called the Translation Lookaside Buffer (TLB).

However, the TLB can only cache so many address mappings. If a requested address mapping is not in the TLB, the page table must still be read to determine the physical to virtual address mapping. This is known as a "TLB miss". Applications with large memory requirements are more likely to be affected by TLB misses than applications with minimal memory requirements because of the relationship between their memory requirements and the size of the pages used to cache address mappings in the TLB. Since each miss involves reading the page table, it is important to avoid these misses wherever possible.

The Huge Translation Lookaside Buffer (HugeTLB) allows memory to be managed in very large segments so that more address mappings can be cached at one time. This reduces the probability of TLB misses, which in turn improves performance in applications with large memory requirements.

Information about configuring the HugeTLB can be found in the kernel documentation: /usr/share/doc/kernel-doc-version/Documentation/vm/hugetlbpage.txt

5.2. Huge Pages and Transparent Huge Pages

Memory is managed in blocks known as pages. A page is 4096 bytes. 1MB of memory is equal to 256 pages; 1GB of memory is equal to 256,000 pages, etc. CPUs have a built-in memory management unit that contains a list of these pages, with each page referenced through a page table entry.

There are two ways to enable the system to manage large amounts of memory:

Increase the number of page table entries in the hardware memory management unit
Increase the page size

The first method is expensive, since the hardware memory management unit in a modern processor only supports hundreds or thousands of page table entries. Additionally, hardware and memory management algorithms that work well with thousands of pages (megabytes of memory) may have difficulty performing well with millions (or even billions) of pages. This results in performance issues: when an application needs to use more memory pages than the memory management unit supports, the system falls back to slower, software-based memory management, which causes the entire system to run more slowly.

Red Hat Enterprise Linux 6 implements the second method via the use of huge pages.

Simply put, huge pages are blocks of memory that come in 2MB and 1GB sizes. The page tables used by the 2MB pages are suitable for managing multiple gigabytes of memory, whereas the page tables of 1GB pages are best for scaling to terabytes of memory.

Huge pages must be assigned at boot time. They are also difficult to manage manually, and often require significant changes to code in order to be used effectively. As such, Red Hat Enterprise Linux 6 also implemented the use of transparent huge pages (THP). THP is an abstraction layer that automates most aspects of creating, managing, and using huge pages.

THP hides much of the complexity in using huge pages from system administrators and developers. As the goal of THP is improving performance, its developers (both from the community and Red Hat) have tested and optimized THP across a wide range of systems, configurations, applications, and workloads. This allows the default settings of THP to improve the performance of most system configurations.

Note that THP can currently only map anonymous memory regions such as heap and stack space.

5.3. Using Valgrind to Profile Memory Usage

Valgrind is a framework that provides instrumentation to user-space binaries. It ships with a number of tools that can be used to profile and analyze program performance. The tools outlined in this section provide analysis that can aid in the detection of memory errors such as the use of uninitialized memory and improper allocation or deallocation of memory. All are included in the valgrind package, and can be run with the following command:

valgrind --tool=toolname program

Replace toolname with the name of the tool you wish to use (for memory profiling, memcheck, massif, or cachegrind), and program with the program you wish to profile with Valgrind. Be aware that Valgrind's instrumentation will cause your program to run more slowly than it would normally.

An overview of Valgrind's capabilities is provided in Section 3.5.3, "Valgrind". Further details, including information about available plugins for Eclipse, are included in the Developer Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/. Accompanying documentation can be viewed with the man valgrind command when the valgrind package is installed, or found in the following locations:

/usr/share/doc/valgrind-version/valgrind_manual.pdf, and
/usr/share/doc/valgrind-version/html/index.html.

5.3.1. Profiling Memory Usage with Memcheck

Memcheck is the default Valgrind tool, and can be run with valgrind program, without specifying --tool=memcheck. It detects and reports on a number of memory errors that can be difficult to detect and diagnose, such as memory access that should not occur, the use of undefined or uninitialized values, incorrectly freed heap memory, overlapping pointers, and memory leaks. Programs run ten to thirty times more slowly with Memcheck than when run normally.

Memcheck returns specific errors depending on the type of issue it detects. These errors are outlined in detail in the Valgrind documentation included at /usr/share/doc/valgrind-version/valgrind_manual.pdf.

Note that Memcheck can only report these errors - it cannot prevent them from occurring. If your program accesses memory in a way that would normally result in a segmentation fault, the segmentation fault still occurs. However, Memcheck will log an error message immediately prior to the fault.

Memcheck provides command line options that can be used to focus the checking process. Some of the options available are:

--leak-check: When enabled, Memcheck searches for memory leaks when the client program finishes. The default value is summary, which outputs the number of leaks found. Other possible values are yes and full, both of which give details of each individual leak, and no, which disables memory leak checking.
--undef-value-errors: When enabled (set to yes), Memcheck reports errors when undefined values are used. When disabled (set to no), undefined value errors are not reported. This is enabled by default. Disabling it speeds up Memcheck slightly.
--ignore-ranges: Allows the user to specify one or more ranges that Memcheck should ignore when checking for addressability. Multiple ranges are delimited by commas, for example, --ignore-ranges=0xPP-0xQQ,0xRR-0xSS.

For a full list of options, refer to the documentation included at /usr/share/doc/valgrind-version/valgrind_manual.pdf.

5.3.2. Profiling Cache Usage with Cachegrind

Cachegrind simulates your program's interaction with a machine's cache hierarchy and (optionally) branch predictor. It tracks usage of the simulated first-level instruction and data caches to detect poor code interaction with this level of cache; and the last-level cache, whether that is a second- or third-level cache, in order to track access to main memory. As such, programs run with Cachegrind run twenty to one hundred times slower than when run normally.

To run Cachegrind, execute the following command, replacing program with the program you wish to profile with Cachegrind:

# valgrind --tool=cachegrind program

Cachegrind can gather the following statistics for the entire program, and for each function in the program:

first-level instruction cache reads (or instructions executed) and read misses, and last-level cache instruction read misses;
data cache reads (or memory reads), read misses, and last-level cache data read misses;
data cache writes (or memory writes), write misses, and last-level cache write misses;
conditional branches executed and mispredicted; and
indirect branches executed and mispredicted.

Cachegrind prints summary information about these statistics to the console, and writes more detailed profiling information to a file (cachegrind.out.pid by default, where pid is the process ID of the program on which you ran Cachegrind). This file can be further processed by the accompanying cg_annotate tool, like so:

# cg_annotate cachegrind.out.pid

Note

cg_annotate can output lines longer than 120 characters, depending on the length of the path. To make the output clearer and easier to read, we recommend making your terminal window at least this wide before executing the aforementioned command.

You can also compare the profile files created by Cachegrind to make it simpler to chart program performance before and after a change. To do so, use the cg_diff command, replacing first with the initial profile output file, and second with the subsequent profile output file:

# cg_diff first second

This command produces a combined output file, which can be viewed in more detail with cg_annotate.

Cachegrind supports a number of options to focus its output. Some of the options available are:

--I1: Specifies the size, associativity, and line size of the first-level instruction cache, separated by commas: --I1=size,associativity,line size.
--D1: Specifies the size, associativity, and line size of the first-level data cache, separated by commas: --D1=size,associativity,line size.
--LL: Specifies the size, associativity, and line size of the last-level cache, separated by commas: --LL=size,associativity,line size.
--cache-sim: Enables or disables the collection of cache access and miss counts. The default value is yes (enabled).
Note that disabling both this and --branch-sim leaves Cachegrind with no information to collect.
--branch-sim: Enables or disables the collection of branch instruction and misprediction counts. This is set to no (disabled) by default, since it slows Cachegrind by approximately 25 per-cent.
Note that disabling both this and --cache-sim leaves Cachegrind with no information to collect.

For a full list of options, refer to the documentation included at /usr/share/doc/valgrind-version/valgrind_manual.pdf.

5.3.3. Profiling Heap and Stack Space with Massif

Massif measures the heap space used by a specified program; both the useful space, and any additional space allocated for book-keeping and alignment purposes. It can help you reduce the amount of memory used by your program, which can increase your program's speed, and reduce the likelihood that your program will exhaust the swap space of the machine on which it executes. Massif can also provide details about which parts of your program are responsible for allocating heap memory. Programs run with Massif run about twenty times more slowly than their normal execution speed.

To profile the heap usage of a program, specify massif as the Valgrind tool you wish to use:

# valgrind --tool=massif program

Profiling data gathered by Massif is written to a file, which by default is called massif.out.pid, where pid is the process ID of the specified program.

This profiling data can also be graphed with the ms_print command, like so:

# ms_print massif.out.pid

This produces a graph showing memory consumption over the program's execution, and detailed information about the sites responsible for allocation at various points in the program, including at the point of peak memory allocation.

Massif provides a number of command line options that can be used to direct the output of the tool. Some of the available options are:

--heap: Specifies whether to perform heap profiling. The default value is yes. Heap profiling can be disabled by setting this option to no.
--heap-admin: Specifies the number of bytes per block to use for administration when heap profiling is enabled. The default value is 8 bytes per block.
--stacks: Specifies whether to perform stack profiling. The default value is no (disabled). To enable stack profiling, set this option to yes, but be aware that doing so will greatly slow Massif. Also note that Massif assumes that the main stack has size zero at start-up in order to better indicate the size of the stack portion over which the program being profiled has control.
--time-unit: Specifies the unit of time used for the profiling. There are three valid values for this option: instructions executed (i), the default value, which is useful in most cases; real time (ms, in milliseconds), which can be useful in certain instances; and bytes allocated/deallocated on the heap and/or stack (B), which is useful for very short-run programs, and for testing purposes, because it is the most reproducible across different machines. This option is useful when graphing Massif output with ms_print.

For a full list of options, refer to the documentation included at /usr/share/doc/valgrind-version/valgrind_manual.pdf.

5.4. Capacity Tuning

Read this section for an outline of memory, kernel and file system capacity, the parameters related to each, and the trade-offs involved in adjusting these parameters.

To set these values temporarily during tuning, echo the desired value to the appropriate file in the proc file system. For example, to set overcommit_memory temporarily to 1, run:

# echo 1 > /proc/sys/vm/overcommit_memory

Note that the path to the parameter in the proc file system varies depending on the system affected by the change.

To set these values persistently, you will need to use the sysctl command. For further details, refer to the Deployment Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

Capacity-related Memory Tunables

Each of the following parameters is located under /proc/sys/vm/ in the proc file system.

overcommit_memory

Defines the conditions that determine whether a large memory request is accepted or denied. There are three possible values for this parameter:

0 - The default setting. The kernel performs heuristic memory overcommit handling by estimating the amount of memory available and failing requests that are blatantly invalid. Unfortunately, since memory is allocated using a heuristic rather than a precise algorithm, this setting can sometimes allow available memory on the system to be overloaded.
1 - The kernel performs no memory overcommit handling. Under this setting, the potential for memory overload is increased, but so is performance for memory-intensive tasks.
2 - The kernel denies requests for memory equal to or larger than the sum of total available swap and the percentage of physical RAM specified in overcommit_ratio. This setting is best if you want a lesser risk of memory overcommitment.
Note
This setting is only recommended for systems with swap areas larger than their physical memory.

overcommit_ratio

Specifies the percentage of physical RAM considered when overcommit_memory is set to 2. The default value is 50.

max_map_count

Defines the maximum number of memory map areas that a process may use. In most cases, the default value of 65530 is appropriate. Increase this value if your application needs to map more than this number of files.

nr_hugepages

Defines the number of hugepages configured in the kernel. The default value is 0. It is only possible to allocate (or deallocate) hugepages if there are sufficient physically contiguous free pages in the system. Pages reserved by this parameter cannot be used for other purposes. Further information is available from the installed documentation: /usr/share/doc/kernel-doc-kernel_version/Documentation/vm/hugetlbpage.txt

Capacity-related Kernel Tunables

Each of the following parameters is located under /proc/sys/kernel/ in the proc file system.

msgmax

Defines the maximum allowable size in bytes of any single message in a message queue. This value must not exceed the size of the queue (msgmnb). The default value is 65536.

msgmnb

Defines the maximum size in bytes of a single message queue. The default value is 65536 bytes.

msgmni

Defines the maximum number of message queue identifiers (and therefore the maximum number of queues). The default value on machines with 64-bit architecture is 1985; for 32-bit architecture, the default value is 1736.

shmall

Defines the total amount of shared memory in bytes that can be used on the system at one time. The default value for machines with 64-bit architecture is 4294967296; for 32-bit architecture the default value is 268435456.

shmmax

Defines the maximum shared memory segment allowed by the kernel, in bytes. The default value on machines with 64-bit architecture is 68719476736; for 32-bit architecture, the default value is 4294967295. Note, however, that the kernel supports values much larger than this.

shmmni

Defines the system-wide maximum number of shared memory segments. The default value is 4096 on both 64-bit and 32-bit architectures.

threads-max

Defines the system-wide maximum number of threads (tasks) to be used by the kernel at one time. The default value is equal to the kernel max_threads value. The formula in use is:

max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE )

The minimum value of threads-max is 20.

Capacity-related File System Tunables

Each of the following parameters is located under /proc/sys/fs/ in the proc file system.

aio-max-nr: Defines the maximum allowed number of events in all active asynchronous I/O contexts. The default value is 65536. Note that changing this value does not pre-allocate or resize any kernel data structures.
file-max: Lists the maximum number of file handles that the kernel allocates. The default value matches the value of files_stat.max_files in the kernel, which is set to the largest value out of either (mempages * (PAGE_SIZE / 1024)) / 10, or NR_FILE (8192 in Red Hat Enterprise Linux). Raising this value can resolve errors caused by a lack of available file handles.

Out-of-Memory Kill Tunables

Out of Memory (OOM) refers to a computing state where all available memory, including swap space, has been allocated. By default, this situation causes the system to panic and stop functioning as expected. However, setting the /proc/sys/vm/panic_on_oom parameter to 0 instructs the kernel to call the oom_killer function when OOM occurs. Usually, oom_killer can kill rogue processes and the system survives.

The following parameter can be set on a per-process basis, giving you increased control over which processes are killed by the oom_killer function. It is located under /proc/pid/ in the proc file system, where pid is the process ID number.

oom_adj: Defines a value from -16 to 15 that helps determine the oom_score of a process. The higher the oom_score value, the more likely the process will be killed by the oom_killer. Setting a oom_adj value of -17 disables the oom_killer for that process.
Important
Any processes spawned by an adjusted process will inherit that process's oom_score. For example, if an sshd process is protected from the oom_killer function, all processes initiated by that SSH session will also be protected. This can affect the oom_killer function's ability to salvage the system if OOM occurs.

5.5. Tuning Virtual Memory

Virtual memory is typically consumed by processes, file system caches, and the kernel. Virtual memory utilization depends on a number of factors, which can be affected by the following parameters:

swappiness

A value from 0 to 100 which controls the degree to which the system swaps. A high value prioritizes system performance, aggressively swapping processes out of physical memory when they are not active. A low value prioritizes interactivity and avoids swapping processes out of physical memory for as long as possible, which decreases response latency. The default value is 60.

min_free_kbytes

The minimum number of kilobytes to keep free across the system. This value is used to compute a watermark value for each low memory zone, which are then assigned a number of reserved free pages proportional to their size.

Extreme values can break your system

Be cautious when setting this parameter, as both too-low and too-high values can be damaging.

Setting min_free_kbytes too low prevents the system from reclaiming memory. This can result in system hangs and OOM-killing multiple processes.

However, setting this parameter to a value that is too high (5-10% of total system memory) will cause your system to become out-of-memory immediately. Linux is designed to use all available RAM to cache file system data. Setting a high min_free_kbytes value results in the system spending too much time reclaiming memory.

dirty_ratio

Defines a percentage value. Writeout of dirty data begins (via pdflush) when dirty data comprises this percentage of total system memory. The default value is 20.

dirty_background_ratio

Defines a percentage value. Writeout of dirty data begins in the background (via pdflush) when dirty data comprises this percentage of total memory. The default value is 10.

drop_caches

Setting this value to 1, 2, or 3 causes the kernel to drop various combinations of page cache and slab cache.

1: The system invalidates and frees all page cache memory.
2: The system frees all unused slab cache memory.
3: The system frees all page cache and slab cache memory.

This is a non-destructive operation. Since dirty objects cannot be freed, running sync before setting this parameter's value is recommended.

Important

Using the drop_caches to free memory is not recommended in a production environment.

To set these values temporarily during tuning, echo the desired value to the appropriate file in the proc file system. For example, to set swappiness temporarily to 50, run:

# echo 50 > /proc/sys/vm/swappiness

To set this value persistently, you will need to use the sysctl command. For further information, refer to the Deployment Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

Chapter 6. Input/Output

6.1. Features

6.2. Analysis

6.3. Tools

6.4. Configuration

6.4.1. Completely Fair Queuing (CFQ)
6.4.2. Deadline I/O Scheduler
6.4.3. Noop

6.1. Features

Red Hat Enterprise Linux 6 introduces a number of performance enhancements in the I/O stack:

Solid state disks (SSDs) are now recognized automatically, and the performance of the I/O scheduler is tuned to take advantage of the high I/Os per second (IOPS) that these devices can perform.
Discard support has been added to the kernel to report unused block ranges to the underlying storage. This helps SSDs with their wear-leveling algorithms. It also helps storage that supports logical block provisioning (a sort of virtual address space for storage) by keeping closer tabs on the actual amount of storage in-use.
The file system barrier implementation was overhauled in Red Hat Enterprise Linux 6.1 to make it more performant.
pdflush has been replaced by per-backing-device flusher threads, which greatly improves system scalability on configurations with large LUN counts.

6.2. Analysis

Successfully tuning storage stack performance requires an understanding of how data flows through the system, as well as intimate knowledge of the underlying storage and how it performs under varying workloads. It also requires an understanding of the actual workload being tuned.

Whenever you deploy a new system, it is a good idea to profile the storage from the bottom up. Start with the raw LUNs or disks, and evaluate their performance using direct I/O (I/O which bypasses the kernel's page cache). This is the most basic test you can perform, and will be the standard by which you measure I/O performance in the stack. Start with a basic workload generator (such as aio-stress) that produces sequential and random reads and writes across a variety of I/O sizes and queue depths.

Following is a graph from a series of aio-stress runs, each of which performs four stages: sequential write, sequential read, random write and random read. In this example, the tool is configured to run across a range of record sizes (the x axis) and queue depths (one per graph). The queue depth represents the total number of I/O operations in progress at a given time.

The y-axis shows the bandwidth in megabytes per second. The x-axis shows the I/O Size in kilobytes.

Figure 6.1. aio-stress output for 1 thread, 1 file

Notice how the throughput line trends from the lower left corner to the upper right. Also note that, for a given record size, you can get more throughput from the storage by increasing the number of I/Os in progress.

By running these simple workloads against your storage, you will gain an understanding of how your storage performs under load. Retain the data generated by these tests for comparison when analyzing more complex workloads.

If you will be using device mapper or md, add that layer in next and repeat your tests. If there is a large loss in performance, ensure that it is expected, or can be explained. For example, a performance drop may be expected if a checksumming raid layer has been added to the stack. Unexpected performance drops can be caused by misaligned I/O operations. By default, Red Hat Enterprise Linux aligns partitions and device mapper metadata optimally. However, not all types of storage report their optimal alignment, and so may require manual tuning.

After adding the device mapper or md layer, add a file system on top of the block device and test against that, still using direct I/O. Again, compare results to the prior tests and ensure that you understand any discrepancies. Direct-write I/O typically performs better on pre-allocated files, so ensure that you pre-allocate files before testing for performance.

Synthetic workload generators that you may find useful include:

aio-stress
iozone
fio

6.3. Tools

There are a number of tools available to help diagnose performance problems in the I/O subsystem. vmstat provides a coarse overview of system performance. The following columns are most relevant to I/O: si (swap in), so (swap out), bi (block in), bo (block out), and wa (I/O wait time). si and so are useful when your swap space is on the same device as your data partition, and as an indicator of overall memory pressure. si and bi are read operations, while so and bo are write operations. Each of these categories is reported in kilobytes. wa is idle time; it indicates what portion of the run queue is blocked waiting for I/O complete.

Analyzing your system with vmstat will give you an idea of whether or not the I/O subsystem may be responsible for any performance issues. The free, buff, and cache columns are also worth noting. The cache value increasing alongside the bo value, followed by a drop in cache and an increase in free indicates that the system is performing write-back and invalidation of the page cache.

Note that the I/O numbers reported by vmstat are aggregations of all I/O to all devices. Once you have determined that there may be a performance gap in the I/O subsystem, you can examine the problem more closely with iostat, which will break down the I/O reporting by device. You can also retrieve more detailed information, such as the average request size, the number of reads and writes per second, and the amount of I/O merging going on.

Using the average request size and the average queue size (avgqu-sz), you can make some estimations about how the storage should perform using the graphs you generated when characterizing the performance of your storage. Some generalizations apply: for example, if the average request size is 4KB and the average queue size is 1, throughput is unlikely to be extremely performant.

If the performance numbers do not map to the performance you expect, you can perform more fine-grained analysis with blktrace. The blktrace suite of utilities gives fine-grained information on how much time is spent in the I/O subsystem. The output from blktrace is a set of binary trace files that can be post-processed by other utilities such as blkparse.

blkparse is the companion utility to blktrace. It reads the raw output from the trace and produces a short-hand textual version.

The following is an example of blktrace output:

8,64   3 1 0.000000000  4162  Q  RM 73992 + 8 [fs_mark]8,64   3 0 0.000012707 0  m   N cfq4162S / alloced8,64   3 2 0.000013433  4162  G  RM 73992 + 8 [fs_mark]8,64   3 3 0.000015813  4162  P   N [fs_mark]8,64   3 4 0.000017347  4162  I   R 73992 + 8 [fs_mark]8,64   3 0 0.000018632 0  m   N cfq4162S / insert_request8,64   3 0 0.000019655 0  m   N cfq4162S / add_to_rr8,64   3 0 0.000021945 0  m   N cfq4162S / idle=08,64   3 5 0.000023460  4162  U   N [fs_mark] 18,64   3 0 0.000025761 0  m   N cfq workload slice:3008,64   3 0 0.000027137 0  m   N cfq4162S / set_active wl_prio:0 wl_type:28,64   3 0 0.000028588 0  m   N cfq4162S / fifo=(null)8,64   3 0 0.000029468 0  m   N cfq4162S / dispatch_insert8,64   3 0 0.000031359 0  m   N cfq4162S / dispatched a request8,64   3 0 0.000032306 0  m   N cfq4162S / activate rq, drv=18,64   3 6 0.000032735  4162  D   R 73992 + 8 [fs_mark]8,64   1 1 0.004276637 0  C   R 73992 + 8 [0]

As you can see, the output is dense and difficult to read. You can tell which processes are responsible for issuing I/O to your device, which is useful, but blkparse can give you additional information in an easy-to-digest format in its summary. blkparse summary information is printed at the very end of its output:

Total (sde):Reads Queued:  19,   76KiB  Writes Queued: 142,183,  568,732KiBRead Dispatches:   19,   76KiB  Write Dispatches:   25,440,  568,732KiBReads Requeued: 0   Writes Requeued:   125Reads Completed:   19,   76KiB  Writes Completed:   25,315,  568,732KiBRead Merges: 0, 0KiB  Write Merges:  116,868,  467,472KiBIO unplugs: 20,087   Timer unplugs:   0

The summary shows average I/O rates, merging activity, and compares the read workload with the write workload. For the most part, however, blkparse output is too voluminous to be useful on its own. Fortunately, there are several tools to assist in visualizing the data.

btt provides an analysis of the amount of time the I/O spent in the different areas of the I/O stack. These areas are:

Q - A block I/O is Queued
G - Get Request
A newly queued block I/O was not a candidate for merging with any existing request, so a new block layer request is allocated.
M - A block I/O is Merged with an existing request.
I - A request is Inserted into the device's queue.
D - A request is issued to the Device.
C - A request is Completed by the driver.
P - The block device queue is Plugged, to allow the aggregation of requests.
U - The device queue is Unplugged, allowing the aggregated requests to be issued to the device.

btt breaks down the time spent in each of these areas, as well as the time spent transitioning between them, like so:

Q2Q - time between requests sent to the block layer
Q2G - how long it takes from the time a block I/O is queued to the time it gets a request allocated for it
G2I - how long it takes from the time a request is allocated to the time it is Inserted into the device's queue
Q2M - how long it takes from the time a block I/O is queued to the time it gets merged with an existing request
I2D - how long it takes from the time a request is inserted into the device's queue to the time it is actually issued to the device
M2D - how long it takes from the time a block I/O is merged with an exiting request until the request is issued to the device
D2C - service time of the request by the device
Q2C - total time spent in the block layer for a request

You can deduce a lot about a workload from the above table. For example, if Q2Q is much larger than Q2C, that means the application is not issuing I/O in rapid succession. Thus, any performance problems you have may not be at all related to the I/O subsystem. If D2C is very high, then the device is taking a long time to service requests. This can indicate that the device is simply overloaded (which may be due to the fact that it is a shared resource), or it could be because the workload sent down to the device is sub-optimal. If Q2G is very high, it means that there are a lot of requests queued concurrently. This could indicate that the storage is unable to keep up with the I/O load.

Finally, seekwatcher consumes blktrace binary data and generates a set of plots, including Logical Block Address (LBA), throughput, seeks per second, and I/Os Per Second (IOPS).

Figure 6.2. Example seekwatcher output

All plots use time as the X axis. The LBA plot shows reads and writes in different colors. It is interesting to note the relationship between the throughput and seeks/sec graphs. For storage that is seek-sensitive, there is an inverse relation between the two plots. The IOPS graph is useful if, for example, you are not getting the throughput you expect from a device, but you are hitting its IOPS limitations.

6.4. Configuration

One of the first decisions you will need to make is which I/O scheduler to use. This section provides an overview of each of the main schedulers to help you decide which is best for your workload.

6.4.1. Completely Fair Queuing (CFQ)

CFQ attempts to provide some fairness in I/O scheduling decisions based on the process which initiated the I/O. Three different scheduling classes are provided: real-time (RT), best-effort (BE), and idle. A scheduling class can be manually assigned to a process with the ionice command, or programmatically assigned via the ioprio_set system call. By default, processes are placed in the best-effort scheduling class. The real-time and best-effort scheduling classes are further subdivided into eight I/O priorities within each class, priority 0 being the highest and 7 the lowest. Processes in the real-time scheduling class are scheduled much more aggressively than processes in either best-effort or idle, so any scheduled real-time I/O is always performed before best-effort or idle I/O. This means that real-time priority I/O can starve out both the best-effort and idle classes. Best effort scheduling is the default scheduling class, and 4 is the default priority within this class. Processes in the idle scheduling class are only serviced when there is no other I/O pending in the system. Thus, it is very important to only set the I/O scheduling class of a process to idle if I/O from the process is not at all required for making forward progress.

CFQ provides fairness by assigning a time slice to each of the processes performing I/O. During its time slice, a process may have (by default) up to 8 requests in flight at a time. The scheduler tries to anticipate whether an application will issue more I/O in the near future based on historical data. If it is expected that a process will issue more I/O, then CFQ will idle, waiting for that I/O, even if there is I/O from other processes waiting to be issued.

Because of the idling performed by CFQ, it is often not a good fit for hardware that does not suffer from a large seek penalty, such as fast external storage arrays or solid state disks. If using CFQ on such storage is a requirement (for example, if you would also like to use the cgroup proportional weight I/O scheduler), you will need to tune some settings to improve CFQ performance. Set the following parameters in the files of the same name located in /sys/block/device/queue/iosched/:

slice_idle = 0quantum = 64group_idle = 1

When group_idle is set to 1, there is still the potential for I/O stalls (whereby the back-end storage is not busy due to idling). However, these stalls will be less frequent than idling on every queue in the system.

CFQ is a non-work-conserving I/O scheduler, which means it can be idle even when there are requests pending (as we discussed above). The stacking of non-work-conserving schedulers can introduce large latencies in the I/O path. An example of such stacking is using CFQ on top of a host-based hardware RAID controller. The RAID controller may implement its own non-work-conserving scheduler, thus causing delays at two levels in the stack. Non-work-conserving schedulers operate best when they have as much data as possible to base their decisions on. In the case of stacking such scheduling algorithms, the bottom-most scheduler will only see what the upper scheduler sends down. Thus, the lower layer will see an I/O pattern that is not at all representative of the actual workload.

Tunables

back_seek_max: Backward seeks are typically bad for performance, as they can incur greater delays in repositioning the heads than forward seeks do. However, CFQ will still perform them, if they are small enough. This tunable controls the maximum distance in KB the I/O scheduler will allow backward seeks. The default is 16 KB.
back_seek_penalty: Because of the inefficiency of backward seeks, a penalty is associated with each one. The penalty is a multiplier; for example, consider a disk head position at 1024KB. Assume there are two requests in the queue, one at 1008KB and another at 1040KB. The two requests are equidistant from the current head position. However, after applying the back seek penalty (default: 2), the request at the later position on disk is now twice as close as the earlier request. Thus, the head will move forward.
fifo_expire_async: This tunable controls how long an async (buffered write) request can go unserviced. After the expiration time (in milliseconds), a single starved async request will be moved to the dispatch list. The default is 250 ms.
fifo_expire_sync: This is the same as the fifo_expire_async tunable, for for synchronous (read and O_DIRECT write) requests. The default is 125 ms.
group_idle: When set, CFQ will idle on the last process issuing I/O in a cgroup. This should be set to 1 when using proportional weight I/O cgroups and setting slice_idle to 0 (typically done on fast storage).
group_isolation: If group isolation is enabled (set to 1), it provides a stronger isolation between groups at the expense of throughput. Generally speaking, if group isolation is disabled, fairness is provided for sequential workloads only. Enabling group isolation provides fairness for both sequential and random workloads. The default value is 0 (disabled). Refer to Documentation/cgroups/blkio-controller.txt for further information.
low_latency: When low latency is enabled (set to 1), CFQ attempts to provide a maximum wait time of 300 ms for each process issuing I/O on a device. This favors fairness over throughput. Disabling low latency (setting it to 0) ignores target latency, allowing each process in the system to get a full time slice. Low latency is enabled by default.
quantum: The quantum controls the number of I/Os that CFQ will send to the storage at a time, essentially limiting the device queue depth. By default, this is set to 8. The storage may support much deeper queue depths, but increasing quantum will also have a negative impact on latency, especially in the presence of large sequential write workloads.
slice_async: This tunable controls the time slice allotted to each process issuing asynchronous (buffered write) I/O. By default it is set to 40 ms.
slice_idle: This specifies how long CFQ should idle while waiting for further requests. The default value in Red Hat Enterprise Linux 6.1 and earlier is 8 ms. In Red Hat Enterprise Linux 6.2 and later, the default value is 0. The zero value improves the throughput of external RAID storage by removing all idling at the queue and service tree level. However, a zero value can degrade throughput on internal non-RAID storage, because it increases the overall number of seeks. For non-RAID storage, we recommend a slice_idle value that is greater than 0.
slice_sync: This tunable dictates the time slice allotted to a process issuing synchronous (read or direct write) I/O. The default is 100 ms.

6.4.2. Deadline I/O Scheduler

The deadline I/O scheduler attempts to provide a guaranteed latency for requests. It is important to note that the latency measurement only starts when the requests gets down to the I/O scheduler (this is an important distinction, as an application may be put to sleep waiting for request descriptors to be freed). By default, reads are given priority over writes, since applications are more likely to block on read I/O.

Deadline dispatches I/Os in batches. A batch is a sequence of either read or write I/Os which are in increasing LBA order (the one-way elevator). After processing each batch, the I/O scheduler checks to see whether write requests have been starved for too long, and then decides whether to start a new batch of reads or writes. The FIFO list of requests is only checked for expired requests at the start of each batch, and then only for the data direction of that batch. So, if a write batch is selected, and there is an expired read request, that read request will not get serviced until the write batch completes.

Tunables

fifo_batch: This determines the number of reads or writes to issue in a single batch. The default is 16. Setting this to a higher value may result in better throughput, but will also increase latency.
front_merges: You can set this tunable to 0 if you know your workload will never generate front merges. Unless you have measured the overhead of this check, it is advisable to leave it at its default setting (1).
read_expire: This tunable allows you to set the number of milliseconds in which a read request should be serviced. By default, this is set to 500 ms (half a second).
write_expire: This tunable allows you to set the number of milliseconds in which a write request should be serviced. By default, this is set to 5000 ms (five seconds).
writes_starved: This tunable controls how many read batches can be processed before processing a single write batch. The higher this is set, the more preference is given to reads.

6.4.3. Noop

The Noop I/O scheduler implements a simple first-in first-out (FIFO) scheduling algorithm. Merging of requests happens at the generic block layer, but is a simple last-hit cache. If a system is CPU-bound and the storage is fast, this can be the best I/O scheduler to use.

Following are the tunables available for the block layer.

/sys/block/sdX/queue tunables

add_random: In some cases, the overhead of I/O events contributing to the entropy pool for /dev/random is measurable. In such cases, it may be desirable to set this value to 0.
max_sectors_kb: By default, the maximum request size sent to disk is 512 KB. This tunable can be used to either raise or lower that value. The minimum value is limited by the logical block size; the maximum value is limited by max_hw_sectors_kb. There are some SSDs which perform worse when I/O sizes exceed the internal erase block size. In such cases, it is recommended to tune max_hw_sectors_kb down to the erase block size. You can test for this using an I/O generator such as iozone or aio-stress, varying the record size from, for example, 512 bytes to 1 MB.
nomerges: This tunable is primarily a debugging aid. Most workloads benefit from request merging (even on faster storage such as SSDs). In some cases, however, it is desirable to disable merging, such as when you want to see how many IOPS a storage back-end can process without disabling read-ahead or performing random I/O.
nr_requests: Each request queue has a limit on the total number of request descriptors that can be allocated for each of read and write I/Os. By default, the number is 128, meaning 128 reads and 128 writes can be queued at a time before putting a process to sleep. The process put to sleep is the next to try to allocate a request, not necessarily the process that has allocated all of the available requests.
If you have a latency-sensitive application, then you should consider lowering the value of nr_requests in your request queue and limiting the command queue depth on the storage to a low number (even as low as 1), so that writeback I/O cannot allocate all of the available request descriptors and fill up the device queue with write I/O. Once nr_requests have been allocated, all other processes attempting to perform I/O will be put to sleep to wait for requests to become available. This makes things more fair, as the requests are then distributed in a round-robin fashion (instead of letting one process consume them all in rapid succession). Note that this is only a problem when using the deadline or noop schedulers, as the default CFQ configuration protects against this situation.
optimal_io_size: In some circumstances, the underlying storage will report an optimal I/O size. This is most common in hardware and software RAID, where the optimal I/O size is the stripe size. If this value is reported, applications should issue I/O aligned to and in multiples of the optimal I/O size whenever possible.
read_ahead_kb: The operating system can detect when an application is reading data sequentially from a file or from disk. In such cases, it performs an intelligent read-ahead algorithm, whereby more data than is requested by the user is read from disk. Thus, when the user next attempts to read a block of data, it will already by in the operating system's page cache. The potential down side to this is that the operating system can read more data from disk than necessary, which occupies space in the page cache until it is evicted because of high memory pressure. Having multiple processes doing false read-ahead would increase memory pressure in this circumstance.
For device mapper devices, it is often a good idea to increase the value of read_ahead_kb to a large number, such as 8192. The reason is that a device mapper device is often made up of multiple underlying devices. Setting this value to the default (128 KB) multiplied by the number of devices you are mapping is a good starting point for tuning.
rotational: Traditional hard disks have been rotational (made up of spinning platters). SSDs, however, are not. Most SSDs will advertise this properly. If, however, you come across a device that does not advertise this flag properly, it may be necessary to set rotational to 0 manually; when rotational is disabled, the I/O elevator does not use logic that is meant to reduce seeks, since there is little penalty for seek operations on non-rotational media.
rq_affinity: I/O completions can be processed on a different CPU from the one that issued the I/O. Setting rq_affinity to 1 causes the kernel to deliver completions to the CPU on which the I/O was issued. This can improve CPU data caching effectiveness.

Performance Tuning Guide

Chapter 5. Memory

5.1. Huge Translation Lookaside Buffer (HugeTLB)

5.2. Huge Pages and Transparent Huge Pages

5.3. Using Valgrind to Profile Memory Usage

5.3.1. Profiling Memory Usage with Memcheck

5.3.2. Profiling Cache Usage with Cachegrind

5.3.3. Profiling Heap and Stack Space with Massif

5.4. Capacity Tuning

5.5. Tuning Virtual Memory

Chapter 6. Input/Output

6.1. Features

6.2. Analysis

6.3. Tools

6.4. Configuration

6.4.1. Completely Fair Queuing (CFQ)

6.4.2. Deadline I/O Scheduler

6.4.3. Noop