Red Hat Enterprise Linux Manual

Daftar Isi ☛

(Sebelumnya) 11 : Chapter 5. Memory - Perfo ...

12 : Cluster Administration (Berikutnya)

Performance Tuning Guide

Chapter 7. File Systems

7.1. Tuning Considerations for File Systems

7.1.1. Formatting Options
7.1.2. Mount Options
7.1.3. File system maintenance
7.1.4. Application Considerations

7.2. Profiles for file system performance

7.3. File Systems

7.3.1. The Ext4 File System
7.3.2. The XFS File System

7.4. Clustering

7.4.1. Global File System 2

Read this chapter for an overview of the file systems supported for use with Red Hat Enterprise Linux, and how to optimize their performance.

7.1. Tuning Considerations for File Systems

There are several tuning considerations common to all file systems: formatting and mount options selected on your system, and actions available to applications that may improve their performance on a given system.

7.1.1. Formatting Options

File system block size

Block size can be selected at mkfs time. The range of valid sizes depends on the system: the upper limit is the maximum page size of the host system, while the lower limit depends on the file system used. The default block size is appropriate for most use cases.

If you expect to create many files smaller than the default block size, you can set a smaller block size to minimize the amount of space wasted on disk. Note, however, that setting a smaller block size may limit the maximum size of the file system, and can cause additional runtime overhead, particularly for files greater than the selected block size.

File system geometry

If your system uses striped storage such as RAID5, you can improve performance by aligning data and metadata with the underlying storage geometry at mkfs time. For software RAID (LVM or MD) and some enterprise hardware storage, this information is queried and set automatically, but in many cases the administrator must specify this geometry manually with mkfs at the command line.

Refer to the Storage Administration Guide for further information about creating and maintaining these file systems.

External journals

Metadata-intensive workloads mean that the log section of a journaling file system (such as ext4 and XFS) is updated extremely frequently. To minimize seek time from file system to journal, you can place the journal on dedicated storage. Note, however, that placing the journal on external storage that is slower than the primary file system can nullify any potential advantage associated with using external storage.

Warning

Ensure that your external journal is reliable. The loss of an external journal device will cause file system corruption.

External journals are created at mkfs time, with journal devices being specified at mount time. Refer to the mke2fs(8), mkfs.xfs(8), and mount(8) man pages for further information.

7.1.2. Mount Options

Barriers

A write barrier is a kernel mechanism used to ensure that file system metadata is correctly written and ordered on persistent storage, even when storage devices with volatile write caches lose power. File systems with write barriers enabled also ensure that any data transmitted via fsync() persists across a power outage. Red Hat Enterprise Linux enables barriers by default on all hardware that supports them.

However, enabling write barriers slows some applications significantly; specifically, applications that use fsync() heavily, or create and delete many small files. For storage with no volatile write cache, or in the rare case where file system inconsistencies and data loss after a power loss is acceptable, barriers can be disabled by using the nobarrier mount option. For further information, refer to the Storage Administration Guide.

Access Time (noatime)

Historically, when a file is read, the access time (atime) for that file must be updated in the inode metadata, which involves additional write I/O. If accurate atime metadata is not required, mount the file system with the noatime option to eliminate these metadata updates. In most cases, however, atime is not a large overhead due to the default relative atime (or relatime) behavior in the Red Hat Enterprise Linux 6 kernel. The relatime behavior only updates atime if the previous atime is older than the modification time (mtime) or status change time (ctime).

Note

Enabling the noatime option also enables nodiratime behavior; there is no need to set both noatime and nodiratime.

Increased read-ahead support

Read-ahead speeds up file access by pre-fetching data and loading it into the page cache so that it can be available earlier in memory instead of from disk. Some workloads, such as those involving heavy streaming of sequential I/O, benefit from high read-ahead values.

The tuned tool and the use of LVM striping elevate the read-ahead value, but this is not always sufficient for some workloads. Additionally, Red Hat Enterprise Linux is not always able to set an appropriate read-ahead value based on what it can detect of your file system. For example, if a powerful storage array presents itself to Red Hat Enterprise Linux as a single powerful LUN, the operating system will not treat it as a powerful LUN array, and therefore will not by default make full use of the read-ahead advantages potentially available to the storage.

Use the blockdev command to view and edit the read-ahead value. To view the current read-ahead value for a particular block device, run:

# blockdev -getra device

To modify the read-ahead value for that block device, run the following command. N represents the number of 512-byte sectors.

# blockdev -setra N device

Note that the value selected with the blockdev command will not persist between boots. We recommend creating a run level init.d script to set this value during boot.

7.1.3. File system maintenance

Discard unused blocks

Batch discard and online discard operations are features of mounted file systems that discard blocks which are not in use by the file system. These operations are useful for both solid-state drives and thinly-provisioned storage.

Batch discard operations are run explicitly by the user with the fstrim command. This command discards all unused blocks in a file system that match the user's criteria. Both operation types are supported for use with the XFS and ext4 file systems in Red Hat Enterprise Linux 6.2 and later as long as the block device underlying the file system supports physical discard operations. Physical discard operations are supported if the value of /sys/block/device/queue/discard_max_bytes is not zero.

Online discard operations are specified at mount time with the -o discard option (either in /etc/fstab or as part of the mount command), and run in realtime without user intervention. Online discard operations only discard blocks that are transitioning from used to free. Online discard operations are supported on ext4 file systems in Red Hat Enterprise Linux 6.2 and later, and on XFS file systems in Red Hat Enterprise Linux 6.4 and later.

Red Hat recommends batch discard operations unless the system's workload is such that batch discard is not feasible, or online discard operations are necessary to maintain performance.

7.1.4. Application Considerations

Pre-allocation

The ext4, XFS, and GFS2 file systems support efficient space pre-allocation via the fallocate(2) glibc call. In cases where files may otherwise become badly fragmented due to write patterns, leading to poor read performance, space preallocation can be a useful technique. Pre-allocation marks disk space as if it has been allocated to a file, without writing any data into that space. Until real data is written to a pre-allocated block, read operations will return zeroes.

7.2. Profiles for file system performance

The tuned-adm tool allows users to easily swap between a number of profiles that have been designed to enhance performance for specific use cases. The profiles that are particularly useful in improving storage performance are:

latency-performance

A server profile for typical latency performance tuning. It disables tuned and ktune power-saving mechanisms. The cpuspeed mode changes to performance. The I/O elevator is changed to deadline for each device. The cpu_dma_latency parameter is registered with a value of 0 (the lowest possible latency) for power management quality-of-service to limit latency where possible.

throughput-performance

A server profile for typical throughput performance tuning. This profile is recommended if the system does not have enterprise-class storage. It is the same as latency-performance, except:

kernel.sched_min_granularity_ns (scheduler minimal preemption granularity) is set to 10 milliseconds,
kernel.sched_wakeup_granularity_ns (scheduler wake-up granularity) is set to 15 milliseconds,
vm.dirty_ratio (virtual machine dirty ratio) is set to 40%, and
transparent huge pages are enabled.

enterprise-storage

This profile is recommended for enterprise-sized server configurations with enterprise-class storage, including battery-backed controller cache protection and management of on-disk cache. It is the same as the throughput-performance profile, except:

readahead value is set to 4x, and
non root/boot file systems are re-mounted with barrier=0.

More information about tuned-adm is available on the man page (man tuned-adm), or in the Power Management Guide available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

7.3. File Systems

7.3.1. The Ext4 File System

The ext4 file system is a scalable extension of the default ext3 file system available in Red Hat Enterprise Linux 5. Ext4 is now the default file system for Red Hat Enterprise Linux 6, and is supported for a maximum file system size of 16 TB and a single file maximum size of 16TB. It also removes the 32000 sub-directory limit present in ext3.

Note

For file systems larger than 16TB, we recommend using a scalable high capacity file system such as XFS. For further information, see Section 7.3.2, "The XFS File System".

The ext4 file system defaults are optimal for most workloads, but if performance analysis shows that file system behavior is impacting performance, several tuning options are available:

Inode table initialization

For very large file systems, the mkfs.ext4 process can take a very long time to initialize all inode tables in the file system. This process can be deferred with the -E lazy_itable_init=1 option. If this is used, kernel processes will continue to initialize the file system after it is mounted. The rate at which this initialization occurs can be controlled with the -o init_itable=n option for the mount command, where the amount of time spent performing this background initialization is roughly 1/n. The default value for n is 10.

Auto-fsync behavior

Because some applications do not always properly fsync() after renaming an existing file, or truncating and rewriting, ext4 defaults to automatic syncing of files after replace-via-rename and replace-via-truncate operations. This behavior is largely consistent with older ext3 filesystem behavior. However, fsync() operations can be time consuming, so if this automatic behavior is not required, use the -o noauto_da_alloc option with the mount command to disable it. This will mean that the application must explicitly use fsync() to ensure data persistence.

Journal I/O priority

By default, journal commit I/O is given a slightly higher priority than normal I/O. This priority can be controlled with the journal_ioprio=n option of the mount command. The default value is 3. Valid values range from 0 to 7, with 0 being the highest priority I/O.

For other mkfs and tuning options, please see the mkfs.ext4(8) and mount(8) man pages, as well as the Documentation/filesystems/ext4.txt file in the kernel-doc package.

7.3.2. The XFS File System

XFS is a robust and highly-scalable single host 64-bit journaling file system. It is entirely extent-based, so it supports very large file and file system sizes. The number of files an XFS system can hold is limited only by the space available in the file system.

XFS supports metadata journaling, which facilitates quicker crash recovery. The XFS file system can also be defragmented and enlarged while mounted and active. In addition, Red Hat Enterprise Linux 6 supports backup and restore utilities specific to XFS.

XFS uses extent-based allocation, and features a number of allocation schemes such as delayed allocation and explicit pre-allocation. Extent-based allocation provides a more compact and efficient method of tracking used space in a file system, and improves large file performance by reducing fragmentation and the space consumed by metadata. Delayed allocation improves the chance that a file will be written in a contiguous group of blocks, reducing fragmentation and improving performance. Pre-allocation can be used to prevent fragmentation entirely in cases where the application knows the amount of data it needs to write ahead of time.

XFS provides excellent I/O scalability by using b-trees to index all user data and metadata. Object counts grow as all operations on indexes inherit the logarithmic scalability characteristics of the underlying b-trees. Some of the tuning options XFS provides at mkfs time vary the width of the b-trees, which changes the scalability characteristics of different subsystems.

7.3.2.1. Basic tuning for XFS

In general, the default XFS format and mount options are optimal for most workloads; Red Hat recommends that the default values are used unless specific configuration changes are expected to benefit the workload of the file system. If software RAID is in use, the mkfs.xfs command automatically configures itself with the correct stripe unit and width to align with the hardware. This may need to be manually configured if hardware RAID is in use.

The inode64 mount option is highly recommended for multi-terabyte file systems, except where the file system is exported via NFS and legacy 32-bit NFS clients require access to the file system.

The logbsize mount option is recommended for file systems that are modified frequently, or in bursts. The default value is MAX (32 KB, log stripe unit), and the maximum size is 256 KB. A value of 256 KB is recommended for file systems that undergo heavy modifications.

7.3.2.2. Advanced tuning for XFS

Before changing XFS parameters, you need to understand why the default XFS parameters are causing performance problems. This involves understanding what your application is doing, and how the file system is reacting to those operations.

Observable performance problems that can be corrected or reduced by tuning are generally caused by file fragmentation or resource contention in the file system. There are different ways to address these problems, and in some cases fixing the problem will require that the application, rather than the file system configuration, be modified.

If you have not been through this process previously, it is recommended that you engage your local Red Hat support engineer for advice.

Optimizing for a large number of files

XFS imposes an arbitrary limit on the number of files that a file system can hold. In general, this limit is high enough that it will never be hit. If you know that the default limit will be insufficient ahead of time, you can increase the percentage of file system space allowed for inodes with the mkfs.xfs command. If you encounter the file limit after file system creation (usually indicated by ENOSPC errors when attempting to create a file or directory even though free space is available), you can adjust the limit with the xfs_growfs command.

Optimizing for a large number of files in a single directory

Directory block size is fixed for the life of a file system, and cannot be changed except upon initial formatting with mkfs. The minimum directory block is the file system block size, which defaults to MAX (4 KB, file system block size). In general, there is no reason to reduce the directory block size.

Because the directory structure is b-tree based, changing the block size affects the amount of directory information that can be retreived or modified per physical I/O. The larger the directory becomes, the more I/O each operation requires at a given block size.

However, when larger directory block sizes are in use, more CPU is consumed by each modification operation compared to the same operation on a file system with a smaller directory block size. This means that for small directory sizes, large directory block sizes will result in lower modification performance. When the directory reaches a size where I/O is the performance-limiting factor, large block size directories perform better.

The default configuration of a 4 KB file system block size and a 4 KB directory block size is best for directories with up to 1-2 million entries with a name length of 20-40 bytes per entry. If your file system requires more entries, larger directory block sizes tend to perform better - a 16 KB block size is best for file systems with 1-10 million directory entries, and a 64 KB block size is best for file systems with over 10 million directory entries.

If the workload uses random directory lookups more than modifications (that is, directory reads are much more common or important than directory writes), then the above thresholds for increasing the block size are approximately one order of magnitude lower.

Optimising for concurrency

Unlike other file systems, XFS can perform many types of allocation and deallocation operations concurrently provided that the operations are occurring on non-shared objects. Allocation or deallocation of extents can occur concurrently provided that the concurrent operations occur in different allocation groups. Similarly, allocation or deallocation of inodes can occur concurrently provided that the concurrent operations affect different allocation groups.

The number of allocation groups becomes important when using machines with a high CPU count and multi-threaded applications that attempt to perform operations concurrently. If only four allocation groups exist, then sustained, parallel metadata operations will only scale as far as those four CPUs (the concurrency limit provided by the system). For small file systems, ensure that the number of allocation groups is supported by the concurrency provided by the system. For large file systems (tens of terabytes and larger) the default formatting options generally create sufficient allocation groups to avoid limiting concurrency.

Applications must be aware of single points of contention in order to use the parallelism inherent in the structure of the XFS file system. It is not possible to modify a directory concurrently, so applications that create and remove large numbers of files should avoid storing all files in a single directory. Each directory created is placed in a different allocation group, so techniques such as hashing files over multiple sub-directories provide a more scalable storage pattern compared to using a single large directory.

Optimising for applications that use extended attributes

XFS can store small attributes directly in the inode if space is available in the inode. If the attribute fits into the inode, then it can be retrieved and modified without requiring extra I/O to retrieve separate attribute blocks. The performance differential between in-line and out-of-line attributes can easily be an order of magnitude slower for out-of-line attributes.

For the default inode size of 256 bytes, roughly 100 bytes of attribute space is available depending on the number of data extent pointers also stored in the inode. The default inode size is really only useful for storing a small number of small attributes.

Increasing the inode size at mkfs time can increase the amount of space available for storing attributes in-line. A 512 byte inode size increases the space available for attributes to roughly 350 bytes; a 2 KB inode has roughly 1900 bytes of space available.

There is, however, a limit on the size of the individual attributes that can be stored in-line - there is a maximum size limit of 254 bytes for both the attribute name and the value (that is, an attribute with a name length of 254 bytes and a value length of 254 bytes will stay in-line). Exceeding these size limits forces the attributes out of line, even if there would have been enough space to store all the attributes in the inode.

Optimising for sustained metadata modifications

The size of the log is the main factor in determining the achievable level of sustained metadata modification. The log device is circular, so before the tail can be overwritten all the modifications in the log must be written to the real locations on disk. This can involve a significant amount of seeking to write back all dirty metadata. The default configuration scales the log size in relation to the overall file system size, so in most cases log size will not require tuning.

A small log device will result in very frequent metadata writeback - the log will constantly be pushing on its tail to free up space and so frequently modified metadata will be frequently written to disk, causing operations to be slow.

Increasing the log size increases the time period between tail pushing events. This allows better aggregation of dirty metadata, resulting in better metadata writeback patterns, and less writeback of frequently modified metadata. The trade-off is that larger logs require more memory to track all outstanding changes in memory.

If you have a machine with limited memory, then large logs are not beneficial because memory constraints will cause metadata writeback long before the benefits of a large log can be realised. In these cases, smaller rather than larger logs will often provide better performance because metadata writeback from the log running out of space is more efficient than writeback driven by memory reclamation.

You should always try to align the log to the underlying stripe unit of the device that contains the file system. mkfs does this by default for MD and DM devices, but for hardware RAID it may need to be specified. Setting this correctly avoids all possibility of log I/O causing unaligned I/O and subsequent read-modify-write operations when writing modifications to disk.

Log operation can be further improved by editing mount options. Increasing the size of the in-memory log buffers (logbsize) increases the speed at which changes can be written to the log. The default log buffer size is MAX (32 KB, log stripe unit), and the maximum size is 256 KB. In general, a larger value results in faster performance. However, under fsync-heavy workloads, small log buffers can be noticeably faster than large buffers with a large stripe unit alignment.

The delaylog mount option also improves sustained metadata modification performance by reducing the number of changes to the log. It achieves this by aggregating individual changes in memory before writing them to the log: frequently modified metadata is written to the log periodically instead of on every modification. This option increases the memory usage of tracking dirty metadata and increases the potential lost operations when a crash occurs, but can improve metadata modification speed and scalability by an order of magnitude or more. Use of this option does not reduce data or metadata integrity when fsync, fdatasync or sync are used to ensure data and metadata is written to disk.

7.4. Clustering

Clustered storage provides a consistent file system image across all servers in a cluster, allowing servers to read and write to a single, shared file system. This simplifies storage administration by limiting tasks like installing and patching applications to one file system. A cluster-wide file system also eliminates the need for redundant copies of application data, simplifying backup and disaster recovery.

Red Hat's High Availability Add-On provides clustered storage in conjunction with Red Hat Global File System 2 (part of the Resilient Storage Add-On).

7.4.1. Global File System 2

Global File System 2 (GFS2) is a native file system that interfaces directly with the Linux kernel file system. It allows multiple computers (nodes) to simultaneously share the same storage device in a cluster. The GFS2 file system is largely self-tuning, but manual tuning is possible. This section outlines performance considerations when attempting to tune performance manually.

Red Hat Enterprise Linux 6.4 introduces improvements to file fragmentation management in GFS2. Files created by Red Hat Enterprise Linux 6.3 or earlier were prone to file fragmentation if multiple files were written at the same time by more than one process. This fragmentation made things run slowly, especially in workloads involving large files. With Red Hat Enterprise Linux 6.4, simultaneous writes result in less file fragmentation and therefore better performance for these workloads.

While there is no defragmentation tool for GFS2 on Red Hat Enterprise Linux, you can defragment individual files by identifying them with the filefrag tool, copying them to temporary files, and renaming the temporary files to replace the originals. (This procedure can also be done in versions prior to 6.4 as long as the writing is done sequentially.)

Since GFS2 uses a global locking mechanism that potentially requires communication between nodes of a cluster, the best performance will be achieved when your system is designed to avoid file and directory contention between these nodes. Some methods of avoiding contention are to:

Pre-allocate files and directories with fallocate where possible, to optimize the allocation process and avoid the need to lock source pages.
Minimize the areas of the file system that are shared between multiple nodes to minimize cross-node cache invalidation and improve performance. For example, if multiple nodes mount the same file system, but access different sub-directories, you will likely achieve better performance by moving one subdirectory to a separate file system.
Select an optimal resource group size and number. This depends on typical file sizes and available free space on the system, and affects the likelihood that multiple nodes will attempt to use a resource group simultaneously. Too many resource groups can slow block allocation while allocation space is located, while too few resource groups can cause lock contention during deallocation. It is generally best to test multiple configurations to determine which is best for your workload.

However, contention is not the only issue that can affect GFS2 file system performance. Other best practices to improve overall performance are to:

Select your storage hardware according to the expected I/O patterns from cluster nodes and the performance requirements of the file system.
Use solid-state storage where possible to lower seek time.
Create an appropriately-sized file system for your workload, and ensure that the file system is never at more than 80% capacity. Smaller file systems will have proportionally shorter backup times, and require less time and memory for file system checks, but are subject to high fragmentation if they are too small for their workload.
Set larger journal sizes for metadata-intensive workloads, or when journaled data is in use. Although this uses more memory, it improves performance because more journaling space is available to store data before a write is necessary.
Ensure that clocks on GFS2 nodes are synchronized to avoid issues with networked applications. We recommend using NTP (Network Time Protocol).
Unless file or directory access times are critical to the operation of your application, mount the file system with the noatime and nodiratime mount options.
Note
Red Hat strongly recommends the use of the noatime option with GFS2.
If you need to use quotas, try to reduce the frequency of quota synchronization transactions or use fuzzy quota synchronization to prevent performance issues arising from constant quota file updates.
Note
Fuzzy quota accounting can allow users and groups to slightly exceed their quota limit. To minimize this issue, GFS2 dynamically reduces the synchronization period as a user or group approaches its quota limit.

For more detailed information about each aspect of GFS2 performance tuning, refer to the Global File System 2 guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

Chapter 8. Networking

8.1. Network Performance Enhancements

8.2. Optimized Network Settings

8.3. Overview of Packet Reception

8.4. Resolving Common Queuing/Frame Loss Issues

8.4.1. NIC Hardware Buffer
8.4.2. Socket Queue

8.5. Multicast Considerations

Over time, Red Hat Enterprise Linux's network stack has been upgraded with numerous automated optimization features. For most workloads, the auto-configured network settings provide optimized performance.

In most cases, networking performance problems are actually caused by a malfunction in hardware or faulty infrastructure. Such causes are beyond the scope of this document; the performance issues and solutions discussed in this chapter are useful in optimizing perfectly functional systems.

Networking is a delicate subsystem, containing different parts with sensitive connections. This is why the open source community and Red Hat invest much work in implementing ways to automatically optimize network performance. As such, given most workloads, you may never even need to reconfigure networking for performance.

8.1. Network Performance Enhancements

Red Hat Enterprise Linux 6.1 provided the following network performance enhancements:

Receive Packet Steering (RPS)

RPS enables a single NIC rx queue to have its receive softirq workload distributed among several CPUs. This helps prevent network traffic from being bottlenecked on a single NIC hardware queue.

To enable RPS, specify the target CPU names in /sys/class/net/ethX/queues/rx-N/rps_cpus, replacing ethX with the NIC's corresponding device name (for example, eth1, eth2) and rx-N with the specified NIC receive queue. This will allow the specified CPUs in the file to process data from queue rx-N on ethX. When specifying CPUs, consider the queue's cache affinity ^[4].

Receive Flow Steering

RFS is an extension of RPS, allowing the administrator to configure a hash table that is populated automatically when applications receive data and are interrogated by the network stack. This determines which applications are receiving each piece of network data (based on source:destination network information).

Using this information, the network stack can schedule the most optimal CPU to receive each packet. To configure RFS, use the following tunables:

/proc/sys/net/core/rps_sock_flow_entries: This controls the maximum number of sockets/flows that the kernel can steer towards any specified CPU. This is a system-wide, shared limit.
/sys/class/net/ethX/queues/rx-N/rps_flow_cnt: This controls the maximum number of sockets/flows that the kernel can steer for a specified receive queue (rx-N) on a NIC (ethX). Note that sum of all per-queue values for this tunable on all NICs should be equal or less than that of /proc/sys/net/core/rps_sock_flow_entries.

Unlike RPS, RFS allows both the receive queue and the application to share the same CPU when processing packet flows. This can result in improved performance in some cases. However, such improvements are dependent on factors such as cache hierarchy, application load, and the like.

getsockopt support for TCP thin-streams

Thin-stream is a term used to characterize transport protocols wherein applications send data at such a low rate that the protocol's retransmission mechanisms are not fully saturated. Applications that use thin-stream protocols typically transport via reliable protocols like TCP; in most cases, such applications provide very time-sensitive services (for example, stock trading, online gaming, control systems).

For time-sensitive services, packet loss can be devastating to service quality. To help prevent this, the getsockopt call has been enhanced to support two extra options:

TCP_THIN_DUPACK: This Boolean enables dynamic triggering of retransmissions after one dupACK for thin streams.
TCP_THIN_LINEAR_TIMEOUTS: This Boolean enables dynamic triggering of linear timeouts for thin streams.

Both options are specifically activated by the application. For more information about these options, refer to file:///usr/share/doc/kernel-doc-version/Documentation/networking/ip-sysctl.txt. For more information about thin-streams, refer to file:///usr/share/doc/kernel-doc-version/Documentation/networking/tcp-thin.txt.

Transparent Proxy (TProxy) support

The kernel can now handle non-locally bound IPv4 TCP and UDP sockets to support transparent proxies. To enable this, you will need to configure iptables accordingly. You will also need to enable and configure policy routing properly.

For more information about transparent proxies, refer to file:///usr/share/doc/kernel-doc-version/Documentation/networking/tproxy.txt.

8.2. Optimized Network Settings

Performance tuning is usually done in a pre-emptive fashion. Often, we adjust known variables before running an application or deploying a system. If the adjustment proves to be ineffective, we try adjusting other variables. The logic behind such thinking is that by default, the system is not operating at an optimal level of performance; as such, we think we need to adjust the system accordingly. In some cases, we do so via calculated guesses.

As mentioned earlier, the network stack is mostly self-optimizing. In addition, effectively tuning the network requires a thorough understanding not just of how the network stack works, but also of the specific system's network resource requirements. Incorrect network performance configuration can actually lead to degraded performance.

For example, consider the bufferfloat problem. Increasing buffer queue depths results in TCP connections that have congestion windows larger than the link would otherwise allow (due to deep buffering). However, those connections also have huge RTT values since the frames spend so much time in-queue. This, in turn, actually results in sub-optimal output, as it would become impossible to detect congestion.

When it comes to network performance, it is advisable to keep the default settings unless a particular performance issue becomes apparent. Such issues include frame loss, significantly reduced throughput, and the like. Even then, the best solution is often one that results from meticulous study of the problem, rather than simply tuning settings upward (increasing buffer/queue lengths, reducing interrupt latency, etc).

To properly diagnose a network performance problem, use the following tools:

netstat

A command-line utility that prints network connections, routing tables, interface statistics, masquerade connections and multicast memberships. It retrieves information about the networking subsystem from the /proc/net/ file system. These files include:

/proc/net/dev (device information)
/proc/net/tcp (TCP socket information)
/proc/net/unix (Unix domain socket information)

For more information about netstat and its referenced files from /proc/net/, refer to the netstat man page: man netstat.

dropwatch

A monitoring utility that monitors packets dropped by the kernel. For more information, refer to the dropwatch man page: man dropwatch.

ip

A utility for managing and monitoring routes, devices, policy routing, and tunnels. For more information, refer to the ip man page: man ip.

ethtool

A utility for displaying and changing NIC settings. For more information, refer to the ethtool man page: man ethtool.

/proc/net/snmp

A file that displays ASCII data needed for the IP, ICMP, TCP, and UDP management information bases for an snmp agent. It also displays real-time UDP-lite statistics.

The SystemTap Beginners Guide contains several sample scripts you can use to profile and monitor network performance. This guide is available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

After collecting relevant data on a network performance problem, you should be able to formulate a theory - and, hopefully, a solution. ^[5] For example, an increase in UDP input errors in /proc/net/snmp indicates that one or more socket receive queues are full when the network stack attempts to queue new frames into an application's socket.

This indicates that packets are bottlenecked at at least one socket queue, which means either the socket queue drains packets too slowly, or packet volume is too large for that socket queue. If it is the latter, then verify the logs of any network-intensive application for lost data -- to resolve this, you would need to optimize or reconfigure the offending application.

Socket receive buffer size

Socket send and receive sizes are dynamically adjusted, so they rarely need to be manually edited. If further analysis, such as the analysis presented in the SystemTap network example, sk_stream_wait_memory.stp, suggests that the socket queue's drain rate is too slow, then you can increase the depth of the application's socket queue. To do so, increase the size of receive buffers used by sockets by configuring either of the following values:

rmem_default

A kernel parameter that controls the default size of receive buffers used by sockets. To configure this, run the following command:

sysctl -w net.core.rmem_default=N

Replace N with the desired buffer size, in bytes. To determine the value for this kernel parameter, view /proc/sys/net/core/rmem_default. Bear in mind that the value of rmem_default should be no greater than rmem_max (/proc/sys/net/core/rmem_max); if need be, increase the value of rmem_max.

SO_RCVBUF

A socket option that controls the maximum size of a socket's receive buffer, in bytes. For more information on SO_RCVBUF, refer to the man page for more details: man 7 socket.

To configure SO_RCVBUF, use the setsockopt utility. You can retrieve the current SO_RCVBUF value with getsockopt. For more information using both utilities, refer to the setsockopt man page: man setsockopt.

8.3. Overview of Packet Reception

To better analyze network bottlenecks and performance issues, you need to understand how packet reception works. Packet reception is important in network performance tuning because the receive path is where frames are often lost. Lost frames in the receive path can cause a significant penalty to network performance.

Figure 8.1. Network receive path diagram

The Linux kernel receives each frame and subjects it to a four-step process:

Hardware Reception: the network interface card (NIC) receives the frame on the wire. Depending on its driver configuration, the NIC transfers the frame either to an internal hardware buffer memory or to a specified ring buffer.
Hard IRQ: the NIC asserts the presence of a net frame by interrupting the CPU. This causes the NIC driver to acknowledge the interrupt and schedule the soft IRQ operation.
Soft IRQ: this stage implements the actual frame-receiving process, and is run in softirq context. This means that the stage pre-empts all applications running on the specified CPU, but still allows hard IRQs to be asserted.
In this context (running on the same CPU as hard IRQ, thereby minimizing locking overhead), the kernel actually removes the frame from the NIC hardware buffers and processes it through the network stack. From there, the frame is either forwarded, discarded, or passed to a target listening socket.
When passed to a socket, the frame is appended to the application that owns the socket. This process is done iteratively until the NIC hardware buffer runs out of frames, or until the device weight (dev_weight). For more information about device weight, refer to Section 8.4.1, "NIC Hardware Buffer"
Application receive: the application receives the frame and dequeues it from any owned sockets via the standard POSIX calls (read, recv, recvfrom). At this point, data received over the network no longer exists on the network stack.

CPU/cache affinity

To maintain high throughput on the receive path, it is recommended that you keep the L2 cache hot. As described earlier, network buffers are received on the same CPU as the IRQ that signaled their presence. This means that buffer data will be on the L2 cache of that receiving CPU.

To take advantage of this, place process affinity on applications expected to receive the most data on the NIC that shares the same core as the L2 cache. This will maximize the chances of a cache hit, and thereby improve performance.

8.4. Resolving Common Queuing/Frame Loss Issues

By far, the most common reason for frame loss is a queue overrun. The kernel sets a limit to the length of a queue, and in some cases the queue fills faster than it drains. When this occurs for too long, frames start to get dropped.

As illustrated in Figure 8.1, "Network receive path diagram", there are two major queues in the receive path: the NIC hardware buffer and the socket queue. Both queues need to be configured accordingly to protect against queue overruns.

8.4.1. NIC Hardware Buffer

The NIC fills its hardware buffer with frames; the buffer is then drained by the softirq, which the NIC asserts via an interrupt. To interrogate the status of this queue, use the following command:

ethtool -S ethX

Replace ethX with the NIC's corresponding device name. This will display how many frames have been dropped within ethX. Often, a drop occurs because the queue runs out of buffer space in which to store frames.

There are different ways to address this problem, namely:

Input traffic

You can help prevent queue overruns by slowing down input traffic. This can be achieved by filtering, reducing the number of joined multicast groups, lowering broadcast traffic, and the like.

Queue length

Alternatively, you can also increase the queue length. This involves increasing the number of buffers in a specified queue to whatever maximum the driver will allow. To do so, edit the rx/tx ring parameters of ethX using:

ethtool --set-ring ethX

Append the appropriate rx or tx values to the aforementioned command. For more information, refer to man ethtool.

Device weight

You can also increase the rate at which a queue is drained. To do this, adjust the NIC's device weight accordingly. This attribute refers to the maximum number of frames that the NIC can receive before the softirq context has to yield the CPU and reschedule itself. It is controlled by the /proc/sys/net/core/dev_weight variable.

Most administrators have a tendency to choose the third option. However, keep in mind that there are consequences for doing so. Increasing the number of frames that can be received from a NIC in one iteration implies extra CPU cycles, during which no applications can be scheduled on that CPU.

8.4.2. Socket Queue

Like the NIC hardware queue, the socket queue is filled by the network stack from the softirq context. Applications then drain the queues of their corresponding sockets via calls to read, recvfrom, and the like.

To monitor the status of this queue, use the netstat utility; the Recv-Q column displays the queue size. Generally speaking, overruns in the socket queue are managed in the same way as NIC hardware buffer overruns (i.e. Section 8.4.1, "NIC Hardware Buffer"):

Input traffic: The first option is to slow down input traffic by configuring the rate at which the queue fills. To do so, either filter frames or pre-emptively drop them. You can also slow down input traffic by lowering the NIC's device weight ^[6].
Queue depth: You can also avoid socket queue overruns by increasing the queue depth. To do so, increase the value of either the rmem_default kernel parameter or the SO_RCVBUF socket option. For more information on both, refer to Section 8.2, "Optimized Network Settings".
Application call frequency: Whenever possible, optimize the application to perform calls more frequently. This involves modifying or reconfiguring the network application to perform more frequent POSIX calls (such as recv, read). In turn, this allows an application to drain the queue faster.

For many administrators, increasing the queue depth is the preferable solution. This is the easiest solution, but it may not always work long-term. As networking technologies get faster, socket queues will continue to fill more quickly. Over time, this means having to re-adjust the queue depth accordingly.

The best solution is to enhance or configure the application to drain data from the kernel more quickly, even if it means queuing the data in application space. This lets the data be stored more flexibly, since it can be swapped out and paged back in as needed.

8.5. Multicast Considerations

When multiple applications listen to a multicast group, the kernel code that handles multicast frames is required by design to duplicate network data for each individual socket. This duplication is time-consuming and occurs in the softirq context.

Adding multiple listeners on a single multicast group therefore has a direct impact on the softirq context's execution time. Adding a listener to a multicast group implies that the kernel must create an additional copy for each frame received for that group.

The effect of this is minimal at low traffic volume and small listener numbers. However, when multiple sockets listen to a high-traffic multicast group, the increased execution time of the softirq context can lead to frame drops at both the network card and the socket queue. Increased softirq runtimes translate to reduced opportunity for applications to run on heavily-loaded systems, so the rate at which multicast frames are lost increases as the number of applications listening to a high-volume multicast group increases.

Resolve this frame loss by optimizing your socket queues and NIC hardware buffers, as described in Section 8.4.2, "Socket Queue" or Section 8.4.1, "NIC Hardware Buffer". Alternatively, you can optimize an application's socket use; to do so, configure the application to control a single socket and disseminate the received network data quickly to other user-space processes.

^[4]Ensuring cache affinity between a CPU and a NIC means configuring them to share the same L2 cache. For more information, refer to Section 8.3, "Overview of Packet Reception".

^[5]Section 8.3, "Overview of Packet Reception" contains an overview of packet travel, which should help you locate and map bottleneck-prone areas in the network stack.

^[6]Device weight is controlled via /proc/sys/net/core/dev_weight. For more information about device weight and the implications of adjusting it, refer to Section 8.4.1, "NIC Hardware Buffer".

Revision History

Revision History

Revision 4.0-22 Fri Feb 15 2013 Laura Bailey

Publishing for Red Hat Enterprise Linux 6.4.

Revision 4.0-19 Wed Jan 16 2013 Laura Bailey

Minor corrections for consistency (BZ#868404).

Revision 4.0-18 Tue Nov 27 2012 Laura Bailey

Publishing for Red Hat Enterprise Linux 6.4 Beta.

Revision 4.0-17 Mon Nov 19 2012 Laura Bailey

Added SME feedback re. numad section (BZ#868404).

Revision 4.0-16 Thu Nov 08 2012 Laura Bailey

Added draft section on numad (BZ#868404).

Revision 4.0-15 Wed Oct 17 2012 Laura Bailey

Applying SME feedback to block discard discussion and moved section to under Mount Options (BZ#852990).

Updated performance profile descriptions (BZ#858220).

Revision 4.0-13 Wed Oct 17 2012 Laura Bailey

Updated performance profile descriptions (BZ#858220).

Revision 4.0-12 Tue Oct 16 2012 Laura Bailey

Improved book navigation (BZ#854082).

Corrected the definition of file-max (BZ#854094).

Corrected the definition of threads-max (BZ#856861).

Revision 4.0-9 Tue Oct 9 2012 Laura Bailey

Added FSTRIM recommendation to the File Systems chapter (BZ#852990).

Updated description of the threads-max parameter according to customer feedback (BZ#856861).

Updated note about GFS2 fragmentation management improvements BZ#857782).

Revision 4.0-6 Thu Oct 4 2012 Laura Bailey

Added new section on numastat utility (BZ#853274).

Revision 4.0-3 Tue Sep 18 2012 Laura Bailey

Added note re. new perf capabilities (BZ#854082).

Corrected the description of the file-max parameter (BZ#854094).

Revision 4.0-2 Mon Sep 10 2012 Laura Bailey

Added BTRFS section and basic introduction to the file system (BZ#852978).

Noted Valgrind integration with GDB (BZ#853279).

Revision 3.0-15 Thursday March 22 2012 Laura Bailey

Added and updated descriptions of tuned-adm profiles (BZ#803552).

Revision 3.0-10 Friday March 02 2012 Laura Bailey

Updated the threads-max and file-max parameter descriptions (BZ#752825).

Updated slice_idle parameter default value (BZ#785054).

Revision 3.0-8 Thursday February 02 2012 Laura Bailey

Restructured and added details about taskset and binding CPU and memory allocation with numactl to Section 4.1.2, "Tuning CPU Performance" (BZ#639784).

Corrected use of internal links (BZ#786099).

Revision 3.0-5 Tuesday January 17 2012 Laura Bailey

Minor corrections to Section 5.3, "Using Valgrind to Profile Memory Usage" (BZ#639793).

Revision 3.0-3 Wednesday January 11 2012 Laura Bailey

Ensured consistency among internal and external hyperlinks (BZ#752796).

Added Section 5.3, "Using Valgrind to Profile Memory Usage" (BZ#639793).

Added Section 4.1.2, "Tuning CPU Performance" and restructured Chapter 4, CPU (BZ#639784).

Revision 1.0-0 Friday December 02 2011 Laura Bailey

Release for GA of Red Hat Enterprise Linux 6.2.

Source : http://www.redhat.com

(Sebelumnya) 11 : Chapter 5. Memory - Perfo ...

12 : Cluster Administration (Berikutnya)