| Performance Tuning GuideRead this chapter for an overview of the file systems supported for use with Red Hat Enterprise Linux, and how to optimize their performance. 7.1. Tuning Considerations for File SystemsThere are several tuning considerations common to all file systems: formatting and mount options selected on your system, and actions available to applications that may improve their performance on a given system. 7.1.1. Formatting OptionsIf you expect to create many files smaller than the default block size, you can set a smaller block size to minimize the amount of space wasted on disk. Note, however, that setting a smaller block size may limit the maximum size of the file system, and can cause additional runtime overhead, particularly for files greater than the selected block size. Refer to the Storage Administration Guide for further information about creating and maintaining these file systems. Ensure that your external journal is reliable. The loss of an external journal device will cause file system corruption. External journals are created at mkfs time, with journal devices being specified at mount time. Refer to the mke2fs(8) , mkfs.xfs(8) , and mount(8) man pages for further information. However, enabling write barriers slows some applications significantly; specifically, applications that use fsync() heavily, or create and delete many small files. For storage with no volatile write cache, or in the rare case where file system inconsistencies and data loss after a power loss is acceptable, barriers can be disabled by using the nobarrier mount option. For further information, refer to the Storage Administration Guide. Enabling the noatime option also enables nodiratime behavior; there is no need to set both noatime and nodiratime . The tuned tool and the use of LVM striping elevate the read-ahead value, but this is not always sufficient for some workloads. Additionally, Red Hat Enterprise Linux is not always able to set an appropriate read-ahead value based on what it can detect of your file system. For example, if a powerful storage array presents itself to Red Hat Enterprise Linux as a single powerful LUN, the operating system will not treat it as a powerful LUN array, and therefore will not by default make full use of the read-ahead advantages potentially available to the storage. Use the blockdev command to view and edit the read-ahead value. To view the current read-ahead value for a particular block device, run: # blockdev -getra device To modify the read-ahead value for that block device, run the following command. N represents the number of 512-byte sectors. # blockdev -setra N device Note that the value selected with the blockdev command will not persist between boots. We recommend creating a run level init.d script to set this value during boot. 7.1.3. File system maintenanceBatch discard operations are run explicitly by the user with the fstrim command. This command discards all unused blocks in a file system that match the user's criteria. Both operation types are supported for use with the XFS and ext4 file systems in Red Hat Enterprise Linux 6.2 and later as long as the block device underlying the file system supports physical discard operations. Physical discard operations are supported if the value of /sys/block/device /queue/discard_max_bytes is not zero. Online discard operations are specified at mount time with the -o discard option (either in /etc/fstab or as part of the mount command), and run in realtime without user intervention. Online discard operations only discard blocks that are transitioning from used to free. Online discard operations are supported on ext4 file systems in Red Hat Enterprise Linux 6.2 and later, and on XFS file systems in Red Hat Enterprise Linux 6.4 and later. Red Hat recommends batch discard operations unless the system's workload is such that batch discard is not feasible, or online discard operations are necessary to maintain performance. 7.1.4. Application Considerations7.2. Profiles for file system performanceThe tuned-adm tool allows users to easily swap between a number of profiles that have been designed to enhance performance for specific use cases. The profiles that are particularly useful in improving storage performance are: latency-performance A server profile for typical latency performance tuning. It disables tuned and ktune power-saving mechanisms. The cpuspeed mode changes to performance . The I/O elevator is changed to deadline for each device. The cpu_dma_latency parameter is registered with a value of 0 (the lowest possible latency) for power management quality-of-service to limit latency where possible. throughput-performance A server profile for typical throughput performance tuning. This profile is recommended if the system does not have enterprise-class storage. It is the same as latency-performance , except: kernel.sched_min_granularity_ns (scheduler minimal preemption granularity) is set to 10 milliseconds,
kernel.sched_wakeup_granularity_ns (scheduler wake-up granularity) is set to 15 milliseconds,
vm.dirty_ratio (virtual machine dirty ratio) is set to 40%, and
transparent huge pages are enabled.
enterprise-storage This profile is recommended for enterprise-sized server configurations with enterprise-class storage, including battery-backed controller cache protection and management of on-disk cache. It is the same as the throughput-performance profile, except:
7.3.1. The Ext4 File SystemThe ext4 file system is a scalable extension of the default ext3 file system available in Red Hat Enterprise Linux 5. Ext4 is now the default file system for Red Hat Enterprise Linux 6, and is supported for a maximum file system size of 16 TB and a single file maximum size of 16TB. It also removes the 32000 sub-directory limit present in ext3. The ext4 file system defaults are optimal for most workloads, but if performance analysis shows that file system behavior is impacting performance, several tuning options are available: For other mkfs and tuning options, please see the mkfs.ext4(8) and mount(8) man pages, as well as the Documentation/filesystems/ext4.txt file in the kernel-doc package. 7.3.2. The XFS File SystemXFS is a robust and highly-scalable single host 64-bit journaling file system. It is entirely extent-based, so it supports very large file and file system sizes. The number of files an XFS system can hold is limited only by the space available in the file system. XFS supports metadata journaling, which facilitates quicker crash recovery. The XFS file system can also be defragmented and enlarged while mounted and active. In addition, Red Hat Enterprise Linux 6 supports backup and restore utilities specific to XFS. XFS uses extent-based allocation, and features a number of allocation schemes such as delayed allocation and explicit pre-allocation. Extent-based allocation provides a more compact and efficient method of tracking used space in a file system, and improves large file performance by reducing fragmentation and the space consumed by metadata. Delayed allocation improves the chance that a file will be written in a contiguous group of blocks, reducing fragmentation and improving performance. Pre-allocation can be used to prevent fragmentation entirely in cases where the application knows the amount of data it needs to write ahead of time. XFS provides excellent I/O scalability by using b-trees to index all user data and metadata. Object counts grow as all operations on indexes inherit the logarithmic scalability characteristics of the underlying b-trees. Some of the tuning options XFS provides at mkfs time vary the width of the b-trees, which changes the scalability characteristics of different subsystems. 7.3.2.1. Basic tuning for XFSIn general, the default XFS format and mount options are optimal for most workloads; Red Hat recommends that the default values are used unless specific configuration changes are expected to benefit the workload of the file system. If software RAID is in use, the mkfs.xfs command automatically configures itself with the correct stripe unit and width to align with the hardware. This may need to be manually configured if hardware RAID is in use. The inode64 mount option is highly recommended for multi-terabyte file systems, except where the file system is exported via NFS and legacy 32-bit NFS clients require access to the file system. The logbsize mount option is recommended for file systems that are modified frequently, or in bursts. The default value is MAX (32 KB, log stripe unit), and the maximum size is 256 KB. A value of 256 KB is recommended for file systems that undergo heavy modifications. 7.3.2.2. Advanced tuning for XFSBefore changing XFS parameters, you need to understand why the default XFS parameters are causing performance problems. This involves understanding what your application is doing, and how the file system is reacting to those operations. Observable performance problems that can be corrected or reduced by tuning are generally caused by file fragmentation or resource contention in the file system. There are different ways to address these problems, and in some cases fixing the problem will require that the application, rather than the file system configuration, be modified. If you have not been through this process previously, it is recommended that you engage your local Red Hat support engineer for advice. Because the directory structure is b-tree based, changing the block size affects the amount of directory information that can be retreived or modified per physical I/O. The larger the directory becomes, the more I/O each operation requires at a given block size. However, when larger directory block sizes are in use, more CPU is consumed by each modification operation compared to the same operation on a file system with a smaller directory block size. This means that for small directory sizes, large directory block sizes will result in lower modification performance. When the directory reaches a size where I/O is the performance-limiting factor, large block size directories perform better. The default configuration of a 4 KB file system block size and a 4 KB directory block size is best for directories with up to 1-2 million entries with a name length of 20-40 bytes per entry. If your file system requires more entries, larger directory block sizes tend to perform better - a 16 KB block size is best for file systems with 1-10 million directory entries, and a 64 KB block size is best for file systems with over 10 million directory entries. If the workload uses random directory lookups more than modifications (that is, directory reads are much more common or important than directory writes), then the above thresholds for increasing the block size are approximately one order of magnitude lower. The number of allocation groups becomes important when using machines with a high CPU count and multi-threaded applications that attempt to perform operations concurrently. If only four allocation groups exist, then sustained, parallel metadata operations will only scale as far as those four CPUs (the concurrency limit provided by the system). For small file systems, ensure that the number of allocation groups is supported by the concurrency provided by the system. For large file systems (tens of terabytes and larger) the default formatting options generally create sufficient allocation groups to avoid limiting concurrency. Applications must be aware of single points of contention in order to use the parallelism inherent in the structure of the XFS file system. It is not possible to modify a directory concurrently, so applications that create and remove large numbers of files should avoid storing all files in a single directory. Each directory created is placed in a different allocation group, so techniques such as hashing files over multiple sub-directories provide a more scalable storage pattern compared to using a single large directory. For the default inode size of 256 bytes, roughly 100 bytes of attribute space is available depending on the number of data extent pointers also stored in the inode. The default inode size is really only useful for storing a small number of small attributes. Increasing the inode size at mkfs time can increase the amount of space available for storing attributes in-line. A 512 byte inode size increases the space available for attributes to roughly 350 bytes; a 2 KB inode has roughly 1900 bytes of space available. There is, however, a limit on the size of the individual attributes that can be stored in-line - there is a maximum size limit of 254 bytes for both the attribute name and the value (that is, an attribute with a name length of 254 bytes and a value length of 254 bytes will stay in-line). Exceeding these size limits forces the attributes out of line, even if there would have been enough space to store all the attributes in the inode. A small log device will result in very frequent metadata writeback - the log will constantly be pushing on its tail to free up space and so frequently modified metadata will be frequently written to disk, causing operations to be slow. Increasing the log size increases the time period between tail pushing events. This allows better aggregation of dirty metadata, resulting in better metadata writeback patterns, and less writeback of frequently modified metadata. The trade-off is that larger logs require more memory to track all outstanding changes in memory. If you have a machine with limited memory, then large logs are not beneficial because memory constraints will cause metadata writeback long before the benefits of a large log can be realised. In these cases, smaller rather than larger logs will often provide better performance because metadata writeback from the log running out of space is more efficient than writeback driven by memory reclamation. You should always try to align the log to the underlying stripe unit of the device that contains the file system. mkfs does this by default for MD and DM devices, but for hardware RAID it may need to be specified. Setting this correctly avoids all possibility of log I/O causing unaligned I/O and subsequent read-modify-write operations when writing modifications to disk. Log operation can be further improved by editing mount options. Increasing the size of the in-memory log buffers (logbsize ) increases the speed at which changes can be written to the log. The default log buffer size is MAX (32 KB, log stripe unit), and the maximum size is 256 KB. In general, a larger value results in faster performance. However, under fsync-heavy workloads, small log buffers can be noticeably faster than large buffers with a large stripe unit alignment. The delaylog mount option also improves sustained metadata modification performance by reducing the number of changes to the log. It achieves this by aggregating individual changes in memory before writing them to the log: frequently modified metadata is written to the log periodically instead of on every modification. This option increases the memory usage of tracking dirty metadata and increases the potential lost operations when a crash occurs, but can improve metadata modification speed and scalability by an order of magnitude or more. Use of this option does not reduce data or metadata integrity when fsync , fdatasync or sync are used to ensure data and metadata is written to disk. Clustered storage provides a consistent file system image across all servers in a cluster, allowing servers to read and write to a single, shared file system. This simplifies storage administration by limiting tasks like installing and patching applications to one file system. A cluster-wide file system also eliminates the need for redundant copies of application data, simplifying backup and disaster recovery. Red Hat's High Availability Add-On provides clustered storage in conjunction with Red Hat Global File System 2 (part of the Resilient Storage Add-On). 7.4.1. Global File System 2Global File System 2 (GFS2) is a native file system that interfaces directly with the Linux kernel file system. It allows multiple computers (nodes) to simultaneously share the same storage device in a cluster. The GFS2 file system is largely self-tuning, but manual tuning is possible. This section outlines performance considerations when attempting to tune performance manually. Red Hat Enterprise Linux 6.4 introduces improvements to file fragmentation management in GFS2. Files created by Red Hat Enterprise Linux 6.3 or earlier were prone to file fragmentation if multiple files were written at the same time by more than one process. This fragmentation made things run slowly, especially in workloads involving large files. With Red Hat Enterprise Linux 6.4, simultaneous writes result in less file fragmentation and therefore better performance for these workloads. While there is no defragmentation tool for GFS2 on Red Hat Enterprise Linux, you can defragment individual files by identifying them with the filefrag tool, copying them to temporary files, and renaming the temporary files to replace the originals. (This procedure can also be done in versions prior to 6.4 as long as the writing is done sequentially.) Since GFS2 uses a global locking mechanism that potentially requires communication between nodes of a cluster, the best performance will be achieved when your system is designed to avoid file and directory contention between these nodes. Some methods of avoiding contention are to: Pre-allocate files and directories with fallocate where possible, to optimize the allocation process and avoid the need to lock source pages. Minimize the areas of the file system that are shared between multiple nodes to minimize cross-node cache invalidation and improve performance. For example, if multiple nodes mount the same file system, but access different sub-directories, you will likely achieve better performance by moving one subdirectory to a separate file system. Select an optimal resource group size and number. This depends on typical file sizes and available free space on the system, and affects the likelihood that multiple nodes will attempt to use a resource group simultaneously. Too many resource groups can slow block allocation while allocation space is located, while too few resource groups can cause lock contention during deallocation. It is generally best to test multiple configurations to determine which is best for your workload.
However, contention is not the only issue that can affect GFS2 file system performance. Other best practices to improve overall performance are to: Select your storage hardware according to the expected I/O patterns from cluster nodes and the performance requirements of the file system. Use solid-state storage where possible to lower seek time. Create an appropriately-sized file system for your workload, and ensure that the file system is never at more than 80% capacity. Smaller file systems will have proportionally shorter backup times, and require less time and memory for file system checks, but are subject to high fragmentation if they are too small for their workload. Set larger journal sizes for metadata-intensive workloads, or when journaled data is in use. Although this uses more memory, it improves performance because more journaling space is available to store data before a write is necessary. Ensure that clocks on GFS2 nodes are synchronized to avoid issues with networked applications. We recommend using NTP (Network Time Protocol). Unless file or directory access times are critical to the operation of your application, mount the file system with the noatime and nodiratime mount options. Red Hat strongly recommends the use of the noatime option with GFS2. If you need to use quotas, try to reduce the frequency of quota synchronization transactions or use fuzzy quota synchronization to prevent performance issues arising from constant quota file updates. Fuzzy quota accounting can allow users and groups to slightly exceed their quota limit. To minimize this issue, GFS2 dynamically reduces the synchronization period as a user or group approaches its quota limit.
Over time, Red Hat Enterprise Linux's network stack has been upgraded with numerous automated optimization features. For most workloads, the auto-configured network settings provide optimized performance. In most cases, networking performance problems are actually caused by a malfunction in hardware or faulty infrastructure. Such causes are beyond the scope of this document; the performance issues and solutions discussed in this chapter are useful in optimizing perfectly functional systems. Networking is a delicate subsystem, containing different parts with sensitive connections. This is why the open source community and Red Hat invest much work in implementing ways to automatically optimize network performance. As such, given most workloads, you may never even need to reconfigure networking for performance. 8.1. Network Performance EnhancementsRed Hat Enterprise Linux 6.1 provided the following network performance enhancements: Receive Packet Steering (RPS)RPS enables a single NIC rx queue to have its receive softirq workload distributed among several CPUs. This helps prevent network traffic from being bottlenecked on a single NIC hardware queue. To enable RPS, specify the target CPU names in /sys/class/net/ethX /queues/rx-N /rps_cpus , replacing ethX
with the NIC's corresponding device name (for example, eth1 , eth2 ) and rx-N
with the specified NIC receive queue. This will allow the specified CPUs in the file to process data from queue rx-N
on ethX
. When specifying CPUs, consider the queue's cache affinity . RFS is an extension of RPS, allowing the administrator to configure a hash table that is populated automatically when applications receive data and are interrogated by the network stack. This determines which applications are receiving each piece of network data (based on source:destination network information). Using this information, the network stack can schedule the most optimal CPU to receive each packet. To configure RFS, use the following tunables: /proc/sys/net/core/rps_sock_flow_entries This controls the maximum number of sockets/flows that the kernel can steer towards any specified CPU. This is a system-wide, shared limit. /sys/class/net/ethX /queues/rx-N /rps_flow_cnt This controls the maximum number of sockets/flows that the kernel can steer for a specified receive queue (rx-N ) on a NIC (ethX ). Note that sum of all per-queue values for this tunable on all NICs should be equal or less than that of /proc/sys/net/core/rps_sock_flow_entries .
Unlike RPS, RFS allows both the receive queue and the application to share the same CPU when processing packet flows. This can result in improved performance in some cases. However, such improvements are dependent on factors such as cache hierarchy, application load, and the like. getsockopt support for TCP thin-streamsThin-stream is a term used to characterize transport protocols wherein applications send data at such a low rate that the protocol's retransmission mechanisms are not fully saturated. Applications that use thin-stream protocols typically transport via reliable protocols like TCP; in most cases, such applications provide very time-sensitive services (for example, stock trading, online gaming, control systems). For time-sensitive services, packet loss can be devastating to service quality. To help prevent this, the getsockopt call has been enhanced to support two extra options: - TCP_THIN_DUPACK
This Boolean enables dynamic triggering of retransmissions after one dupACK for thin streams. - TCP_THIN_LINEAR_TIMEOUTS
This Boolean enables dynamic triggering of linear timeouts for thin streams.
Both options are specifically activated by the application. For more information about these options, refer to file:///usr/share/doc/kernel-doc-version /Documentation/networking/ip-sysctl.txt . For more information about thin-streams, refer to file:///usr/share/doc/kernel-doc-version /Documentation/networking/tcp-thin.txt . Transparent Proxy (TProxy) supportThe kernel can now handle non-locally bound IPv4 TCP and UDP sockets to support transparent proxies. To enable this, you will need to configure iptables accordingly. You will also need to enable and configure policy routing properly. For more information about transparent proxies, refer to file:///usr/share/doc/kernel-doc-version /Documentation/networking/tproxy.txt . 8.2. Optimized Network SettingsPerformance tuning is usually done in a pre-emptive fashion. Often, we adjust known variables before running an application or deploying a system. If the adjustment proves to be ineffective, we try adjusting other variables. The logic behind such thinking is that by default, the system is not operating at an optimal level of performance; as such, we think we need to adjust the system accordingly. In some cases, we do so via calculated guesses. As mentioned earlier, the network stack is mostly self-optimizing. In addition, effectively tuning the network requires a thorough understanding not just of how the network stack works, but also of the specific system's network resource requirements. Incorrect network performance configuration can actually lead to degraded performance. For example, consider the bufferfloat problem. Increasing buffer queue depths results in TCP connections that have congestion windows larger than the link would otherwise allow (due to deep buffering). However, those connections also have huge RTT values since the frames spend so much time in-queue. This, in turn, actually results in sub-optimal output, as it would become impossible to detect congestion. When it comes to network performance, it is advisable to keep the default settings unless a particular performance issue becomes apparent. Such issues include frame loss, significantly reduced throughput, and the like. Even then, the best solution is often one that results from meticulous study of the problem, rather than simply tuning settings upward (increasing buffer/queue lengths, reducing interrupt latency, etc). To properly diagnose a network performance problem, use the following tools: - netstat
A command-line utility that prints network connections, routing tables, interface statistics, masquerade connections and multicast memberships. It retrieves information about the networking subsystem from the /proc/net/ file system. These files include: /proc/net/dev (device information)
/proc/net/tcp (TCP socket information)
/proc/net/unix (Unix domain socket information)
For more information about netstat and its referenced files from /proc/net/ , refer to the netstat man page: man netstat . - dropwatch
A monitoring utility that monitors packets dropped by the kernel. For more information, refer to the dropwatch man page: man dropwatch . - ip
A utility for managing and monitoring routes, devices, policy routing, and tunnels. For more information, refer to the ip man page: man ip . - ethtool
A utility for displaying and changing NIC settings. For more information, refer to the ethtool man page: man ethtool . - /proc/net/snmp
A file that displays ASCII data needed for the IP, ICMP, TCP, and UDP management information bases for an snmp agent. It also displays real-time UDP-lite statistics.
After collecting relevant data on a network performance problem, you should be able to formulate a theory - and, hopefully, a solution. For example, an increase in UDP input errors in /proc/net/snmp indicates that one or more socket receive queues are full when the network stack attempts to queue new frames into an application's socket. This indicates that packets are bottlenecked at at least one socket queue, which means either the socket queue drains packets too slowly, or packet volume is too large for that socket queue. If it is the latter, then verify the logs of any network-intensive application for lost data -- to resolve this, you would need to optimize or reconfigure the offending application. Socket receive buffer sizeSocket send and receive sizes are dynamically adjusted, so they rarely need to be manually edited. If further analysis, such as the analysis presented in the SystemTap network example, sk_stream_wait_memory.stp , suggests that the socket queue's drain rate is too slow, then you can increase the depth of the application's socket queue. To do so, increase the size of receive buffers used by sockets by configuring either of the following values: - rmem_default
A kernel parameter that controls the default size of receive buffers used by sockets. To configure this, run the following command: sysctl -w net.core.rmem_default=N Replace N with the desired buffer size, in bytes. To determine the value for this kernel parameter, view /proc/sys/net/core/rmem_default . Bear in mind that the value of rmem_default should be no greater than rmem_max (/proc/sys/net/core/rmem_max ); if need be, increase the value of rmem_max . - SO_RCVBUF
A socket option that controls the maximum size of a socket's receive buffer, in bytes. For more information on SO_RCVBUF , refer to the man page for more details: man 7 socket . To configure SO_RCVBUF , use the setsockopt utility. You can retrieve the current SO_RCVBUF value with getsockopt . For more information using both utilities, refer to the setsockopt man page: man setsockopt .
8.3. Overview of Packet ReceptionTo better analyze network bottlenecks and performance issues, you need to understand how packet reception works. Packet reception is important in network performance tuning because the receive path is where frames are often lost. Lost frames in the receive path can cause a significant penalty to network performance. The Linux kernel receives each frame and subjects it to a four-step process: Hardware Reception: the network interface card (NIC) receives the frame on the wire. Depending on its driver configuration, the NIC transfers the frame either to an internal hardware buffer memory or to a specified ring buffer. Hard IRQ: the NIC asserts the presence of a net frame by interrupting the CPU. This causes the NIC driver to acknowledge the interrupt and schedule the soft IRQ operation. Soft IRQ: this stage implements the actual frame-receiving process, and is run in softirq context. This means that the stage pre-empts all applications running on the specified CPU, but still allows hard IRQs to be asserted. In this context (running on the same CPU as hard IRQ, thereby minimizing locking overhead), the kernel actually removes the frame from the NIC hardware buffers and processes it through the network stack. From there, the frame is either forwarded, discarded, or passed to a target listening socket. When passed to a socket, the frame is appended to the application that owns the socket. This process is done iteratively until the NIC hardware buffer runs out of frames, or until the device weight ( dev_weight ). For more information about device weight, refer to Section 8.4.1, "NIC Hardware Buffer" Application receive: the application receives the frame and dequeues it from any owned sockets via the standard POSIX calls (read , recv , recvfrom ). At this point, data received over the network no longer exists on the network stack.
To maintain high throughput on the receive path, it is recommended that you keep the L2 cache hot. As described earlier, network buffers are received on the same CPU as the IRQ that signaled their presence. This means that buffer data will be on the L2 cache of that receiving CPU. To take advantage of this, place process affinity on applications expected to receive the most data on the NIC that shares the same core as the L2 cache. This will maximize the chances of a cache hit, and thereby improve performance. 8.4. Resolving Common Queuing/Frame Loss IssuesBy far, the most common reason for frame loss is a queue overrun. The kernel sets a limit to the length of a queue, and in some cases the queue fills faster than it drains. When this occurs for too long, frames start to get dropped. As illustrated in Figure 8.1, "Network receive path diagram", there are two major queues in the receive path: the NIC hardware buffer and the socket queue. Both queues need to be configured accordingly to protect against queue overruns. 8.4.1. NIC Hardware BufferThe NIC fills its hardware buffer with frames; the buffer is then drained by the softirq , which the NIC asserts via an interrupt. To interrogate the status of this queue, use the following command: ethtool -S ethX Replace ethX with the NIC's corresponding device name. This will display how many frames have been dropped within ethX . Often, a drop occurs because the queue runs out of buffer space in which to store frames. There are different ways to address this problem, namely: - Input traffic
You can help prevent queue overruns by slowing down input traffic. This can be achieved by filtering, reducing the number of joined multicast groups, lowering broadcast traffic, and the like. - Queue length
Alternatively, you can also increase the queue length. This involves increasing the number of buffers in a specified queue to whatever maximum the driver will allow. To do so, edit the rx /tx ring parameters of ethX using: ethtool --set-ring ethX Append the appropriate rx or tx values to the aforementioned command. For more information, refer to man ethtool . - Device weight
You can also increase the rate at which a queue is drained. To do this, adjust the NIC's device weight accordingly. This attribute refers to the maximum number of frames that the NIC can receive before the softirq context has to yield the CPU and reschedule itself. It is controlled by the /proc/sys/net/core/dev_weight variable.
Most administrators have a tendency to choose the third option. However, keep in mind that there are consequences for doing so. Increasing the number of frames that can be received from a NIC in one iteration implies extra CPU cycles, during which no applications can be scheduled on that CPU. Like the NIC hardware queue, the socket queue is filled by the network stack from the softirq context. Applications then drain the queues of their corresponding sockets via calls to read , recvfrom , and the like. To monitor the status of this queue, use the netstat utility; the Recv-Q column displays the queue size. Generally speaking, overruns in the socket queue are managed in the same way as NIC hardware buffer overruns (i.e. Section 8.4.1, "NIC Hardware Buffer"): - Input traffic
The first option is to slow down input traffic by configuring the rate at which the queue fills. To do so, either filter frames or pre-emptively drop them. You can also slow down input traffic by lowering the NIC's device weight . - Queue depth
You can also avoid socket queue overruns by increasing the queue depth. To do so, increase the value of either the rmem_default kernel parameter or the SO_RCVBUF socket option. For more information on both, refer to Section 8.2, "Optimized Network Settings". - Application call frequency
Whenever possible, optimize the application to perform calls more frequently. This involves modifying or reconfiguring the network application to perform more frequent POSIX calls (such as recv , read ). In turn, this allows an application to drain the queue faster.
For many administrators, increasing the queue depth is the preferable solution. This is the easiest solution, but it may not always work long-term. As networking technologies get faster, socket queues will continue to fill more quickly. Over time, this means having to re-adjust the queue depth accordingly. The best solution is to enhance or configure the application to drain data from the kernel more quickly, even if it means queuing the data in application space. This lets the data be stored more flexibly, since it can be swapped out and paged back in as needed. 8.5. Multicast ConsiderationsWhen multiple applications listen to a multicast group, the kernel code that handles multicast frames is required by design to duplicate network data for each individual socket. This duplication is time-consuming and occurs in the softirq context. Adding multiple listeners on a single multicast group therefore has a direct impact on the softirq context's execution time. Adding a listener to a multicast group implies that the kernel must create an additional copy for each frame received for that group. The effect of this is minimal at low traffic volume and small listener numbers. However, when multiple sockets listen to a high-traffic multicast group, the increased execution time of the softirq context can lead to frame drops at both the network card and the socket queue. Increased softirq runtimes translate to reduced opportunity for applications to run on heavily-loaded systems, so the rate at which multicast frames are lost increases as the number of applications listening to a high-volume multicast group increases. Resolve this frame loss by optimizing your socket queues and NIC hardware buffers, as described in Section 8.4.2, "Socket Queue" or Section 8.4.1, "NIC Hardware Buffer". Alternatively, you can optimize an application's socket use; to do so, configure the application to control a single socket and disseminate the received network data quickly to other user-space processes. Revision History |
---|
Revision 4.0-22 | Fri Feb 15 2013 | Laura Bailey | Publishing for Red Hat Enterprise Linux 6.4. |
| Revision 4.0-19 | Wed Jan 16 2013 | Laura Bailey | Minor corrections for consistency (BZ#868404). |
| Revision 4.0-18 | Tue Nov 27 2012 | Laura Bailey | Publishing for Red Hat Enterprise Linux 6.4 Beta. |
| Revision 4.0-17 | Mon Nov 19 2012 | Laura Bailey | Added SME feedback re. numad section (BZ#868404). |
| Revision 4.0-16 | Thu Nov 08 2012 | Laura Bailey | | Revision 4.0-15 | Wed Oct 17 2012 | Laura Bailey | Applying SME feedback to block discard discussion and moved section to under Mount Options (BZ#852990). | Updated performance profile descriptions (BZ#858220). |
| Revision 4.0-13 | Wed Oct 17 2012 | Laura Bailey | Updated performance profile descriptions (BZ#858220). |
| Revision 4.0-12 | Tue Oct 16 2012 | Laura Bailey | Improved book navigation (BZ#854082). | Corrected the definition of file-max (BZ#854094). | Corrected the definition of threads-max (BZ#856861). |
| Revision 4.0-9 | Tue Oct 9 2012 | Laura Bailey | Added FSTRIM recommendation to the File Systems chapter (BZ#852990). | Updated description of the threads-max parameter according to customer feedback (BZ#856861). | Updated note about GFS2 fragmentation management improvements BZ#857782). |
| Revision 4.0-6 | Thu Oct 4 2012 | Laura Bailey | Added new section on numastat utility (BZ#853274). |
| Revision 4.0-3 | Tue Sep 18 2012 | Laura Bailey | Added note re. new perf capabilities (BZ#854082). | Corrected the description of the file-max parameter (BZ#854094). |
| Revision 4.0-2 | Mon Sep 10 2012 | Laura Bailey | Added BTRFS section and basic introduction to the file system (BZ#852978). | Noted Valgrind integration with GDB (BZ#853279). |
| Revision 3.0-15 | Thursday March 22 2012 | Laura Bailey | Added and updated descriptions of tuned-adm profiles (BZ#803552). |
| Revision 3.0-10 | Friday March 02 2012 | Laura Bailey | Updated the threads-max and file-max parameter descriptions (BZ#752825). | Updated slice_idle parameter default value (BZ#785054). |
| Revision 3.0-8 | Thursday February 02 2012 | Laura Bailey | | Revision 3.0-5 | Tuesday January 17 2012 | Laura Bailey | | Revision 3.0-3 | Wednesday January 11 2012 | Laura Bailey | | Revision 1.0-0 | Friday December 02 2011 | Laura Bailey | Release for GA of Red Hat Enterprise Linux 6.2. |
|
|
| |