Chapter 3. Subsystems and Tunable Parameters

Subsystems are kernel modules that are aware of cgroups. Typically, they are resource controllers that allocate varying levels of system resources to different cgroups. However, subsystems could be programmed for any other interaction with the kernel where the need exists to treat different groups of processes differently. The application programming interface (API) to develop new subsystems is documented in cgroups.txt in the kernel documentation, installed on your system at /usr/share/doc/kernel-doc-kernel-version/Documentation/cgroups/ (provided by the kernel-doc package). The latest version of the cgroups documentation is also available on line at http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt. Note, however, that the features in the latest documentation might not match those available in the kernel installed on your system.

State objects that contain the subsystem parameters for a cgroup are represented as pseudofiles within the cgroup virtual file system. These pseudo-files can be manipulated by shell commands or their equivalent system calls. For example, cpuset.cpus is a pseudo-file that specifies which CPUs a cgroup is permitted to access. If /cgroup/cpuset/webserver is a cgroup for the web server that runs on a system, and the following command is executed:

~]# echo 0,2 > /cgroup/cpuset/webserver/cpuset.cpus

The value 0,2 is written to the cpuset.cpus pseudofile and therefore limits any tasks whose PIDs are listed in /cgroup/cpuset/webserver/tasks to use only CPU 0 and CPU 2 on the system.

3.1. blkio

The Block I/O (blkio) subsystem controls and monitors access to I/O on block devices by tasks in cgroups. Writing values to some of these pseudofiles limits access or bandwidth, and reading values from some of these pseudofiles provides information on I/O operations.

The blkio subsystem offers two policies for controlling access to I/O:

Proportional weight division - implemented in the Completely Fair Queuing I/O scheduler, this policy allows you to set weights to specific cgroups. This means that each cgroup has a set percentage (depending on the weight of the cgroup) of all I/O operations reserved. For more information, refer to Section 3.1.1, "Proportional Weight Division Tunable Parameters"
I/O throttling (Upper limit) - this policy is used to set an upper limit for the number of I/O operations performed by a specific device. This means that a device can have a limited rate of read or write operations. For more information, refer to Section 3.1.2, "I/O Throttling Tunable Parameters"

Buffered write operations

Currently, the Block I/O subsystem does not work for buffered write operations. It is primarily targeted at direct I/O, although it works for buffered read operations.

3.1.1. Proportional Weight Division Tunable Parameters

blkio.weight

specifies the relative proportion (weight) of block I/O access available by default to a cgroup, in the range 100 to 1000. This value is overridden for specific devices by the blkio.weight_device parameter. For example, to assign a default weight of 500 to a cgroup for access to block devices, run:

echo 500 > blkio.weight

blkio.weight_device

specifies the relative proportion (weight) of I/O access on specific devices available to a cgroup, in the range 100 to 1000. The value of this parameter overrides the value of the blkio.weight parameter for the devices specified. Values take the format major:minor weight, where major and minor are device types and node numbers specified in Linux Allocated Devices, otherwise known as the Linux Devices List and available from http://www.kernel.org/doc/Documentation/devices.txt. For example, to assign a weight of 500 to a cgroup for access to /dev/sda, run:

echo 8:0 500 > blkio.weight_device

In the Linux Allocated Devices notation, 8:0 represents /dev/sda.

3.1.2. I/O Throttling Tunable Parameters

blkio.throttle.read_bps_device

specifies the upper limit on the number of read operations a device can perform. The rate of the read operations is specified in bytes per second. Entries have three fields: major, minor, and bytes_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and bytes_per_second is the upper limit rate at which read operations can be performed. For example, to allow the /dev/sda device to perform read operations at a maximum of 10 MBps, run:

~]# echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.read_bps_device

blkio.throttle.read_iops_device

specifies the upper limit on the number of read operations a device can perform. The rate of the read operations is specified in operations per second. Entries have three fields: major, minor, and operations_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and operations_per_second is the upper limit rate at which read operations can be performed. For example, to allow the /dev/sda device to perform a maximum of 10 read operations per second, run:

~]# echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.read_iops_device

blkio.throttle.write_bps_device

specifies the upper limit on the number of write operations a device can perform. The rate of the write operations is specified in bytes per second. Entries have three fields: major, minor, and bytes_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and bytes_per_second is the upper limit rate at which write operations can be performed. For example, to allow the /dev/sda device to perform write operations at a maximum of 10 MBps, run:

~]# echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.write_bps_device

blkio.throttle.write_iops_device

specifies the upper limit on the number of write operations a device can perform. The rate of the write operations is specified in operations per second. Entries have three fields: major, minor, and operations_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and operations_per_second is the upper limit rate at which write operations can be performed. For example, to allow the /dev/sda device to perform a maximum of 10 write operations per second, run:

~]# echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.write_iops_device

blkio.throttle.io_serviced

reports the number of I/O operations performed on specific devices by a cgroup as seen by the throttling policy. Entries have four fields: major, minor, operation, and number. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and number represents the number of operations.

blkio.throttle.io_service_bytes

reports the number of bytes transferred to or from specific devices by a cgroup. The only difference between blkio.io_service_bytes and blkio.throttle.io_service_bytes is that the former is not updated when the CFQ scheduler is operating on a request queue. Entries have four fields: major, minor, operation, and bytes. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and bytes is the number of bytes transferred.

3.1.3. blkio Common Tunable Parameters

The following parameters may be used for either of the policies listed in Section 3.1, "blkio".

blkio.reset_stats

resets the statistics recorded in the other pseudofiles. Write an integer to this file to reset the statistics for this cgroup.

blkio.time

reports the time that a cgroup had I/O access to specific devices. Entries have three fields: major, minor, and time. Major and minor are device types and node numbers specified in Linux Allocated Devices, and time is the length of time in milliseconds (ms).

blkio.sectors

reports the number of sectors transferred to or from specific devices by a cgroup. Entries have three fields: major, minor, and sectors. Major and minor are device types and node numbers specified in Linux Allocated Devices, and sectors is the number of disk sectors.

blkio.avg_queue_size

reports the average queue size for I/O operations by a cgroup, over the entire length of time of the group's existence. The queue size is sampled every time a queue for this cgroup gets a timeslice. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.

blkio.group_wait_time

reports the total time (in nanoseconds - ns) a cgroup spent waiting for a timeslice for one of its queues. The report is updated every time a queue for this cgroup gets a timeslice, so if you read this pseudofile while the cgroup is waiting for a timeslice, the report will not contain time spent waiting for the operation currently queued. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.

blkio.empty_time

reports the total time (in nanoseconds - ns) a cgroup spent without any pending requests. The report is updated every time a queue for this cgroup has a pending request, so if you read this pseudofile while the cgroup has no pending requests, the report will not contain time spent in the current empty state. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.

blkio.idle_time

reports the total time (in nanoseconds - ns) the scheduler spent idling for a cgroup in anticipation of a better request than those requests already in other queues or from other groups. The report is updated every time the group is no longer idling, so if you read this pseudofile while the cgroup is idling, the report will not contain time spent in the current idling state. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.

blkio.dequeue

reports the number of times requests for I/O operations by a cgroup were dequeued by specific devices. Entries have three fields: major, minor, and number. Major and minor are device types and node numbers specified in Linux Allocated Devices, and number is the number of requests the group was dequeued. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.

blkio.io_serviced

reports the number of I/O operations performed on specific devices by a cgroup as seen by the CFQ scheduler. Entries have four fields: major, minor, operation, and number. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and number represents the number of operations.

blkio.io_service_bytes

reports the number of bytes transferred to or from specific devices by a cgroup as seen by the CFQ scheduler. Entries have four fields: major, minor, operation, and bytes. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and bytes is the number of bytes transferred.

blkio.io_service_time

reports the total time between request dispatch and request completion for I/O operations on specific devices by a cgroup as seen by the CFQ scheduler. Entries have four fields: major, minor, operation, and time. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and time is the length of time in nanoseconds (ns). The time is reported in nanoseconds rather than a larger unit so that this report is meaningful even for solid-state devices.

blkio.io_wait_time

reports the total time I/O operations on specific devices by a cgroup spent waiting for service in the scheduler queues. When you interpret this report, note:

the time reported can be greater than the total time elapsed, because the time reported is the cumulative total of all I/O operations for the cgroup rather than the time that the cgroup itself spent waiting for I/O operations. To find the time that the group as a whole has spent waiting, use the blkio.group_wait_time parameter.
if the device has a queue_depth > 1, the time reported only includes the time until the request is dispatched to the device, not any time spent waiting for service while the device re-orders requests.

Entries have four fields: major, minor, operation, and time. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and time is the length of time in nanoseconds (ns). The time is reported in nanoseconds rather than a larger unit so that this report is meaningful even for solid-state devices.

blkio.io_merged

reports the number of BIOS requests merged into requests for I/O operations by a cgroup. Entries have two fields: number and operation. Number is the number of requests, and operation represents the type of operation (read, write, sync, or async).

blkio.io_queued

reports the number of requests queued for I/O operations by a cgroup. Entries have two fields: number and operation. Number is the number of requests, and operation represents the type of operation (read, write, sync, or async).

3.1.4. Example Usage

Refer to Example 3.1, "blkio proportional weight division" for a simple test of running two dd threads in two different cgroups with various blkio.weight values.

Example 3.1. blkio proportional weight division

Mount the blkio subsystem:

~]# mount -t cgroup -o blkio blkio /cgroup/blkio/

Create two cgroups for the blkio subsystem:

~]# mkdir /cgroup/blkio/test1/~]# mkdir /cgroup/blkio/test2/

Set various blkio weights in the previously-created cgroups:

~]# echo 1000 > /cgroup/blkio/test1/blkio.weight~]# echo 500 > /cgroup/blkio/test2/blkio.weight

Create two large files:

~]# dd if=/dev/zero of=file_1 bs=1M count=4000~]# dd if=/dev/zero of=file_2 bs=1M count=4000

The above commands create two files (file_1 and file_2) of size 4 GB.

For each of the test cgroups, execute a dd command (which reads the contents of a file and outputs it to the null device) on one of the large files:
```
~]# cgexec -g blkio:test1 time dd if=file_1 of=/dev/null~]# cgexec -g blkio:test2 time dd if=file_2 of=/dev/null
```
Both commands will output their completion time once they have finished.
Simultaneously with the two running dd threads, you can monitor the performance in real time by using the iotop utility. To install the iotop utility, execute, as root, the yum install iotop command. The following is an example of the output as seen in the iotop utility while running the previously-started dd threads:
```
Total DISK READ: 83.16 M/s | Total DISK WRITE: 0.00 B/s TIME  TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN  IO COMMAND15:18:04 15071 be/4 root   27.64 M/s 0.00 B/s  0.00 % 92.30 % dd if=file_2 of=/dev/null15:18:04 15069 be/4 root   55.52 M/s 0.00 B/s  0.00 % 88.48 % dd if=file_1 of=/dev/null
```

In order to get the most accurate result in Example 3.1, "blkio proportional weight division", prior to the execution of the dd commands, flush all file system buffers and free pagecache, dentries and inodes using the following commands:

~]# sync~]# echo 3 > /proc/sys/vm/drop_caches

Additionally, you can enable group isolation which provides stronger isolation between groups at the expense of throughput. When group isolation is disabled, fairness can be expected only for a sequential workload. By default, group isolation is enabled and fairness can be expected for random I/O workloads as well. To enable group isolation, use the following command:

~]# echo 1 > /sys/block/<disk_device>/queue/iosched/group_isolation

where <disk_device> stands for the name of the desired device, for example sda.

3.2. cpu

The cpu subsystem schedules CPU access to cgroups. Access to CPU resources can be scheduled using two schedulers:

Completely Fair Scheduler (CFS) - a proportional share scheduler which divides the CPU time (CPU bandwidth) proportionately between groups of tasks (cgroups) depending on the priority/weight of the task or shares assigned to cgroups. For more information about resource limiting using CFS, refer to Section 3.2.1, "CFS Tunable Parameters".
Real-Time scheduler (RT) - a task scheduler that provides a way to specify the amount of CPU time that real-time tasks can use. For more information about resource limiting of real-time tasks, refer to Section 3.2.2, "RT Tunable Parameters".

3.2.1. CFS Tunable Parameters

In CFS, a cgroup can get more than its share of CPU if there are enough idle CPU cycles available in the system, due to the work conserving nature of the scheduler. This is usually the case for cgroups that consume CPU time based on relative shares. Ceiling enforcement can be used for cases when a hard limit on the amount of CPU that a cgroup can utilize is required (that is, tasks cannot use more than a set amount of CPU time).

The following options can be used to configure ceiling enforcement or relative sharing of CPU:

Ceiling Enforcement Tunable Parameters

cpu.cfs_period_us

specifies a period of time in microseconds (µs, represented here as "us") for how regularly a cgroup's access to CPU resources should be reallocated. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 1000000. The upper limit of the cpu.cfs_quota_us parameter is 1 second and the lower limit is 1000 microseconds.

cpu.cfs_quota_us

specifies the total amount of time in microseconds (µs, represented here as "us") for which all tasks in a cgroup can run during one period (as defined by cpu.cfs_period_us). As soon as tasks in a cgroup use up all the time specified by the quota, they are throttled for the remainder of the time specified by the period and not allowed to run until the next period. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 1000000. Note that the quota and period parameters operate on a CPU basis. To allow a process to fully utilize two CPUs, for example, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 100000.

Setting the value in cpu.cfs_quota_us to -1 indicates that the cgroup does not adhere to any CPU time restrictions. This is also the default value for every cgroup (except the root cgroup).

cpu.stat

reports CPU time statistics using the following values:

nr_periods - number of period intervals (as specified in cpu.cfs_period_us) that have elapsed.
nr_throttled - number of times tasks in a cgroup have been throttled (that is, not allowed to run because they have exhausted all of the available time as specified by their quota).
throttled_time - the total time duration (in nanoseconds) for which tasks in a cgroup have been throttled.

Relative Shares Tunable Parameters

cpu.shares

contains an integer value that specifies a relative share of CPU time available to the tasks in a cgroup. For example, tasks in two cgroups that have cpu.shares set to 100 will receive equal CPU time, but tasks in a cgroup that has cpu.shares set to 200 receive twice the CPU time of tasks in a cgroup where cpu.shares is set to 100. The value specified in the cpu.shares file must be 2 or higher.

Note that shares of CPU time are distributed per all CPU cores on multi-core systems. Even if a cgroup is limited to less than 100% of CPU on a multi-core system, it may use 100% of each individual CPU core. Consider the following example: if cgroup A is configured to use 25% and cgroup B 75% of the CPU, starting four CPU-intensive processes (one in A and three in B) on a system with four cores results in the following division of CPU shares:

Table 3.1. CPU share division

PID	cgroup	CPU	CPU share
100	A	0	100% of CPU0
101	B	1	100% of CPU1
102	B	2	100% of CPU2
103	B	3	100% of CPU3

Using relative shares to specify CPU access has two implications on resource management that should be considered:

Because the CFS does not demand equal usage of CPU, it is hard to predict how much CPU time a cgroup will be allowed to utilize. When tasks in one cgroup are idle and are not using any CPU time, this left-over time is collected in a global pool of unused CPU cycles. Other cgroups are allowed to borrow CPU cycles from this pool.
The actual amount of CPU time that is available to a cgroup can vary depending on the number of cgroups that exist on the system. If a cgroup has a relative share of 1000 and two other cgroups have a relative share of 500, the first cgroup receives 50% of all CPU time in cases when processes in all cgroups attempt to use 100% of the CPU. However, if another cgroup is added with a relative share of 1000, the first cgroup is only allowed 33% of the CPU (the rest of the cgroups receive 16.5%, 16.5%, and 33% of CPU).

3.2.2. RT Tunable Parameters

The RT scheduler works similar to the ceiling enforcement control of the CFS (for more information, refer to Section 3.2.1, "CFS Tunable Parameters") but limits CPU access to real-time tasks only. The amount of time for which a real-time task can access the CPU is decided by allocating a run time and a period for each cgroup. All tasks in a cgroup are then allowed to access the CPU for the defined period of time for one run time (for example, tasks in a cgroup may be allowed to run 0.1 seconds in every 1 second).

cpu.rt_period_us: applicable to real-time scheduling tasks only, this parameter specifies a period of time in microseconds (µs, represented here as "us") for how regularly a cgroup's access to CPU resources should be reallocated. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.rt_runtime_us to 200000 and cpu.rt_period_us to 1000000.
cpu.rt_runtime_us: applicable to real-time scheduling tasks only, this parameter specifies a period of time in microseconds (µs, represented here as "us") for the longest continuous period in which the tasks in a cgroup have access to CPU resources. Establishing this limit prevents tasks in one cgroup from monopolizing CPU time. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.rt_runtime_us to 200000 and cpu.rt_period_us to 1000000. Note that the run time and period parameters operate on a CPU basis. To allow a real-time task to fully utilize two CPUs, for example, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 100000.

3.2.3. Example Usage

Example 3.2. Limiting CPU access

The following examples assume you have an existing hierarchy of cgroups configured and the cpu subsystem mounted on your system:

To allow one cgroup to use 25% of a single CPU and a different cgroup to use 75% of that same CPU, use the following commands:
```
~]# echo 250 > /cgroup/cpu/blue/cpu.shares~]# echo 750 > /cgroup/cpu/red/cpu.shares
```

To limit a cgroup to fully utilize a single CPU, use the following commands:

~]# echo 10000 > /cgroup/cpu/red/cpu.cfs_quota_us~]# echo 10000 > /cgroup/cpu/red/cpu.cfs_period_us

To limit a cgroup to utilize 10% of a single CPU, use the following commands:

~]# echo 10000 > /cgroup/cpu/red/cpu.cfs_quota_us~]# echo 100000 > /cgroup/cpu/red/cpu.cfs_period_us

On a multi-core system, to allow a cgroup to fully utilize two CPU cores, use the following commands:

~]# echo 200000 > /cgroup/cpu/red/cpu.cfs_quota_us~]# echo 100000 > /cgroup/cpu/red/cpu.cfs_period_us

3.3. cpuacct

The CPU Accounting (cpuacct) subsystem generates automatic reports on CPU resources used by the tasks in a cgroup, including tasks in child groups. Three reports are available:

cpuacct.usage

reports the total CPU time (in nanoseconds) consumed by all tasks in this cgroup (including tasks lower in the hierarchy).

Resetting cpuacct.usage

To reset the value in cpuacct.usage, execute the following command:

~]# echo 0 > /cgroup/cpuacct/cpuacct.usage

The above command also resets values in cpuacct.usage_percpu.

cpuacct.stat

reports the user and system CPU time consumed by all tasks in this cgroup (including tasks lower in the hierarchy) in the following way:

user - CPU time consumed by tasks in user mode.
system - CPU time consumed by tasks in system (kernel) mode.

CPU time is reported in the units defined by the USER_HZ variable.

cpuacct.usage_percpu

reports the CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup (including tasks lower in the hierarchy).

3.4. cpuset

The cpuset subsystem assigns individual CPUs and memory nodes to cgroups. Each cpuset can be specified according to the following parameters, each one in a separate pseudofile within the cgroup virtual file system:

Mandatory parameters

Some subsystems have mandatory parameters that must be set before you can move a task into a cgroup which uses any of those subsystems. For example, before you move a task into a cgroup which uses the cpuset subsystem, the cpuset.cpus and cpuset.mems parameters must be defined for that cgroup.

cpuset.cpus (mandatory)

specifies the CPUs that tasks in this cgroup are permitted to access. This is a comma-separated list in ASCII format, with dashes ("-") to represent ranges. For example,

0-2,16

represents CPUs 0, 1, 2, and 16.

cpuset.mems (mandatory)

specifies the memory nodes that tasks in this cgroup are permitted to access. This is a comma-separated list in ASCII format, with dashes ("-") to represent ranges. For example,

0-2,16

represents memory nodes 0, 1, 2, and 16.

cpuset.memory_migrate

contains a flag (0 or 1) that specifies whether a page in memory should migrate to a new node if the values in cpuset.mems change. By default, memory migration is disabled (0) and pages stay on the node to which they were originally allocated, even if this node is no longer one of the nodes now specified in cpuset.mems. If enabled (1), the system will migrate pages to memory nodes within the new parameters specified by cpuset.mems, maintaining their relative placement if possible - for example, pages on the second node on the list originally specified by cpuset.mems will be allocated to the second node on the list now specified by cpuset.mems, if this place is available.

cpuset.cpu_exclusive

contains a flag (0 or 1) that specifies whether cpusets other than this one and its parents and children can share the CPUs specified for this cpuset. By default (0), CPUs are not allocated exclusively to one cpuset.

cpuset.mem_exclusive

contains a flag (0 or 1) that specifies whether other cpusets can share the memory nodes specified for this cpuset. By default (0), memory nodes are not allocated exclusively to one cpuset. Reserving memory nodes for the exclusive use of a cpuset (1) is functionally the same as enabling a memory hardwall with the cpuset.mem_hardwall parameter.

cpuset.mem_hardwall

contains a flag (0 or 1) that specifies whether kernel allocations of memory page and buffer data should be restricted to the memory nodes specified for this cpuset. By default (0), page and buffer data is shared across processes belonging to multiple users. With a hardwall enabled (1), each tasks' user allocation can be kept separate.

cpuset.memory_pressure

a read-only file that contains a running average of the memory pressure created by the processes in this cpuset. The value in this pseudofile is automatically updated when cpuset.memory_pressure_enabled is enabled, otherwise, the pseudofile contains the value 0.

cpuset.memory_pressure_enabled

contains a flag (0 or 1) that specifies whether the system should compute the memory pressure created by the processes in this cgroup. Computed values are output to cpuset.memory_pressure and represent the rate at which processes attempt to free in-use memory, reported as an integer value of attempts to reclaim memory per second, multiplied by 1000.

cpuset.memory_spread_page

contains a flag (0 or 1) that specifies whether file system buffers should be spread evenly across the memory nodes allocated to this cpuset. By default (0), no attempt is made to spread memory pages for these buffers evenly, and buffers are placed on the same node on which the process that created them is running.

cpuset.memory_spread_slab

contains a flag (0 or 1) that specifies whether kernel slab caches for file input/output operations should be spread evenly across the cpuset. By default (0), no attempt is made to spread kernel slab caches evenly, and slab caches are placed on the same node on which the process that created them is running.

cpuset.sched_load_balance

contains a flag (0 or 1) that specifies whether the kernel will balance loads across the CPUs in this cpuset. By default (1), the kernel balances loads by moving processes from overloaded CPUs to less heavily used CPUs.

Note, however, that setting this flag in a cgroup has no effect if load balancing is enabled in any parent cgroup, as load balancing is already being carried out at a higher level. Therefore, to disable load balancing in a cgroup, disable load balancing also in each of its parents in the hierarchy. In this case, you should also consider whether load balancing should be enabled for any siblings of the cgroup in question.

cpuset.sched_relax_domain_level

contains an integer between -1 and a small positive value, which represents the width of the range of CPUs across which the kernel should attempt to balance loads. This value is meaningless if cpuset.sched_load_balance is disabled.

The precise effect of this value varies according to system architecture, but the following values are typical:

Values of cpuset.sched_relax_domain_level

Value	Effect
`-1`	Use the system default value for load balancing
`0`	Do not perform immediate load balancing; balance loads only periodically
`1`	Immediately balance loads across threads on the same core
`2`	Immediately balance loads across cores in the same package
`3`	Immediately balance loads across CPUs on the same node or blade
`4`	Immediately balance loads across several CPUs on architectures with non-uniform memory access (NUMA)
`5`	Immediately balance loads across all CPUs on architectures with NUMA

3.5. devices

The devices subsystem allows or denies access to devices by tasks in a cgroup.

Technology preview

The Device Whitelist (devices) subsystem is considered to be a Technology Preview in Red Hat Enterprise Linux 6.

Technology preview features are currently not supported under Red Hat Enterprise Linux 6 subscription services, might not be functionally complete, and are generally not suitable for production use. However, Red Hat includes these features in the operating system as a customer convenience and to provide the feature with wider exposure. You might find these features useful in a non-production environment and are also free to provide feedback and functionality suggestions for a technology preview feature before it becomes fully supported.

devices.allow

specifies devices to which tasks in a cgroup have access. Each entry has four fields: type, major, minor, and access. The values used in the type, major, and minor fields correspond to device types and node numbers specified in Linux Allocated Devices, otherwise known as the Linux Devices List and available from http://www.kernel.org/doc/Documentation/devices.txt.

type

type can have one of the following three values:

a - applies to all devices, both character devices and block devices
b - specifies a block device
c - specifies a character device

major, minor

major and minor are device node numbers specified by Linux Allocated Devices. The major and minor numbers are separated by a colon. For example, 8 is the major number that specifies SCSI disk drives, and the minor number 1 specifies the first partition on the first SCSI disk drive; therefore 8:1 fully specifies this partition, corresponding to a file system location of /dev/sda1.

* can stand for all major or all minor device nodes, for example 9:* (all RAID devices) or *:* (all devices).

access

access is a sequence of one or more of the following letters:

r - allows tasks to read from the specified device
w - allows tasks to write to the specified device
m - allows tasks to create device files that do not yet exist

For example, when access is specified as r, tasks can only read from the specified device, but when access is specified as rw, tasks can read from and write to the device.

devices.deny

specifies devices that tasks in a cgroup cannot access. The syntax of entries is identical with devices.allow.

devices.list

reports the devices for which access controls have been set for tasks in this cgroup.

3.6. freezer

The freezer subsystem suspends or resumes tasks in a cgroup.

freezer.state

freezer.state is only available in non-root cgroups, and has three possible values:

FROZEN - tasks in the cgroup are suspended.
FREEZING - the system is in the process of suspending tasks in the cgroup.
THAWED - tasks in the cgroup have resumed.

To suspend a specific process:

Move that process to a cgroup in a hierarchy which has the freezer subsystem attached to it.
Freeze that particular cgroup to suspend the process contained in it.

It is not possible to move a process into a suspended (frozen) cgroup.

Note that while the FROZEN and THAWED values can be written to freezer.state, FREEZING cannot be written, only read.

3.7. memory

The memory subsystem generates automatic reports on memory resources used by the tasks in a cgroup, and sets limits on memory use by those tasks:

memory.stat

reports a wide range of memory statistics, as described in the following table:

Table 3.2. Values reported by memory.stat

Statistic	Description
`cache`	page cache, including `tmpfs` (`shmem`), in bytes
`rss`	anonymous and swap cache, not including `tmpfs` (`shmem`), in bytes
`mapped_file`	size of memory-mapped mapped files, including `tmpfs` (`shmem`), in bytes
`pgpgin`	number of pages paged into memory
`pgpgout`	number of pages paged out of memory
`swap`	swap usage, in bytes
`active_anon`	anonymous and swap cache on active least-recently-used (LRU) list, including `tmpfs` (`shmem`), in bytes
`inactive_anon`	anonymous and swap cache on inactive LRU list, including `tmpfs` (`shmem`), in bytes
`active_file`	file-backed memory on active LRU list, in bytes
`inactive_file`	file-backed memory on inactive LRU list, in bytes
`unevictable`	memory that cannot be reclaimed, in bytes
`hierarchical_memory_limit`	memory limit for the hierarchy that contains the `memory` cgroup, in bytes
`hierarchical_memsw_limit`	memory plus swap limit for the hierarchy that contains the `memory` cgroup, in bytes

Additionally, each of these files other than hierarchical_memory_limit and hierarchical_memsw_limit has a counterpart prefixed total_ that reports not only on the cgroup, but on all its children as well. For example, swap reports the swap usage by a cgroup and total_swap reports the total swap usage by the cgroup and all its child groups.

When you interpret the values reported by memory.stat, note how the various statistics inter-relate:

active_anon + inactive_anon = anonymous memory + file cache for tmpfs + swap cache
Therefore, active_anon + inactive_anon � rss, because rss does not include tmpfs.
active_file + inactive_file = cache - size of tmpfs

memory.usage_in_bytes

reports the total current memory usage by processes in the cgroup (in bytes).

memory.memsw.usage_in_bytes

reports the sum of current memory usage plus swap space used by processes in the cgroup (in bytes).

memory.max_usage_in_bytes

reports the maximum memory used by processes in the cgroup (in bytes).

memory.memsw.max_usage_in_bytes

reports the maximum amount of memory and swap space used by processes in the cgroup (in bytes).

memory.limit_in_bytes

sets the maximum amount of user memory (including file cache). If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units - k or K for kilobytes, m or M for Megabytes, and g or G for Gigabytes.

You cannot use memory.limit_in_bytes to limit the root cgroup; you can only apply values to groups lower in the hierarchy.

Write -1 to memory.limit_in_bytes to remove any existing limits.

memory.memsw.limit_in_bytes

sets the maximum amount for the sum of memory and swap usage. If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units - k or K for kilobytes, m or M for Megabytes, and g or G for Gigabytes.

You cannot use memory.memsw.limit_in_bytes to limit the root cgroup; you can only apply values to groups lower in the hierarchy.

Write -1 to memory.memsw.limit_in_bytes to remove any existing limits.

Setting the memory.memsw.limit_in_bytes and memory.limit_in_bytes parameters

It is important to set the memory.limit_in_bytes parameter before setting the memory.memsw.limit_in_bytes parameter: attempting to do so in the reverse order results in an error. This is because memory.memsw.limit_in_bytes becomes available only after all memory limitations (previously set in memory.limit_in_bytes) are exhausted.

Consider the following example: setting memory.limit_in_bytes = 2G and memory.memsw.limit_in_bytes = 4G for a certain cgroup will allow processes in that cgroup to allocate 2 GB of memory and, once exhausted, allocate another 2 GB of swap only. The memory.memsw.limit_in_bytes parameter represents the sum of memory and swap. Processes in a cgroup that does not have the memory.memsw.limit_in_bytes parameter set can potentially use up all the available swap (after exhausting the set memory limitation) and trigger an Out Of Memory situation caused by the lack of available swap.

The order in which the memory.limit_in_bytes and memory.memsw.limit_in_bytes parameters are set in the /etc/cgconfig.conf file is important as well. The following is a correct example of such a configuration:

memory { memory.limit_in_bytes = 1G; memory.memsw.limit_in_bytes = 1G;}

memory.failcnt

reports the number of times that the memory limit has reached the value set in memory.limit_in_bytes.

memory.memsw.failcnt

reports the number of times that the memory plus swap space limit has reached the value set in memory.memsw.limit_in_bytes.

memory.force_empty

when set to 0, empties memory of all pages used by tasks in this cgroup. This interface can only be used when the cgroup has no tasks. If memory cannot be freed, it is moved to a parent cgroup if possible. Use the memory.force_empty parameter before removing a cgroup to avoid moving out-of-use page caches to its parent cgroup.

memory.swappiness

sets the tendency of the kernel to swap out process memory used by tasks in this cgroup instead of reclaiming pages from the page cache. This is the same tendency, calculated the same way, as set in /proc/sys/vm/swappiness for the system as a whole. The default value is 60. Values lower than 60 decrease the kernel's tendency to swap out process memory, values greater than 60 increase the kernel's tendency to swap out process memory, and values greater than 100 permit the kernel to swap out pages that are part of the address space of the processes in this cgroup.

Note that a value of 0 does not prevent process memory being swapped out; swap out might still happen when there is a shortage of system memory because the global virtual memory management logic does not read the cgroup value. To lock pages completely, use mlock() instead of cgroups.

You cannot change the swappiness of the following groups:

the root cgroup, which uses the swappiness set in /proc/sys/vm/swappiness.
a cgroup that has child groups below it.

memory.use_hierarchy

contains a flag (0 or 1) that specifies whether memory usage should be accounted for throughout a hierarchy of cgroups. If enabled (1), the memory subsystem reclaims memory from the children of and process that exceeds its memory limit. By default (0), the subsystem does not reclaim memory from a task's children.

memory.oom_control

contains a flag (0 or 1) that enables or disables the Out of Memory killer for a cgroup. If enabled (0), tasks that attempt to consume more memory than they are allowed are immediately killed by the OOM killer. The OOM killer is enabled by default in every cgroup using the memory subsystem; to disable it, write 1 to the memory.oom_control file:

~]# echo 1 > /cgroup/memory/lab1/memory.oom_control

When the OOM killer is disabled, tasks that attempt to use more memory than they are allowed are paused until additional memory is freed.

The memory.oom_control file also reports the OOM status of the current cgroup under the under_oom entry. If the cgroup is out of memory and tasks in it are paused, the under_oom entry reports the value 1.

The memory.oom_control file is capable of reporting an occurrence of an OOM situation using the notification API. For more information, refer to Section 2.13, "Using the Notification API" and Example 3.3, "OOM Control and Notifications".

3.7.1. Example Usage

Example 3.3. OOM Control and Notifications

The following example demonstrates how the OOM killer takes action when a task in a cgroup attempts to use more memory than allowed, and how a notification handler can report OOM situations:

Attach the memory subsystem to a hierarchy and create a cgroup:

~]# mount -t memory -o memory memory /cgroup/memory~]# mkdir /cgroup/memory/blue

Set the amount of memory which tasks in the blue cgroup can use to 100MB:
```
~]# echo 104857600 > memory.limit_in_bytes
```

Change into the blue directory and make sure the OOM killer is enabled:

~]# cd /cgroup/memory/blueblue]# cat memory.oom_controloom_kill_disable 0under_oom 0

Move the current shell process into the tasks file of the blue cgroup so that all other processes started in this shell are automatically moved to the blue cgroup:
```
blue]# echo $$ > tasks
```

Start a test program that attempts to allocate a large amount of memory exceeding the limit you set in Step 2. As soon as the blue cgroup runs out of free memory, the OOM killer kills the test program and reports Killed to the standard output:

blue]# ~/mem-hogKilled

The following is an example of such a test program ^[5]:

#include <stdio.h>#include <stdlib.h>#include <string.h>#include <unistd.h>#define KB (1024)#define MB (1024 * KB)#define GB (1024 * MB)int main(int argc, char *argv[]){char *p;again:while ((p = (char *)malloc(GB)))memset(p, 0, GB);while ((p = (char *)malloc(MB)))memset(p, 0, MB);while ((p = (char *)malloc(KB)))memset(p, 0,KB);sleep(1);goto again;return 0;}

Disable the OOM killer and re-run the test program. This time, the test program remains paused waiting for additional memory to be freed:
```
blue]# echo 1 > memory.oom_controlblue]# ~/mem-hog
```
While the test program is paused, note that the under_oom state of the cgroup has changed to indicate that the cgroup is out of available memory:
```
~]# cat /cgroup/memory/blue/memory.oom_controloom_kill_disable 1under_oom 1
```
Re-enabling the OOM killer immediately kills the test program.

To receive notifications about every OOM situation, create a program as specified in Section 2.13, "Using the Notification API". For example ^[6]:

#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>#include <sys/eventfd.h>#include <errno.h>#include <string.h>#include <stdio.h>#include <stdlib.h>static inline void die(const char *msg){fprintf(stderr, "error: %s: %s(%d)\n", msg, strerror(errno), errno);exit(EXIT_FAILURE);}static inline void usage(void){fprintf(stderr, "usage: oom_eventfd_test <cgroup.event_control> <memory.oom_control>\n");exit(EXIT_FAILURE);}#define BUFSIZE 256int main(int argc, char *argv[]){char buf[BUFSIZE];int efd, cfd, ofd, rb, wb;uint64_t u;if (argc != 3)usage();if ((efd = eventfd(0, 0)) == -1)die("eventfd");if ((cfd = open(argv[1], O_WRONLY)) == -1)die("cgroup.event_control");if ((ofd = open(argv[2], O_RDONLY)) == -1)die("memory.oom_control");if ((wb = snprintf(buf, BUFSIZE, "%d %d", efd, ofd)) >= BUFSIZE)die("buffer too small");if (write(cfd, buf, wb) == -1)die("write cgroup.event_control");if (close(cfd) == -1)die("close cgroup.event_control");for (;) {if (read(efd, &u, sizeof(uint64_t)) != sizeof(uint64_t))die("read eventfd");printf("mem_cgroup oom event received\n");}return 0;}

The above program detects OOM situations in a cgroup specified as an argument on the command line and reports them using the mem_cgroup oom event received string to the standard output.

Run the above notification handler program in a separate console, specifying the blue cgroup's control files as arguments:
```
~]$ ./oom_notification /cgroup/memory/blue/cgroup.event_control /cgroup/memory/blue/memory.oom_control
```
In a different console, run the mem_hog test program to create an OOM situation to see the oom_notification program report it on the standard output:
```
blue]# ~/mem-hog
```

3.8. net_cls

The net_cls subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc) to identify packets originating from a particular cgroup. The traffic controller can be configured to assign different priorities to packets from different cgroups.

net_cls.classid

net_cls.classid contains a single value that indicates a traffic control handle. The value of classid read from the net_cls.classid file is presented in the decimal format while the value to be written to the file is expected in the hexadecimal format. For example, 0x100001 represents the handle conventionally written as 10:1 in the format used by iproute2. In the net_cls.classid file, it would be represented by the number 1048577.

The format for these handles is: 0xAAAABBBB, where AAAA is the major number in hexadecimal and BBBB is the minor number in hexadecimal. You can omit any leading zeroes; 0x10001 is the same as 0x00010001, and represents 1:1. The following is an example of setting a 10:1 handle in the net_cls.classid file:

~]# echo 0x100001 > /cgroup/net_cls/red/net_cls.classid~]# cat /cgroup/net_cls/red/net_cls.classid1048577

Refer to the man page for tc to learn how to configure the traffic controller to use the handles that the net_cls adds to network packets.

3.9. net_prio

The Network Priority (net_prio) subsystem provides a way to dynamically set the priority of network traffic per each network interface for applications within various cgroups. A network's priority is a number assigned to network traffic and used internally by the system and network devices. Network priority is used to differentiate packets that are sent, queued, or dropped. The tc command may be used to set a network's priority (setting the network priority via the tc command is outside the scope of this guide; for more information, refer to the tc man page).

Typically, an application sets the priority of its traffic via the SO_PRIORITY socket option. However, applications are often not coded to set the priority value, or the application's traffic is site-specific and does not provide a defined priority.

Using the net_prio subsystem in a cgroup allows an administrator to assign a process to a specific cgroup which defines the priority of outgoing traffic on a given network interface.

net_prio.prioidx

a read-only file which contains a unique integer value that the kernel uses as an internal representation of this cgroup.

net_prio.ifpriomap

contains a map of priorities assigned to traffic originating from processes in this group and leaving the system on various interfaces. This map is represented by a list of pairs in the form <network_interface> <priority>:

~]# cat /cgroup/net_prio/iscsi/net_prio.ifpriomapeth0 5eth1 4eth2 6

Contents of the net_prio.ifpriomap file can be modified by echoing a string into the file using the above format, for example:

~]# echo "eth0 5" > /cgroup/net_prio/iscsi/net_prio.ifpriomap

The above command forces any traffic originating from processes belonging to the iscsi net_prio cgroup, and with traffic outgoing on the eth0 network interface, to have the priority set to the value 5. The parent cgroup also has a writable net_prio.ifpriomap file that can be used to set a system default priority.

3.10. ns

The ns subsystem provides a way to group processes into separate namespaces. Within a particular namespace, processes can interact with each other but are isolated from processes running in other namespaces. These separate namespaces are sometimes referred to as containers when used for operating-system-level virtualization.

3.11. perf_event

When the perf_event subsystem is attached to a hierarchy, all cgroups in that hierarchy can be used to group processes and threads which can then be monitored with the perf tool, as opposed to monitoring each process or thread separately or per-CPU. Cgroups which use the perf_event subsystem do not contain any special tunable parameters other than the common parameters listed in Section 3.12, "Common Tunable Parameters".

For additional information on how tasks in a cgroup can be monitored using the perf tool, refer to the Red Hat Enterprise Linux Developer Guide, available from http://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/.

3.12. Common Tunable Parameters

The following parameters are present in every created cgroup, regardless of the subsystem that the cgroup is using:

tasks

contains a list of processes, represented by their PIDs, that are running in a cgroup. The list of PIDs is not guaranteed to be ordered or unique (that is, it may contain duplicate entries). Writing a PID into the tasks file of a cgroup moves that process into that cgroup.

cgroup.procs

contains a list of thread groups, represented by their TGIDs, that are running in a cgroup. The list of TGIDs is not guaranteed to be ordered or unique (that is, it may contain duplicate entries). Writing a TGID into the cgroup.procs file of a cgroup moves that thread group into that cgroup.

cgroup.event_control

along with the cgroup notification API, allows notifications to be sent about a changing status of a cgroup.

notify_on_release

contains a Boolean value, 1 or 0, that either enables or disables the execution of the release agent. If the notify_on_release is enabled, the kernel executes the contents of the release_agent file when a cgroup no longer contains any tasks (that is, the cgroup's tasks file contained some PIDs and those PIDs were removed, leaving the file empty). A path to the empty cgroup is provided as an argument to the release agent.

The default value of the notify_on_release parameter in the root cgroup is 0. All non-root cgroups inherit the value in notify_on_release from their parent cgroup.

release_agent (present in the root cgroup only)

contains a command to be executed when a "notify on release" is triggered. Once a cgroup is emptied of all processes, and the notify_on_release flag is enabled, the kernel runs the command in the release_agent file and supplies it with a relative path (relative to the root cgroup) to the emptied cgroup as an argument. The release agent can be used, for example, to automatically remove empty cgroups; for more information, see Example 3.4, "Automatically removing empty cgroups".

Example 3.4. Automatically removing empty cgroups

Follow these steps to configure automatic removal of any emptied cgroup from the cpu cgroup:

Create a shell script that removes empty cpu cgroups, place it in, for example, /usr/local/bin, and make it executable.
```
~]# cat /usr/local/bin/remove-empty-cpu-cgroup.sh#!/bin/shrmdir /cgroup/cpu/$1~]# chmod +x /usr/local/bin/remove-empty-cpu-cgroup.sh
```
The $1 variable contains a relative path to the emptied cgroup.
In the cpu cgroup, enable the notify_on_release flag:
```
~]# echo 1 > /cgroup/cpu/notify_on_release
```

In the cpu cgroup, specify a release agent to be used:

~]# echo "/usr/local/bin/remove-empty-cpu-cgroup.sh" > /cgroup/cpu/release_agent

Test your configuration to make sure emptied cgroups are properly removed:

cpu]# pwd; ls/cgroup/cpucgroup.event_control  cgroup.procs  cpu.cfs_period_us  cpu.cfs_quota_us  cpu.rt_period_us  cpu.rt_runtime_us  cpu.shares  cpu.stat  libvirt  notify_on_release  release_agent  taskscpu]# cat notify_on_release 1cpu]# cat release_agent /usr/local/bin/remove-empty-cpu-cgroup.shcpu]# mkdir blue; lsblue  cgroup.event_control  cgroup.procs  cpu.cfs_period_us  cpu.cfs_quota_us  cpu.rt_period_us  cpu.rt_runtime_us  cpu.shares  cpu.stat  libvirt  notify_on_release  release_agent  taskscpu]# cat blue/notify_on_release 1cpu]# cgexec -g cpu:blue dd if=/dev/zero of=/dev/null bs=1024k &[1] 8623cpu]# cat blue/tasks 8623cpu]# kill -9 8623cpu]# lscgroup.event_control  cgroup.procs  cpu.cfs_period_us  cpu.cfs_quota_us  cpu.rt_period_us  cpu.rt_runtime_us  cpu.shares  cpu.stat  libvirt  notify_on_release  release_agent  tasks

3.13. Additional Resources

Subsystem-Specific Kernel Documentation

All of the following files are located under the /usr/share/doc/kernel-doc-<kernel_version>/Documentation/cgroups/ directory (provided by the kernel-doc package).

blkio subsystem - blkio-controller.txt
cpuacct subsystem - cpuacct.txt
cpuset subsystem - cpusets.txt
devices subsystem - devices.txt
freezer subsystem - freezer-subsystem.txt
memory subsystem - memory.txt
net_prio subsystem - net_prio.txt

Additionally, refer to the following files on further information about the cpu subsystem:

Real-Time scheduling - /usr/share/doc/kernel-doc-<kernel_version>/Documentation/scheduler/sched-rt-group.txt
CFS scheduling - /usr/share/doc/kernel-doc-<kernel_version>/Documentation/scheduler/sched-bwc.txt

^[5]Source code provided by Red Hat Engineer František Hrbata.

^[6]Source code provided by Red Hat Engineer František Hrbata.

Resource Management Guide

Chapter 3. Subsystems and Tunable Parameters

3.1. blkio

3.1.1. Proportional Weight Division Tunable Parameters

3.1.2. I/O Throttling Tunable Parameters

3.1.3. blkio Common Tunable Parameters

3.1.4. Example Usage

3.2. cpu

3.2.1. CFS Tunable Parameters

3.2.2. RT Tunable Parameters

3.2.3. Example Usage

3.3. cpuacct

3.4. cpuset

3.5. devices

3.6. freezer

3.7. memory

3.7.1. Example Usage

3.8. net_cls

3.9. net_prio

3.10. ns

3.11. perf_event

3.12. Common Tunable Parameters

3.13. Additional Resources