RHE Linux User Manual

Daftar Isi ☛

(Sebelumnya) 20 : Chapter 15. Disk Quotas - ...

20 : Chapter 24. Online Storag ... (Berikutnya)

Storage Administration Guide

Chapter 18. The volume_key function

18.1. Commands

18.2. Using volume_key as an individual user

18.3. Using volume_key in a larger organization

18.3.1. Preparation for saving encryption keys
18.3.2. Saving encryption keys
18.3.3. Restoring access to a volume
18.3.4. Setting up emergency passphrases

18.4. Documentation

The volume_key function provides two tools, libvolume_key and volume_key. libvolume_key is a library for manipulating storage volume encryption keys and storing them separately from volumes. volume_key is an associated command line tool used to extract keys and passphrases in order to restore access to an encrypted hard drive.

This is useful for when the primary user forgets their keys and passwords, after an employee leaves abruptly, or in order to extract data after a hardware or software failure corrupts the header of the encrypted volume. In a corporate setting, the IT help desk can use volume_key to back up the encryption keys before handing over the computer to the end user.

Currently, volume_key only supports the LUKS volume encryption format.

Note

volume_key is not included in a standard install of Red Hat Enterprise Linux 6 server. For information on installing it, refer to http://fedoraproject.org/wiki/Disk_encryption_key_escrow_use_cases.

18.1. Commands

The format for volume_key is:

volume_key [OPTION]... OPERAND

The operands and mode of operation for volume_key are determined by specifying one of the following options:

--save: This command expects the operand volume [packet]. If a packet is provided then volume_key will extract the keys and passphrases from it. If packet is not provided, then volume_key will extract the keys and passphrases from the volume, prompting the user where necessary. These keys and passphrases will then be stored in one or more output packets.
--restore: This command expects the operands volume packet. It then opens the volume and uses the keys and passphrases in the packet to make the volume accessible again, prompting the user where necessary, such as allowing the user to enter a new passphrase, for example.
--setup-volume: This command expects the operands volume packet name. It then opens the volume and uses the keys and passphrases in the packet to set up the volume for use of the decrypted data as name.
Name is the name of a dm-crypt volume. This operation makes the decrypted volume available as /dev/mapper/name.
This operation does not permanently alter the volume by adding a new passphrase, for example. The user can access and modify the decrypted volume, modifying volume in the process.
--reencrypt, --secrets, and --dump: These three commands perform similar functions with varying output methods. They each require the operand packet, and each opens the packet, decrypting it where necessary. --reencrypt then stores the information in one or more new output packets. --secrets outputs the keys and passphrases contained in the packet. --dump outputs the content of the packet, though the keys and passphrases are not output by default. This can be changed by appending --with-secrets to the command. It is also possible to only dump the unencrypted parts of the packet, if any, by using the --unencrypted command. This does not require any passphrase or private key access.

Each of these can be appended with the following options:

-o, --output packet

This command writes the default key or passphrase to the packet. The default key or passphrase depends on the volume format. Ensure it is one that is unlikely to expire, and will allow --restore to restore access to the volume.

--output-format format

This command uses the specified format for all output packets. Currently, format can be one of the following:

asymmetric: uses CMS to encrypt the whole packet, and requires a certificate
asymmetric_wrap_secret_only: wraps only the secret, or keys and passphrases, and requires a certificate
passphrase: uses GPG to encrypt the whole packet, and requires a passphrase

--create-random-passphrase packet

This command generates a random alphanumeric passphrase, adds it to the volume (without affecting other passphrases), and then stores this random passphrase into the packet.

18.2. Using `volume_key` as an individual user

As an individual user, volume_key can be used to save encryption keys by using the following procedure.

Note

For all examples in this file, /path/to/volume is a LUKS device, not the plaintext device contained within. blkid -s type /path/to/volume should report type="crypto_LUKS".

Procedure 18.1. Using volume_key stand-alone

Run:
```
volume_key --save /path/to/volume -o escrow-packet
```
A prompt will then appear requiring an escrow packet passphrase to protect the key.
Save the generated escrow-packet file, ensuring that the passphrase is not forgotten.

If the volume passphrase is forgotten, use the saved escrow packet to restore access to the data.

Procedure 18.2. Restore access to data with escrow packet

Boot the system in an environment where volume_key can be run and the escrow packet is available (a rescue mode, for example).
Run:
```
volume_key --restore /path/to/volume escrow-packet
```
A prompt will appear for the escrow packet passphrase that was used when creating the escrow packet, and for the new passphrase for the volume.
Mount the volume using the chosen passphrase.

To free up the passphrase slot in the LUKS header of the encrypted volume, remove the old, forgotten passphrase by using the command cryptsetup luksKillSlot.

18.3. Using `volume_key` in a larger organization

In a larger organization, using a single password known by every system administrator and keeping track of a separate password for each system is impractical and a security risk. To counter this, volume_key can use asymmetric cryptography to minimize the number of people who know the password required to access encrypted data on any computer.

This section will cover the procedures required for preparation before saving encryption keys, how to save encryption keys, restoring access to a volume, and setting up emergency passphrases.

18.3.1. Preparation for saving encryption keys

In order to begin saving encryption keys, some preparation is required.

Procedure 18.3. Preparation

Create an X509 certificate/private pair.
Designate trusted users who are trusted not to compromise the private key. These users will be able to decrypt the escrow packets.
Choose which systems will be used to decrypt the escrow packets. On these systems, set up an NSS database that contains the private key.
If the private key was not created in an NSS database, follow these steps:
- Store the certificate and private key in an PKCS#12 file.
- Run:
```
certutil -d /the/nss/directory -N
```
  At this point it is possible to choose an NSS database password. Each NSS database can have a different password so the designated users do not need to share a single password if a separate NSS database is used by each user.
- Run:
```
pk12util -d /the/nss/directory -i the-pkcs12-file
```
Distribute the certificate to anyone installing systems or saving keys on existing systems.
For saved private keys, prepare storage that allows them to be looked up by machine and volume. For example, this can be a simple directory with one subdirectory per machine, or a database used for other system management tasks as well.

18.3.2. Saving encryption keys

After completing the required preparation (see Section 18.3.1, "Preparation for saving encryption keys") it is now possible to save the encryption keys using the following procedure.

Note

For all examples in this file, /path/to/volume is a LUKS device, not the plaintext device contained within; blkid -s type /path/to/volume should report type="crypto_LUKS".

Procedure 18.4. Saving encryption keys

Run:

volume_key --save /path/to/volume -c /path/to/cert escrow-packet

Save the generated escrow-packet file in the prepared storage, associating it with the system and the volume.

These steps can be performed manually, or scripted as part of system installation.

18.3.3. Restoring access to a volume

After the encryption keys have been saved (see Section 18.3.1, "Preparation for saving encryption keys" and Section 18.3.2, "Saving encryption keys"), access can be restored to a driver where needed.

Procedure 18.5. Restoring access to a volume

Get the escrow packet for the volume from the packet storage and send it to one of the designated users for decryption.
The designated user runs:
```
volume_key --reencrypt -d /the/nss/directory escrow-packet-in -o escrow-packet-out
```
After providing the NSS database password, the designated user chooses a passphrase for encrypting escrow-packet-out. This passphrase can be different every time and only protects the encryption keys while they are moved from the designated user to the target system.
Obtain the escrow-packet-out file and the passphrase from the designated user.
Boot the target system in an environment that can run volume_key and have the escrow-packet-out file available, such as in a rescue mode.
Run:
```
volume_key --restore /path/to/volume escrow-packet-out
```
A prompt will appear for the packet passphrase chosen by the designated user, and for a new passphrase for the volume.
Mount the volume using the chosen volume passphrase.

It is possible to remove the old passphrase that was forgotten by using cryptsetup luksKillSlot, for example, to free up the passphrase slot in the LUKS header of the encrypted volume. This is done with the command cryptsetup luksKillSlot device key-slot. For more information and examples see cryptsetup --help.

18.3.4. Setting up emergency passphrases

In some circumstances (such as traveling for business) it is impractical for system administrators to work directly with the affected systems, but users still need access to their data. In this case, volume_key can work with passphrases as well as encryption keys.

During the system installation, run:

volume_key --save /path/to/volume -c /path/to/ert --create-random-passphrase passphrase-packet

This generates a random passphrase, adds it to the specified volume, and stores it to passphrase-packet. It is also possible to combine the --create-random-passphrase and -o options to generate both packets at the same time.

If a user forgets the password, the designated user runs:

volume_key --secrets -d /your/nss/directory passphrase-packet

This shows the random passphrase. Give this passphrase to the end user.

18.4. Documentation

More information on volume_key can be found:

in the readme file located at /usr/share/doc/volume_key-*/README
on volume_key's manpage using man volume_key
online at http://fedoraproject.org/wiki/Disk_encryption_key_escrow_use_cases

Chapter 19. Access Control Lists

19.1. Mounting File Systems

19.1.1. NFS

19.2. Setting Access ACLs

19.3. Setting Default ACLs

19.4. Retrieving ACLs

19.5. Archiving File Systems With ACLs

19.6. Compatibility with Older Systems

19.7. References

Files and directories have permission sets for the owner of the file, the group associated with the file, and all other users for the system. However, these permission sets have limitations. For example, different permissions cannot be configured for different users. Thus, Access Control Lists (ACLs) were implemented.

The Red Hat Enterprise Linux kernel provides ACL support for the ext3 file system and NFS-exported file systems. ACLs are also recognized on ext3 file systems accessed via Samba.

Along with support in the kernel, the acl package is required to implement ACLs. It contains the utilities used to add, modify, remove, and retrieve ACL information.

The cp and mv commands copy or move any ACLs associated with files and directories.

19.1. Mounting File Systems

Before using ACLs for a file or directory, the partition for the file or directory must be mounted with ACL support. If it is a local ext3 file system, it can mounted with the following command:

mount -t ext3 -o acl device-name partition

For example:

mount -t ext3 -o acl /dev/VolGroup00/LogVol02 /work

Alternatively, if the partition is listed in the /etc/fstab file, the entry for the partition can include the acl option:

LABEL=/work  /work   ext3 acl 1 2

If an ext3 file system is accessed via Samba and ACLs have been enabled for it, the ACLs are recognized because Samba has been compiled with the --with-acl-support option. No special flags are required when accessing or mounting a Samba share.

19.1.1. NFS

By default, if the file system being exported by an NFS server supports ACLs and the NFS client can read ACLs, ACLs are utilized by the client system.

To disable ACLs on NFS shares when configuring the server, include the no_acl option in the /etc/exports file. To disable ACLs on an NFS share when mounting it on a client, mount it with the no_acl option via the command line or the /etc/fstab file.

19.2. Setting Access ACLs

There are two types of ACLs: access ACLs and default ACLs. An access ACL is the access control list for a specific file or directory. A default ACL can only be associated with a directory; if a file within the directory does not have an access ACL, it uses the rules of the default ACL for the directory. Default ACLs are optional.

ACLs can be configured:

Per user
Per group
Via the effective rights mask
For users not in the user group for the file

The setfacl utility sets ACLs for files and directories. Use the -m option to add or modify the ACL of a file or directory:

# setfacl -m rules files

Rules (rules) must be specified in the following formats. Multiple rules can be specified in the same command if they are separated by commas.

u:uid:perms: Sets the access ACL for a user. The user name or UID may be specified. The user may be any valid user on the system.
g:gid:perms: Sets the access ACL for a group. The group name or GID may be specified. The group may be any valid group on the system.
m:perms: Sets the effective rights mask. The mask is the union of all permissions of the owning group and all of the user and group entries.
o:perms: Sets the access ACL for users other than the ones in the group for the file.

Permissions (perms) must be a combination of the characters r, w, and x for read, write, and execute.

If a file or directory already has an ACL, and the setfacl command is used, the additional rules are added to the existing ACL or the existing rule is modified.

Example 19.1. Give read and write permissions

For example, to give read and write permissions to user andrius:

# setfacl -m u:andrius:rw /project/somefile

To remove all the permissions for a user, group, or others, use the -x option and do not specify any permissions:

# setfacl -x rules files

Example 19.2. Remove all permissions

For example, to remove all permissions from the user with UID 500:

# setfacl -x u:500 /project/somefile

19.3. Setting Default ACLs

To set a default ACL, add d: before the rule and specify a directory instead of a file name.

Example 19.3. Setting default ACLs

For example, to set the default ACL for the /share/ directory to read and execute for users not in the user group (an access ACL for an individual file can override it):

# setfacl -m d:o:rx /share

19.4. Retrieving ACLs

To determine the existing ACLs for a file or directory, use the getfacl command. In the example below, the getfacl is used to determine the existing ACLs for a file.

Example 19.4. Retrieving ACLs

# getfacl home/john/picture.png

The above command returns the following output:

# file: home/john/picture.png # owner: john # group: john user::rw- group::r-- other::r--

If a directory with a default ACL is specified, the default ACL is also displayed as illustrated below. For example, getfacl home/sales/ will display similar output:

# file: home/sales/ # owner: john # group: john user::rw- user:barryg:r-- group::r-- mask::r-- other::r-- default:user::rwx default:user:john:rwx default:group::r-x default:mask::rwx default:other::r-x

19.5. Archiving File Systems With ACLs

By default, the dump command now preserves ACLs during a backup operation. When archiving a file or file system with tar, use the --acls option to preserve ACLs. Similarly, when using cp to copy files with ACLs, include the --preserve=mode option to ensure that ACLs are copied across too. In addition, the -a option (equivalent to -dR --preserve=all) of cp also preserves ACLs during a backup along with other information such as timestamps, SELinux contexts, and the like. For more information about dump, tar, or cp, refer to their respective man pages.

The star utility is similar to the tar utility in that it can be used to generate archives of files; however, some of its options are different. Refer to Table 19.1, "Command Line Options for star" for a listing of more commonly used options. For all available options, refer to man star. The star package is required to use this utility.

Table 19.1. Command Line Options for star

Option	Description
`-c`	Creates an archive file.
`-n`	Do not extract the files; use in conjunction with `-x` to show what extracting the files does.
`-r`	Replaces files in the archive. The files are written to the end of the archive file, replacing any files with the same path and file name.
`-t`	Displays the contents of the archive file.
`-u`	Updates the archive file. The files are written to the end of the archive if they do not exist in the archive, or if the files are newer than the files of the same name in the archive. This option only works if the archive is a file or an unblocked tape that may backspace.
`-x`	Extracts the files from the archive. If used with `-U` and a file in the archive is older than the corresponding file on the file system, the file is not extracted.
`-help`	Displays the most important options.
`-xhelp`	Displays the least important options.
`-/`	Do not strip leading slashes from file names when extracting the files from an archive. By default, they are stripped when files are extracted.
`-acl`	When creating or extracting, archives or restores any ACLs associated with the files and directories.

19.6. Compatibility with Older Systems

If an ACL has been set on any file on a given file system, that file system has the ext_attr attribute. This attribute can be seen using the following command:

# tune2fs -l filesystem-device

A file system that has acquired the ext_attr attribute can be mounted with older kernels, but those kernels do not enforce any ACLs which have been set.

Versions of the e2fsck utility included in version 1.22 and higher of the e2fsprogs package (including the versions in Red Hat Enterprise Linux 2.1 and 4) can check a file system with the ext_attr attribute. Older versions refuse to check it.

19.7. References

Refer to the following man pages for more information.

man acl - Description of ACLs
man getfacl - Discusses how to get file access control lists
man setfacl - Explains how to set file access control lists
man star - Explains more about the star utility and its many options

Chapter 20. Solid-State Disk Deployment Guidelines

20.1. Deployment Considerations
20.2. Tuning Considerations

Solid-state disks (SSD) are storage devices that use NAND flash chips to persistently store data. This sets them apart from previous generations of disks, which store data in rotating, magnetic platters. In an SSD, the access time for data across the full Logical Block Address (LBA) range is constant; whereas with older disks that use rotating media, access patterns that span large address ranges incur seek costs. As such, SSD devices have better latency and throughput.

Not all SSDs show the same performance profiles, however. In fact, many of the first generation devices show little or no advantage over spinning media. Thus, it is important to define classes of solid state storage to frame further discussion in this section.

SSDs can be divided into three classes, based on throughput:

The first class of SSDs use a PCI-Express connection, which offers the fastest I/O throughput compared to other classes. This class also has a very low latency for random access.
The second class uses the traditional SATA connection, and features fast random access for read and write operations (though not as fast as SSDs that use PCI-Express connection).
The third class also uses SATA, but the performance of SSDs in this class do not differ substantially from devices that use 7200rpm rotational disks.

For all three classes, performance degrades as the number of used blocks approaches the disk capacity. The degree of performance impact varies greatly by vendor. However, all devices experience some degradation.

To address the degradation issue, the host system (for example, the Linux kernel) may use discard requests to inform the storage that a given range of blocks is no longer in use. An SSD can use this information to free up space internally, using the free blocks for wear-leveling. Discards will only be issued if the storage advertises support in terms of its storage protocol (be it ATA or SCSI). Discard requests are issued to the storage using the negotiated discard command specific to the storage protocol (TRIM command for ATA, and WRITE SAME with UNMAP set, or UNMAP command for SCSI).

Enabling discard support is most useful when there is available free space on the file system, but the file system has already written to most logical blocks on the underlying storage device. For more information about TRIM, refer to its Data Set Management T13 Specifications from the following link:

http://t13.org/Documents/UploadedDocuments/docs2008/e07154r6-Data_Set_Management_Proposal_for_ATA-ACS2.doc

For more information about UNMAP, refer to section 4.7.3.4 of the SCSI Block Commands 3 T10 Specification from the following link:

http://www.t10.org/cgi-bin/ac.pl?t=f&f=sbc3r26.pdf

Note

Not all solid-state devices in the market have discard support.

20.1. Deployment Considerations

Because of the internal layout and operation of SSDs, it is best to partition devices on an internal erase block boundary. Partitioning utilities in Red Hat Enterprise Linux 6 chooses sane defaults if the SSD exports topology information.

However, if the device does not export topology information, Red Hat recommends that the first partition be created at a 1MB boundary.

In addition, keep in mind that MD (software raid) does not support discards. In contrast, the logical volume manager (LVM) and the device-mapper (DM) targets that LVM uses do support discards. The only DM targets that do not support discards are dm-snapshot, dm-crypt, and dm-raid45. Discard support for the dm-mirror was added in Red Hat Enterprise Linux 6.1.

Red Hat also warns that software RAID levels 1, 4, 5, and 6 are not recommended for use on SSDs. During the initialization stage of these RAID levels, some RAID management utilities (such as mdadm) write to all of the blocks on the storage device to ensure that checksums operate properly. This will cause the performance of the SSD to degrade quickly.

At present, ext4 is the only fully-supported file system that supports discard. To enable discard commands on a device, use the mount option discard. For example, to mount /dev/sda2 to /mnt with discard enabled, run:

# mount -t ext4 -o discard /dev/sda2 /mnt

By default, ext4 does not issue the discard command. This is mostly to avoid problems on devices which may not properly implement the discard command. The Linux swap code will issue discard commands to discard-enabled devices, and there is no option to control this behavior.

20.2. Tuning Considerations

This section describes several factors to consider when configuring settings that may affect SSD performance.

I/O Scheduler

Any I/O scheduler should perform well with most SSDs. However, as with any other storage type, Red Hat recommends benchmarking to determine the optimal configuration for a given workload.

When using SSDs, Red Hat advises changing the I/O scheduler only for benchmarking particular workloads. For more information about the different types of I/O schedulers, refer to the I/O Tuning Guide (also provided by Red Hat). The following kernel document also contains instructions on how to switch between I/O schedulers:

/usr/share/doc/kernel-version/Documentation/block/switching-sched.txt

Virtual Memory

Like the I/O scheduler, virtual memory (VM) subsystem requires no special tuning. Given the fast nature of I/O on SSD, it should be possible to turn down the vm_dirty_background_ratio and vm_dirty_ratio settings, as increased write-out activity should not negatively impact the latency of other operations on the disk. However, this can generate more overall I/O and so is not generally recommended without workload-specific testing.

Swap

An SSD can also be used as a swap device, and is likely to produce good page-out/page-in performance.

Chapter 21. Write Barriers

21.1. Importance of Write Barriers
21.2. Enabling/Disabling Write Barriers
21.3. Write Barrier Considerations

A write barrier is a kernel mechanism used to ensure that file system metadata is correctly written and ordered on persistent storage, even when storage devices with volatile write caches lose power. File systems with write barriers enabled also ensure that data transmitted via fsync() is persistent throughout a power loss.

Enabling write barriers incurs a substantial performance penalty for some applications. Specifically, applications that use fsync() heavily or create and delete many small files will likely run much slower.

21.1. Importance of Write Barriers

File systems take great care to safely update metadata, ensuring consistency. Journalled file systems bundle metadata updates into transactions and send them to persistent storage in the following manner:

First, the file system sends the body of the transaction to the storage device.
Then, the file system sends a commit block.
If the transaction and its corresponding commit block are written to disk, the file system assumes that the transaction will survive any power failure.

However, file system integrity during power failure becomes more complex for storage devices with extra caches. Storage target devices like local S-ATA or SAS drives may have write caches ranging from 32MB to 64MB in size (with modern drives). Hardware RAID controllers often contain internal write caches. Further, high end arrays, like those from NetApp, IBM, Hitachi and EMC (among others), also have large caches.

Storage devices with write caches report I/O as "complete" when the data is in cache; if the cache loses power, it loses its data as well. Worse, as the cache de-stages to persistent storage, it may change the original metadata ordering. When this occurs, the commit block may be present on disk without having the complete, associated transaction in place. As a result, the journal may replay these uninitialized transaction blocks into the file system during post-power-loss recovery; this will cause data inconsistency and corruption.

How Write Barriers Work

Write barriers are implemented in the Linux kernel via storage write cache flushes before and after the I/O, which is order-critical. After the transaction is written, the storage cache is flushed, the commit block is written, and the cache is flushed again. This ensures that:

The disk contains all the data.
No re-ordering has occurred.

With barriers enabled, an fsync() call will also issue a storage cache flush. This guarantees that file data is persistent on disk even if power loss occurs shortly after fsync() returns.

21.2. Enabling/Disabling Write Barriers

To mitigate the risk of data corruption during power loss, some storage devices use battery-backed write caches. Generally, high-end arrays and some hardware controllers use battery-backed write cached. However, because the cache's volatility is not visible to the kernel, Red Hat Enterprise Linux 6 enables write barriers by default on all supported journaling file systems.

Note

Write caches are designed to increase I/O performance. However, enabling write barriers means constantly flushing these caches, which can significantly reduce performance.

For devices with non-volatile, battery-backed write caches and those with write-caching disabled, you can safely disable write barriers at mount time using the -o nobarrier option for mount. However, some devices do not support write barriers; such devices will log an error message to /var/log/messages (refer to Table 21.1, "Write barrier error messages per file system").

Table 21.1. Write barrier error messages per file system

File System	Error Message
ext3/ext4	`JBD: barrier-based sync failed on device - disabling barriers`
XFS	`Filesystem device - Disabling barriers, trial barrier write failed`
btrfs	`btrfs: disabling barriers on dev device`

21.3. Write Barrier Considerations

Some system configurations do not need write barriers to protect data. In most cases, other methods are preferable to write barriers, since enabling write barriers causes a significant performance penalty.

Disabling Write Caches

One way to alternatively avoid data integrity issues is to ensure that no write caches lose data on power failures. When possible, the best way to configure this is to simply disable the write cache. On a simple server or desktop with one or more SATA drives (off a local SATA controller Intel AHCI part), you can disable the write cache on the target SATA drives with the hdparm command, as in:

# hdparm -W0 /device/

Battery-Backed Write Caches

Write barriers are also unnecessary whenever the system uses hardware RAID controllers with battery-backed write cache. If the system is equipped with such controllers and if its component drives have write caches disabled, the controller will advertise itself as a write-through cache; this will inform the kernel that the write cache data will survive a power loss.

Most controllers use vendor-specific tools to query and manipulate target drives. For example, the LSI Megaraid SAS controller uses a battery-backed write cache; this type of controller requires the MegaCli64 tool to manage target drives. To show the state of all back-end drives for LSI Megaraid SAS, use:

# MegaCli64 -LDGetProp  -DskCache  -LAll -aALL

To disable the write cache of all back-end drives for LSI Megaraid SAS, use:

# MegaCli64 -LDSetProp -DisDskCache -Lall -aALL

Note

Hardware RAID cards recharge their batteries while the system is operational. If a system is powered off for an extended period of time, the batteries will lose their charge, leaving stored data vulnerable during a power failure.

High-End Arrays

High-end arrays have various ways of protecting data in the event of a power failure. As such, there is no need to verify the state of the internal drives in external RAID storage.

NFS

NFS clients do not need to enable write barriers, since data integrity is handled by the NFS server side. As such, NFS servers should be configured to ensure data persistence throughout a power loss (whether through write barriers or other means).

Chapter 22. Storage I/O Alignment and Size

22.1. Parameters for Storage Access
22.2. Userspace Access
22.3. Standards
22.4. Stacking I/O Parameters
22.5. Logical Volume Manager
22.6. Partition and File System Tools

Recent enhancements to the SCSI and ATA standards allow storage devices to indicate their preferred (and in some cases, required) I/O alignment and I/O size. This information is particularly useful with newer disk drives that increase the physical sector size from 512 bytes to 4k bytes. This information may also be beneficial for RAID devices, where the chunk size and stripe size may impact performance.

The Linux I/O stack has been enhanced to process vendor-provided I/O alignment and I/O size information, allowing storage management tools (parted, lvm, mkfs.*, and the like) to optimize data placement and access. If a legacy device does not export I/O alignment and size data, then storage management tools in Red Hat Enterprise Linux 6 will conservatively align I/O on a 4k (or larger power of 2) boundary. This will ensure that 4k-sector devices operate correctly even if they do not indicate any required/preferred I/O alignment and size.

Refer to Section 22.2, "Userspace Access" to learn how to determine the information that the operating system obtained from the device. This data is subsequently used by the storage management tools to determine data placement.

The IO scheduler has changed for Red Hat Enterprise Linux 7. Default IO Scheduler is now Deadline, except for SATA drives. CFQ is the default IO scheduler for SATA drives. For faster storage, Deadline outperforms CFQ and when it is used there is a performance increase without the need of special tuning.

If default is not right for some disks (for example, SAS rotational disks), then change the IO scheduler to CFQ. This instance will depend on the workload.

22.1. Parameters for Storage Access

The operating system uses the following information to determine I/O alignment and size:

physical_block_size: Smallest internal unit on which the device can operate
logical_block_size: Used externally to address a location on the device
alignment_offset: Tthe number of bytes that the beginning of the Linux block device (partition/MD/LVM device) is offset from the underlying physical alignment
minimum_io_size: The device�s preferred minimum unit for random I/O
optimal_io_size: The device�s preferred unit for streaming I/O

For example, certain 4K sector devices may use a 4K physical_block_size internally but expose a more granular 512-byte logical_block_size to Linux. This discrepancy introduces potential for misaligned I/O. To address this, the Red Hat Enterprise Linux 6 I/O stack will attempt to start all data areas on a naturally-aligned boundary (physical_block_size) by making sure it accounts for any alignment_offset if the beginning of the block device is offset from the underlying physical alignment.

Storage vendors can also supply I/O hints about the preferred minimum unit for random I/O (minimum_io_size) and streaming I/O (optimal_io_size) of a device. For example, minimum_io_size and optimal_io_size may correspond to a RAID device's chunk size and stripe size respectively.

22.2. Userspace Access

Always take care to use properly aligned and sized I/O. This is especially important for Direct I/O access. Direct I/O should be aligned on a logical_block_size boundary, and in multiples of the logical_block_size.

With native 4K devices (i.e. logical_block_size is 4K) it is now critical that applications perform direct I/O in multiples of the device's logical_block_size. This means that applications will fail with native 4k devices that perform 512-byte aligned I/O rather than 4k-aligned I/O.

To avoid this, an application should consult the I/O parameters of a device to ensure it is using the proper I/O alignment and size. As mentioned earlier, I/O parameters are exposed through the both sysfs and block device ioctl interfaces.

For more details, refer to man libblkid. This man page is provided by the libblkid-devel package.

sysfs Interface

/sys/block/disk/alignment_offset
/sys/block/disk/partition/alignment_offset
/sys/block/disk/queue/physical_block_size
/sys/block/disk/queue/logical_block_size
/sys/block/disk/queue/minimum_io_size
/sys/block/disk/queue/optimal_io_size

The kernel will still export these sysfs attributes for "legacy" devices that do not provide I/O parameters information, for example:

Example 22.1. sysfs interface

alignment_offset: 0physical_block_size: 512logical_block_size:  512minimum_io_size: 512optimal_io_size: 0

Block Device ioctls

BLKALIGNOFF: alignment_offset
BLKPBSZGET: physical_block_size
BLKSSZGET: logical_block_size
BLKIOMIN: minimum_io_size
BLKIOOPT: optimal_io_size

22.3. Standards

This section describes I/O standards used by ATA and SCSI devices.

ATA

ATA devices must report appropriate information via the IDENTIFY DEVICE command. ATA devices only report I/O parameters for physical_block_size, logical_block_size, and alignment_offset. The additional I/O hints are outside the scope of the ATA Command Set.

SCSI

I/O parameters support in Red Hat Enterprise Linux 6 requires at least version 3 of the SCSI Primary Commands (SPC-3) protocol. The kernel will only send an extended inquiry (which gains access to the BLOCK LIMITS VPD page) and READ CAPACITY(16) command to devices which claim compliance with SPC-3.

The READ CAPACITY(16) command provides the block sizes and alignment offset:

LOGICAL BLOCK LENGTH IN BYTES is used to derive /sys/block/disk/queue/physical_block_size
LOGICAL BLOCKS PER PHYSICAL BLOCK EXPONENT is used to derive /sys/block/disk/queue/logical_block_size
LOWEST ALIGNED LOGICAL BLOCK ADDRESS is used to derive:
- /sys/block/disk/alignment_offset
- /sys/block/disk/partition/alignment_offset

The BLOCK LIMITS VPD page (0xb0) provides the I/O hints. It also uses OPTIMAL TRANSFER LENGTH GRANULARITY and OPTIMAL TRANSFER LENGTH to derive:

/sys/block/disk/queue/minimum_io_size
/sys/block/disk/queue/optimal_io_size

The sg3_utils package provides the sg_inq utility, which can be used to access the BLOCK LIMITS VPD page. To do so, run:

# sg_inq -p 0xb0 disk

22.4. Stacking I/O Parameters

All layers of the Linux I/O stack have been engineered to propagate the various I/O parameters up the stack. When a layer consumes an attribute or aggregates many devices, the layer must expose appropriate I/O parameters so that upper-layer devices or tools will have an accurate view of the storage as it transformed. Some practical examples are:

Only one layer in the I/O stack should adjust for a non-zero alignment_offset; once a layer adjusts accordingly, it will export a device with an alignment_offset of zero.
A striped Device Mapper (DM) device created with LVM must export a minimum_io_size and optimal_io_size relative to the stripe count (number of disks) and user-provided chunk size.

In Red Hat Enterprise Linux 6, Device Mapper and Software Raid (MD) device drivers can be used to arbitrarily combine devices with different I/O parameters. The kernel's block layer will attempt to reasonably combine the I/O parameters of the individual devices. The kernel will not prevent combining heterogeneous devices; however, be aware of the risks associated with doing so.

For instance, a 512-byte device and a 4K device may be combined into a single logical DM device, which would have a logical_block_size of 4K. File systems layered on such a hybrid device assume that 4K will be written atomically, but in reality it will span 8 logical block addresses when issued to the 512-byte device. Using a 4K logical_block_size for the higher-level DM device increases potential for a partial write to the 512-byte device if there is a system crash.

If combining the I/O parameters of multiple devices results in a conflict, the block layer may issue a warning that the device is susceptible to partial writes and/or is misaligned.

22.5. Logical Volume Manager

LVM provides userspace tools that are used to manage the kernel's DM devices. LVM will shift the start of the data area (that a given DM device will use) to account for a non-zero alignment_offset associated with any device managed by LVM. This means logical volumes will be properly aligned (alignment_offset=0).

By default, LVM will adjust for any alignment_offset, but this behavior can be disabled by setting data_alignment_offset_detection to 0 in /etc/lvm/lvm.conf. Disabling this is not recommended.

LVM will also detect the I/O hints for a device. The start of a device's data area will be a multiple of the minimum_io_size or optimal_io_size exposed in sysfs. LVM will use the minimum_io_size if optimal_io_size is undefined (i.e. 0).

By default, LVM will automatically determine these I/O hints, but this behavior can be disabled by setting data_alignment_detection to 0 in /etc/lvm/lvm.conf. Disabling this is not recommended.

22.6. Partition and File System Tools

This section describes how different partition and file system management tools interact with a device's I/O parameters.

util-linux-ng's libblkid and fdisk

The libblkid library provided with the util-linux-ng package includes a programmatic API to access a device's I/O parameters. libblkid allows applications, especially those that use Direct I/O, to properly size their I/O requests. The fdisk utility from util-linux-ng uses libblkid to determine the I/O parameters of a device for optimal placement of all partitions. The fdisk utility will align all partitions on a 1MB boundary.

parted and libparted

The libparted library from parted also uses the I/O parameters API of libblkid. The Red Hat Enterprise Linux 6 installer (Anaconda) uses libparted, which means that all partitions created by either the installer or parted will be properly aligned. For all partitions created on a device that does not appear to provide I/O parameters, the default alignment will be 1MB.

The heuristics parted uses are as follows:

Always use the reported alignment_offset as the offset for the start of the first primary partition.
If optimal_io_size is defined (i.e. not 0), align all partitions on an optimal_io_size boundary.
If optimal_io_size is undefined (i.e. 0), alignment_offset is 0, and minimum_io_size is a power of 2, use a 1MB default alignment.
This is the catch-all for "legacy" devices which don't appear to provide I/O hints. As such, by default all partitions will be aligned on a 1MB boundary.
Note
Red Hat Enterprise Linux 6 cannot distinguish between devices that don't provide I/O hints and those that do so with alignment_offset=0 and optimal_io_size=0. Such a device might be a single SAS 4K device; as such, at worst 1MB of space is lost at the start of the disk.

File System tools

The different mkfs.filesystem utilities have also been enhanced to consume a device's I/O parameters. These utilities will not allow a file system to be formatted to use a block size smaller than the logical_block_size of the underlying storage device.

Except for mkfs.gfs2, all other mkfs.filesystem utilities also use the I/O hints to layout on-disk data structure and data areas relative to the minimum_io_size and optimal_io_size of the underlying storage device. This allows file systems to be optimally formatted for various RAID (striped) layouts.

Chapter 23. Setting Up A Remote Diskless System

23.1. Configuring a tftp Service for Diskless Clients
23.2. Configuring DHCP for Diskless Clients
23.3. Configuring an Exported File System for Diskless Clients

The Network Booting Service (provided by system-config-netboot) is no longer available in Red Hat Enterprise Linux 6. Deploying diskless systems is now possible in this release without the use of system-config-netboot.

To set up a basic remote diskless system booted over PXE, you need the following packages:

tftp-server
xinetd
dhcp
syslinux
dracut-network

Remote diskless system booting requires both a tftp service (provided by tftp-server) and a DHCP service (provided by dhcp). The tftp service is used to retrieve kernel image and initrd over the network via the PXE loader.

The following sections outline the necessary procedures for deploying remote diskless systems in a network environment.

23.1. Configuring a tftp Service for Diskless Clients

The tftp service is disabled by default. To enable it and allow PXE booting via the network, set the Disabled option in /etc/xinetd.d/tftp to no. To configure tftp, perform the following steps:

Procedure 23.1. To configure tftp

The tftp root directory (chroot) is located in /var/lib/tftpboot. Copy /usr/share/syslinux/pxelinux.0 to /var/lib/tftpboot/, as in:
cp /usr/share/syslinux/pxelinux.0 /var/lib/tftpboot/
Create a pxelinux.cfg directory inside the tftp root directory:
mkdir -p /var/lib/tftpboot/pxelinux.cfg/

You will also need to configure firewall rules properly to allow tftp traffic; as tftp supports TCP wrappers, you can configure host access to tftp via /etc/hosts.allow. For more information on configuring TCP wrappers and the /etc/hosts.allow configuration file, refer to the Red Hat Enterprise Linux 6 Security Guide; man hosts_access also provides information about /etc/hosts.allow.

After configuring tftp for diskless clients, configure DHCP, NFS, and the exported file system accordingly. Refer to Section 23.2, "Configuring DHCP for Diskless Clients" and Section 23.3, "Configuring an Exported File System for Diskless Clients" for instructions on how to do so.

23.2. Configuring DHCP for Diskless Clients

After configuring a tftp server, you need to set up a DHCP service on the same host machine. Refer to the Red Hat Enterprise Linux 6 Deployment Guide for instructions on how to set up a DHCP server. In addition, you should enable PXE booting on the DHCP server; to do this, add the following configuration to /etc/dhcp/dhcp.conf:

allow booting;allow bootp;class "pxeclients" {   match if substring(option vendor-class-identifier, 0, 9) = "PXEClient";   next-server server-ip;   filename "linux-install/pxelinux.0";}

Replace server-ip with the IP address of the host machine on which the tftp and DHCP services reside. Now that tftp and DHCP are configured, all that remains is to configure NFS and the exported file system; refer to Section 23.3, "Configuring an Exported File System for Diskless Clients" for instructions.

23.3. Configuring an Exported File System for Diskless Clients

The root directory of the exported file system (used by diskless clients in the network) is shared via NFS. Configure the NFS service to export the root directory by adding it to /etc/exports. For instructions on how to do so, refer to Section 9.7.1, " The /etc/exports Configuration File".

To accommodate completely diskless clients, the root directory should contain a complete Red Hat Enterprise Linux installation. You can synchronize this with a running system via rsync, as in:

# rsync -a -e ssh --exclude='/proc/*' --exclude='/sys/*' hostname.com:/ /exported/root/directory

Replace hostname.com with the hostname of the running system with which to synchronize via rsync. The /exported/root/directory is the path to the exported file system.

Alternatively, you can also use yum with the --installroot option to install Red Hat Enterprise Linux to a specific location. For example:

yum groupinstall Base --installroot=/exported/root/directory

The file system to be exported still needs to be configured further before it can be used by diskless clients. To do this, perform the following procedure:

Procedure 23.2. Configure file system

Configure the exported file system's /etc/fstab to contain (at least) the following configuration:

none/tmptmpfsdefaults0 0tmpfs/dev/shmtmpfsdefaults0 0sysfs/syssysfsdefaults0 0proc/procproc defaults0 0

Select the kernel that diskless clients should use (vmlinuz-kernel-version) and copy it to the tftp boot directory:
```
# cp /boot/vmlinuz-kernel-version /var/lib/tftpboot/
```
Create the initrd (i.e. initramfs-kernel-version.img) with network support:
```
# dracut initramfs-kernel-version.img vmlinuz-kernel-version
```
Copy the resulting initramfs-kernel-version.img into the tftp boot directory as well.
Edit the default boot configuration to use the initrd and kernel inside /var/lib/tftpboot. This configuration should instruct the diskless client's root to mount the exported file system (/exported/root/directory) as read-write. To do this, configure /var/lib/tftpboot/pxelinux.cfg/default with the following:
```
default rhel6label rhel6  kernel vmlinuz-kernel-version  append initrd=initramfs-kernel-version.img root=nfs:server-ip:/exported/root/directory rw
```
Replace server-ip with the IP address of the host machine on which the tftp and DHCP services reside.

The NFS share is now ready for exporting to diskless clients. These clients can boot over the network via PXE.

Source : http://www.redhat.com

(Sebelumnya) 20 : Chapter 15. Disk Quotas - ...

20 : Chapter 24. Online Storag ... (Berikutnya)

Storage Administration Guide

Chapter 18. The volume_key function

18.1. Commands

18.2. Using volume_key as an individual user

18.3. Using volume_key in a larger organization

18.3.1. Preparation for saving encryption keys

18.3.2. Saving encryption keys

18.3.3. Restoring access to a volume

18.3.4. Setting up emergency passphrases

18.4. Documentation

Chapter 19. Access Control Lists

19.1. Mounting File Systems

19.1.1. NFS

19.2. Setting Access ACLs

19.3. Setting Default ACLs

19.4. Retrieving ACLs

19.5. Archiving File Systems With ACLs

19.6. Compatibility with Older Systems

19.7. References

Chapter 20. Solid-State Disk Deployment Guidelines

20.1. Deployment Considerations

20.2. Tuning Considerations

I/O Scheduler

Virtual Memory

Swap

Chapter 21. Write Barriers

21.1. Importance of Write Barriers

How Write Barriers Work

21.2. Enabling/Disabling Write Barriers

21.3. Write Barrier Considerations

Disabling Write Caches

Battery-Backed Write Caches

High-End Arrays

NFS

Chapter 22. Storage I/O Alignment and Size

22.1. Parameters for Storage Access

22.2. Userspace Access

sysfs Interface

Block Device ioctls

22.3. Standards

ATA

SCSI

22.4. Stacking I/O Parameters

22.5. Logical Volume Manager

22.6. Partition and File System Tools

util-linux-ng's libblkid and fdisk

parted and libparted

File System tools

Chapter 23. Setting Up A Remote Diskless System

23.1. Configuring a tftp Service for Diskless Clients

23.2. Configuring DHCP for Diskless Clients

23.3. Configuring an Exported File System for Diskless Clients

18.2. Using `volume_key` as an individual user

18.3. Using `volume_key` in a larger organization