Clusters problems, by nature, can be difficult to troubleshoot. This is due to the increased complexity that a cluster of systems introduces as opposed to diagnosing issues on a single system. However, there are common issues that system administrators are more likely to encounter when deploying or administering a cluster. Understanding how to tackle those common issues can help make deploying and administering a cluster much easier.

This chapter provides information about some common cluster issues and how to troubleshoot them. Additional help can be found in our knowledge base and by contacting an authorized Red Hat support representative. If your issue is related to the GFS2 file system specifically, you can find information about troubleshooting common GFS2 issues in the Global File System 2 document.

9.1. Configuration Changes Do Not Take Effect

When you make changes to a cluster configuration, you must propagate those changes to every node in the cluster.

When you configure a cluster using Conga, Conga propagates the changes automatically when you apply the changes.
For information on propagating changes to cluster configuration with the ccs command, refer to Section 5.15, "Propagating the Configuration File to the Cluster Nodes".
For information on propagating changes to cluster configuration with command line tools, refer to Section 8.4, "Updating a Configuration".

If you make any of the following configuration changes to your cluster, it is not necessary to restart the cluster after propagating those changes the changes to take effect.

Deleting a node from the cluster configuration-except where the node count changes from greater than two nodes to two nodes.
Adding a node to the cluster configuration-except where the node count changes from two nodes to greater than two nodes.
Changing the logging settings.
Adding, editing, or deleting HA services or VM components.
Adding, editing, or deleting cluster resources.
Adding, editing, or deleting failover domains.

If you make any other configuration changes to your cluster, however, you must restart the cluster to implement those changes. The following cluster configuration changes require a cluster restart to take effect:

Adding or removing the two_node option from the cluster configuration file.
Renaming the cluster.
Changing any corosync or openais timers.
Adding, changing, or deleting heuristics for quorum disk, changing any quorum disk timers, or changing the quorum disk device. For these changes to take effect, a global restart of the qdiskd daemon is required.
Changing the central_processing mode for rgmanager. For this change to take effect, a global restart of rgmanager is required.
Changing the multicast address.
Switching the transport mode from UDP multicast to UDP unicast, or switching from UDP unicast to UDP multicast.

You can restart the cluster using Conga, the ccs command, or command line tools,

For information on restarting a cluster with Conga, refer to Section 4.4, "Starting, Stopping, Restarting, and Deleting Clusters".
For information on restarting a cluster with the ccs command, refer to Section 6.2, "Starting and Stopping a Cluster".
For information on restarting a cluster with command line tools, refer to Section 8.1, "Starting and Stopping the Cluster Software".

9.2. Cluster Does Not Form

If you find you are having trouble getting a new cluster to form, check for the following things:

Make sure you have name resolution set up correctly. The cluster node name in the cluster.conf file should correspond to the name used to resolve that cluster's address over the network that cluster will be using to communicate. For example, if your cluster's node names are nodea and nodeb make sure both nodes have entries in the /etc/cluster/cluster.conf file and /etc/hosts file that match those names.
If the cluster uses multicast for communication between nodes, make sure that multicast traffic is not being blocked, delayed, or otherwise interfered with on the network that the cluster is using to communicate. Note that some Cisco switches have features that may cause delays in multicast traffic.
Use telnet or SSH to verify whether you can reach remote nodes.
Execute the ethtool eth1 | grep link command to check whether the ethernet link is up.
Use the tcpdump command at each node to check the network traffic.
Ensure that you do not have firewall rules blocking communication between your nodes.
Ensure that the interfaces the cluster uses for inter-node communication are not using any bonding mode other than 0, 1, or 2. (Bonding modes 0 and 2 are supported as of Red Hat Enterprise Linux 6.4.)

9.3. Nodes Unable to Rejoin Cluster after Fence or Reboot

If your nodes do not rejoin the cluster after a fence or reboot, check for the following things:

Clusters that are passing their traffic through a Cisco Catalyst switch may experience this problem.
Ensure that all cluster nodes have the same version of the cluster.conf file. If the cluster.conf file is different on any of the nodes, then nodes may be unable to join the cluster post fence.
As of Red Hat Enterprise Linux 6.1, you can use the following command to verify that all of the nodes specified in the host's cluster configuration file have the identical cluster configuration file:
```
ccs -h host --checkconf
```
For information on the ccs command, see Chapter 5, Configuring Red Hat High Availability Add-On With the ccs Command and Chapter 6, Managing Red Hat High Availability Add-On With ccs.
Make sure that you have configured chkconfig on for cluster services in the node that is attempting to join the cluster.
Ensure that no firewall rules are blocking the node from communicating with other nodes in the cluster.

9.4. Cluster Daemon crashes

RGManager has a watchdog process that reboots the host if the main rgmanager process fails unexpectedly. This causes the cluster node to get fenced and rgmanager to recover the service on another host. When the watchdog daemon detects that the main rgmanager process has crashed then it will reboot the cluster node, and the active cluster nodes will detect that the cluster node has left and evict it from the cluster.

The lower number process ID (PID) is the watchdog process that takes action if its child (the process with the higher PID number) crashes. Capturing the core of the process with the higher PID number using gcore can aid in troubleshooting a crashed daemon.

Install the packages that are required to capture and view the core, and ensure that both the rgmanager and rgmanager-debuginfo are the same version or the captured application core might be unusable.

$ yum -y --enablerepo=rhel-debuginfo install gdb rgmanager-debuginfo

9.4.1. Capturing the `rgmanager` Core at Runtime

There are two rgmanager processes that are running as it is started. You must capture the core for the rgmanager process with the higher PID.

The following is an example output from the ps command showing two processes for rgmanager.

$ ps aux | grep rgmanager | grep -v grep root 22482  0.0  0.5  23544  5136 ? S<Ls Dec01   0:00 rgmanager root 22483  0.0  0.2  78372  2060 ? S<l  Dec01   0:47 rgmanager

In the following example, the pidof program is used to automatically determine the higher-numbered pid, which is the appropriate pid to create the core. The full command captures the application core for the process 22483 which has the higher pid number.

$ gcore -o /tmp/rgmanager-$(date '+%F_%s').core $(pidof -s rgmanager)

9.4.2. Capturing the Core When the Daemon Crashes

By default, the /etc/init.d/functions script blocks core files from daemons called by /etc/init.d/rgmanager. For the daemon to create application cores, you must enable that option. This procedure must be done on all cluster nodes that need an application core caught.

For creating a core file when the rgmanager daemon crashes, edit the /etc/sysconfig/cluster file. The DAEMONCOREFILELIMIT parameter allows the daemon to create core files if the process crashes. There is a -w option that prevents the watchdog process from running. The watchdog daemon is responsible for rebooting the cluster node if rgmanager crashes and, in some cases, if the watchdog daemon is running then the core file will not be generated, so it must be disabled to capture core files.

DAEMONCOREFILELIMIT="unlimited"RGMGR_OPTS="-w"

Restart rgmanager to activate the new configuration options:

service rgmanager restart

Note

If cluster services are running on this cluster node, then it could leave the running services in a bad state.

The core file will be written when it is generated from a crash of the rgmanager process.

ls /core*

The output should appear similar to the following:

/core.11926

Move or delete any old cores files under the / directory before restarting rgmanager to capture the application core. The cluster node that experienced the rgmanager crash should be rebooted or fenced after the core is captured to ensure that the watchdog process was not running.

9.4.3. Recording a `gdb` Backtrace Session

Once you have captured the core file, you can view its contents by using gdb, the GNU Debugger. To record a script session of gdb on the core file from the affected system, run the following:

$ script /tmp/gdb-rgmanager.txt$ gdb /usr/sbin/rgmanager /tmp/rgmanager-.core.

This will start a gdb session, while script records it to the appropriate text file. While in gdb, run the following commands:

(gdb) thread apply all bt full(gdb) quit

Press ctrl-D to stop the script session and save it to the text file.

9.5. Cluster Services Hang

When the cluster services attempt to fence a node, the cluster services stop until the fence operation has successfully completed. Therefore, if your cluster-controlled storage or services hang and the cluster nodes show different views of cluster membership or if your cluster hangs when you try to fence a node and you need to reboot nodes to recover, check for the following conditions:

The cluster may have attempted to fence a node and the fence operation may have failed.
Look through the /var/log/messages file on all nodes and see if there are any failed fence messages. If so, then reboot the nodes in the cluster and configure fencing correctly.
Verify that a network partition did not occur, as described in Section 9.8, "Each Node in a Two-Node Cluster Reports Second Node Down". and verify that communication between nodes is still possible and that the network is up.
If nodes leave the cluster the remaining nodes may be inquorate. The cluster needs to be quorate to operate. If nodes are removed such that the cluster is no longer quorate then services and storage will hang. Either adjust the expected votes or return the required amount of nodes to the cluster.

Note

You can fence a node manually with the fence_node command or with Conga. For information, see the fence_node man page and Section 4.3.2, "Causing a Node to Leave or Join a Cluster".

9.6. Cluster Service Will Not Start

If a cluster-controlled service will not start, check for the following conditions.

There may be a syntax error in the service configuration in the cluster.conf file. You can use the rg_test command to validate the syntax in your configuration. If there are any configuration or syntax faults, the rg_test will inform you what the problem is.
```
$ rg_test test /etc/cluster/cluster.conf start service servicename 
```
For more information on the rg_test command, see Section C.5, "Debugging and Testing Services and Resource Ordering".
If the configuration is valid, then increase the resource group manager's logging and then read the messages logs to determine what is causing the service start to fail. You can increase the log level by adding the loglevel="7" parameter to the rm tag in the cluster.conf file. You will then get increased verbosity in your messages logs with regards to starting, stopping, and migrating clustered services.

9.7. Cluster-Controlled Services Fails to Migrate

If a cluster-controlled service fails to migrate to another node but the service will start on some specific node, check for the following conditions.

Ensure that the resources required to run a given service are present on all nodes in the cluster that may be required to run that service. For example, if your clustered service assumes a script file in a specific location or a file system mounted at a specific mount point then you must ensure that those resources are available in the expected places on all nodes in the cluster.
Ensure that failover domains, service dependency, and service exclusivity are not configured in such a way that you are unable to migrate services to nodes as you'd expect.
If the service in question is a virtual machine resource, check the documentation to ensure that all of the correct configuration work has been completed.
Increase the resource group manager's logging, as described in Section 9.6, "Cluster Service Will Not Start", and then read the messages logs to determine what is causing the service start to fail to migrate.

9.8. Each Node in a Two-Node Cluster Reports Second Node Down

If your cluster is a two-node cluster and each node reports that it is up but that the other node is down, this indicates that your cluster nodes are unable to communicate with each other via multicast over the cluster heartbeat network. This is known as "split brain" or a "network partition." To address this, check the conditions outlined in Section 9.2, "Cluster Does Not Form".

9.9. Nodes are Fenced on LUN Path Failure

If a node or nodes in your cluster get fenced whenever you have a LUN path failure, this may be a result of the use of a quorum disk over multipathed storage. If you are using a quorum disk, and your quorum disk is over multipathed storage, ensure that you have all of the correct timings set up to tolerate a path failure.

9.10. Quorum Disk Does Not Appear as Cluster Member

If you have configured your system to use a quorum disk but the quorum disk does not appear as a member of your cluster, check for the following conditions.

Ensure that you have set chkconfig on for the qdisk service.
Ensure that you have started the qdisk service.
Note that it may take multiple minutes for the quorum disk to register with the cluster. This is normal and expected behavior.

9.11. Unusual Failover Behavior

A common problem with cluster servers is unusual failover behavior. Services will stop when other services start or services will refuse to start on failover. This can be due to having complex systems of failover consisting of failover domains, service dependency, and service exclusivity. Try scaling back to a simpler service or failover domain configuration and see if the issue persists. Avoid features like service exclusivity and dependency unless you fully understand how those features may effect failover under all conditions.

9.12. Fencing Occurs at Random

If you find that a node is being fenced at random, check for the following conditions.

The root cause of fences is always a node losing token, meaning that it lost communication with the rest of the cluster and stopped returning heartbeat.
Any situation that results in a system not returning heartbeat within the specified token interval could lead to a fence. By default the token interval is 10 seconds. It can be specified by adding the desired value (in milliseconds) to the token parameter of the totem tag in the cluster.conf file (for example, setting totem token="30000" for 30 seconds).
Ensure that the network is sound and working as expected.
Ensure that the interfaces the cluster uses for inter-node communication are not using any bonding mode other than 0, 1, or 2. (Bonding modes 0 and 2 are supported as of Red Hat Enterprise Linux 6.4.)
Take measures to determine if the system is "freezing" or kernel panicking. Set up the kdump utility and see if you get a core during one of these fences.
Make sure some situation is not arising that you are wrongly attributing to a fence, for example the quorum disk ejecting a node due to a storage failure or a third party product like Oracle RAC rebooting a node due to some outside condition. The messages logs are often very helpful in determining such problems. Whenever fences or node reboots occur it should be standard practice to inspect the messages logs of all nodes in the cluster from the time the reboot/fence occurred.
Thoroughly inspect the system for hardware faults that may lead to the system not responding to heartbeat when expected.

9.13. Debug Logging for Distributed Lock Manager (DLM) Needs to be Enabled

There are two debug options for the Distributed Lock Manager (DLM) that you can enable, if necessary: DLM kernel debugging, and POSIX lock debugging.

To enable DLM debugging, edit the /etc/cluster/cluster.conf file to add configuration options to the dlm tag. The log_debug option enables DLM kernel debugging messages, and the plock_debug option enables POSIX lock debugging messages.

The following example section of a /etc/cluster/cluster.conf file shows the dlm tag that enables both DLM debug options:

<cluster config_version="42" name="cluster1">  ...  <dlm log_debug="1" plock_debug="1"/>  ...</cluster>

After editing the /etc/cluster/cluster.conf file, run the cman_tool version -r command to propagate the configuration to the rest of the cluster nodes.

Chapter 10. SNMP Configuration with the Red Hat High Availability Add-On

10.1. SNMP and the Red Hat High Availability Add-On
10.2. Configuring SNMP with the Red Hat High Availability Add-On
10.3. Forwarding SNMP traps
10.4. SNMP Traps Produced by Red Hat High Availability Add-On

As of the Red Hat Enterprise Linux 6.1 release and later, the Red Hat High Availability Add-On provides support for SNMP traps. This chapter describes how to configure your system for SNMP followed by a summary of the traps that the Red Hat High Availability Add-On emits for specific cluster events.

10.1. SNMP and the Red Hat High Availability Add-On

The Red Hat High Availability Add-On SNMP subagent is foghorn, which emits the SNMP traps. The foghorn subagent talks to the snmpd daemon by means of the AgentX Protocol. The foghorn subagent only creates SNMP traps; it does not support other SNMP operations such as get or set.

There are currently no config options for the foghorn subagent. It cannot be configured to use a specific socket; only the default AgentX socket is currently supported.

10.2. Configuring SNMP with the Red Hat High Availability Add-On

To configure SNMP with the Red Hat High Availability Add-On, perform the following steps on each node in the cluster to ensure that the necessary services are enabled and running.

To use SNMP traps with the Red Hat High Availability Add-On, the snmpd service is required and acts as the master agent. Since the foghorn service is the subagent and uses the AgentX protocol, you must add the following line to the /etc/snmp/snmpd.conf file to enable AgentX support:
```
master agentx
```
To specify the host where the SNMP trap notifications should be sent, add the following line to the to the /etc/snmp/snmpd.conf file:
```
trap2sink host
```
For more information on notification handling, see the snmpd.conf man page.
Make sure that the snmpd daemon is enabled and running by executing the following commands:
```
# chkconfig snmpd on# service snmpd start
```
If the messagebus daemon is not already enabled and running, execute the following commands:
```
# chkconfig messagebus on# service messagebus start
```
Make sure that the foghorn daemon is enabled and running by executing the following commands:
```
# chkconfig foghorn on# service foghorn start
```
Execute the following command to configure your system so that the COROSYNC-MIB generates SNMP traps and to ensure that the corosync-notifyd daemon is enabled and running:
```
# echo "OPTIONS=\"-d\" " > /etc/sysconfig/corosync-notifyd# chkconfig corosync-notifyd on# service corosync-notifyd start
```

After you have configured each node in the cluster for SNMP and ensured that the necessary services are running, D-bus signals will be received by the foghorn service and translated into SNMPv2 traps. These traps are then passed to the host that you defined with the trapsink entry to receive SNMPv2 traps.

10.3. Forwarding SNMP traps

It is possible to forward SNMP traps to a machine that is not part of the cluster where you can use the snmptrapd daemon on the external machine and customize how to respond to the notifications.

Perform the following steps to forward SNMP traps in a cluster to a machine that is not one of the cluster nodes:

For each node in the cluster, follow the procedure described in Section 10.2, "Configuring SNMP with the Red Hat High Availability Add-On", setting the trap2sink host entry in the /etc/snmp/snmpd.conf file to specify the external host that will be running the snmptrapd daemon.
On the external host that will receive the traps, edit the /etc/snmp/snmptrapd.conf configuration file to specify your community strings. For example, you can use the following entry to allow the snmptrapd daemon to process notifications using the public community string.
```
authCommunity log,execute,net public
```
On the external host that will receive the traps, make sure that the snmptrapd daemon is enabled and running by executing the following commands:
```
# chkconfig snmptrapd on# service snmptrapd start
```

For further information on processing SNMP notifications, see the snmptrapd.conf man page.

10.4. SNMP Traps Produced by Red Hat High Availability Add-On

The foghorn daemon generates the following traps:

fenceNotifyFenceNode
This trap occurs whenever a fenced node attempts to fence another node. Note that this trap is only generated on one node - the node that attempted to perform the fence operation. The notification includes the following fields:
- fenceNodeName - name of the fenced node
- fenceNodeID - node id of the fenced node
- fenceResult - the result of the fence operation (0 for success, -1 for something went wrong, -2 for no fencing methods defined)
rgmanagerServiceStateChange
This trap occurs when the state of a cluster service changes. The notification includes the following fields:
- rgmanagerServiceName - the name of the service, which includes the service type (for example, service:foo or vm:foo).
- rgmanagerServiceState - the state of the service. This excludes transitional states such as starting and stopping to reduce clutter in the traps.
- rgmanagerServiceFlags - the service flags. There are currently two supported flags: frozen, indicating a service which has been frozen using clusvcadm -Z, and partial, indicating a service in which a failed resource has been flagged as non-critical so that the resource may fail and its components manually restarted without the entire service being affected.
- rgmanagerServiceCurrentOwner - the service owner. If the service is not running, this will be (none).
- rgmanagerServicePreviousOwner - the last service owner, if known. If the last owner is not known, this may indicate (none).

The corosync-nodifyd daemon generates the following traps:

corosyncNoticesNodeStatus
This trap occurs when a node joins or leaves the cluster. The notification includes the following fields:
- corosyncObjectsNodeName - node name
- corosyncObjectsNodeID - node id
- corosyncObjectsNodeAddress - node IP address
- corosyncObjectsNodeStatus - node status (joined or left)
corosyncNoticesQuorumStatus
This trap occurs when the quorum state changes. The notification includes the following fields:
- corosyncObjectsNodeName - node name
- corosyncObjectsNodeID - node id
- corosyncObjectsQuorumStatus - new state of the quorum (quorate or NOT quorate)
corosyncNoticesAppStatus
This trap occurs when a client application connects or disconnects from Corosync.
- corosyncObjectsNodeName - node name
- corosyncObjectsNodeID - node id
- corosyncObjectsAppName - application name
- corosyncObjectsAppStatus - new state of the application (connected or disconnected)

Chapter 11. Clustered Samba Configuration

11.1. CTDB Overview
11.2. Required Packages
11.3. GFS2 Configuration
11.4. CTDB Configuration
11.5. Samba Configuration
11.6. Starting CTDB and Samba Services
11.7. Using the Clustered Samba Server

As of the Red Hat Enterprise Linux 6.2 release, the Red Hat High Availability Add-On provides support for running Clustered Samba in an active/active configuration. This requires that you install and configure CTDB on all nodes in a cluster, which you use in conjunction with GFS2 clustered file systems.

Note

Red Hat Enterprise Linux 6 supports a maximum of four nodes running clustered Samba.

This chapter describes the procedure for configuring CTDB by configuring an example system. For information on configuring GFS2 file systems, refer to Global File System 2. For information on configuring logical volumes, refer to Logical Volume Manager Administration.

11.1. CTDB Overview

CTDB is a cluster implementation of the TDB database used by Samba. To use CTDB, a clustered file system must be available and shared on all nodes in the cluster. CTDB provides clustered features on top of this clustered file system. As of the Red Hat Enterprise Linux 6.2 release, CTDB also runs a cluster stack in parallel to the one provided by Red Hat Enterprise Linux clustering. CTDB manages node membership, recovery/failover, IP relocation and Samba services.

11.2. Required Packages

In addition to the standard packages required to run the Red Hat High Availability Add-On and the Red Hat Resilient Storage Add-On, running Samba with Red Hat Enterprise Linux clustering requires the following packages:

ctdb
samba
samba-common
samba-winbind-clients

11.3. GFS2 Configuration

Configuring Samba with the Red Hat Enterprise Linux clustering requires two GFS2 file systems: One small file system for CTDB, and a second file system for the Samba share. This example shows how to create the two GFS2 file systems.

Before creating the GFS2 file systems, first create an LVM logical volume for each of the file systems. For information on creating LVM logical volumes, refer to Logical Volume Manager Administration. This example uses the following logical volumes:

/dev/csmb_vg/csmb_lv, which will hold the user data that will be exported via a Samba share and should be sized accordingly. This example creates a logical volume that is 100GB in size.
/dev/csmb_vg/ctdb_lv, which will store the shared CTDB state information and needs to be 1GB in size.

You create clustered volume groups and logical volumes on one node of the cluster only.

To create a GFS2 file system on a logical volume, run the mkfs.gfs2 command. You run this command on one cluster node only.

To create the file system to host the Samba share on the logical volume /dev/csmb_vg/csmb_lv, execute the following command:

[root@clusmb-01 ~]# mkfs.gfs2 -j3 -p lock_dlm -t csmb:gfs2 /dev/csmb_vg/csmb_lv

The meaning of the parameters is as follows:

-j: Specifies the number of journals to create in the filesystem. This example uses a cluster with three nodes, so we create one journal per node.
-p: Specifies the locking protocol. lock_dlm is the locking protocol GFS2 uses for inter-node communication.
-t: Specifies the lock table name and is of the format cluster_name:fs_name. In this example, the cluster name as specified in the cluster.conf file is csmb, and we use gfs2 as the name for the file system.

The output of this command appears as follows:

This will destroy any data on /dev/csmb_vg/csmb_lv.  It appears to contain a gfs2 filesystem.Are you sure you want to proceed? [y/n] yDevice:/dev/csmb_vg/csmb_lvBlocksize:4096Device Size100.00 GB (26214400 blocks)Filesystem Size:100.00 GB (26214398 blocks)Journals:3Resource Groups: 400Locking Protocol:  "lock_dlm"Lock Table: "csmb:gfs2"UUID:  94297529-ABG3-7285-4B19-182F4F2DF2D7

In this example, the /dev/csmb_vg/csmb_lv file system will be mounted at /mnt/gfs2 on all nodes. This mount point must match the value that you specify as the location of the share directory with the path = option in the /etc/samba/smb.conf file, as described in Section 11.5, "Samba Configuration".

To create the file system to host the CTDB state information on the logical volume /dev/csmb_vg/ctdb_lv, execute the following command:

[root@clusmb-01 ~]# mkfs.gfs2 -j3 -p lock_dlm -t csmb:ctdb_state /dev/csmb_vg/ctdb_lv

Note that this command specifies a different lock table name than the lock table in the example that created the filesystem on /dev/csmb_vg/csmb_lv. This distinguishes the lock table names for the different devices used for the file systems.

The output of the mkfs.gfs2 appears as follows:

This will destroy any data on /dev/csmb_vg/ctdb_lv.  It appears to contain a gfs2 filesystem.Are you sure you want to proceed? [y/n] yDevice:/dev/csmb_vg/ctdb_lvBlocksize:  4096Device Size 1.00 GB (262144 blocks)Filesystem Size: 1.00 GB (262142 blocks)Journals:3Resource Groups: 4Locking Protocol: "lock_dlm"Lock Table: "csmb:ctdb_state"UUID:  BCDA8025-CAF3-85BB-B062-CC0AB8849A03

In this example, the /dev/csmb_vg/ctdb_lv file system will be mounted at /mnt/ctdb on all nodes. This mount point must match the value that you specify as the location of the .ctdb.lock file with the CTDB_RECOVERY_LOCK option in the /etc/sysconfig/ctdb file, as described in Section 11.4, "CTDB Configuration".

11.4. CTDB Configuration

The CTDB configuration file is located at /etc/sysconfig/ctdb. The mandatory fields that must be configured for CTDB operation are as follows:

CTDB_NODES
CTDB_PUBLIC_ADDRESSES
CTDB_RECOVERY_LOCK
CTDB_MANAGES_SAMBA (must be enabled)
CTDB_MANAGES_WINBIND (must be enabled if running on a member server)

The following example shows a configuration file with the mandatory fields for CTDB operation set with example parameters:

CTDB_NODES=/etc/ctdb/nodesCTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addressesCTDB_RECOVERY_LOCK="/mnt/ctdb/.ctdb.lock"CTDB_MANAGES_SAMBA=yesCTDB_MANAGES_WINBIND=yes

The meaning of these parameters is as follows.

CTDB_NODES

Specifies the location of the file which contains the cluster node list.

The /etc/ctdb/nodes file that CTDB_NODES references simply lists the IP addresses of the cluster nodes, as in the following example:

192.168.1.151192.168.1.152192.168.1.153

In this example, there is only one interface/IP on each node that is used for both cluster/CTDB communication and serving clients. However, it is highly recommended that each cluster node have two network interfaces so that one set of interfaces can be dedicated to cluster/CTDB communication and another set of interfaces can be dedicated to public client access. Use the appropriate IP addresses of the cluster network here and make sure the hostnames/IP addresses used in the cluster.conf file are the same. Similarly, use the appropriate interfaces of the public network for client access in the public_addresses file.

It is critical that the /etc/ctdb/nodes file is identical on all nodes because the ordering is important and CTDB will fail if it finds different information on different nodes.

CTDB_PUBLIC_ADDRESSES

Specifies the location of the file that lists the IP addresses that can be used to access the Samba shares exported by this cluster. These are the IP addresses that you should configure in DNS for the name of the clustered Samba server and are the addresses that CIFS clients will connect to. Configure the name of the clustered Samba server as one DNS type A record with multiple IP addresses and let round-robin DNS distribute the clients across the nodes of the cluster.

For this example, we have configured a round-robin DNS entry csmb-server with all the addresses listed in the /etc/ctdb/public_addresses file. DNS will distribute the clients that use this entry across the cluster in a round-robin fashion.

The contents of the /etc/ctdb/public_addresses file on each node are as follows:

192.168.1.201/0 eth0192.168.1.202/0 eth0192.168.1.203/0 eth0

This example uses three addresses that are currently unused on the network. In your own configuration, choose addresses that can be accessed by the intended clients.

Alternately, this example shows the contents of the /etc/ctdb/public_addresses files in a cluster in which there are three nodes but a total of four public addresses. In this example, IP address 198.162.2.1 can be hosted by either node 0 or node 1 and will be available to clients as long as at least one of these nodes is available. Only if both nodes 0 and 1 fail does this public address become unavailable to clients. All other public addresses can only be served by one single node respectively and will therefore only be available if the respective node is also available.

The /etc/ctdb/public_addresses file on node 0 includes the following contents:

198.162.1.1/24 eth0198.162.2.1/24 eth1

The /etc/ctdb/public_addresses file on node 1 includes the following contents:

198.162.2.1/24 eth1198.162.3.1/24 eth2

The /etc/ctdb/public_addresses file on node 2 includes the following contents:

198.162.3.2/24 eth2

CTDB_RECOVERY_LOCK

Specifies a lock file that CTDB uses internally for recovery. This file must reside on shared storage such that all the cluster nodes have access to it. The example in this section uses the GFS2 file system that will be mounted at /mnt/ctdb on all nodes. This is different from the GFS2 file system that will host the Samba share that will be exported. This recovery lock file is used to prevent split-brain scenarios. With newer versions of CTDB (1.0.112 and later), specifying this file is optional as long as it is substituted with another split-brain prevention mechanism.

CTDB_MANAGES_SAMBA

When enabling by setting it to yes, specifies that CTDB is allowed to start and stop the Samba service as it deems necessary to provide service migration/failover.

When CTDB_MANAGES_SAMBA is enabled, you should disable automatic init startup of the smb and nmb daemons by executing the following commands:

[root@clusmb-01 ~]# chkconfig snb off[root@clusmb-01 ~]# chkconfig nmb off

CTDB_MANAGES_WINBIND

When enabling by setting it to yes, specifies that CTDB is allowed to start and stop the winbind daemon as required. This should be enabled when you are using CTDB in a Windows domain or in active directory security mode.

When CTDB_MANAGES_WINBIND is enabled, you should disable automatic init startup of the winbind daemon by executing the following command:

[root@clusmb-01 ~]# chkconfig windinbd off

11.5. Samba Configuration

The Samba configuration file smb.conf is located at /etc/samba/smb.conf in this example. It contains the following parameters:

[global]guest ok = yesclustering = yesnetbios name = csmb-server[csmb]comment = Clustered Samba public = yespath = /mnt/gfs2/sharewriteable = yesea support = yes

This example exports a share with name csmb located at /mnt/gfs2/share. This is different from the GFS2 shared filesystem at /mnt/ctdb/.ctdb.lock that we specified as the CTDB_RECOVERY_LOCK parameter in the CTDB configuration file at /etc/sysconfig/ctdb.

In this example, we will create the share directory in /mnt/gfs2 when we mount it for the first time. The clustering = yes entry instructs Samba to use CTDB. The netbios name = csmb-server entry explicitly sets all the nodes to have a common NetBIOS name. The ea support parameter is required if you plan to use extended attributes.

The smb.conf configuration file must be identical on all of the cluster nodes.

Samba also offers registry-based configuration using the net conf command to automatically keep configuration in sync between cluster members without having to manually copy configuration files among the cluster nodes. For information on the net conf command, refer to the net(8) man page.

11.6. Starting CTDB and Samba Services

After starting up the cluster, you must mount the GFS2 file systems that you created, as described in Section 11.3, "GFS2 Configuration". The permissions on the Samba share directory and user accounts on the cluster nodes should be set up for client access.

Execute the following command on all of the nodes to start up the ctdbd daemon. Since this example configured CTDB with CTDB_MANAGES_SAMBA=yes, CTDB will also start up the Samba service on all nodes and export all configured Samba shares.

[root@clusmb-01 ~]# service ctdb start

It can take a couple of minutes for CTDB to start Samba, export the shares, and stabilize. Executing ctdb status shows the status of CTDB, as in the following example:

[root@clusmb-01 ~]# ctdb statusNumber of nodes:3pnn:0 192.168.1.151 OK (THIS NODE)pnn:1 192.168.1.152 OKpnn:2 192.168.1.153 OKGeneration:1410259202Size:3hash:0 lmaster:0hash:1 lmaster:1hash:2 lmaster:2Recovery mode:NORMAL (0)Recovery master:0

When you see that all nodes are "OK", it is safe to move on to use the clustered Samba server, as described in Section 11.7, "Using the Clustered Samba Server".

11.7. Using the Clustered Samba Server

Clients can connect to the Samba share that was exported by connecting to one of the IP addresses specified in the /etc/ctdb/public_addresses file, or using the csmb-server DNS entry we configured earlier, as shown below:

[root@clusmb-01 ~]# mount -t cifs //csmb-server/csmb /mnt/sambashare -o user=testmonkey

[user@clusmb-01 ~]$ smbclient //csmb-server/csmb

Source : http://www.redhat.com

(Sebelumnya) 12 : Chapter 8. ManagingAdd-On ...

12 : Fence Device Parameters - ... (Berikutnya)

Cluster Administration

Chapter 9. Diagnosing and Correcting Problems in a Cluster

9.1. Configuration Changes Do Not Take Effect

9.2. Cluster Does Not Form

9.3. Nodes Unable to Rejoin Cluster after Fence or Reboot

9.4. Cluster Daemon crashes

9.4.1. Capturing the rgmanager Core at Runtime

9.4.2. Capturing the Core When the Daemon Crashes

9.4.3. Recording a gdb Backtrace Session

9.5. Cluster Services Hang

9.6. Cluster Service Will Not Start

9.7. Cluster-Controlled Services Fails to Migrate

9.8. Each Node in a Two-Node Cluster Reports Second Node Down

9.9. Nodes are Fenced on LUN Path Failure

9.10. Quorum Disk Does Not Appear as Cluster Member

9.11. Unusual Failover Behavior

9.12. Fencing Occurs at Random

9.13. Debug Logging for Distributed Lock Manager (DLM) Needs to be Enabled

Chapter 10. SNMP Configuration with the Red Hat High Availability Add-On

10.1. SNMP and the Red Hat High Availability Add-On

10.2. Configuring SNMP with the Red Hat High Availability Add-On

10.3. Forwarding SNMP traps

10.4. SNMP Traps Produced by Red Hat High Availability Add-On

Chapter 11. Clustered Samba Configuration

11.1. CTDB Overview

11.2. Required Packages

11.3. GFS2 Configuration

11.4. CTDB Configuration

11.5. Samba Configuration

11.6. Starting CTDB and Samba Services

11.7. Using the Clustered Samba Server

9.4.1. Capturing the `rgmanager` Core at Runtime

9.4.3. Recording a `gdb` Backtrace Session