4 node GFS+IPVSADM cluster with Ubuntu Linux
Linux and Open Source world has vast array of technologies to fit our needs. In these pages, I will define steps to build 4 node GFS cluster with LVS. Unfortunately there is not much howtos or guides to setup similar environments.
GFS is developed by Redhat and GFS configuration can be most easily done using GUI tools in Redhat Enterprise Linux. However, there are many reasons to deploy a Debian/Ubuntu instead of Redhat. (I will not discuss them here)
This setup has been tested on following hardware:
- 4x HP Compaq DL380 G5
- Intel Xeon quad core
- 16 GB RAM
- 2 x 146 GB SCSI internal disk
- 2 x Qlogic HBA
- 2 x Gigabit Ethernet
- MSA 2000 with 2x controllers
- 2x Fibre Switch
Our environment will end up having following features:
- All servers, running Ubuntu Hardy Heron 8.04 LTS 64-bit edition (Openvz enabled kernel)
- MSA2000 configured as single RAID6 drive
- GFS cluster setup with simple config, including quorum device and simple fencing
- RAID6 drive formatted as LVM volume and shared between 4 nodes using GFS
- LVS+Heartbeat to setup ip failover for web and mysql services
- Network interfaces are bonded to provide redundancy
- Two nodes will run web server and the other two nodes will use mysql
- Use OpenVZ for virtualization
(db1, db2, web1, web2)
Fun, right ? Ok, I will summarize MSA 2000.
MSA 2000 can be configured using serial console or web management gui. I recommend using serial console just for setting up ip address and continue with its nice web gui. I also recommend reading basic information about volumes (physical-logical) and RAID systems. I preferred using RAID6 because I need to maximize disk space while maintaining safety. RAID5 is not the best option because it can not deal with 2 broken disks and its performance is awful when single disk fails. Using RAID6 we can still stand with 2 disk failues and we will still have acceptable performance. If performance is more important than the storage size, RAID 0+1 is more suitable.
HP delivers servers with hardware raid capability and you will most probably have two mirrored disks. Leave those, it will not have a negative effect on the performance and you will gain important redundancy against our old, crappy hard disk technology.
-
Download Ubuntu Hardy Heron 8.04 64-bit server edition and burn the image. Boot the system with the cd, verify the content on the cdrom and start the installation. Make a base installation with minimal packages. We can continue using updated packages in the repositories.
If you plan adding new local disks to hosts, you can choose to have an LVM managed disk during partitioning step. If you do not, you can just select ext3 as local partition.Here, some might say that you should make separate partitions for /boot, /var, /home or /usr. Here are some basic rules:
- Do not make /boot a separate partition unless you have a real good reason
- Do not make /var a separate partition. Some applications (like mysql) put database files under /var/lib. (Or you can have mail queue under /var/spool/) In that case, put that specific directory on another partition, not whole /var.
- You should make /home a different partition if you intend to use home directories a lot.
The point is to be able to use one single backup file restore all the necessary files in case of a disaster. After more than 10 years, I can say that I see no advantage of having separate /var or /usr but I faced many problems
- Setup network parameters during the installation. Provide http proxy if necessary
After the installation. Open /etc/apt/sources.list
vi /etc/apt/sources.listChange contents of the file like this:
deb http://us.archive.ubuntu.com/ubuntu/ hardy main restricted universe multiverse
deb-src http://us.archive.ubuntu.com/ubuntu/ hardy main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu/ hardy-security restricted main multiverse universe
deb-src http://security.ubuntu.com/ubuntu/ hardy-security restricted main multiverse universe
deb http://us.archive.ubuntu.com/ubuntu/ hardy-updates restricted main multiverse universe
deb-src http://us.archive.ubuntu.com/ubuntu/ hardy-updates restricted main multiverse universeThis will make sure that the system will fetch the updated and security packages. You can also add hardy-proposed and hardy-backports to have more recent packages. However, once you have the system up and running, do not install updates other than security, unless necessary
Update the package cache and make an upgrade:
apt-get update
apt-get upgrade
- Now run tasksel and select OpenSSH server. If you want to have a desktop environment, also select Ubuntu Desktop. Tasksel will install all the required packages. I will not focus on installation of web server or database packages. It quite depends on needs and taste. So I’m stepping in to other stuff.
-
Set up serial console for remote access. ilo2 requires additional license in order to provide remote console functionality via browser. However, you can still have a serial console using virtual serial port.
First we need to create a serial port listener. Since Ubuntu uses upstart as the init replacement, we will not use old inittab. Open /etc/event.d/ttyS0 and write:ttyS0 – gettystart on runlevel 2
start on runlevel 3
start on runlevel 4
start on runlevel 5stop on runlevel 0
stop on runlevel 1
stop on runlevel 6respawn
exec /sbin/getty -L 115200 ttyS0 vt102Let root login from the serial console. Open /etc/securetty and enter ttyS0
Now edit /boot/grub/menu.lst to tell grub to use ttyS0 during boot up phase
serial –unit=0 –speed=115200 –word=8 –parity=no –stop=1
terminal –timeout=10 serial consoleAnd lastly, append following to the end of the each kernel line in /boot/grub/menu.lst
console=ttyS0 console=tty0 - Install multipath-tools. Ubuntu kernels already include support for Multipathed IO. This allows us to connect to the storage array using two different physical paths. We can have redundancy for failover or we can distribute I/O load between the paths
My /etc/multipath.conf looks like:
defaults { udev_dir /dev user_friendly_names yes } blacklist { devnode "cciss" devnode "fd" devnode "hd" devnode "md" devnode "sr" devnode "scd" devnode "st" devnode "ram" devnode "raw" devnode "loop" } multipaths { multipath { wwid 3600c0ff000d56f978446734801000000 alias san1 } } device { vendor "HP" model "MSA2*" path_grouping_policy multibus getuid_callout "/lib/udev/scsi_id -g -u -s /block/%n" selector "round-robin 0" rr_weight uniform prio_callout "/bin/true" path_checker tur hardware_handler "0" failback immediate no_path_retry 12 rr_min_io 100 }If we had other devices in the path which we do not want to include, we would have a blacklist section. Examples for this section is included in the default config file. My path_grouping_policy is failover. You can also use other policies to distribute load, such as multibus
Start multipathing daemon by running /etc/init.d/multipath-tools start . Then run multipath -ll (Double L) . You should see like this:
root@node1:~# multipath -ll san1 (3600c0ff000d56f978446734801000000) dm-0 HP ,MSA2212fc [size=820G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:0 sda 8:0 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:0 sdb 8:16 [active][ready]
If you see only one path, then your fibre switches may need configuration or you may have problems in cabling. All nodes should be able to see the device. Now we can mount the disk on one of the nodes but we must not use it on the other nodes concurrently, for now.
- You should be able to see the disk on the storage array by now. You have to see as many as physical disks as you created in the management GUI of the MSA 2000.
Note: For kernel versions < 2.6.19, qla2xxx drivers include failover capability. However, the driver does not work with newer kernels due to serious changes in kernel structs. Newer versions of the driver omits this capability and recommends using multipath-tools. I recommend sticking with the newest kernels.- Run cfdisk against the shared disk
- Create an LVM partition. The size is up to you, I will continue with one gigantic partition
- Create another partition for cluster quorum disk. ~10MB is enough and the type can be Linux
- Save and exit. (You may need to reboot or run partprobe to let the kernel see the changes directly)
- First create a physical volume. We will put our volume groups in it.
pvcreate /dev/mapper/san1-part1
- Create and activate a volume group. We will put our logical volumes in it
vgcreate vgname /dev/mapper/san1-part1
vgchange -a y vgname
- Create logical volume using all available space
vgdisplay vgname|grep “Total PE”
lvcreate -l usetotalpehere vgname -n lvname
- Create your favourite file system using the new logical volume. The path will look like:
mkfs /dev/mapper/lvname
- Now we will configure GFS2 cluster. First, we have to install required packages on all the nodes.
apt-get install gfs2-tools gfs-tools cman clvm
Add following modules to /etc/modules
loop lp rtc fuse gfs lock_dlm ip_vs_ftp ip_vs_lblc ip_vs_lc ip_vs_wlc ip_vs_lblcr ip_vs_nq ip_vs_wrr ip_vs_sh ip_vs_dh ip_vs_sed ip_vs_rr bonding
Run modprobe against all these modules to load them in to the kernel
Create GFS2 (or GFS, whichever you want. GFS2 is still experimental) volume. Parameter j defines the number of journals. Normally, every node must have one. We have to use Distributed Lock Manager for lock management.mkfs -t gfs2 -p lock_dlm -j 4 /dev/mapper/lvnameSetup /etc/lvm2/lvm.conf by defining following values
locking_type = 2
fallback_to_clustered_locking = 0
fallback_to_local_locking = 0
locking_library = “liblvm2clusterlock.so”We should never fallback to another locking mechanism otherwise we can harm the data integrity.
Now, we have to create our quorum disk. Basically, quorum disk is accessed by all nodes frequently to write timestamped “I am here” messages. If an alive message is not seen for specified time, that node is considered to be dead and fenced from the storage. Quorum disk is extremely important for data consistency. It must reside on a physical partition and all nodes musth have access to it. Quorum can not be on a LVM volume because accessing to cluster LVM requires cluster membership. Initialize the quorum disk:
mkqdisk /dev/mapper/san1-part2Open /etc/cluster/cluster.conf and define the GFS cluster. Ideally, all nodes should have at least one working fencing method. Otherwise, it might not be possible to prevent data corruption. The best fencing methods are power related ones. Since I only have a ilo2 port, I send poweroff signal through it. It is also possible to run scripts which login to fiber switch and shutdown a specific port.
- Setup bonding for ethernet interface.
Normally, I recommend having 4 ethernet interfaces for maximum reliability. Since we have 4 nodes, we can not just use cross cables or serial cables for cluster interconnect. We have to connect to a switch and switches can fail anytime. Therefore, two of the four interfaces must be bonded in failover mode to in order to make sure that even if one link/switch fails cluster traffic can still survive. This extremely vital. Cluster interconnect will not only carry heartbeat messages, but also provide transport for distributed lock manager. If distributed lock manager losts connectivity, the node will lost its access to global storage.
Bonding setup is extremely simple. Just configure /etc/network/interfaces file like this:auto bond0 iface bond0 inet static address x.y.z.t netmask a.b.c.d gateway q.w.e.r slaves eth0 eth1Of course you will need another similar section for bond1
Later, it is possible to check the status from /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.3 (December 6, 2007) Bonding Mode: fault-tolerance (active-backup) (fail_over_mac) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 100 Down Delay (ms): 100 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:1e:0b:d3:f2:34 Slave Interface: eth1 MII Status: down Link Failure Count: 0 Permanent HW addr: 00:1e:0b:d3:f2:32
- Reboot all nodes. If everything is correct and there is no network issue, all nodes should join to the cluster. Cluster memberships can be checked by running cman_tool nodes.
- Setup Linux IP cluster
- Install required packages for ip cluster. apt-get install heartbeat2 ipvsadm ldirectord
-
Tune kernel parameters for cluster operation and reboot all nodes. Note that if any of the interfaces is missing, you will probably have issues in ipvs cluster.
net.ipv4.conf.default.rp_filter=1 net.ipv4.conf.all.rp_filter=1 net.ipv4.ip_forward=1 net.ipv4.conf.all.arp_ignore = 1 net.ipv4.conf.eth0.arp_ignore = 1 net.ipv4.conf.eth1.arp_ignore = 1 net.ipv4.conf.all.arp_announce = 2 net.ipv4.conf.eth0.arp_announce = 2 net.ipv4.conf.eth1.arp_announce = 2 net.ipv4.conf.bond0.proxy_arp = 1
- Edit /etc/ha.d/ha.cf
Three lines are most important in this file.ucast bond0 x.y.z.t node web1 node web2
ucast line defines interface for heartbeat checks and destination ip address. node lines define the hostnames of the heartbeat nodes.
- Edit /etc/ha.d/haresources. Define ldirectord module for cluster ip. The line must be same in both nodes of the cluster. For example, following line must be used on both db1 and db2.
web1 ldirectord::ldirectord.cf LVSSyncDaemonSwap::master IPaddr2::clusteripaddress/netmaskslashnotation/bond0/broadcastaddress
- Edit /etc/ha.d/ldirectord.conf. Here we define two real servers with different weights. manual of ipvsadm explains all load balancing algorithms in detail. In this setup, we create an pure active/passive balancing using sed scheduler. Checktype is only connect but it is possible to send http requests and check for return values.
checktimeout=10 checkinterval=1 autoreload=yes logfile="local0" virtual = tomcat:8080 service = http real = web1:8080 gate 65535 real = web2:8080 gate 1 checktype = connect scheduler = sed protocol = tcp
- Create a new openvz based virtual server. You can check here for a template creation howto if you do not already have one
Now we can continue with setting up heartbeat for web and sql services. Before this, all ip address should be planned and entered to all /etc/hosts files on all nodes.






