LVS Cluster Management

Cluster Management

Cluster management is to monitor and administrate all the computers in a computer cluster. It covers a wide range of functionality, such as resource monitoring, cluster membership management, reliable group communication, and full-featured administration interfaces.

One of the advantages of a cluster system is that it has hardware and software redundancy, because the cluster system consists of a number of independent nodes, and each node runs a copy of operating system and application software. Cluster Management can help achieve high availability by detecting node or daemon failures and reconfiguring the system appropriately, so that the workload can be taken over by the remaining nodes in the cluster.

LVS Cluster Management

Since LVS Cluster is load balancing cluster, the requirement of LVS cluster management is simple, cluster monitoring and administration interface are two major parts.

Cluster Monitoring

The major work of cluster monitoring in LVS is to monitor the availability of real servers and load balancers, and reconfigure the system if any partial failure happens, so that the whole cluster system can still serve requests. Note that monitoring the availability of database, network file system or distributed file system is not addressed here.

To monitor the availability of real servers, there are two approaches, one is to run service monitoring daemons at the load balancer to check server health periodically, the other is to run monitoring agents at real servers to collect information and report to the load balancer. The service monitor usually sends service requests and/or ICMP ECHO_REQUEST to real servers periodically, and remove/disable a real server in server list at the load balancer if there is no response in a specified time or error response, thus no new requests will be sent to this dead server. When the service monitor detects the dead server has recovered to work, the service monitor will add the server back to the available server list at the load balancer. Therefore, the load balancer can mask the failure of service daemons or servers automatically.

In the monitoring agent approach, there is also a monitoring master running at the load balancer to receive information from the agents. The monitoring master will add/remove servers at the load balancer based on the availability of agents, can also adjust server weight based on server load information. However, there is more efforts to make the monitoring agents running at all kinds of server operating systems, such as Linux, FreeBSD, and Windows.

The load balancer is the core of a server cluster system, and it cannot be a single failure point of the whole system. In order to prevent the whole system from being out of service because of the load balancer failure, we need setup a backup (or several backups) of the load balancer, which are connected by heartbeat or VRRP. Two heartbeat daemons run on the primary and the backup respectively, they heartbeat the message like "I'm alive" each other through serial lines and/or network interfaces periodically. When the heartbeat daemon of the backup cannot hear the heartbeat message from the primary in the specified time, it will take over the virtual IP address to provide the load-balancing service. When the failed load balancer comes back to work, there are two solutions, one is that it becomes the backup load balancer automatically, the other is the active load balancer releases the VIP address, and the recover one takes over the VIP address and becomes the primary load balancer again.

The primary load balancer has state of connections, i.e. which server the connection is forwarded to. If the backup load balancer takes over without those connections information, the clients have to send their requests again to access service. In order to make load balancer failover transparent to client applications, we implement connection synchronization in IPVS, the primary IPVS load balancer synchronizes connection information to the backup load balancers through UDP multicast. When the backup load balancer takes over after the primary one fails, the backup load balancer will have the state of most connections, so that almost all connections can continue to access the service through the backup load balancer.

Administration Interface

The administration interface of LVS cluster management should enable administrators to do the following things:

add new servers to increase the system throughput or remove servers for system maintenance, without bringing down the whole system service
monitor the traffic of LVS cluster and provide statistics

Cluster Management Software

There are many cluster management software in conjuction with LVS to provide high availability and management of the whole system.

For computing cluster management software, see the page Computing Cluster Mangement.

External Links

Computer Cluster

LVS Cluster Management

Contents

Cluster Management