http://kb.linuxvirtualserver.org/api.php?action=feedcontributions&user=Svenx&feedformat=atomLVSKB - User contributions [en]2024-03-29T00:44:18ZUser contributionsMediaWiki 1.26.2http://kb.linuxvirtualserver.org/wiki?title=Two-node_setup_with_overlapping_client_subnets&diff=5961Two-node setup with overlapping client subnets2011-10-19T07:53:36Z<p>Svenx: /* Theory of operation */ Note on remapped subnets in the main application</p>
<hr />
<div>This guide describes how to set up a two-node cluster that handles<br />
multiple overlapping client subnets, and keeps clients uniquely<br />
identifiable, even as they reach the main application.<br />
<br />
==Preamble==<br />
This setup requires kernel version 2.6.37 or newer. In particular, it<br />
depends on the following recent features:<br />
*Accept incoming packets with local source (v2.6.33, [https://github.com/torvalds/linux/commit/8153a10c08f1312af563bb92532002e46d3f504a commit])<br />
*Connection tracking zones (v2.6.34, [https://github.com/torvalds/linux/commit/5d0aa2ccd4699a01cfdf14886191c249d7b45a01 commit])<br />
*Netfilter nat INPUT chain, NETMAP changes (v2.6.36, [https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e commit])<br />
*Netfilter connection tracking for lvs/ipvs (v2.6.37, [https://github.com/torvalds/linux/commit/f4bc17cdd205ebaa3807c2aa973719bb5ce6a5b2 commit])<br />
<br />
==Introduction==<br />
This setup solves the challenge of serving remote users that originate<br />
from multiple different sites that all use the same overlapping<br />
subnet, all on a single pair of nodes. It also maps each remote site<br />
to a unique IP range so that users can be identified in application<br />
logs, etc. So in short, it combines a two-node, multi-interface lvs<br />
setup with network remapping. It's ultimately a fairly complicated<br />
puzzle.<br />
<br />
==Network infrastructure==<br />
Network diagram:<br />
Main virtual IP: 1.2.3.4 port 80<br />
/-----------\<br />
+------------------------------+ /==( 10.0.0.0/24 )<br />
| Router | // \-----------/<br />
| (VPN) ' ' ' ' ' ' ' ' ' ' '===/ Remote A<br />
| ' (crypto map) |<br />
| ' ' ' ' '===\<br />
| ' ' | \\ /-----------\<br />
| if0.10 if0.20 | \==( 10.0.0.0/24 )<br />
| 172.16.10.1 172.16.20.1 | \-----------/<br />
+------------#-----$-----------+ Remote B<br />
# $<br />
VLAN 10 # $ VLAN 20<br />
# $<br />
VIP 172.16.10.2/24 # $ VIP 172.16.20.2/24<br />
VIP 1.2.3.4/32 # $ VIP 1.2.3.4/32<br />
+------------------+ # $ +------------------+<br />
| LVS A | # $ | LVS B |<br />
| eth0.10 ######### $ ### eth0.10 |<br />
| RIP 172.16.10.3 | $ | RIP 172.16.10.4 |<br />
| | $ | |<br />
| eth0.20 $$$$$$$$$$$$$$$ eth0.20 |<br />
| RIP 172.16.20.3 | | RIP 172.16.20.4 |<br />
| | | |<br />
+------------------+ +------------------+<br />
<br />
===Notes===<br />
While you can use individual network interfaces, using VLANs saves<br />
valuable resources. Combine VLANs with interface bonding to achieve<br />
an even higher degree of resilience against failures.<br />
<br />
The router maps each remote site to its own VLAN. How this is done<br />
isn't really important; the remote sites can be directly connected on<br />
separate egress interfaces, or using IPSec VPNs, like in the diagram:<br />
This approach is common in Cisco routers by using VRF-aware IPSec. In<br />
short, a crypto map is defined so that tunnel A is mapped to VRF<br />
(virtual routing and forwarding) instance 10, and tunnel B to VRF 20.<br />
These VRF instances will in turn have separate routing tables,<br />
pointing the virtual IP towards the LVS VIP on each VLAN.<br />
<br />
An obvious and easy solution to the overlapping subnets, would be to<br />
have the router do SNAT/masquerading of the incoming packets. In my<br />
case, I spent lots of time trying to get that to work on the Cisco<br />
router, but without luck.<br />
<br />
==Theory of operation==<br />
Assuming that traffic reaches the LVS pair on both VLAN 10 and 20, the<br />
idea is that packets are handled in the following way on each VLAN:<br />
<br />
# Incoming packets coming from the client subnet 10.0.0.0/24, destined for the virtual IP 1.2.3.4 on port 80, are marked with an fwmark using iptables with the MARK target.<br />
# Keepalived/LVS/ipvs is configured to schedule packets based on fwmarks. Mark 10 is loadbalanced to VLAN 10 backends 172.16.10.3 and .4. Mark 20 is loadbalanced to VLAN 20 backends 172.16.20.3 and .4.<br />
# Either on the way ''in'' on the same node, or on the way ''out'' to the other node, the packet's source IP is mapped to a unique subnet using the iptables NETMAP target.<br />
# The packet is handled by the main application, and a response packet is sent back to the source.<br />
# RPDB entries and custom routing tables are set up using iproute2, to ensure that the response packet makes it back the same way it came, through the NETMAP translation and then out the same interface it came in.<br />
<br />
In the end, the main application (running on both nodes) would see<br />
clients from site A coming from source 10.0.10.0/24, and clients<br />
from site B coming from 10.0.20.0/24. Any application/user logic,<br />
log analysis or accounting, would have to take this into account<br />
and do a reverse mapping.<br />
<br />
==Proof of concept==<br />
The following script will use iptables and iproute to set the network<br />
up to the required state. It assumes that the network interfaces have<br />
been set up with the basic IP addresses:<br />
<br />
*LVS A, eth0.10: 172.16.10.3/24, default gateway 172.16.10.1<br />
*LVS A, eth0.20: 172.16.20.3/24, no default gateway<br />
*LVS B, eth0.10: 172.16.10.4/24, default gateway 172.16.10.1<br />
*LVS B, eth0.20: 172.16.20.4/24, no default gateway<br />
<br />
===Network configuration script===<br />
#!/bin/sh<br />
<br />
# This proof of concept script is intended to be straight forward to<br />
# read and understand, rather than being cleverly written with<br />
# variables, loops, etc. It is intended to work on both nodes, so some<br />
# conditional variables must be set initially.<br />
<br />
# Unique variables per node, derived from hostname.<br />
case `hostname` in<br />
lvsa)<br />
other_mac=00:10:10:10:10:20 # lvsb eth0 mac address<br />
v10my_ip=172.16.10.3 # lvsa eth0.10 ip addr<br />
v20my_ip=172.16.20.3 # lvsa eth0.20 ip addr<br />
v10other_ip=172.16.10.4 # lvsb eth0.10 ip addr<br />
v20other_ip=172.16.20.4 # lvsb eth0.20 ip addr<br />
;;<br />
lvsb)<br />
other_mac=00:10:10:10:10:10 # lvsa eth0 mac address<br />
v10my_ip=172.16.10.4 # lvsb eth0.10 ip addr<br />
v20my_ip=172.16.20.4 # lvsb eth0.20 ip addr<br />
v10other_ip=172.16.10.3 # lvsa eth0.10 ip addr<br />
v20other_ip=172.16.20.3 # lvsa eth0.20 ip addr<br />
;;<br />
*)<br />
echo 2>&1 "Unknown host: `hostname`"<br />
exit 1<br />
;;<br />
esac<br />
<br />
start() {<br />
### RPDB (Routing Policy Database)<br />
# Remote response packets (those that will have to be sent to the<br />
# other node) will have source IP equal to the outgoing interface's<br />
# primary address, due to the iptables REDIRECT that rewrites the<br />
# destination address to the incoming interface's primary address on<br />
# incoming packets. The routing tables pointed to have the other<br />
# node as gateway.<br />
ip rule add pref 110 from $v10my_ip to 10.0.10.0/24 lookup 210<br />
ip rule add pref 120 from $v20my_ip to 10.0.20.0/24 lookup 220<br />
<br />
# Local response packets will have been NETMAP detranslated already,<br />
# so the destination will be the untranslated source net. The<br />
# routing tables pointed to have the upstream router as gateway,<br />
# since the packets should be sent straight back to the source.<br />
ip rule add pref 210 to 10.0.10.0/24 lookup 110<br />
ip rule add pref 220 to 10.0.20.0/24 lookup 120<br />
<br />
# Remote response packets are marked so that they are routed out the<br />
# correct interface when send out.<br />
ip rule add pref 310 to 10.0.0.0/24 fwmark 210 lookup 110<br />
ip rule add pref 320 to 10.0.0.0/24 fwmark 220 lookup 120<br />
<br />
<br />
### Routing<br />
# Routing tables pointing the default gateway to the upstream<br />
# router.<br />
ip route add default via 172.16.10.1 dev eth0.10 table 110<br />
ip route add default via 172.16.20.1 dev eth0.20 table 120<br />
<br />
# Routing tables pointing the default gateway to the other node.<br />
ip route add default via $v0other_ip dev eth0.10 table 210<br />
ip route add default via $v1other_ip dev eth0.20 table 220<br />
<br />
<br />
### Netfilter rules<br />
# Put request packets on each interface into separate connection<br />
# tracking zones. NAT rules applied to packets in one zone, will not<br />
# be touched by other NAT rules that don't apply to that zone. See<br />
# http://lwn.net/Articles/371028/<br />
# https://github.com/torvalds/linux/commit/5d0aa2ccd4699a01cfdf14886191c249d7b45a01<br />
iptables -t raw -A PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 1<br />
iptables -t raw -A PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 2<br />
<br />
# Put remote response packets (from the other node) into the<br />
# corresponding connection tracking zones.<br />
iptables -t raw -A PREROUTING -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1<br />
iptables -t raw -A PREROUTING -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2<br />
<br />
# Put local response packets (from this node) into the corresponding<br />
# connection tracking zones.<br />
iptables -t raw -A OUTPUT -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1<br />
iptables -t raw -A OUTPUT -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2<br />
<br />
# Mark incoming packets for lvs/ipvs scheduling. These marks match<br />
# the ones in keepalived.conf.<br />
iptables -t raw -A PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 110<br />
iptables -t raw -A PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 120<br />
<br />
# Mark remote response packets (from the other node) so that they<br />
# are routed out the correct interface after NETMAP detranslation.<br />
iptables -t raw -A PREROUTING -i eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 210<br />
iptables -t raw -A PREROUTING -i eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 220<br />
<br />
# Mark local response packets (from this node) so that they are<br />
# routed out the correct interface after NETMAP detranslation.<br />
iptables -t raw -A OUTPUT -o eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 210<br />
iptables -t raw -A OUTPUT -o eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 220<br />
<br />
# Remap remote request packets to unique subnets.<br />
# https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e<br />
iptables -t nat -A POSTROUTING -m mark --mark 110 -j NETMAP --to 10.0.10.0/24<br />
iptables -t nat -A POSTROUTING -m mark --mark 120 -j NETMAP --to 10.0.20.0/24<br />
<br />
# Remap local request packets to unique subnets.<br />
iptables -t nat -A INPUT -m mark --mark 110 -j NETMAP --to 10.0.10.0/24<br />
iptables -t nat -A INPUT -m mark --mark 120 -j NETMAP --to 10.0.20.0/24<br />
<br />
# Redirect incoming remote request packets so that the source IP is<br />
# set to the primary address of the incoming interface. This is<br />
# essential since it will ensure that the response packet is routed<br />
# out the same interface. Without it, the routing would select the<br />
# default route. A potential solution would involve connmark, but<br />
# the mark is applied after the ip rule is evaluated.<br />
# FIXME: The mac-source matching should be unnecessary, as the<br />
# source subnet has been translated already? Needs verification.<br />
iptables -t nat -A PREROUTING -s 10.0.10.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT<br />
iptables -t nat -A PREROUTING -s 10.0.20.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT<br />
<br />
<br />
### Sysctl settings<br />
# Use ARP settings that works with our setup.<br />
# http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L895<br />
# http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L926<br />
# http://kb.linuxvirtualserver.org/wiki/ARP_Issues_in_LVS/DR_and_LVS/TUN_Clusters<br />
sysctl net.ipv4.conf.eth0/10.arp_announce=2<br />
sysctl net.ipv4.conf.eth0/10.arp_ignore=1<br />
<br />
# Forwarding must be enabled, although forwarding doesn't apply to<br />
# packets scheduled by lvs/ipvs, it's needed for remote response<br />
# packets. (Forwarding applies to the inbound interface, not the<br />
# outbound.)<br />
sysctl net.ipv4.conf.eth0/10.forwarding=1<br />
<br />
# Accept incoming packets with a local source addres. This is<br />
# required, as remote response packets will have the virtual IP as<br />
# their source: That virtual IP is also present as a secondary<br />
# address on the local incoming interface. Normally it would be<br />
# dropped, but this sysctl allows it to be accepted.<br />
# https://github.com/torvalds/linux/commit/8153a10c08f1312af563bb92532002e46d3f504a<br />
# http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L849<br />
sysctl net.ipv4.conf.eth0/10.accept_local=1<br />
<br />
# Enable connection tracking for lvs/ipvs connections. This lets us<br />
# apply the NETMAP rule in POSTROUTING for remote request packets.<br />
# Without this setting, the netfilter nat table would not be<br />
# traversed by the ipvs'ed packets.<br />
# https://github.com/torvalds/linux/commit/f4bc17cdd205ebaa3807c2aa973719bb5ce6a5b2<br />
# http://lxr.linux.no/#linux+v3.0/net/netfilter/ipvs/Kconfig#L252<br />
sysctl net.ipv4.vs.conntrack=1<br />
<br />
# Disable accepting and sending ICMP redirects. This is essential to<br />
# avoid redirecting remote response packets directly to the router:<br />
# These packets must go through the lvs/ipvs master for correct<br />
# NETMAP detranslation. Setting this for 'all' is enough as long as<br />
# other interfaces have forwarding enabled (which they do).<br />
# http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L753<br />
sysctl net.ipv4.conf.all.accept_redirects=0<br />
sysctl net.ipv4.conf.all.send_redirects=0<br />
}<br />
<br />
stop() {<br />
# Revert most of the settings from start().<br />
sysctl net.ipv4.conf.eth0/10.accept_local=0<br />
sysctl net.ipv4.conf.eth0/10.forwarding=0<br />
<br />
iptables -t nat -D PREROUTING -s 10.0.20.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT<br />
iptables -t nat -D PREROUTING -s 10.0.10.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT<br />
iptables -t nat -D INPUT -m mark --mark 320 -j NETMAP --to 10.0.20.0/24<br />
iptables -t nat -D INPUT -m mark --mark 310 -j NETMAP --to 10.0.10.0/24<br />
iptables -t nat -D POSTROUTING -m mark --mark 320 -j NETMAP --to 10.0.20.0/24<br />
iptables -t nat -D POSTROUTING -m mark --mark 310 -j NETMAP --to 10.0.10.0/24<br />
iptables -t raw -D OUTPUT -o eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 120<br />
iptables -t raw -D OUTPUT -o eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 110<br />
iptables -t raw -D PREROUTING -i eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 120<br />
iptables -t raw -D PREROUTING -i eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 110<br />
iptables -t raw -D PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 320<br />
iptables -t raw -D PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 310<br />
iptables -t raw -D OUTPUT -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2<br />
iptables -t raw -D OUTPUT -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1<br />
iptables -t raw -D PREROUTING -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2<br />
iptables -t raw -D PREROUTING -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1<br />
iptables -t raw -D PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 2<br />
iptables -t raw -D PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 1<br />
<br />
ip route flush table 220<br />
ip route flush table 210<br />
ip route flush table 120<br />
ip route flush table 110<br />
ip rule del pref 320<br />
ip rule del pref 310<br />
ip rule del pref 220<br />
ip rule del pref 210<br />
ip rule del pref 120<br />
ip rule del pref 110<br />
}<br />
<br />
case "$1" in<br />
start|stop)<br />
$1<br />
;;<br />
restart)<br />
stop<br />
start<br />
;;<br />
*)<br />
echo "Usage: $0 start|stop|restart"<br />
;;<br />
esac<br />
<br />
===keepalived.conf===<br />
FIXME: This is fairly stripped down, so might not work out of the box.<br />
<br />
vrrp_sync_group mylvs {<br />
group {<br />
VI_1<br />
VI_2<br />
}<br />
}<br />
vrrp_instance VI_1 {<br />
state BACKUP<br />
interface eth0.10<br />
virtual_router_id 10<br />
priority 100<br />
virtual_ipaddress {<br />
172.16.10.2 # VLAN 10 VIP<br />
1.2.3.4 # Main virtual IP<br />
}<br />
}<br />
vrrp_instance VI_2 {<br />
state BACKUP<br />
interface eth0.20<br />
virtual_router_id 20<br />
priority 150<br />
virtual_ipaddress {<br />
172.16.20.2 # VLAN 20 VIP<br />
1.2.3.4 # Main virtual IP<br />
}<br />
}<br />
virtual_server fwmark 310 {<br />
lb_algo lc<br />
lb_kind DR<br />
persistence_timeout 0<br />
delay_loop 20<br />
protocol TCP<br />
real_server 172.16.10.3 80 {<br />
weight 1<br />
}<br />
real_server 172.16.10.4 80 {<br />
weight 1<br />
}<br />
}<br />
virtual_server fwmark 320 {<br />
lb_algo lc<br />
lb_kind DR<br />
persistence_timeout 0<br />
delay_loop 20<br />
protocol TCP<br />
real_server 172.16.20.3 80 {<br />
weight 1<br />
}<br />
real_server 172.16.20.4 80 {<br />
weight 1<br />
}<br />
}<br />
<br />
==Tips==<br />
You can use symbolic routing table names instead of numbers (both for<br />
ip rule and ip route) by adding the number-name mapping to<br />
/etc/iproute2/rt_tables. You can then use syntax like "ip rule add ... lookup<br />
v0vlan", and "ip route add ... table v0vlan". For example:<br />
<br />
110 v0vlan<br />
120 v1vlan<br />
210 v0to_peer<br />
220 v1to_peer<br />
<br />
Make sure to keep track of fwmark numbers and ip rule preference numbers. They<br />
can overlap if you want, but make sure to keep track of each without mixing<br />
them up. Use a logical scheme.<br />
<br />
==Debian implementation==<br />
The following script is the one I use on my own setup. It is designed to work<br />
on systems that use ifupdown (Debian and Ubuntu, for example). Ifupdown can<br />
call hook scripts on events (like before or after the interface is brought up<br />
or down). In this case it is called after bringing the interface up, and before<br />
bringing it down.<br />
<br />
===/etc/network/interfaces===<br />
====LVS A====<br />
auto lo<br />
iface lo inet loopback<br />
<br />
auto eth0<br />
iface eth0 inet manual<br />
# Load ip_vs here to avoid segfaults in keepalived (it tries to do<br />
# 'modprobe -k'. See https://bugzilla.redhat.com/show_bug.cgi?id=528465<br />
pre-up /sbin/modprobe ip_vs # Avoid<br />
# Add main virtual IP to lo interface<br />
up /sbin/ip addr add 1.2.3.4/32 dev lo<br />
<br />
# VLAN 10<br />
auto eth0.10<br />
iface eth0.10 inet static<br />
address 172.16.10.3<br />
netmask 255.255.255.255<br />
broadcast 172.16.10.255<br />
gateway 172.16.10.1<br />
<br />
# VLAN 20<br />
auto eth0.20<br />
iface eth0.20 inet static<br />
address 172.16.20.3<br />
netmask 255.255.255.255<br />
broadcast 172.16.20.255<br />
<br />
====LVS B====<br />
auto lo<br />
iface lo inet loopback<br />
<br />
auto eth0<br />
iface eth0 inet manual<br />
# Load ip_vs here to avoid segfaults in keepalived (it tries to do<br />
# 'modprobe -k'. See https://bugzilla.redhat.com/show_bug.cgi?id=528465<br />
pre-up /sbin/modprobe ip_vs # Avoid<br />
# Add main virtual IP to lo interface<br />
up /sbin/ip addr add 1.2.3.4/32 dev lo<br />
<br />
# VLAN 10<br />
auto eth0.10<br />
iface eth0.10 inet static<br />
address 172.16.10.4<br />
netmask 255.255.255.255<br />
broadcast 172.16.10.255<br />
gateway 172.16.10.1<br />
<br />
# VLAN 20<br />
auto eth0.20<br />
iface eth0.20 inet static<br />
address 172.16.20.4<br />
netmask 255.255.255.255<br />
broadcast 172.16.20.255<br />
<br />
===ifupdown script===<br />
Put this script in /etc/network/if-up.d/, and put a symlink to it in<br />
/etc/network/if-down.d/.<br />
<br />
#!/bin/sh<br />
<br />
# This script is most likely not plug-and-play. Take note of the<br />
# FIXMEs and how the script is intended to work.<br />
<br />
# This script depends on MAC and IP addresses, so we only support the<br />
# following two hostnames.<br />
case `hostname` in<br />
lvsa)<br />
other_mac=00:10:10:10:10:20 # lvsb eth0 mac address<br />
;;<br />
lvsb)<br />
other_mac=00:10:10:10:10:10 # lvsa eth0 mac address<br />
;;<br />
*)<br />
echo >&2 "Skipping unknown hostname: `hostname`"<br />
exit 0<br />
;;<br />
esac<br />
<br />
client_net=10.0.0.0/24<br />
virtual_ips="1.2.3.4" # Multiple allowed, space separated.<br />
virtual_ports="80" # Multiple allowed, space separated.<br />
iface=$IFACE<br />
<br />
case "$iface" in<br />
eth0.10)<br />
index=0<br />
ct_zone=1<br />
my_gw=172.16.10.1<br />
my_ip=172.16.10.3<br />
other_ip=172.16.10.4<br />
test `hostname` = lvsb && {<br />
my_ip=172.16.10.4<br />
other_ip=172.16.10.3<br />
}<br />
;;<br />
eth0.20)<br />
index=1<br />
ct_zone=2<br />
my_gw=172.16.20.1<br />
my_ip=172.16.20.3<br />
other_ip=172.16.20.4<br />
test `hostname` = lvsb && {<br />
my_ip=172.16.20.4<br />
other_ip=172.16.20.3<br />
}<br />
;;<br />
"")<br />
echo >&2 "Skipping empty interface"<br />
exit 0<br />
;;<br />
*)<br />
echo >&2 "Skipping unknown interface: $iface"<br />
exit 0<br />
;;<br />
esac<br />
<br />
# These variables use lots of shortcuts based on the $ct_zone number.<br />
netmap=10.0.${ct_zone}0.0/24 # 10.0.10.0/24, 10.0.20.0/24<br />
rt_vlan_num=1${ct_zone}0 # 110, 120<br />
rt_vlan=v${ct_zone}0vlan # v10vlan, v20vlan<br />
rt_topeer_num=2${ct_zone}0 # 210, 220<br />
rt_topeer=v${ct_zone}0to_peer # v10to_peer, v20to_peer<br />
rp_topeer=1${ct_zone}0 # 110, 120<br />
rp_return=2${ct_zone}0 # 210, 220<br />
rp_fwmark=3${ct_zone}0 # 310, 320<br />
fwm_ipvs=1${ct_zone}0 # 110, 120<br />
fwm_return=2${ct_zone}0 # 210, 220<br />
<br />
start() {<br />
# Add symbolic routing table names<br />
grep -q "^$rt_vlan_num\>" /etc/iproute2/rt_tables ||<br />
echo "$rt_vlan_num\t$rt_vlan" >> /etc/iproute2/rt_tables<br />
grep -q "^$rt_topeer_num\>" /etc/iproute2/rt_tables ||<br />
echo "$rt_topeer_num\t$rt_topeer" >> /etc/iproute2/rt_tables<br />
<br />
### Routing policy database (RPDB) entries<br />
# Delete stale entries<br />
ip rule del pref $rp_topeer 2>/dev/null<br />
ip rule del pref $rp_return 2>/dev/null<br />
ip rule del pref $rp_fwmark 2>/dev/null<br />
<br />
# From own IP (due to iptables REDIRECT) to mapped subnet,<br />
# use routing table pointing to peer node.<br />
ip rule add pref $rp_topeer from $my_ip to $netmap lookup $rt_topeer<br />
<br />
# To mapped subnet (if this node has handled the packet),<br />
# use routing table pointing to vlan's gateway.<br />
ip rule add pref $rp_return to $netmap lookup $rt_vlan<br />
<br />
# To original subnet with fwmark (packet has been through ipvs<br />
# and translation, and is on its way back), use routing table<br />
# pointing to vlan's gateway.<br />
ip rule add pref $rp_fwmark to $client_net fwmark $fwm_return lookup $rt_vlan<br />
<br />
### Routing entries<br />
# Flush stale tables<br />
ip route flush table $rt_vlan<br />
ip route flush table $rt_topeer<br />
<br />
# VLAN gateway<br />
ip route add default via $my_gw dev $iface table $rt_vlan<br />
<br />
# Other peer is gateway<br />
ip route add default via $other_ip dev $iface table $rt_topeer<br />
<br />
# Accept local source on interface. This is for packets returning<br />
# from the second node after being ipvs'ed by this node.<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.accept_local=1<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.forwarding=1<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.arp_announce=2<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.arp_ignore=1<br />
<br />
# Ensure conntrack of ipvs'ed packets. Sourcing /etc/sysctl.conf<br />
# from /etc/init.d/procps happens too early.<br />
sysctl net.ipv4.vs.conntrack=1<br />
<br />
# https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e<br />
# Use separate conntrack zone for each interface. Request packets.<br />
for vip in $virtual_ips; do<br />
for port $virtual_ports; do<br />
iptables -t raw -A PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport $port -j CT --zone $ct_zone<br />
done<br />
done<br />
<br />
# Use separate conntrack zone for each interface. Return packets.<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
for chain in PREROUTING OUTPUT; do<br />
iptables -t raw -A $chain -d $netmap -s $vip -p tcp --sport $port -j CT --zone $ct_zone<br />
done<br />
done<br />
done<br />
<br />
# Mark packets for ipvs scheduling.<br />
for vip in $virtual_ips; do<br />
# FIXME: Use $virtual_ports somehow, and map them to fwmarks?<br />
iptables -t raw -A PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport 80 -j MARK --set-mark $fwm_ipvs<br />
done<br />
<br />
# Mark packets for rpdb return routes. Forwarded from peer.<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -A PREROUTING -i $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return<br />
done<br />
done<br />
<br />
# Mark packets for rpdb return routes. From self.<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -A OUTPUT -o $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return<br />
done<br />
done<br />
<br />
# Map source network to unique subnet<br />
iptables -t nat -A INPUT -m mark --mark $fwm_ipvs -j NETMAP --to $netmap<br />
iptables -t nat -A POSTROUTING -m mark --mark $fwm_ipvs -j NETMAP --to $netmap<br />
<br />
# DNAT to local iface address if packet is coming from other node<br />
# (after ipvs scheduling). This lets us do correct rpdb+routing for<br />
# return packets.<br />
# FIXME: The mac-source matching should be unnecessary, as the<br />
# source subnet has been translated already? Needs verification.<br />
for vip in $virtual_ips; do<br />
iptables -t nat -A PREROUTING -s $netmap -m mac --mac-source $other_mac -d $vip -j REDIRECT<br />
done<br />
}<br />
<br />
stop() {<br />
# Return packets<br />
ip rule del pref $rp_topeer 2>/dev/null<br />
ip rule del pref $rp_return 2>/dev/null<br />
ip rule del pref $rp_fwmark 2>/dev/null<br />
<br />
# Return routes<br />
ip route flush table $rt_vlan 2>/dev/null<br />
ip route flush $rt_topeer 2>/dev/null<br />
<br />
# Accept local source on interface. This is for packets returning<br />
# from the second node after being ipvs'ed by this node.<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.accept_local=0 2>/dev/null<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.forwarding=0 2>/dev/null<br />
<br />
# https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e<br />
# Use separate conntrack zone for each interface.<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -D PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport $port -j CT --zone $ct_zone 2>/dev/null<br />
done<br />
done<br />
<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
for chain in PREROUTING OUTPUT; do<br />
iptables -t raw -D $chain -d $netmap -s $vip -p tcp --sport $port -j CT --zone $ct_zone 2>/dev/null<br />
done<br />
done<br />
done<br />
<br />
# Marks for ipvs<br />
for vip in $virtual_ips; do<br />
# FIXME: Use $virtual_ports somehow, and map them to fwmarks?<br />
iptables -t raw -D PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport 80 -j MARK --set-mark $fwm_ipvs 2>/dev/null<br />
done<br />
<br />
# Marks for rpdb return routes<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -D PREROUTING -i $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return 2>/dev/null<br />
done<br />
done<br />
<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -D OUTPUT -o $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return 2>/dev/null<br />
done<br />
done<br />
<br />
# Map source network to unique subnet<br />
iptables -t nat -D INPUT -m mark --mark $fwm_ipvs -j NETMAP --to $netmap 2>/dev/null<br />
iptables -t nat -D POSTROUTING -m mark --mark $fwm_ipvs -j NETMAP --to $netmap 2>/dev/null<br />
<br />
# DNAT to local iface address if packet is coming from other node<br />
# (after ipvs scheduling).<br />
for vip in $virtual_ips; do<br />
iptables -t nat -D PREROUTING -s $netmap -m mac --mac-source $other_mac -d $vip -j REDIRECT 2>/dev/null<br />
done<br />
<br />
return 0<br />
}<br />
<br />
case "$MODE" in<br />
start)<br />
start<br />
;;<br />
stop)<br />
stop<br />
;;<br />
esac<br />
<br />
==Troubleshooting==<br />
While developing this setup, I ran into tons of problems. The<br />
following debugging tricks are invaluable when working with complex<br />
network setups.<br />
<br />
===Tcpdump===<br />
The mother of all network debugging. It is very useful here,<br />
especially with some good filters. Always use the -e option so you can<br />
inspect the MAC addresses. They are very important in this sort of<br />
setup.<br />
<br />
Here's a useful example that dumps packets on eth0.10, filtering on<br />
packets to/from port 80, and involving the MAC addresses for either of<br />
the LVS nodes. It also shows ARP and ICMP packets, which is very<br />
useful.<br />
<br />
tcpdump -envi eth0.10 -n port 80 and '( ether host 00:10:10:10:10:10 or ether host 00:10:10:10:10:20 )' or arp or icmp<br />
<br />
===Iptables logging===<br />
Logging in all netfilter tables and chains is a great way to inspect<br />
how a packet traverses the stack. This script will set up four rules<br />
per chain:<br />
*Request packets towards port 80 in the start of the chain<br />
*Response packets from port 80 in the start of the chain<br />
*Request packets towards port 80 in the end of the chain<br />
*Response packets from port 80 in the end of the chain<br />
<br />
for t in raw mangle nat filter; do<br />
for c in PREROUTING INPUT FORWARD OUTPUT POSTROUTING; do<br />
iptables -t $t -I $c -p tcp --dport 80 -j LOG --log-prefix "REQ-A-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null<br />
iptables -t $t -I $c -p tcp --sport 80 -j LOG --log-prefix "RES-A-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null<br />
iptables -t $t -A $c -p tcp --dport 80 -j LOG --log-prefix "REQ-Z-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null<br />
iptables -t $t -A $c -p tcp --sport 80 -j LOG --log-prefix "RES-Z-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null<br />
done<br />
done<br />
<br />
Then use something like "tail -f /var/log/kern.log" to track what's<br />
going on.<br />
<br />
===LVS/ipvs debugging===<br />
For detailed lvs/ipvs debugging, you can check if your kernel is<br />
compiled with CONFIG_IP_VS_DEBUG enabled. If not, you can recompile<br />
the kernel after enabling it. Set the<br />
[http://lxr.linux.no/#linux+v3.0/Documentation/networking/ipvs-sysctl.txt#L27 debug level]<br />
to a suitable number, and you can tail the kernel log to see what's<br />
going on.<br />
<br />
===ICMP redirects===<br />
If you leave ICMP redirects enabled, the LVS nodes will teach each<br />
other to send remote response packets directly back to the router,<br />
instead of through the required NETMAP detranslation. To avoid this,<br />
set the net.ipv4.conf.all.accept_redirects sysctl to 0.<br />
<br />
In a particularly long debug session, I couldn't figure out why<br />
packets were being sent directly back to the router, even if redirects<br />
were disabled. Listing the route cache with "ip route show cache"<br />
indicated that the route was flagged with 'redirected'. This turned<br />
out to be due to<br />
[https://github.com/torvalds/linux/commit/f39925dbde7788cfb96419c0f092b086aa325c0f a modification]<br />
where the inet peer cache kept information about a previously learned<br />
redirect (before I disabled them), and propagated that to the route<br />
cache. Instead of "ip route flush cache", I had to reboot the node to<br />
clear the inet peer cache (or wait for it to time out, which could<br />
take a while).<br />
<br />
==Other==<br />
Before finding the 2.6.36 NAT and NETMAP modifications, I played<br />
around with [http://vde.sourceforge.net/ Virtual Distributed Ethernet]<br />
and the feature allowing to<br />
[https://github.com/torvalds/linux/commit/5adef1809147a9c39119ffd5a13a1ca4fe23a411 delete/move the local routing table preference]<br />
(2.6.33) to loop packets out through a virtual switch and back in<br />
again, but it got even more messy.<br />
<br />
==Thanks==<br />
This approach would not be possible without all the recent patches by<br />
Patrick McHardy, and of course the years of ground work in netfilter<br />
and ipvs that it builds upon.</div>Svenxhttp://kb.linuxvirtualserver.org/wiki?title=Two-node_setup_with_overlapping_client_subnets&diff=5960Two-node setup with overlapping client subnets2011-10-19T07:41:06Z<p>Svenx: New article</p>
<hr />
<div>This guide describes how to set up a two-node cluster that handles<br />
multiple overlapping client subnets, and keeps clients uniquely<br />
identifiable, even as they reach the main application.<br />
<br />
==Preamble==<br />
This setup requires kernel version 2.6.37 or newer. In particular, it<br />
depends on the following recent features:<br />
*Accept incoming packets with local source (v2.6.33, [https://github.com/torvalds/linux/commit/8153a10c08f1312af563bb92532002e46d3f504a commit])<br />
*Connection tracking zones (v2.6.34, [https://github.com/torvalds/linux/commit/5d0aa2ccd4699a01cfdf14886191c249d7b45a01 commit])<br />
*Netfilter nat INPUT chain, NETMAP changes (v2.6.36, [https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e commit])<br />
*Netfilter connection tracking for lvs/ipvs (v2.6.37, [https://github.com/torvalds/linux/commit/f4bc17cdd205ebaa3807c2aa973719bb5ce6a5b2 commit])<br />
<br />
==Introduction==<br />
This setup solves the challenge of serving remote users that originate<br />
from multiple different sites that all use the same overlapping<br />
subnet, all on a single pair of nodes. It also maps each remote site<br />
to a unique IP range so that users can be identified in application<br />
logs, etc. So in short, it combines a two-node, multi-interface lvs<br />
setup with network remapping. It's ultimately a fairly complicated<br />
puzzle.<br />
<br />
==Network infrastructure==<br />
Network diagram:<br />
Main virtual IP: 1.2.3.4 port 80<br />
/-----------\<br />
+------------------------------+ /==( 10.0.0.0/24 )<br />
| Router | // \-----------/<br />
| (VPN) ' ' ' ' ' ' ' ' ' ' '===/ Remote A<br />
| ' (crypto map) |<br />
| ' ' ' ' '===\<br />
| ' ' | \\ /-----------\<br />
| if0.10 if0.20 | \==( 10.0.0.0/24 )<br />
| 172.16.10.1 172.16.20.1 | \-----------/<br />
+------------#-----$-----------+ Remote B<br />
# $<br />
VLAN 10 # $ VLAN 20<br />
# $<br />
VIP 172.16.10.2/24 # $ VIP 172.16.20.2/24<br />
VIP 1.2.3.4/32 # $ VIP 1.2.3.4/32<br />
+------------------+ # $ +------------------+<br />
| LVS A | # $ | LVS B |<br />
| eth0.10 ######### $ ### eth0.10 |<br />
| RIP 172.16.10.3 | $ | RIP 172.16.10.4 |<br />
| | $ | |<br />
| eth0.20 $$$$$$$$$$$$$$$ eth0.20 |<br />
| RIP 172.16.20.3 | | RIP 172.16.20.4 |<br />
| | | |<br />
+------------------+ +------------------+<br />
<br />
===Notes===<br />
While you can use individual network interfaces, using VLANs saves<br />
valuable resources. Combine VLANs with interface bonding to achieve<br />
an even higher degree of resilience against failures.<br />
<br />
The router maps each remote site to its own VLAN. How this is done<br />
isn't really important; the remote sites can be directly connected on<br />
separate egress interfaces, or using IPSec VPNs, like in the diagram:<br />
This approach is common in Cisco routers by using VRF-aware IPSec. In<br />
short, a crypto map is defined so that tunnel A is mapped to VRF<br />
(virtual routing and forwarding) instance 10, and tunnel B to VRF 20.<br />
These VRF instances will in turn have separate routing tables,<br />
pointing the virtual IP towards the LVS VIP on each VLAN.<br />
<br />
An obvious and easy solution to the overlapping subnets, would be to<br />
have the router do SNAT/masquerading of the incoming packets. In my<br />
case, I spent lots of time trying to get that to work on the Cisco<br />
router, but without luck.<br />
<br />
==Theory of operation==<br />
Assuming that traffic reaches the LVS pair on both VLAN 10 and 20, the<br />
idea is that packets are handled in the following way on each VLAN:<br />
<br />
# Incoming packets coming from the client subnet 10.0.0.0/24, destined for the virtual IP 1.2.3.4 on port 80, are marked with an fwmark using iptables with the MARK target.<br />
# Keepalived/LVS/ipvs is configured to schedule packets based on fwmarks. Mark 10 is loadbalanced to VLAN 10 backends 172.16.10.3 and .4. Mark 20 is loadbalanced to VLAN 20 backends 172.16.20.3 and .4.<br />
# Either on the way ''in'' on the same node, or on the way ''out'' to the other node, the packet's source IP is mapped to a unique subnet using the iptables NETMAP target.<br />
# The packet is handled by the main application, and a response packet is sent back to the source.<br />
# RPDB entries and custom routing tables are set up using iproute2, to ensure that the response packet makes it back the same way it came, through the NETMAP translation and then out the same interface it came in.<br />
<br />
==Proof of concept==<br />
The following script will use iptables and iproute to set the network<br />
up to the required state. It assumes that the network interfaces have<br />
been set up with the basic IP addresses:<br />
<br />
*LVS A, eth0.10: 172.16.10.3/24, default gateway 172.16.10.1<br />
*LVS A, eth0.20: 172.16.20.3/24, no default gateway<br />
*LVS B, eth0.10: 172.16.10.4/24, default gateway 172.16.10.1<br />
*LVS B, eth0.20: 172.16.20.4/24, no default gateway<br />
<br />
===Network configuration script===<br />
#!/bin/sh<br />
<br />
# This proof of concept script is intended to be straight forward to<br />
# read and understand, rather than being cleverly written with<br />
# variables, loops, etc. It is intended to work on both nodes, so some<br />
# conditional variables must be set initially.<br />
<br />
# Unique variables per node, derived from hostname.<br />
case `hostname` in<br />
lvsa)<br />
other_mac=00:10:10:10:10:20 # lvsb eth0 mac address<br />
v10my_ip=172.16.10.3 # lvsa eth0.10 ip addr<br />
v20my_ip=172.16.20.3 # lvsa eth0.20 ip addr<br />
v10other_ip=172.16.10.4 # lvsb eth0.10 ip addr<br />
v20other_ip=172.16.20.4 # lvsb eth0.20 ip addr<br />
;;<br />
lvsb)<br />
other_mac=00:10:10:10:10:10 # lvsa eth0 mac address<br />
v10my_ip=172.16.10.4 # lvsb eth0.10 ip addr<br />
v20my_ip=172.16.20.4 # lvsb eth0.20 ip addr<br />
v10other_ip=172.16.10.3 # lvsa eth0.10 ip addr<br />
v20other_ip=172.16.20.3 # lvsa eth0.20 ip addr<br />
;;<br />
*)<br />
echo 2>&1 "Unknown host: `hostname`"<br />
exit 1<br />
;;<br />
esac<br />
<br />
start() {<br />
### RPDB (Routing Policy Database)<br />
# Remote response packets (those that will have to be sent to the<br />
# other node) will have source IP equal to the outgoing interface's<br />
# primary address, due to the iptables REDIRECT that rewrites the<br />
# destination address to the incoming interface's primary address on<br />
# incoming packets. The routing tables pointed to have the other<br />
# node as gateway.<br />
ip rule add pref 110 from $v10my_ip to 10.0.10.0/24 lookup 210<br />
ip rule add pref 120 from $v20my_ip to 10.0.20.0/24 lookup 220<br />
<br />
# Local response packets will have been NETMAP detranslated already,<br />
# so the destination will be the untranslated source net. The<br />
# routing tables pointed to have the upstream router as gateway,<br />
# since the packets should be sent straight back to the source.<br />
ip rule add pref 210 to 10.0.10.0/24 lookup 110<br />
ip rule add pref 220 to 10.0.20.0/24 lookup 120<br />
<br />
# Remote response packets are marked so that they are routed out the<br />
# correct interface when send out.<br />
ip rule add pref 310 to 10.0.0.0/24 fwmark 210 lookup 110<br />
ip rule add pref 320 to 10.0.0.0/24 fwmark 220 lookup 120<br />
<br />
<br />
### Routing<br />
# Routing tables pointing the default gateway to the upstream<br />
# router.<br />
ip route add default via 172.16.10.1 dev eth0.10 table 110<br />
ip route add default via 172.16.20.1 dev eth0.20 table 120<br />
<br />
# Routing tables pointing the default gateway to the other node.<br />
ip route add default via $v0other_ip dev eth0.10 table 210<br />
ip route add default via $v1other_ip dev eth0.20 table 220<br />
<br />
<br />
### Netfilter rules<br />
# Put request packets on each interface into separate connection<br />
# tracking zones. NAT rules applied to packets in one zone, will not<br />
# be touched by other NAT rules that don't apply to that zone. See<br />
# http://lwn.net/Articles/371028/<br />
# https://github.com/torvalds/linux/commit/5d0aa2ccd4699a01cfdf14886191c249d7b45a01<br />
iptables -t raw -A PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 1<br />
iptables -t raw -A PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 2<br />
<br />
# Put remote response packets (from the other node) into the<br />
# corresponding connection tracking zones.<br />
iptables -t raw -A PREROUTING -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1<br />
iptables -t raw -A PREROUTING -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2<br />
<br />
# Put local response packets (from this node) into the corresponding<br />
# connection tracking zones.<br />
iptables -t raw -A OUTPUT -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1<br />
iptables -t raw -A OUTPUT -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2<br />
<br />
# Mark incoming packets for lvs/ipvs scheduling. These marks match<br />
# the ones in keepalived.conf.<br />
iptables -t raw -A PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 110<br />
iptables -t raw -A PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 120<br />
<br />
# Mark remote response packets (from the other node) so that they<br />
# are routed out the correct interface after NETMAP detranslation.<br />
iptables -t raw -A PREROUTING -i eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 210<br />
iptables -t raw -A PREROUTING -i eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 220<br />
<br />
# Mark local response packets (from this node) so that they are<br />
# routed out the correct interface after NETMAP detranslation.<br />
iptables -t raw -A OUTPUT -o eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 210<br />
iptables -t raw -A OUTPUT -o eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 220<br />
<br />
# Remap remote request packets to unique subnets.<br />
# https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e<br />
iptables -t nat -A POSTROUTING -m mark --mark 110 -j NETMAP --to 10.0.10.0/24<br />
iptables -t nat -A POSTROUTING -m mark --mark 120 -j NETMAP --to 10.0.20.0/24<br />
<br />
# Remap local request packets to unique subnets.<br />
iptables -t nat -A INPUT -m mark --mark 110 -j NETMAP --to 10.0.10.0/24<br />
iptables -t nat -A INPUT -m mark --mark 120 -j NETMAP --to 10.0.20.0/24<br />
<br />
# Redirect incoming remote request packets so that the source IP is<br />
# set to the primary address of the incoming interface. This is<br />
# essential since it will ensure that the response packet is routed<br />
# out the same interface. Without it, the routing would select the<br />
# default route. A potential solution would involve connmark, but<br />
# the mark is applied after the ip rule is evaluated.<br />
# FIXME: The mac-source matching should be unnecessary, as the<br />
# source subnet has been translated already? Needs verification.<br />
iptables -t nat -A PREROUTING -s 10.0.10.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT<br />
iptables -t nat -A PREROUTING -s 10.0.20.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT<br />
<br />
<br />
### Sysctl settings<br />
# Use ARP settings that works with our setup.<br />
# http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L895<br />
# http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L926<br />
# http://kb.linuxvirtualserver.org/wiki/ARP_Issues_in_LVS/DR_and_LVS/TUN_Clusters<br />
sysctl net.ipv4.conf.eth0/10.arp_announce=2<br />
sysctl net.ipv4.conf.eth0/10.arp_ignore=1<br />
<br />
# Forwarding must be enabled, although forwarding doesn't apply to<br />
# packets scheduled by lvs/ipvs, it's needed for remote response<br />
# packets. (Forwarding applies to the inbound interface, not the<br />
# outbound.)<br />
sysctl net.ipv4.conf.eth0/10.forwarding=1<br />
<br />
# Accept incoming packets with a local source addres. This is<br />
# required, as remote response packets will have the virtual IP as<br />
# their source: That virtual IP is also present as a secondary<br />
# address on the local incoming interface. Normally it would be<br />
# dropped, but this sysctl allows it to be accepted.<br />
# https://github.com/torvalds/linux/commit/8153a10c08f1312af563bb92532002e46d3f504a<br />
# http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L849<br />
sysctl net.ipv4.conf.eth0/10.accept_local=1<br />
<br />
# Enable connection tracking for lvs/ipvs connections. This lets us<br />
# apply the NETMAP rule in POSTROUTING for remote request packets.<br />
# Without this setting, the netfilter nat table would not be<br />
# traversed by the ipvs'ed packets.<br />
# https://github.com/torvalds/linux/commit/f4bc17cdd205ebaa3807c2aa973719bb5ce6a5b2<br />
# http://lxr.linux.no/#linux+v3.0/net/netfilter/ipvs/Kconfig#L252<br />
sysctl net.ipv4.vs.conntrack=1<br />
<br />
# Disable accepting and sending ICMP redirects. This is essential to<br />
# avoid redirecting remote response packets directly to the router:<br />
# These packets must go through the lvs/ipvs master for correct<br />
# NETMAP detranslation. Setting this for 'all' is enough as long as<br />
# other interfaces have forwarding enabled (which they do).<br />
# http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L753<br />
sysctl net.ipv4.conf.all.accept_redirects=0<br />
sysctl net.ipv4.conf.all.send_redirects=0<br />
}<br />
<br />
stop() {<br />
# Revert most of the settings from start().<br />
sysctl net.ipv4.conf.eth0/10.accept_local=0<br />
sysctl net.ipv4.conf.eth0/10.forwarding=0<br />
<br />
iptables -t nat -D PREROUTING -s 10.0.20.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT<br />
iptables -t nat -D PREROUTING -s 10.0.10.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT<br />
iptables -t nat -D INPUT -m mark --mark 320 -j NETMAP --to 10.0.20.0/24<br />
iptables -t nat -D INPUT -m mark --mark 310 -j NETMAP --to 10.0.10.0/24<br />
iptables -t nat -D POSTROUTING -m mark --mark 320 -j NETMAP --to 10.0.20.0/24<br />
iptables -t nat -D POSTROUTING -m mark --mark 310 -j NETMAP --to 10.0.10.0/24<br />
iptables -t raw -D OUTPUT -o eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 120<br />
iptables -t raw -D OUTPUT -o eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 110<br />
iptables -t raw -D PREROUTING -i eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 120<br />
iptables -t raw -D PREROUTING -i eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 110<br />
iptables -t raw -D PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 320<br />
iptables -t raw -D PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 310<br />
iptables -t raw -D OUTPUT -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2<br />
iptables -t raw -D OUTPUT -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1<br />
iptables -t raw -D PREROUTING -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2<br />
iptables -t raw -D PREROUTING -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1<br />
iptables -t raw -D PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 2<br />
iptables -t raw -D PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 1<br />
<br />
ip route flush table 220<br />
ip route flush table 210<br />
ip route flush table 120<br />
ip route flush table 110<br />
ip rule del pref 320<br />
ip rule del pref 310<br />
ip rule del pref 220<br />
ip rule del pref 210<br />
ip rule del pref 120<br />
ip rule del pref 110<br />
}<br />
<br />
case "$1" in<br />
start|stop)<br />
$1<br />
;;<br />
restart)<br />
stop<br />
start<br />
;;<br />
*)<br />
echo "Usage: $0 start|stop|restart"<br />
;;<br />
esac<br />
<br />
===keepalived.conf===<br />
FIXME: This is fairly stripped down, so might not work out of the box.<br />
<br />
vrrp_sync_group mylvs {<br />
group {<br />
VI_1<br />
VI_2<br />
}<br />
}<br />
vrrp_instance VI_1 {<br />
state BACKUP<br />
interface eth0.10<br />
virtual_router_id 10<br />
priority 100<br />
virtual_ipaddress {<br />
172.16.10.2 # VLAN 10 VIP<br />
1.2.3.4 # Main virtual IP<br />
}<br />
}<br />
vrrp_instance VI_2 {<br />
state BACKUP<br />
interface eth0.20<br />
virtual_router_id 20<br />
priority 150<br />
virtual_ipaddress {<br />
172.16.20.2 # VLAN 20 VIP<br />
1.2.3.4 # Main virtual IP<br />
}<br />
}<br />
virtual_server fwmark 310 {<br />
lb_algo lc<br />
lb_kind DR<br />
persistence_timeout 0<br />
delay_loop 20<br />
protocol TCP<br />
real_server 172.16.10.3 80 {<br />
weight 1<br />
}<br />
real_server 172.16.10.4 80 {<br />
weight 1<br />
}<br />
}<br />
virtual_server fwmark 320 {<br />
lb_algo lc<br />
lb_kind DR<br />
persistence_timeout 0<br />
delay_loop 20<br />
protocol TCP<br />
real_server 172.16.20.3 80 {<br />
weight 1<br />
}<br />
real_server 172.16.20.4 80 {<br />
weight 1<br />
}<br />
}<br />
<br />
==Tips==<br />
You can use symbolic routing table names instead of numbers (both for<br />
ip rule and ip route) by adding the number-name mapping to<br />
/etc/iproute2/rt_tables. You can then use syntax like "ip rule add ... lookup<br />
v0vlan", and "ip route add ... table v0vlan". For example:<br />
<br />
110 v0vlan<br />
120 v1vlan<br />
210 v0to_peer<br />
220 v1to_peer<br />
<br />
Make sure to keep track of fwmark numbers and ip rule preference numbers. They<br />
can overlap if you want, but make sure to keep track of each without mixing<br />
them up. Use a logical scheme.<br />
<br />
==Debian implementation==<br />
The following script is the one I use on my own setup. It is designed to work<br />
on systems that use ifupdown (Debian and Ubuntu, for example). Ifupdown can<br />
call hook scripts on events (like before or after the interface is brought up<br />
or down). In this case it is called after bringing the interface up, and before<br />
bringing it down.<br />
<br />
===/etc/network/interfaces===<br />
====LVS A====<br />
auto lo<br />
iface lo inet loopback<br />
<br />
auto eth0<br />
iface eth0 inet manual<br />
# Load ip_vs here to avoid segfaults in keepalived (it tries to do<br />
# 'modprobe -k'. See https://bugzilla.redhat.com/show_bug.cgi?id=528465<br />
pre-up /sbin/modprobe ip_vs # Avoid<br />
# Add main virtual IP to lo interface<br />
up /sbin/ip addr add 1.2.3.4/32 dev lo<br />
<br />
# VLAN 10<br />
auto eth0.10<br />
iface eth0.10 inet static<br />
address 172.16.10.3<br />
netmask 255.255.255.255<br />
broadcast 172.16.10.255<br />
gateway 172.16.10.1<br />
<br />
# VLAN 20<br />
auto eth0.20<br />
iface eth0.20 inet static<br />
address 172.16.20.3<br />
netmask 255.255.255.255<br />
broadcast 172.16.20.255<br />
<br />
====LVS B====<br />
auto lo<br />
iface lo inet loopback<br />
<br />
auto eth0<br />
iface eth0 inet manual<br />
# Load ip_vs here to avoid segfaults in keepalived (it tries to do<br />
# 'modprobe -k'. See https://bugzilla.redhat.com/show_bug.cgi?id=528465<br />
pre-up /sbin/modprobe ip_vs # Avoid<br />
# Add main virtual IP to lo interface<br />
up /sbin/ip addr add 1.2.3.4/32 dev lo<br />
<br />
# VLAN 10<br />
auto eth0.10<br />
iface eth0.10 inet static<br />
address 172.16.10.4<br />
netmask 255.255.255.255<br />
broadcast 172.16.10.255<br />
gateway 172.16.10.1<br />
<br />
# VLAN 20<br />
auto eth0.20<br />
iface eth0.20 inet static<br />
address 172.16.20.4<br />
netmask 255.255.255.255<br />
broadcast 172.16.20.255<br />
<br />
===ifupdown script===<br />
Put this script in /etc/network/if-up.d/, and put a symlink to it in<br />
/etc/network/if-down.d/.<br />
<br />
#!/bin/sh<br />
<br />
# This script is most likely not plug-and-play. Take note of the<br />
# FIXMEs and how the script is intended to work.<br />
<br />
# This script depends on MAC and IP addresses, so we only support the<br />
# following two hostnames.<br />
case `hostname` in<br />
lvsa)<br />
other_mac=00:10:10:10:10:20 # lvsb eth0 mac address<br />
;;<br />
lvsb)<br />
other_mac=00:10:10:10:10:10 # lvsa eth0 mac address<br />
;;<br />
*)<br />
echo >&2 "Skipping unknown hostname: `hostname`"<br />
exit 0<br />
;;<br />
esac<br />
<br />
client_net=10.0.0.0/24<br />
virtual_ips="1.2.3.4" # Multiple allowed, space separated.<br />
virtual_ports="80" # Multiple allowed, space separated.<br />
iface=$IFACE<br />
<br />
case "$iface" in<br />
eth0.10)<br />
index=0<br />
ct_zone=1<br />
my_gw=172.16.10.1<br />
my_ip=172.16.10.3<br />
other_ip=172.16.10.4<br />
test `hostname` = lvsb && {<br />
my_ip=172.16.10.4<br />
other_ip=172.16.10.3<br />
}<br />
;;<br />
eth0.20)<br />
index=1<br />
ct_zone=2<br />
my_gw=172.16.20.1<br />
my_ip=172.16.20.3<br />
other_ip=172.16.20.4<br />
test `hostname` = lvsb && {<br />
my_ip=172.16.20.4<br />
other_ip=172.16.20.3<br />
}<br />
;;<br />
"")<br />
echo >&2 "Skipping empty interface"<br />
exit 0<br />
;;<br />
*)<br />
echo >&2 "Skipping unknown interface: $iface"<br />
exit 0<br />
;;<br />
esac<br />
<br />
# These variables use lots of shortcuts based on the $ct_zone number.<br />
netmap=10.0.${ct_zone}0.0/24 # 10.0.10.0/24, 10.0.20.0/24<br />
rt_vlan_num=1${ct_zone}0 # 110, 120<br />
rt_vlan=v${ct_zone}0vlan # v10vlan, v20vlan<br />
rt_topeer_num=2${ct_zone}0 # 210, 220<br />
rt_topeer=v${ct_zone}0to_peer # v10to_peer, v20to_peer<br />
rp_topeer=1${ct_zone}0 # 110, 120<br />
rp_return=2${ct_zone}0 # 210, 220<br />
rp_fwmark=3${ct_zone}0 # 310, 320<br />
fwm_ipvs=1${ct_zone}0 # 110, 120<br />
fwm_return=2${ct_zone}0 # 210, 220<br />
<br />
start() {<br />
# Add symbolic routing table names<br />
grep -q "^$rt_vlan_num\>" /etc/iproute2/rt_tables ||<br />
echo "$rt_vlan_num\t$rt_vlan" >> /etc/iproute2/rt_tables<br />
grep -q "^$rt_topeer_num\>" /etc/iproute2/rt_tables ||<br />
echo "$rt_topeer_num\t$rt_topeer" >> /etc/iproute2/rt_tables<br />
<br />
### Routing policy database (RPDB) entries<br />
# Delete stale entries<br />
ip rule del pref $rp_topeer 2>/dev/null<br />
ip rule del pref $rp_return 2>/dev/null<br />
ip rule del pref $rp_fwmark 2>/dev/null<br />
<br />
# From own IP (due to iptables REDIRECT) to mapped subnet,<br />
# use routing table pointing to peer node.<br />
ip rule add pref $rp_topeer from $my_ip to $netmap lookup $rt_topeer<br />
<br />
# To mapped subnet (if this node has handled the packet),<br />
# use routing table pointing to vlan's gateway.<br />
ip rule add pref $rp_return to $netmap lookup $rt_vlan<br />
<br />
# To original subnet with fwmark (packet has been through ipvs<br />
# and translation, and is on its way back), use routing table<br />
# pointing to vlan's gateway.<br />
ip rule add pref $rp_fwmark to $client_net fwmark $fwm_return lookup $rt_vlan<br />
<br />
### Routing entries<br />
# Flush stale tables<br />
ip route flush table $rt_vlan<br />
ip route flush table $rt_topeer<br />
<br />
# VLAN gateway<br />
ip route add default via $my_gw dev $iface table $rt_vlan<br />
<br />
# Other peer is gateway<br />
ip route add default via $other_ip dev $iface table $rt_topeer<br />
<br />
# Accept local source on interface. This is for packets returning<br />
# from the second node after being ipvs'ed by this node.<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.accept_local=1<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.forwarding=1<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.arp_announce=2<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.arp_ignore=1<br />
<br />
# Ensure conntrack of ipvs'ed packets. Sourcing /etc/sysctl.conf<br />
# from /etc/init.d/procps happens too early.<br />
sysctl net.ipv4.vs.conntrack=1<br />
<br />
# https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e<br />
# Use separate conntrack zone for each interface. Request packets.<br />
for vip in $virtual_ips; do<br />
for port $virtual_ports; do<br />
iptables -t raw -A PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport $port -j CT --zone $ct_zone<br />
done<br />
done<br />
<br />
# Use separate conntrack zone for each interface. Return packets.<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
for chain in PREROUTING OUTPUT; do<br />
iptables -t raw -A $chain -d $netmap -s $vip -p tcp --sport $port -j CT --zone $ct_zone<br />
done<br />
done<br />
done<br />
<br />
# Mark packets for ipvs scheduling.<br />
for vip in $virtual_ips; do<br />
# FIXME: Use $virtual_ports somehow, and map them to fwmarks?<br />
iptables -t raw -A PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport 80 -j MARK --set-mark $fwm_ipvs<br />
done<br />
<br />
# Mark packets for rpdb return routes. Forwarded from peer.<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -A PREROUTING -i $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return<br />
done<br />
done<br />
<br />
# Mark packets for rpdb return routes. From self.<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -A OUTPUT -o $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return<br />
done<br />
done<br />
<br />
# Map source network to unique subnet<br />
iptables -t nat -A INPUT -m mark --mark $fwm_ipvs -j NETMAP --to $netmap<br />
iptables -t nat -A POSTROUTING -m mark --mark $fwm_ipvs -j NETMAP --to $netmap<br />
<br />
# DNAT to local iface address if packet is coming from other node<br />
# (after ipvs scheduling). This lets us do correct rpdb+routing for<br />
# return packets.<br />
# FIXME: The mac-source matching should be unnecessary, as the<br />
# source subnet has been translated already? Needs verification.<br />
for vip in $virtual_ips; do<br />
iptables -t nat -A PREROUTING -s $netmap -m mac --mac-source $other_mac -d $vip -j REDIRECT<br />
done<br />
}<br />
<br />
stop() {<br />
# Return packets<br />
ip rule del pref $rp_topeer 2>/dev/null<br />
ip rule del pref $rp_return 2>/dev/null<br />
ip rule del pref $rp_fwmark 2>/dev/null<br />
<br />
# Return routes<br />
ip route flush table $rt_vlan 2>/dev/null<br />
ip route flush $rt_topeer 2>/dev/null<br />
<br />
# Accept local source on interface. This is for packets returning<br />
# from the second node after being ipvs'ed by this node.<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.accept_local=0 2>/dev/null<br />
sysctl net.ipv4.conf.`echo $iface|tr . /`.forwarding=0 2>/dev/null<br />
<br />
# https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e<br />
# Use separate conntrack zone for each interface.<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -D PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport $port -j CT --zone $ct_zone 2>/dev/null<br />
done<br />
done<br />
<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
for chain in PREROUTING OUTPUT; do<br />
iptables -t raw -D $chain -d $netmap -s $vip -p tcp --sport $port -j CT --zone $ct_zone 2>/dev/null<br />
done<br />
done<br />
done<br />
<br />
# Marks for ipvs<br />
for vip in $virtual_ips; do<br />
# FIXME: Use $virtual_ports somehow, and map them to fwmarks?<br />
iptables -t raw -D PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport 80 -j MARK --set-mark $fwm_ipvs 2>/dev/null<br />
done<br />
<br />
# Marks for rpdb return routes<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -D PREROUTING -i $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return 2>/dev/null<br />
done<br />
done<br />
<br />
for vip in $virtual_ips; do<br />
for port in $virtual_ports; do<br />
iptables -t raw -D OUTPUT -o $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return 2>/dev/null<br />
done<br />
done<br />
<br />
# Map source network to unique subnet<br />
iptables -t nat -D INPUT -m mark --mark $fwm_ipvs -j NETMAP --to $netmap 2>/dev/null<br />
iptables -t nat -D POSTROUTING -m mark --mark $fwm_ipvs -j NETMAP --to $netmap 2>/dev/null<br />
<br />
# DNAT to local iface address if packet is coming from other node<br />
# (after ipvs scheduling).<br />
for vip in $virtual_ips; do<br />
iptables -t nat -D PREROUTING -s $netmap -m mac --mac-source $other_mac -d $vip -j REDIRECT 2>/dev/null<br />
done<br />
<br />
return 0<br />
}<br />
<br />
case "$MODE" in<br />
start)<br />
start<br />
;;<br />
stop)<br />
stop<br />
;;<br />
esac<br />
<br />
==Troubleshooting==<br />
While developing this setup, I ran into tons of problems. The<br />
following debugging tricks are invaluable when working with complex<br />
network setups.<br />
<br />
===Tcpdump===<br />
The mother of all network debugging. It is very useful here,<br />
especially with some good filters. Always use the -e option so you can<br />
inspect the MAC addresses. They are very important in this sort of<br />
setup.<br />
<br />
Here's a useful example that dumps packets on eth0.10, filtering on<br />
packets to/from port 80, and involving the MAC addresses for either of<br />
the LVS nodes. It also shows ARP and ICMP packets, which is very<br />
useful.<br />
<br />
tcpdump -envi eth0.10 -n port 80 and '( ether host 00:10:10:10:10:10 or ether host 00:10:10:10:10:20 )' or arp or icmp<br />
<br />
===Iptables logging===<br />
Logging in all netfilter tables and chains is a great way to inspect<br />
how a packet traverses the stack. This script will set up four rules<br />
per chain:<br />
*Request packets towards port 80 in the start of the chain<br />
*Response packets from port 80 in the start of the chain<br />
*Request packets towards port 80 in the end of the chain<br />
*Response packets from port 80 in the end of the chain<br />
<br />
for t in raw mangle nat filter; do<br />
for c in PREROUTING INPUT FORWARD OUTPUT POSTROUTING; do<br />
iptables -t $t -I $c -p tcp --dport 80 -j LOG --log-prefix "REQ-A-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null<br />
iptables -t $t -I $c -p tcp --sport 80 -j LOG --log-prefix "RES-A-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null<br />
iptables -t $t -A $c -p tcp --dport 80 -j LOG --log-prefix "REQ-Z-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null<br />
iptables -t $t -A $c -p tcp --sport 80 -j LOG --log-prefix "RES-Z-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null<br />
done<br />
done<br />
<br />
Then use something like "tail -f /var/log/kern.log" to track what's<br />
going on.<br />
<br />
===LVS/ipvs debugging===<br />
For detailed lvs/ipvs debugging, you can check if your kernel is<br />
compiled with CONFIG_IP_VS_DEBUG enabled. If not, you can recompile<br />
the kernel after enabling it. Set the<br />
[http://lxr.linux.no/#linux+v3.0/Documentation/networking/ipvs-sysctl.txt#L27 debug level]<br />
to a suitable number, and you can tail the kernel log to see what's<br />
going on.<br />
<br />
===ICMP redirects===<br />
If you leave ICMP redirects enabled, the LVS nodes will teach each<br />
other to send remote response packets directly back to the router,<br />
instead of through the required NETMAP detranslation. To avoid this,<br />
set the net.ipv4.conf.all.accept_redirects sysctl to 0.<br />
<br />
In a particularly long debug session, I couldn't figure out why<br />
packets were being sent directly back to the router, even if redirects<br />
were disabled. Listing the route cache with "ip route show cache"<br />
indicated that the route was flagged with 'redirected'. This turned<br />
out to be due to<br />
[https://github.com/torvalds/linux/commit/f39925dbde7788cfb96419c0f092b086aa325c0f a modification]<br />
where the inet peer cache kept information about a previously learned<br />
redirect (before I disabled them), and propagated that to the route<br />
cache. Instead of "ip route flush cache", I had to reboot the node to<br />
clear the inet peer cache (or wait for it to time out, which could<br />
take a while).<br />
<br />
==Other==<br />
Before finding the 2.6.36 NAT and NETMAP modifications, I played<br />
around with [http://vde.sourceforge.net/ Virtual Distributed Ethernet]<br />
and the feature allowing to<br />
[https://github.com/torvalds/linux/commit/5adef1809147a9c39119ffd5a13a1ca4fe23a411 delete/move the local routing table preference]<br />
(2.6.33) to loop packets out through a virtual switch and back in<br />
again, but it got even more messy.<br />
<br />
==Thanks==<br />
This approach would not be possible without all the recent patches by<br />
Patrick McHardy, and of course the years of ground work in netfilter<br />
and ipvs that it builds upon.</div>Svenxhttp://kb.linuxvirtualserver.org/wiki?title=Examples&diff=5959Examples2011-10-19T07:37:14Z<p>Svenx: Link to two-node setup with overlapping client subnets</p>
<hr />
<div>This page is to design examples using LVS, so please feel free to write your LVS systems here and share them with other LVS users.<br />
<br />
== Web Cluster ==<br />
<br />
* [[Building Scalable Web Cluster using LVS]]<br />
* [[Building Tomcat Cluster using LVS]]<br />
* [[Building Ruby on Rails Cluster using LVS]]<br />
* [[Building Web Cache Cluster using LVS]]<br />
* [[Building clusterized proxy farms using LVS]]<br />
<br />
== Linux/Unix Cluster ==<br />
<br />
* [[Building Scalable Mail Cluster using LVS]]<br />
* [[Building Scalable FTP Cluster using LVS]]<br />
* [[Building Scalable TFTP Cluster using LVS]]<br />
* [[Building MySQL Cluster using LVS]]<br />
* [[Building Scalable DNS Cluster using LVS]]<br />
* [[Building Two-Node Directors/Real Servers using LVS and Keepalived]]<br />
* [[Building an LDAP cluster using LVS and NetWare real servers]]<br />
* [[Building Scalable DHCP Cluster using LVS]]<br />
* [[LVS/TUN mode with FreeBSD and Solaris realserver]]<br />
* [[Two-node setup with overlapping client subnets]]<br />
<br />
== Media Service Cluster ==<br />
<br />
* [[Building Scalable Media Cluster using LVS]]<br />
* [[Building Windows Media Service Cluster using LVS]]<br />
* [[Building Darwin Streaming Service Cluster using LVS]]<br />
* [[Building Helix Server Cluster using LVS]]<br />
* [http://www.freebsdcluster.org/~lasse/icecast-lvs-cluster-howto/ Building a streaming cluster with Icecast, LVS and other cools apps]<br />
<br />
== Terminal Service Cluster ==<br />
<br />
* [[Building Linux Terminal Service Cluster using LVS]]<br />
* [[Building Windows Terminal Service Cluster using LVS]]<br />
<br />
[[Category:LVS Handbook]]</div>Svenx