Loadbalancer-less clusters on Linux

When you think of implementing virtual servers, either because you need to cope with a higher load or to provide enhanced reliability, you face a new problem: how to avoid making the load balancers (or directors) the bottleneck and the single point of failure for the whole architecture. And there is an associated cost with traditional load balancing as you also need to add one (or better two for redundancy) more servers for the load balancer function itself.

The solution is simple: don’t use load balancers at all.

Clusterip is a relatively new iptables extension written by Harald Welte, which allows the configuration of server farms (or clusters) without load balancers (or directors in Linux Virtual Server jargon). The iptables clusterip module is included with the latest 2.6 kernels so it will be present right out of the box in most modern Linux distributions.

The idea behind clusterip is simple: all servers in the farm will present a common ethernet MAC address (which is a multicast MAC address) for the virtual IP address (VIP), so ARP requests for this VIP will be responded by any node in the cluster using this common MAC address. The node handling any given IP packet is determined by a hashing algorithm that we’ll review in a few moments.

Clusterip is actually an iptables target extension. It supports a few parameters:

  • “—new ”
    Create a new ClusterIP. You always have to set this on the first rule for a given ClusterIP

  • “—hashmode ” “mode”
    Specify the hashing mode. Has to be one of sourceip, sourceip-sourceport, sourceip-sourceport-destport

  • “—clustermac ” “mac”
    Specify the ClusterIP MAC address. Has to be a link-layer multicast address

  • “—total-nodes ” “num”
    Number of total nodes within this cluster

  • “—local-node ” “num”
    Local node number within this cluster

While some of the parameters are self explanatory, others may require some discussion.

Hashmode specifies the way requests will be distributed among different nodes: Sourceip will assign all traffic sourced by a single IP to a single server in the farm. This means that if thousands of requests are coming from a single IP (i.e. a proxy server), all those requests will be assigned to the same server (the traffic distribution will be less than optimal). Sourceip-sourceport and sourceip-sourceport-destport will provide a more even distribution of traffic, but will require more memory to hold larger hash tables.

Clustermac determines the virtual MAC that will be used to respond to ARP requests. The only requirements are that it needs to be the same in all the nodes of the cluster, and that it needs to belong to the range of multicast ethernet MAC addresses. A multicast MAC address is indicated by the low order bit of the first byte (which, by the way, is the first one in the wire). If the servers are connected to an ethernet switch, the use of a multicast MAC address forces the switch to send this packet to all the ports.

An example configuration for a two nodes cluster would be:


Node 1
iptables - A INPUT - d 192.168.1.1 - i eth0 - p tcp --dport 80 - j CLUSTERIP --new --hashmode sourceip --clustermac 01:23:45:67:89:AB --total- nodes 2 --local-node 1


Node 2
iptables - A INPUT - d 192.168.1.1 - i eth0 - p tcp --dport 80 - j CLUSTERIP --new --hashmode sourceip --clustermac 01:23:45:67:89:AB --total- nodes 2 --local-node 2

In our example, 192.168.1.1 is the Virtual IP address (VIP), we load balance HTTP traffic to port 80 (web servers), our hashing algorithm is based solely on the source IP address (not as even in traffic distribution but frugal in memory requirements), the multicast MAC address is 01:23:45:67:89:AB (the low order bit in the first byte must be on, therefore the most significant byte must be an odd number) and we have two nodes. Each node receives its own node number under --local-node.

If you execute: cat /proc/net/ipt_CLUSTERIP/192.168.1.1 in one of the nodes, you’ll see that it will return the node number of that node. This is more than just an identifier: it really makes that node attend requests addressed for that node number.

If one of the nodes dies (let’s say node 1), you will want to assign another node (in this case node 2) to respond for those queries, so you’ll execute in node 2: echo "+1" > /proc/net/ipt_CLUSTERIP/192.168.1.1.

Now if you take a look at /proc/net/ipt_CLUSTERIP/192.168.1.1 you’ll see that it will respond to queries for both nodes (2,1).

So, clusterip provides you with functionality to have loadbalancer-less clusters, thus avoiding the bottleneck and single point of failure that a load-balancer may represent, and it does it with high availability in mind. But are there any caveats?

The first caveat is that clusterip is still marked experimental, which means that either it may make wonders for you, or it may not work at all (nasty bugs may/will be present). On the other hand, due to a recent patch to clusterip, the version in the latest kernels got out of sync with the userland, so some combinations of kernels and iptables won’t work (you need either an older userland iptables or a very recent kernel).

Regardless of the caveats and warnings, be sure that the need for something like clusterip in Linux, will bring enough testers to squash the current bugs very soon.

39 Responses to “Loadbalancer-less clusters on Linux”

  1. gonad Says:

    Thanks for the write up, interesting stuff :)

  2. albert Says:

    In the past we tried something like this using the DNS server. We created a general name (eg ‘cluster’) that alternatedly returned one of the IP address of one of the machines (ie with two machines, the first DNS request gets ‘mach1’, the second ‘mach2’, the third ‘mach1’ again, etc).
    While this worked, we ran into trouble when using ‘cluster’ as name for ssh to log into. It appeared that ssh stores the combination of name (‘cluster’) and host key, so as soon as you get a different machine than the first time, ssh refuses to log in due to an incorrect host key.

    Would clusterip also suffer from this problem?

    Albert

  3. raul saura Says:

    this sounds like stonebeat way of balancing connections.

    Caution must be taken to avoid loops when multiples layer3 switches are involved.

  4. flavio Says:

    Albert, you’re correct: the SSH client authenticates the server, so it will refuse a connection the second time if the server is a different box.

    That’s why each node must have it’s own IP address, in addition to the VIP. SSH connections must use the node’s own IP address to avoid this problem.

    Flavio

  5. albert Says:

    I completely agree that a better solution may exist, however load distribution is not entirely the goal for us.
    With user ssh connections, load is not very high, so how well the distribution actually is, is not that important, it is more that there is a single name that everybody can use (so we can say to students, use ‘cluster’ for your assignment). This would be a (very cheap) step forward compared to the current hard-coded assignment of students to machines.

    I don’t entirely understand the danger of loops; the DNS server returns different IP addresses of machines each time, how does that create a loop (maybe I should have stated this more clearly).

    Do ssh implementations that can deal with changing real IP addresses for a (virtual) name exist at all, or would such an implementation defeat ssh itself?
    (With the reasoning that all information comes from external sources, which implies that nothing can be trusted any more, which makes ssh useless as a mechanism for trusted connections.)

    Albert

  6. Jan Says:

    Ok, so if I get this right, the node number is used for the node to figure out itself if it’s supposed to reply to the incoming, broadcast request?

    And – if a node goes down, and “nothing” is done about it, would you not lose the requests coming in to that node, since all the other nodes would not respond to them? I.e. you need an outside mechanism for sensing that a node is dead and alerting another node that it is now supposed to handle that node’s requests?

    Jan

  7. flavio Says:

    Jan, yes to both of your questions:

    The node number is used to determine if the nodes needs to respond to that particular request.

    If a node dies, you need some external system to add that node number to one of the remaining nodes, otherwise requests going to the dead node won’t be responded.

    Flavio

  8. mike Says:

    I am a bit of a newbe to the whole load balancing concept. I am asuming that the firewall/NAT is a seperate machine that uses clusterip and determines which server recieves the requist. Or, do I just need two servers (node one and node two) with one IP address and each server passes on the request to the next machine?

    Thank’s
    Mike

  9. flavio Says:

    Mike, you don’t need any additional machines if you use clusterip.
    You just need to configure iptables and clusterip in each of the servers and they will distribute the traffic by themselves. To put it in a nutshell: if you have three servers and configure clusterip in each of them, each server will respond to 1/3 of the total traffic auto-magically. No director or load balancer machine is required.

    Flavio

  10. SoupIsGood Food Says:

    Well, that sucks. Pretty flaky for a HA cluster setup. A possible solution would be for the other nodes need to poll the called node number to verify the req’s had been recieved, and the next node in que to fail over and respond…

    Not a coder, so don’t look at me for implementation. Dunno what sort of traffic and performance overhead this would need, either. Might be a lot. Such would be the price for eliminating the load balancers.

  11. Rogan Says:

    You would use something like “heartbeat” to verify that all the nodes in the cluster are available, and respond appropriately (echo +1) when a node goes away.

  12. Tim Says:

    Wouldn’t a better solution be making the cluster members the load balancers

    maybe an add-in can share the state tables and elect one of the nodes as a master. Like a HA module or something.

    Then all cluster nodes are aware of connections and if a member drops they can handle the conneciton without dropping it.

    If the master drops one of the other nodes is elected as the master.

    You could even then use different algorithms for connection distribution such as latency, round robin, or maybe processor/mem utilization.

  13. flavio Says:

    Tim, you can certainly do that, and use keepalived and LVS to make each node a director and a real server at the same time. In this case, VRRP will take care of the failover of the director subsystem.

    Flavio

  14. Markus Says:

    Sounds very nice :)

    I tried to test it with two nodes – without success. I suppose that I have to configure the network interface right the parameters given iptables manually.

    After I insert the the iptables rule a file in the /proc/net/ipt_CLUSTERIP directory appear, but it don`t return anything if I use the cat command.

    If I have a look at the iptables rules the clustermac address changed to anything. The mac-address in the arp-reply from the cluster is changed one more time.
    Is this the normal behavior of the CLUSTERIP module?

    Markus

  15. flavio Says:

    Markus, you’re probably suffering from the divergence between kernel and userspace versions. I’d recommend using the latest 2.6.11.12 kernel with the latest iptables (1.3.1).

    Flavio

  16. sunspark Says:

    Albert,
    In response to your question about SSH host key mismatch warnings when using Round-Robin DNS clustering (and other cluster methods, I suppose), we solved this problem by generating a set of keys and then installing this public/private key set on all the machines in the cluster. Thus, all machines hand out the same public key to “first-time” clients and all inbound public key presentations from clients are hashed against the same private key.

  17. Markus Says:

    Im using the latest version of the kernel and iptables, so make another try. I have a look at the netfilter mailinglist but I cant find similar problem. It seems to be a very new topic and documentation is rare.

    Can you plase tell me if I`m right with configure the interface ip- and mac-address manually (that is not special mentioned in your very good howto) – thanks

    Markus

  18. flavio Says:

    Markus, you are supposed to configure the mac address in the IP tables entry (not in the interface itself). And same thing for the IP address.

    After you run the iptables commands, try executing:

    echo “+1” > /proc/net/ipt_CLUSTERIP/192.168.1.1

    (where 1 is the node number and 192.168.1.1 is the VIP)

    Then try a:

    cat /proc/netc/ipt_CLUSTERIP/192.168.1.1

    and see if it returns the node number.

    Flavio

  19. Mitry Says:

    Im try CLUSTERIP and it does not work for me.
    My kernel 2.6.11.4, iptables 1.3.1 on SuSE Linux ES 9
    Both nodes connected to Catalyst switch to ports with multicast enabled.

    What i do, on first node:
    iptables – A INPUT - d 10.2.10.100 – i eth0 – p tcp—dport 80 – j CLUSTERIP —new—hashmode sourceip- sourceport- destport—clustermac 01:00:5e:02:0a:64—total-nodes 2—local- node 1
    echo “+1” > /proc/net/ipt_CLUSTERIP/10.2.10.100

    On second node:
    iptables – A INPUT - d 10.2.10.100 – i eth0 – p tcp—dport 80 – j CLUSTERIP - new—hashmode sourceip- sourceport- destport—clustermac 01:00:5e:02:0a:64—total-nodes 2—local- node 2
    echo “+2” > /proc/net/ipt_CLUSTERIP/10.2.10.100

    After i “cat /proc/net/ipt_CLUSTERIP/10.2.10.100” and see “1” and “2”.
    Next i entering 10.2.10.100 in browser on my local machine(10.2.2.29) and see this on both nodes:

    tcpdump -i eth0 host 10.2.10.100
    listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
    13:53:53.217480 arp who-has 10.2.10.100 tell 10.2.2.29
    13:54:02.114233 arp who-has 10.2.10.100 tell 10.2.2.29
    13:54:19.162201 arp who-has 10.2.10.100 tell 10.2.2.29

    Why no reply from nodes to arp requests?
    Another strange thing – when i see “iptables -L” virtual MAC adress of farm is changed and it is different on every node!
    It is normal?

  20. flavio Says:

    Mitry, you may have an incompatibility between kernel and userland.

    Try upgrading your kernel to 2.6.12 and your userland iptables to 1.3.1.

    Flavio

  21. Mitry Says:

    UPD: I also tried latest 2.6.12 kernel and have no changes.

  22. flavio Says:

    Mitry, can you update your userland iptables to 1.3.1 to see if that fixes your problem?

    Flavio

  23. Mitry Says:

    flavio, when i upgrade kernel to 2.6.12 i recompile iptables-1.3.1 with new kernel sources and install it.

  24. flavio Says:

    Mitry, are you sure that you are using the new iptables binary and libraries?

    Usually linux distributions have iptables libraries installed under /usr/. However when you install iptables from the source, it installs itself under /usr/local/ unless you specify a different directory.

    Flavio

  25. Mitry Says:

    flavio, first i used iptables 1.3.1 and kernel 2.6.11.4, which comes with my SuSE distribution.
    Second, i build kernel 2.6.12 from kernel.org and iptables-1.3.1-3.src.rpm from distribution.
    Third, i take iptables from iptables.org, and…
    ...CLUSTERIP still not working.
    Situation the same – no ARP replies from virtual IP.

  26. flavio Says:

    Mitry, it’s hard to say what’s going on without some more research. The fact that you see a different MAC address when you list the rules in both nodes is a hint that something is wrong (maybe you hit a bug in clusterip).

    Flavio

  27. Sven Says:

    I also has the same problem MAC address change…
    2.6.12 iptables 1.3.1.

  28. Dennis Says:

    Hi!

    I’ve been playing with this and can’t get it to work either. No arp replies. However, the combination of iptables-1.2.11 and the 2.6.11 kernel seems to work better than the newest versions.

    The MAC address is set to the correct value and the ‘1’ is set in /proc/net/ipt_CLUSTERIP/x.x.x.x. I cannot change this value though with echo “+1”. Also, I still do not get any arp replies.

    Flavio, is there anything special that needs to be set up and that wasn’t covered in your article? Also, which kernel and iptables versions did you use?

    Dennis

  29. Andrew Says:

    Doesn’t CARP (http://www.ucarp.org) do this better?

  30. Sven Says:

    Andrew: CARP is for failover, not loadbalance.

  31. John Says:

    One the router has the MAC address of one of the servers for the virtual IP, won’t it send all the requests on that node ???

  32. flavio Says:

    John, the trick is that the MAC address belongs to the multicast range and all the nodes in the cluster listen for ethernet frames directed to that MAC. All the nodes will get the frames but only one will respond.

    Flavio

  33. garfield Says:

    Hi very great article, you a word for say I wrote a french traduction of it, it’s avalable here http://www.hardz.net/~garfield/index.php?2005/08/27/45—traduction-loadbalancer-less-clusters-on-linux.

    If you have any objection about it please mail me, and I remove it imediatly.

    Wish you the best and good continuation.

     garfield
    
  34. Anoop Says:

    If I have cluster nodes located across routers (ie in multiple lan segments ) will this solution work. It may make sense to install cluster services in different LAN segments within a geographic area or even across areas to achieve better redundancy.

    Also which will be the Outbound packet’s source IP ? CLUSTERIP or the ACTUAL IP assigned to the NIC ?

  35. Marco Says:

    I try CLUSTERIP, and it does not work for me :
    I use Mandriva 2006, kernel 2.6.14-12mdk, iptables 1.3.3
    Both nodes connected to catalyst and one entry manually entered :
    arp 10.46.24.10 0100.5e7f.180a ARPA

    First node, i do:
    iptables -A INPUT -d 10.46.24.10 -i eth0 -p tcp—dport 80 -j CLUSTERIP —new—hashmode sourceip—clustermac 01:00:5E:7F:18:0A—total-nodes 2—local-node 1

    Second node, i do:
    iptables -A INPUT -d 10.46.24.10 -i eth0 -p tcp—dport 80 -j CLUSTERIP —new—hashmode sourceip—clustermac 01:00:5E:7F:18:0A—total-nodes 2—local-node 2

    Until now, everything is good
    cat /proc/net/iptCLUSTERIP/10.46.24.10 gives me the good number. No need to enter it manually by echo “+1” > cat /proc/net/iptCLUSTERIP/10.46.24.10.

    Next, iptables -L give me that:
    node1:
    CLUSTERIP tcp — anywhere 10.46.24.10 tcp dpt:http CLUSTERIP hashmode=sourceip clustermac=01:00:5E:7F:18:0A totalnodes=2 localnode=1

    node2:
    CLUSTERIP tcp — anywhere 10.46.24.10 tcp dpt:http CLUSTERIP hashmode=sourceip clustermac=01:00:5E:7F:18:0A totalnodes=2 localnode=2

    Everything seems ok, but like “Mitry” (post n°19), i’ve got no reply from nodes to arp request

    18:44:07.123088 IP (tos 0×0, ttl 125, id 24583, offset 0, flags [none], proto: ICMP (1), length: 60) 10.16.92.200 > 10.46.24.10: ICMP echo request, id 512, seq 27913, length 40
    18:44:07.125206 arp who-has 10.46.24.10 tell 10.46.24.252

    Any ideas ? I’m ready to make other tests.
    My collegue try Microsoft NLB with exactelly the same config and everything works.

  36. Francois Says:

    Good idea to supress the director on clusters. It could maybe be used also to load-balance firewalls that are connection tracking, but for that we would need one interface to be setup with a hashmode of sourceip, and another interface with a hashmode of destip for returning packets to cross the same firewall.

                 |---FW2---|
    

    Network1—-| |—-Network2

    |—-FW1—-|

    What do you think about it? Anyone is good enough to make a patch to have destip as a hash algorithm (if idea is worth it)?

  37. Zeko Says:

    flavio you said: “Sourceip-sourceport and sourceip-sourceport-destport will provide a more even distribution of traffic, but will require more memory to hold larger hash tables.”

    Could you imagine hashing and storing hash values in memory for N IPs? That would be bad idea.

    Well, I looked into source code of CLUSTERIP, there are no hash tables. So feel free to use any of mentioned hashing modes.

  38. flavio Says:

    Zeko, you’re correct. All modes just utilize a hash function to determine the node receiving the request, so no tables are held.

  39. Czytom Says:

    What’s going on when one of node’s goes down? Does it automatically change configuration of other nodes to exclude this one off node

Leave a Reply