Marek Goldmann | Connecting Docker containers on multiple hosts

Warning

This guide is based on an old version of Docker. The instructions you find below may be already removed or changed in the Docker codebase.

In my previous blog post I talked about running the Fedora Cloud images on a local KVM with libvirt. This was not a standalone task, but rather the preparation for this blog post: running Docker containers on multiple hosts attached to the same network.

I was asked in the comments on my WildFly cluster based on Docker blog post if it would be possible to run a cluster on multiple hosts. I found a very nice tutorial written by Franck Besnard. I’ve decided to set up a similar environment on my own to see how/if it works.

I made a few changes to Franck’s set up:

I’m not using the pipework script to minimize the dependencies.
I wanted to make the launching of the containers as simple as possible, so I dropped the use of the ovswork.sh script Franck crafted and I only use the docker run command.
I’m not creating a virtual ethernet device for each container — instead I’m attaching all containers to a bridge.
I’m not using VLAN’s (yet)

Set up

I used two VM’s (host1 and host2) each with Fedora 20 as the operating system. You can use the script I described in my previous post to create them. Later you can use the virsh start host1 command to run them.

Docker

On both hosts I’ve installed Docker:

yum -y install docker-io

The Docker configuration requires some changes.

By default Docker chooses a (more or less) random network to run the containers. After this it creates a bridge and assigns an address to it. This is not really what we want because we need to have static address assignment, so we need to prepare our own bridge and disable the one managed by Docker.

Copy the /usr/lib/systemd/system/docker.service file to /etc/systemd/system/docker.service and add following content to disable the default docker0 bridge creation on Docker startup.

.include /usr/lib/systemd/system/docker.service

[Service]
ExecStart=
ExecStart=/usr/bin/docker -d -b=none

You can start Docker with systemctl start docker.

Note	Every time you modify a systemd service file do not forget to run `systemctl daemon-reload` to apply your changes.

Networking

This is the interesting part :)

Open vSwitch

To make networking easy I used the Open vSwitch software. I’m very new to it, but its flexibility and ease of use is just impressive. I haven’t done any performance testing, though. Maybe some day.

You can install Open vSwitch on Fedora by running this command:

yum -y install openvswitch

Network configuration

The script below prepares the networking for you. You can execute it on both hosts by adjusting the REMOTE_IP and BRIDGE_ADDRESS variables. The BRIDGE_NAME can be the same on both hosts.

# The 'other' host
REMOTE_IP=192.168.122.189
# Name of the bridge
BRIDGE_NAME=docker0
# Bridge address
BRIDGE_ADDRESS=172.16.42.2/24

# Deactivate the docker0 bridge
ip link set $BRIDGE_NAME down
# Remove the docker0 bridge
brctl delbr $BRIDGE_NAME
# Delete the Open vSwitch bridge
ovs-vsctl del-br br0
# Add the docker0 bridge
brctl addbr $BRIDGE_NAME
# Set up the IP for the docker0 bridge
ip a add $BRIDGE_ADDRESS dev $BRIDGE_NAME
# Activate the bridge
ip link set $BRIDGE_NAME up
# Add the br0 Open vSwitch bridge
ovs-vsctl add-br br0
# Create the tunnel to the other host and attach it to the
# br0 bridge
ovs-vsctl add-port br0 gre0 -- set interface gre0 type=gre options:remote_ip=$REMOTE_IP
# Add the br0 bridge to docker0 bridge
brctl addif $BRIDGE_NAME br0

# Some useful commands to confirm the settings:
# ip a s
# ip r s
# ovs-vsctl show
# brctl show

After executing these commands on both hosts you should be able to ping the docker0 bridge addresses from both hosts.

Here is an example from host2 (ip 192.168.122.189):

$ ping 172.16.42.1
PING 172.16.42.1 (172.16.42.1) 56(84) bytes of data.
64 bytes from 172.16.42.1: icmp_seq=1 ttl=64 time=2.16 ms
64 bytes from 172.16.42.1: icmp_seq=2 ttl=64 time=0.628 ms
^C
--- 172.16.42.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.628/1.396/2.165/0.769 ms

Networking explained

The above script has some useful comments that help to understand what it’s doing, but here’s a high level view on the networking part.

Every container run with Docker is attached to docker0 bridge. This is a regular bridge you can create on every Linux system, without the need for Open vSwitch.
The docker0 bridge is attached to another bridge: br0. This time it’s an Open vSwitch bridge. This means that all traffic between containers is routed through br0 too. You can think about two switches connected to each other.
Additionally we need to connect together the networks from both hosts in which the containers are running. A GRE tunnel is used for this purpose. This tunnel is attached to the br0 Open vSwitch bridge and as a result to docker0 too.

The issue: IP assignment

While creating this environment I found a problem.

Docker assumes that it’s managing the network where the containers are run. It does not expect any other hosts to be run on the network besides the ones it starts. This works well in a typical environment (and definitely makes the code easier). But if we’re going to spread across multiple hosts — this can cause some headaches.

Docker address assignement method

The way Docker assignes IP addresses to the containers is very simple: it tries to assign the first unused address. It sounds valid, right? But it depends how do you define not used. When Docker starts a container — the assigned IP is added to a list of used IPs maintained by the Docker daemon. Not used IP in Docker’s case means that the IP wasn’t found in that list.

This can be problematic, though. If you run something manually on that network and you assign an IP to it — Docker will not be able to detect it and instead it can happen that Docker assigns this IP blindly again causing a conflict.

Solution

Over the weekend I was thinking about some solutions, and I ended up with two:

Obvious one: change the Docker code to find out if the address is really free.
Manually assign IP’s to the containers when running them.

Both have pros and cons. There may be other solutions too. Feel free to drop a comment if you find one.

Option 1: Modifying Docker

The first idea involves patching Docker. We need to make it aware of the hosts running on the network. From the beginning I was focused on using the ARP protocol.

I was trying to use the host ARP cache table for the interface bound to Docker (by default it’s docker0), but I found that:

Containers do not advertise themselves on startup, and
Even if we advertise manually (using gratuitous ARP message) — the ARP table is not reliable enough since entries will be removed after some time if there is no communication between these two hosts.

Note	Fedora does drop the broadcast ARP messages by default. You can change this by setting: `echo 1 > /proc/sys/net/ipv4/conf/<device>/arp_accept`. Read more in the Linux kernel documentation (search for `arp_accept`).

But the good news is that we still can find if the selected IP is used by using the arping utility and this is what I used.

I prepared a very ugly patch for Docker 0.7.6 which adds an additional check if the IP we’re trying to use is actually free.

In my testing I found that using arping is pretty reliable — the hosts were discovered properly and it didn’t take too long to find a free IP.

I built an RPM with this patch for Fedora 20, you can download it from here, if you want to give it a try.

After installing the patched Docker you should be able to run containers just like you’re used to:

docker run -i -t centos:latest /bin/bash

Option 2: Manual address assignment

Sometimes patching Docker is not an option.

This is where assigning IP addresses manually makes sense. Since Docker does not expose the ability to assign a selected IP directly to the docker run command — we need to do this in two steps:

Disable the automatic network configuration in Docker by specifying -n=false,
Configure networking using the LXC configuration using -lxc-conf

Example

This is how it could be done:

docker run \
-n=false \
-lxc-conf="lxc.network.type = veth" \
-lxc-conf="lxc.network.ipv4 = 172.16.42.20/24" \
-lxc-conf="lxc.network.ipv4.gateway = 172.16.42.1" \
-lxc-conf="lxc.network.link = docker0" \
-lxc-conf="lxc.network.name = eth0" \
-lxc-conf="lxc.network.flags = up" \
-i -t centos:latest /bin/bash

This will run a CentOS container with networking set up as follows:

Create a virtual ethernet interface
Attach this interface to the docker0 bridge
Expose it in the container as eth0
Assign the 172.16.42.20 IP to the interface
Set up the default gateway as 172.16.42.1

If you want to run multiple containers on one host, the only thing you’ll change is the IP address — everything else can be left as-is.

Expected result

If you followed the tutorial (no matter which option you choose) — you should be able to run containers on both hosts. Containers should be attached to the same network and be able to ping each other. Additionaly no IP address conflicts should happen.

Win!

Troubleshooting

If you encounter some problems — you need to check the configuration.

Make sure the brctl show command outputs similar content:

bridge name bridge id   STP enabled interfaces
docker0   8000.7a7c5f332842 no    br0

Make sure the ovs-vsctl show command outputs similar content:

73f7bcaa-7141-4b20-8fa8-3a0c1ec34f39
    Bridge "br0"
        Port "br0"
            Interface "br0"
                type: internal
        Port "gre0"
            Interface "gre0"
                type: gre
                options: {remote_ip="192.168.122.43"}
    ovs_version: "2.0.0"

Make sure you can ping host1 from host2 and vice-versa.
Make sure you can ping the docker0 interface running on host1 from host2 and vice-versa.

Conclusion

It’s possible to run Docker containers on different hosts that share the same network.

It’s even pretty simple. But like always — it could be better: Docker should make it possible without any workarounds.

One idea would be to implement the ARP requests directly in Go and drop the use of arping.

The other idea is to expose the network settings for the containers to the docker run call. I’m thinking here about the -i (IP with network prefix) and -g (gateway) options forwarded to dockerinit when launching a container.

Whoah, you’re still reading this? Not bad.

Thanks!