Warning
|
This guide is based on an old version of Docker. The instructions you find below may be already removed or changed in the Docker codebase. |
In my previous blog post I talked about running the Fedora Cloud images on a local KVM with libvirt. This was not a standalone task, but rather the preparation for this blog post: running Docker containers on multiple hosts attached to the same network.
I was asked in the comments on my WildFly cluster based on Docker blog post if it would be possible to run a cluster on multiple hosts. I found a very nice tutorial written by Franck Besnard. I’ve decided to set up a similar environment on my own to see how/if it works.
I made a few changes to Franck’s set up:
-
I’m not using the pipework script to minimize the dependencies.
-
I wanted to make the launching of the containers as simple as possible, so I dropped the use of the
ovswork.sh
script Franck crafted and I only use thedocker run
command. -
I’m not creating a virtual ethernet device for each container — instead I’m attaching all containers to a bridge.
-
I’m not using VLAN’s (yet)
Set up
I used two VM’s (host1
and host2
) each with Fedora 20 as the operating system.
You can use the script
I described in
my previous post to create them. Later you can use the virsh start host1
command to run them.
Docker
On both hosts I’ve installed Docker:
yum -y install docker-io
The Docker configuration requires some changes.
By default Docker chooses a (more or less) random network to run the containers. After this it creates a bridge and assigns an address to it. This is not really what we want because we need to have static address assignment, so we need to prepare our own bridge and disable the one managed by Docker.
Copy the /usr/lib/systemd/system/docker.service
file to
/etc/systemd/system/docker.service
and add following content to disable the
default docker0
bridge creation on Docker startup.
.include /usr/lib/systemd/system/docker.service [Service] ExecStart= ExecStart=/usr/bin/docker -d -b=none
You can start Docker with systemctl start docker
.
Note
|
Every time you modify a systemd service file do not forget to run
systemctl daemon-reload to apply your changes.
|
Networking
This is the interesting part :)
Open vSwitch
To make networking easy I used the Open vSwitch software. I’m very new to it, but its flexibility and ease of use is just impressive. I haven’t done any performance testing, though. Maybe some day.
You can install Open vSwitch on Fedora by running this command:
yum -y install openvswitch
Network configuration
The script below prepares the networking for you. You can execute it on both
hosts by adjusting the REMOTE_IP
and BRIDGE_ADDRESS
variables. The
BRIDGE_NAME
can be the same on both hosts.
# The 'other' host REMOTE_IP=192.168.122.189 # Name of the bridge BRIDGE_NAME=docker0 # Bridge address BRIDGE_ADDRESS=172.16.42.2/24 # Deactivate the docker0 bridge ip link set $BRIDGE_NAME down # Remove the docker0 bridge brctl delbr $BRIDGE_NAME # Delete the Open vSwitch bridge ovs-vsctl del-br br0 # Add the docker0 bridge brctl addbr $BRIDGE_NAME # Set up the IP for the docker0 bridge ip a add $BRIDGE_ADDRESS dev $BRIDGE_NAME # Activate the bridge ip link set $BRIDGE_NAME up # Add the br0 Open vSwitch bridge ovs-vsctl add-br br0 # Create the tunnel to the other host and attach it to the # br0 bridge ovs-vsctl add-port br0 gre0 -- set interface gre0 type=gre options:remote_ip=$REMOTE_IP # Add the br0 bridge to docker0 bridge brctl addif $BRIDGE_NAME br0 # Some useful commands to confirm the settings: # ip a s # ip r s # ovs-vsctl show # brctl show
After executing these commands on both hosts you should be able to ping the
docker0
bridge addresses from both hosts.
Here is an example from host2
(ip 192.168.122.189
):
$ ping 172.16.42.1 PING 172.16.42.1 (172.16.42.1) 56(84) bytes of data. 64 bytes from 172.16.42.1: icmp_seq=1 ttl=64 time=2.16 ms 64 bytes from 172.16.42.1: icmp_seq=2 ttl=64 time=0.628 ms ^C --- 172.16.42.1 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.628/1.396/2.165/0.769 ms
Networking explained
The above script has some useful comments that help to understand what it’s doing, but here’s a high level view on the networking part.
-
Every container run with Docker is attached to
docker0
bridge. This is a regular bridge you can create on every Linux system, without the need for Open vSwitch. -
The
docker0
bridge is attached to another bridge:br0
. This time it’s an Open vSwitch bridge. This means that all traffic between containers is routed throughbr0
too. You can think about two switches connected to each other. -
Additionally we need to connect together the networks from both hosts in which the containers are running. A GRE tunnel is used for this purpose. This tunnel is attached to the
br0
Open vSwitch bridge and as a result todocker0
too.
The issue: IP assignment
While creating this environment I found a problem.
Docker assumes that it’s managing the network where the containers are run. It does not expect any other hosts to be run on the network besides the ones it starts. This works well in a typical environment (and definitely makes the code easier). But if we’re going to spread across multiple hosts — this can cause some headaches.
Docker address assignement method
The way Docker assignes IP addresses to the containers is very simple: it tries to assign the first unused address. It sounds valid, right? But it depends how do you define not used. When Docker starts a container — the assigned IP is added to a list of used IPs maintained by the Docker daemon. Not used IP in Docker’s case means that the IP wasn’t found in that list.
This can be problematic, though. If you run something manually on that network and you assign an IP to it — Docker will not be able to detect it and instead it can happen that Docker assigns this IP blindly again causing a conflict.
Solution
Over the weekend I was thinking about some solutions, and I ended up with two:
-
Obvious one: change the Docker code to find out if the address is really free.
-
Manually assign IP’s to the containers when running them.
Both have pros and cons. There may be other solutions too. Feel free to drop a comment if you find one.
Option 1: Modifying Docker
The first idea involves patching Docker. We need to make it aware of the hosts running on the network. From the beginning I was focused on using the ARP protocol.
I was trying to use the host ARP cache table for the interface bound to Docker (by
default it’s docker0
), but I found that:
-
Containers do not advertise themselves on startup, and
-
Even if we advertise manually (using gratuitous ARP message) — the ARP table is not reliable enough since entries will be removed after some time if there is no communication between these two hosts.
Note
|
Fedora does drop the broadcast ARP messages by default. You can change
this by setting: echo 1 > /proc/sys/net/ipv4/conf/<device>/arp_accept .
Read more in the
Linux kernel documentation (search for arp_accept ).
|
But the good news is that we still can find if the selected IP is used by using
the arping
utility and this is what I used.
I prepared a very ugly patch for
Docker 0.7.6
which adds an additional check if the IP we’re trying to use is
actually free.
In my testing I found that using arping is pretty reliable — the hosts were discovered properly and it didn’t take too long to find a free IP.
I built an RPM with this patch for Fedora 20, you can download it from here, if you want to give it a try.
After installing the patched Docker you should be able to run containers just like you’re used to:
docker run -i -t centos:latest /bin/bash
Option 2: Manual address assignment
Sometimes patching Docker is not an option.
This is where assigning IP addresses manually makes sense. Since Docker does
not expose the ability to assign a selected IP
directly to the docker run
command — we need to do this in two steps:
-
Disable the automatic network configuration in Docker by specifying
-n=false
, -
Configure networking using the LXC configuration using
-lxc-conf
Example
This is how it could be done:
docker run \ -n=false \ -lxc-conf="lxc.network.type = veth" \ -lxc-conf="lxc.network.ipv4 = 172.16.42.20/24" \ -lxc-conf="lxc.network.ipv4.gateway = 172.16.42.1" \ -lxc-conf="lxc.network.link = docker0" \ -lxc-conf="lxc.network.name = eth0" \ -lxc-conf="lxc.network.flags = up" \ -i -t centos:latest /bin/bash
This will run a CentOS container with networking set up as follows:
-
Create a virtual ethernet interface
-
Attach this interface to the
docker0
bridge -
Expose it in the container as
eth0
-
Assign the
172.16.42.20
IP to the interface -
Set up the default gateway as
172.16.42.1
If you want to run multiple containers on one host, the only thing you’ll change is the IP address — everything else can be left as-is.
Expected result
If you followed the tutorial (no matter which option you choose) — you should be able to run containers on both hosts. Containers should be attached to the same network and be able to ping each other. Additionaly no IP address conflicts should happen.
Win!
Troubleshooting
If you encounter some problems — you need to check the configuration.
-
Make sure the
brctl show
command outputs similar content:
bridge name bridge id STP enabled interfaces docker0 8000.7a7c5f332842 no br0
-
Make sure the
ovs-vsctl show
command outputs similar content:
73f7bcaa-7141-4b20-8fa8-3a0c1ec34f39 Bridge "br0" Port "br0" Interface "br0" type: internal Port "gre0" Interface "gre0" type: gre options: {remote_ip="192.168.122.43"} ovs_version: "2.0.0"
-
Make sure you can ping
host1
fromhost2
and vice-versa. -
Make sure you can ping the
docker0
interface running onhost1
fromhost2
and vice-versa.
Conclusion
It’s possible to run Docker containers on different hosts that share the same network.
It’s even pretty simple. But like always — it could be better: Docker should make it possible without any workarounds.
One idea would be to implement the ARP requests directly in Go and drop the use
of arping
.
The other idea is to expose the network settings for the containers to the
docker run
call. I’m thinking here about the -i
(IP with network prefix)
and -g
(gateway) options forwarded to dockerinit
when launching a container.
Whoah, you’re still reading this? Not bad.
Thanks!