Marek Goldmann | Resource management in Docker

In this blog post I would like to touch on the topic of resource management for Docker containers. It is often unclear how it works and what we can and cannot do. I hope that after reading this blog post resource management will be a bit easier for you to understand.

Note

I assume that you are running Docker on a systemd enabled operating system. If you are on RHEL/CentOS 7+ or Fedora 19+ this is certainly true. But please note that there can be some changes in the available configuration options between different systemd versions. When in doubt, use the systemd man pages for the system you work with.

The basics

Docker uses cgroups to group processes running in the container. This allows you to manage the resources of a group of processes, which is very valuable, as you can imagine.

If we run an operating system which uses systemd as the service manager, every process (not only the ones inside of the container) will be placed in a cgroups tree. You can see it for yourself if you run the systemd-cgls command:

$ systemd-cgls
├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
├─machine.slice
│ └─machine-qemu\x2drhel7.scope
│   └─29898 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name rhel7 -S -machine pc-i440fx-1.6,accel=kvm,usb=off -cpu SandyBridge -m 2048
├─system.slice
│ ├─avahi-daemon.service
│ │ ├─ 905 avahi-daemon: running [mistress.local
│ │ └─1055 avahi-daemon: chroot helpe
│ ├─dbus.service
│ │ └─890 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
│ ├─firewalld.service
│ │ └─887 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
│ ├─lvm2-lvmetad.service
│ │ └─512 /usr/sbin/lvmetad -f
│ ├─abrtd.service
│ │ └─909 /usr/sbin/abrtd -d -s
│ ├─wpa_supplicant.service
│ │ └─1289 /usr/sbin/wpa_supplicant -u -f /var/log/wpa_supplicant.log -c /etc/wpa_supplicant/wpa_supplicant.conf -u -f /var/log/wpa_supplica
│ ├─systemd-machined.service
│ │ └─29899 /usr/lib/systemd/systemd-machined

[SNIP]

This approach gives a lot of flexibility when we want to manage resources, since we can manage every group individually. Although this blog post focuses on containers, the same principle applies to other processes as well.

Note	If you want to read more about resource management with systemd I highly recommend the Resource Management and Linux Containers Guide for RHEL 7.

A note on testing

In my examples I’ll use the stress tool that helps me to generate some load in the containers so I can actually see the resource limits being applied. I created a custom Docker images called (surprisingly) stress using this Dockerfile:

FROM fedora:latest
RUN yum -y install stress && yum clean all
ENTRYPOINT ["stress"]

A note on resource reporting tools

The tools you are used to using to report usage like top, /proc/meminfo and so on are not cgroups aware. This means that they’ll report the information about the host even if we run them inside of a container. I found a nice blog post from Fabio Kung on this topic. Give it a read.

So, what can we do?

If you want to quickly find which container (or any systemd service, really) uses the most resources on the host I recommend the systemd-cgtop command:

$ systemd-cgtop
Path                                    Tasks   %CPU   Memory  Input/s Output/s

/                                         226   13.0     6.7G        -        -
/system.slice                              47    2.2    16.0M        -        -
/system.slice/gdm.service                   2    2.1        -        -        -
/system.slice/rngd.service                  1    0.0        -        -        -
/system.slice/NetworkManager.service        2      -        -        -        -

[SNIP]

This tool can give you a quick overview of what’s going on on the system right now. But if you want to get some detailed information about the usage (for example you need to create nice graphs) you will want to parse the /sys/fs/cgroup/… directories. I’ll show you where to find useful files for each resource I will talk about (look at the CGroups fs paragraphs below).

CPU

Docker makes it possible (via the -c switch of the run command) to specify a value of shares of the CPU available to the container. This is a relative weight and has nothing to do with the actual processor speed. In fact, there is no way to say that a container should have access only to 1Ghz of the CPU. Keep that in mind.

Every new container will have 1024 shares of CPU by default. This value does not mean anything, when speaking of it alone. But if we start two containers and both will use 100% CPU, the CPU time will be divided equally between the two containers because they both have the same CPU shares (for the sake of simplicity I assume that there are no other processes running).

If we set one container’s CPU shares to 512 it will receive half of the CPU time compared to the other container. But this does not mean that it can use only half of the CPU. If the other container (with 1024 shares) is idle — our container will be allowed to use 100% of the CPU. That’s another thing to note.

Limits are enforced only when they should be. CGroups does not limit the processes upfront (for example by not allowing them to run fast, even if there are free resources). Instead it gives as much as it can and limits only when necessary (for example when many processes start to use the CPU heavily at the same time).

Of course it’s not easy (and I would say impossible) to say how many resources will be assigned to your process. It really depends on how other processes will behave and how many shares are assigned to them.

Example: managing the CPU shares of a container

As I mentioned before you can use the -c switch to manage the value of shares assigned to all processes running inside of a Docker container.

Since I have 4 cores on my machine available, I’ll tell stress to use all 4:

$ docker run -it --rm stress --cpu 4
stress: info: [1] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd

If we start two containers the same way, both will use around 50% of the CPU. But what happens if we modify the CPU shares for one container?

$ docker run -it --rm -c 512 stress --cpu 4
stress: info: [1] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd

Containers using CPU

As you can see, the CPU is divided between the two containers in such a way that the first container uses ~60% of the CPU and the other ~30%. This seems to be the expected result.

Note	The missing ~10% of the CPU was taken by GNOME, Chrome and my music player, in case you were wondering.

Attaching containers to cores

Besides limiting shares of the CPU, we can do one more thing: we can pin the container’s processes to a particular processor (core). To do this, we use the --cpuset switch of the docker run command.

To allow execution only on the first core:

docker run -it --rm --cpuset=0 stress --cpu 1

To allow execution only on the first two cores:

docker run -it --rm --cpuset=0,1 stress --cpu 2

You can of course mix the option --cpuset with -c.

Note

Share enforcement will only take place when the processes are run on the same core. This means that if you pin one container to the first core and the other container to the second core, both will use 100% of each core, even if they have different a CPU share value set (once again, I assume that only these two containers are running on the host).

Changing the shares value for a running container

It is possible to change the value of shares for a running container (or any other process, of course). You can directly interact with the cgroups filesystem, but since we have systemd we can leverage it to manage this for us (since it manages the processes anyhow).

For this purpose we’ll use the systemctl command with the set-property argument. Every new container created using the docker run command will have a systemd scope automatically assigned under which all of its processes will be executed. To change the CPU share for all processes in the container we just need to change it for the scope, like so:

$ sudo systemctl set-property docker-4be96b853089bc6044b29cb873cac460b429cfcbdd0e877c0868eb2a901dbf80.scope CPUShares=512

Note	Add `--runtime` to change the setting temporarily. Otherwise, this setting will be remembered when the host is restarted.

This changes the default value from 1024 to 512. You can see the result below. The change happens somewhere in the middle of the recording. Please note the CPU usage. In systemd-cgtop 100% means full use of 1 core and this is correct since I bound both containers to the same core.

Note	To show all properties you can use the `systemctl show docker-4be96b853089bc6044b29cb873cac460b429cfcbdd0e877c0868eb2a901dbf80.scope` command. To list all available properties take a look at `man systemd.resource-control`.

CGroups fs

You can find all the information about the CPU for a specific container under /sys/fs/cgroup/cpu/system.slice/docker-$FULL_CONTAINER_ID.scope/, for example:

$ ls /sys/fs/cgroup/cpu/system.slice/docker-6935854d444d78abe52d629cb9d680334751a0cda82e11d2610e041d77a62b3f.scope/
cgroup.clone_children  cpuacct.usage_percpu  cpu.rt_runtime_us  tasks
cgroup.procs           cpu.cfs_period_us     cpu.shares
cpuacct.stat           cpu.cfs_quota_us      cpu.stat
cpuacct.usage          cpu.rt_period_us      notify_on_release

Note	More information about these files can be found in the RHEL Resource Management Guide. This information is spread across the cpu, cpuacct and cpuset sections.

Recap

A few things to remember:

a CPU share is just a number — it’s not related to the CPU speed
By default new containers have 1024 shares
On an idle host a container with low shares will still be able to use 100% of the CPU
You can pin a container to specific core, if you want

Memory

Now let’s take a look at limiting memory.

The first thing to note is that a container can use all of the memory on the host with the default settings.

If you want to limit memory for all of the processes inside of the container just use the -m docker run switch. You can define the value in bytes or by adding a suffix (k, m or g).

Example: managing the memory shares of a container

You can use the -m switch like so:

$ docker run -it --rm -m 128m fedora bash

To show that the limitation actually works I’ll use my stress image again. Consider the following run:

$ docker run -it --rm -m 128m stress --vm 1 --vm-bytes 128M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The stress tool will create one process and try to allocate 128MB of memory to it. It works fine, good. But what happens if we try to use more than we have actually allocated for the container?

$ docker run -it --rm -m 128m stress --vm 1 --vm-bytes 200M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

It works too. Surprising? Yes I agree.

We can find the explanation for this in the libcontainer source code (Docker’s interface to cgroups). We can see there that by default the memory.memsw.limit_in_bytes value is set to twice as much as the memory parameter we specify while starting a container. What does the memory.memsw.limit_in_bytes parameter say? It is a sum of memory and swap. This means that Docker will assign to the container -m amount of memory as well as -m amount of swap.

The current Docker interface does not allow us to specify how much (or disable it entirely) swap should be allowed, so we need live with it for now.

With the above information we can run our example again. This time we will try to allocate over twice the amount of memory we assign. This should use all of the memory and all of the swap, then die.

$ docker run -it --rm -m 128m stress --vm 1 --vm-bytes 260M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [1] (415) <-- worker 6 got signal 9
stress: WARN: [1] (417) now reaping child worker processes
stress: FAIL: [1] (421) kill error: No such process
stress: FAIL: [1] (451) failed run completed in 5s

If you try once again to allocate for example 250MB (--vm-bytes 250M) it will work just fine.

Warning

If we don’t limit the memory by using -m switch the swap size will be unlimited too. ^[1]

Having no limit on memory can lead to issues where one container can easily make the whole system unstable and as a result unusable. So please remember: always use the -m parameter ^[2].

CGroups fs

You can find all the information about the memory under /sys/fs/cgroup/memory/system.slice/docker-$FULL_CONTAINER_ID.scope/, for example:

$ ls /sys/fs/cgroup/memory/system.slice/docker-48db72d492307799d8b3e37a48627af464d19895601f18a82702116b097e8396.scope/
cgroup.clone_children               memory.memsw.failcnt
cgroup.event_control                memory.memsw.limit_in_bytes
cgroup.procs                        memory.memsw.max_usage_in_bytes
memory.failcnt                      memory.memsw.usage_in_bytes
memory.force_empty                  memory.move_charge_at_immigrate
memory.kmem.failcnt                 memory.numa_stat
memory.kmem.limit_in_bytes          memory.oom_control
memory.kmem.max_usage_in_bytes      memory.pressure_level
memory.kmem.slabinfo                memory.soft_limit_in_bytes
memory.kmem.tcp.failcnt             memory.stat
memory.kmem.tcp.limit_in_bytes      memory.swappiness
memory.kmem.tcp.max_usage_in_bytes  memory.usage_in_bytes
memory.kmem.tcp.usage_in_bytes      memory.use_hierarchy
memory.kmem.usage_in_bytes          notify_on_release
memory.limit_in_bytes               tasks
memory.max_usage_in_bytes

Note	More information about these files can be found in the RHEL Resource Management Guide, memory section.

Block devices (disk)

With block devices we can think about two different types of limits:

Read/write speed
Amount of space available to write (quota)

The first one is pretty easy to enforce, whereas the second is still unsolved.

Note	I assume you are using the devicemapper storage backed for Docker. Everything below may be untrue for other backends.

Limiting read/write speed

Docker does not provide any switch that can be used to define how fast we can read or write data to a block device. But CGroups does have it built-in. And it’s even exposed in systemd via the BlockIO* properties.

To limit read and write speed we can use the BlockIOReadBandwidth and BlockIOWriteBandwidth properties, respectively.

By default the bandwith is not limited. This means that one container can make the disk hot, especially if it starts to swap…

Example: limiting write speed

Let’s measure the speed with no limits enforced:

$ docker run -it --rm --name block-device-test fedora bash
bash-4.2# time $(dd if=/dev/zero of=testfile0 bs=1000 count=100000 && sync)
100000+0 records in
100000+0 records out
100000000 bytes (100 MB) copied, 0.202718 s, 493 MB/s

real  0m3.838s
user  0m0.018s
sys   0m0.213s

It took 3.8 sec to write 100MB of data which gives us about 26MB/s. Let’s try to limit the disk speed a bit.

To be able to adjust the bandwitch available for the container we need to know exactly where the container filesystem is mounted. You can find it when you execute the mount command from inside of the container and find the device that is mounted on the root filesystem:

$ mount
/dev/mapper/docker-253:0-3408580-d2115072c442b0453b3df3b16e8366ac9fd3defd4cecd182317a6f195dab3b88 on / type ext4 (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c447,c990",discard,stripe=16,data=ordered)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,context="system_u:object_r:svirt_sandbox_file_t:s0:c447,c990",mode=755)

[SNIP]

In our case this is /dev/mapper/docker-253:0-3408580-d2115072c442b0453b3df3b16e8366ac9fd3defd4cecd182317a6f195dab3b88.

You can also use the nsenter command to get this value, like so:

$ sudo /usr/bin/nsenter --target $(docker inspect -f '{{ .State.Pid }}' $CONTAINER_ID) --mount --uts --ipc --net --pid mount | head -1 | awk '{ print $1 }'
/dev/mapper/docker-253:0-3408580-d2115072c442b0453b3df3b16e8366ac9fd3defd4cecd182317a6f195dab3b88

Now we can change the value of the BlockIOWriteBandwidth property, like so:

$ sudo systemctl set-property --runtime docker-d2115072c442b0453b3df3b16e8366ac9fd3defd4cecd182317a6f195dab3b88.scope "BlockIOWriteBandwidth=/dev/mapper/docker-253:0-3408580-d2115072c442b0453b3df3b16e8366ac9fd3defd4cecd182317a6f195dab3b88 10M"

This should limit the disk write speed to 10MB/s, so let’s run dd again:

bash-4.2# time $(dd if=/dev/zero of=testfile0 bs=1000 count=100000 && sync)
100000+0 records in
100000+0 records out
100000000 bytes (100 MB) copied, 0.229776 s, 435 MB/s

real  0m10.428s
user  0m0.012s
sys   0m0.276s

It seems to work, it took 10s to write 100MB to the disk, so the speed was about 10MB/s.

Note	The same applies to limiting the read bandwith with the difference being you use the `BlockIOReadBandwidth` property.

Limiting disk space

As I mentioned before this is tough topic. By default you get 10GB of space for each container. Sometimes this is too much, sometimes we cannot fit all of our data there. Unfortunately there is not much we can do about it now.

The only thing we can do is to change the default value for new containers. If you think that some other value (for example 5GB) is a beter fit in your case, you can do it by specifying the --storage-opt for the Docker daemon, like so:

docker -d --storage-opt dm.basesize=5G

You can tweak some other things, but please keep in mind that it requires restarting the Docker daemon afterwards. More info can be found in the readme.

CGroups fs

You can find all the information about the block devices under /sys/fs/cgroup/blkio/system.slice/docker-$FULL_CONTAINER_ID.scope/, for example:

$ ls /sys/fs/cgroup/blkio/system.slice/docker-48db72d492307799d8b3e37a48627af464d19895601f18a82702116b097e8396.scope/
blkio.io_merged                   blkio.sectors_recursive
blkio.io_merged_recursive         blkio.throttle.io_service_bytes
blkio.io_queued                   blkio.throttle.io_serviced
blkio.io_queued_recursive         blkio.throttle.read_bps_device
blkio.io_service_bytes            blkio.throttle.read_iops_device
blkio.io_service_bytes_recursive  blkio.throttle.write_bps_device
blkio.io_serviced                 blkio.throttle.write_iops_device
blkio.io_serviced_recursive       blkio.time
blkio.io_service_time             blkio.time_recursive
blkio.io_service_time_recursive   blkio.weight
blkio.io_wait_time                blkio.weight_device
blkio.io_wait_time_recursive      cgroup.clone_children
blkio.leaf_weight                 cgroup.procs
blkio.leaf_weight_device          notify_on_release
blkio.reset_stats                 tasks
blkio.sectors

Note	More information about these files can be found in the RHEL Resource Management Guide, blkio section.

Summary

As you can see resource management for Docker containers is possible. It’s even pretty easy. The only thing that bothers me (and others too) is that we cannot set a quota for disk usage. There is an issue filled upstream — watch it and comment.

Hope you found my post useful. Happy dockerizing!

1. This is technically not true; there is a limit, but it’s set to a value that is not reachable in the systems we currently run. For example on my laptop with 16GB of ram the value is 18446744073709551615 which is ~18.5 exabytes…

2. Or just use the MemoryLimit property.