How to get a unique MAC/DHCP IP for a Docker/Podman container without MACVLAN?

glizzyguzzler@lemmy.blahaj.zone · 7 months ago

How to get a unique MAC/DHCP IP for a Docker/Podman container without MACVLAN?

litchralee@sh.itjust.works · 7 months ago

I want to make sure I’ve understood your initial configuration correctly, as well as what you’ve tried.

In the original setup, you have eth0 as the interface to the rest of your network, and eth0 obtains a DHCP-assigned address from the DHCP server. Against eth0, you created a bridge interface br0, and your host also obtains a DHCP-assigned address in br0. Then in Incus, you created a Macvlan network against br0, such that each containers against this network will be assigned a random MAC, and all the container Ethernet frames will be bridged to br0, which in-turn bridges to eth0. In this way, all containers can each receive a DHCP-assigned address. Also, each container can send traffic to the br0 IP address, to access services running on the host. Do I have that right?

For your Docker attempt, it looks like you created a Docker network using the Macvlan driver, but it wasn’t clear to me if the parent interface here was eth0 or br0, if you still have br0. When you say “I have MACVLAN working”, can you describe which aspect is working? Unique MAC assignment? Bridged traffic to/from the containers or the network?

I’m not very familiar with Incus, but I’m entirely in the dark about this shoddy plugin you mentioned for DHCP and Macvlan to work. So far as I’m aware, modern Docker Engine uses the CNI plugins when creating networks, so the “-d macvlan” parameter specifies which CNI plugin will load. Since this would all be at Layer 2, I don’t see why a plugin is needed to support DHCP – v4 or v6? – traffic.

And the host cannot contact the container due to the MACVLAN method

Correct, but this is remedied by what’s to follow…

Can I make another bridge device off of br0 and bind to that one host-like?

Yes, this post seems to do exactly that: https://kcore.org/2020/08/18/macvlan-host-access/

I can always put a Docker/podman inside of an Incus container, but I’d like to avoid onioning if possible.

I think you’re right to avoid multiple container management tools, if simply because it’s generally unnecessary. Although it kinda looks like Incus is more akin to Proxmox, in that it supports managing VMs and containers, whereas Podman and Docker only manage containers, which is further still distinct from the container runtime (eg CRI-O, containerd, Docker Engine (which uses containerd under the hood)).

glizzyguzzler@lemmy.blahaj.zone · 7 months ago

Thanks for taking the time to reply!

The host setup has eth0 as the physical interface to the rest of the network, with br0 replacing it completely. br0 has the same MAC as the eth0 interface and eth0 just forwards to br0 which then does the bridging internally. br0 being a bridge means that incus is able to split it off without MACVLAN but rather its nic device in bridge mode which “Uses an existing bridge on the host (br0) and creates a virtual device pair to connect the host bridge to the instance.” That results in a network interface that has its own MAC and is assigned a local IP by the DHCP server on the network while also being able to talk to the host.

Incus accomplishes the same goal as Proxmox (Proxmox has similar bridge network devices for its containers/VMs) just without Incus needing to be your OS/distro like Proxmox does, it’s just a package.

As for the Docker, the parent interface is br0 which has supplanted eth0. MACVLAN is working as it is intended to in Docker, as far as I can tell. The container has a networking device with its own MAC address, and after supplying the MACVLAN network device with my network’s subnet and gateway and static IP address in the Docker compose file it works as expected. If I don’t supply a static IP in the Docker compose file, Docker just assigns it the first IP in the given subnet - no DHCP interaction. This docker-net-dhcp plugin (I linked to the issue about it not working on the latest version of Docker anymore) was made to give Docker network devices the ability to use DHCP to get an IP address, but it’s clearly not something to rely on.

If I’m missing something about MACVLAN that makes DHCP work for Docker, let me know! Hardcoding an IP into a docker-compose file adds an extra step to remember compared to everything else being configured on the centralized DHCP server - hence the shoddy implementation claim for Docker.

Thanks for the link to using another MACVLAN and routing around the host<-/->container connection issue inherent to MACVLAN. I’ll keep it in mind as an alternate to Incus container around another container! I do wish there could be something like Incus’ hassle-free solution for Docker or Podman.

MangoPenguin@lemmy.blahaj.zone · 7 months ago

What about using the default docker bridge networking instead of macvlan? You can access docker containers from the host, they can talk to each other if on the same bridge network, and there’s nothing hardcoded into the docker compose files.

glizzyguzzler@lemmy.blahaj.zone · 7 months ago

With the default Docker bridge networking the container won’t have a unique IP/MAC address on the local network, as far as I am aware. Communication with external clients will have to contact the host server’s IP at the port the container is tied to in order to interact. If there’s a way to specify a specific parent interface, let me know!

MangoPenguin@lemmy.blahaj.zone · 7 months ago

Thats correct but that is fine for the majority of setups.

litchralee@sh.itjust.works · edit-2 7 months ago

Ah, now I understand your setup. To answer the title question, I’ll have to be a bit verbose with how I think Incus behaves, so that the Docker behavior can be put into context. Bear with me.

br0 has the same MAC as the eth0 interface

This behavior stood out to me, since it’s not a fundamental part of Linux bridging. And it turns out that this might be a systemd-specific thing, since creating a bridge is functionally equivalent to a software switch, where every port of the switch has its own MAC, and all “clients” to that switch also have their own MAC. If I had to guess, systemd does this so that traffic from the physical interface (eth0) that passes directly through to br0 will have the MAC from the physical network, thus making it easier to understand traffic flows in Wireshark, for example. I personally can’t agree with this design choice, since it obfuscates what Linux is really doing vis-a-vis a software switch. But reusing this MAC here is merely a weird side-effect and doesn’t influence what Incus is doing.

Instead, the reason Incus needs the bridge interface is precisely because a physical interface like eth0 will not automatically forward frames to subordinate interfaces. Whereas for a virtual switch, that’s the default. To that end, the bridge interface is combined with virtual ethernet (veth) interfaces – another networking primitive in Linux – to each container that Incus manages. The behavior of a veth is akin to a point-to-point network cable, plus the NICs on both ends. That means a veth always consists of a pair of interfaces, where traffic into one end comes out the other, and each interface has its own MAC address. Functionally, this is the networking equivalent of a bidirectional pipe.

By combining a bridge (ie a virtual switch) with veth (ie virtual cables), we have a full Layer 2 network topology that behaves identically as if it were a physical bridge with physical cables. Thus, your DHCP server is none the wiser when it sends and receives BOOTP traffic for assigning an IP address. This is the most flexible way of constructing a virtual network within Linux, since it has feature-parity with physical networks: there is no Macvlan or Ipvlan or tunneling or whatever needed to make this work. Linux is just operating as a switch, with all the attendant flexibility. This architecture is what Calico – a network framework for Kubernetes – uses, in order to achieve scalable, Layer 3 connectivity to containers; by default, Kubernetes does not depend on Layer 2 to function.

OK, so we now understand why Incus does things the way it does. For Docker, when using the Macvlan driver, the benefits of the bridge+veth model are not achieved, because Macvlan – although being a feature of Linux networking – is something which is implemented against an individual interface on the host. Compare this to a bridge, which is a standalone concept and thus can exist with or without any interfaces to the host: when Linux is actually used as a switch – like on many home routers – the host itself can choose to have zero interfaces attached to the switch, meaning that traffic flows through the box, rather than to the box as a destination.

So when creating subordinate interfaces using Macvlan, we get most of the bridging behavior as bridge+veth but the Macvlan implementation in the kernel means that outbound traffic from a subordinate interface always get put onto the outbound queue of the parent interface. This makes it impossible for a subordinate interface to exchange traffic with the host itself, by design. Had they chosen to go the extra mile, they would have just reinvented a version of bridge+veth that is excessively niche.

We also need to discuss the behavior of Docker networks. Similar to Kubernetes, containers managed by Docker mandate having IP connectivity (Layer 3). But whereas Kubernetes will not start a container unless an IPAM (IP Address Management) plugin explicitly provides an IP address, Docker’s legacy behavior is to always generate a random IP address from a default range, unless given an IP explicitly. So even though bridge+veth or Macvlan will imbue Layer 2 connectivity to a DHCP server to obtain an IP address, Docker is eager to provide an IP, just so the container has one from the very start. The distinction between Docker and Kubernetes+Calico is thus one of actual utility: by getting an address from Calico’s IPAM, Kubernetes knows that the address will actual work for networking, because Calico also creates/manages a network. Whereas Docker has no problem assigning an IP but not actually checking if this IP can be used on that network; it’s almost a pro-forma exercise.

I will say this about early Docker: although they led the charge for making containers useful, how they implemented networking was very strange and led to a whole class of engineers who now have a deep misunderstanding of how real networks operate, and that only causes confusion when scaling up to orchestrated container frameworks like Kubernetes that depend on rigorous understanding of networking and Linux implementations. But all the same, Docker was more interested in getting things working without external dependencies like DHCP servers, so there’s some sense in mandating an IP locally, perhaps because they didn’t yet envision that containers would talk to the physical network.

The plugin that you mentioned operates by requesting a DHCP-assigned address for each container, but within the Docker runtime. And once it obtains that address, it then statically assigns it to the container. So from the container’s perspective, it’s just getting an IP assigned to it, not aware that DHCP has happened at all. The plugin is thus responsible for renewing that IP periodically. It’s a kludge to satisfy Docker’s networking requirements while still using DHCP-assigned addresses. But Docker just doesn’t play well with Layer 2 physical networks, because otherwise the responsibility for running the DHCP client would fall to the containers; some containers might not even have a DHCP client to run.

If I’m missing something about MACVLAN that makes DHCP work for Docker, let me know!

Sadly, there just isn’t a really good way to do this within Docker, and it’s not the kernel’s fault. Other container runtimes like containerd – which relies wholly on the standard CNI plugins and thus doesn’t have Docker’s networking footguns – have no problem with containers running their own DHCP client on a bridged network. But for any container manager to handle DHCP assignment without the container’s cooperation always leads to the same kludge as what Docker did. And that’s probably why no major container manager does that natively; it’s hard to solve.

I do wish there could be something like Incus’ hassle-free solution for Docker or Podman.

Since your containers were able to get their own DHCP addresses from a bridged network in Incus, can you still run the DHCP client on those containers to override Docker’s randomly-assigned local IP address? You’d have to use the bridge network driver in Docker, since you also want host-container traffic to work and we know Macvlan won’t do that. But even this is a delicate solution, since if DHCP fails to assign an address, then your container still has the Docker-assigned address but it won’t be usable on the bridged network.

The best solution I’ve seen for containers on DHCP-assigned networks is to not use DHCP assignment at all. Instead, part of the IP subnet is carved out, a region which is dedicated only for containers. So in a home IPv4 network like 192.168.69.0/24, the DHCP server would be restricted to only assigning 192.168.69.2 through 192.168.69.127, and then Docker would be allowed to allocate the addresses from 192.168.69.128 to 192.168.69.254 however it wants, with a subnet mask of 255.255.255.0. This mask allows containers to speak directly to addresses in the entire 192.168.69.0/24 range, which includes the rest of the network. The other physical hosts do the same, allowing them to connect to containers.

This neatly avoids interacting with the DHCP server, but at a loss of central management and it splits the allocatable addresses into smaller parts, potentially causing exhaustion in one side while the other has spare addresses. Yet another reason to adopt IPv6 as the standard for containers, but I digress. For Kubernetes and similar orchestration frameworks, DHCP isn’t even considered since the orchestrator must have full internal authority to assign addresses with its chosen IPAM plugin.

TL;DR: if your containers are like mini VMs, DHCP assignment is doable. But if they’re pre-packaged appliances, then only sadness results when trying to use DHCP.

glizzyguzzler@lemmy.blahaj.zone · 7 months ago

This was very insightful and I’d like to say I groked 90% of it meaningfully!

For an Incus container with its unique MAC interface, yes if I run a Docker container in that Incus container and leave the Docker container in its default bridge mode then I get the desired feature set (with the power of onions).

And thanks for explaining CNI, I’ve seen it referenced but didn’t fully get how it’s involved. I see that podman uses it to make a MACVLAN interface that can do DHCP (until 5.0, but the replacement seems to be feature-compatible for MACVLAN), so podman will sidestep the pain point of having to assign a no-go-zone on the DHCP server for a Docker swath of IPv4s, as you mentioned. Close enough for containers that the host doesn’t need to talk to.

So in summary:

I’ve got Docker doing the extent it can manage with MACVLAN and there’s no extra magicks to be done on it.
Podman will still use MACVLAN (no host to container comms still) but it’s able to use DHCP to get an address for the MACVLAN container.
If the host must talk to the container with MACVLAN, I can either use the MACVLAN bypass as you linked to above or put the Docker/Podman container inside an Incus container with its bridge mode.
Kubernutes continues to sound very powerful and flexible but is definitely beyond my reach yet. (Womp womp)

Thanks again for taking the time to type and explain all of that!

litchralee@sh.itjust.works · 7 months ago

Kubernetes does indeed have a learning curve, but it’s also strangely accommodating for single-node setups which can then be expanded only by adding components, rather than tearing the whole thing down and starting again. In that sense, it’s a great learning platform towards managing larger or commercial clusters, if simply to get experience with the unique challenges inherent to scaling up.

But that might be more of a !homelab@lemmy.ml point of view haha