bpf: use BPF_F_ALLOW_MULTI flag if it is available

This new kernel 4.15 flag permits that multiple BPF programs can be executed for each packet processed: multiple per cgroup plus all programs defined up the tree on all parent cgroups. We can use this for two features: 1. Finally provide per-slice IP accounting (which was previously unavailable) 2. Permit delegation of BPF programs to services (i.e. leaf nodes). This patch beefs up PID1's handling of BPF to enable both. Note two special items to keep in mind: a. Our inner-node BPF programs (i.e. the ones we attach to slices) do not enforce IP access lists, that's done exclsuively in the leaf-node BPF programs. That's a good thing, since that way rules in leaf nodes can cancel out rules further up (i.e. for example to implement a logic of "disallow everything except httpd.service"). Inner node BPF programs to accounting however if that's requested. This is beneficial for performance reasons: it means in order to provide per-slice IP accounting we don't have to add up all child unit's data. b. When this code is run on pre-4.15 kernel (i.e. where BPF_F_ALLOW_MULTI is not available) we'll make IP acocunting on slice units unavailable (i.e. revert to behaviour from before this commit). For leaf nodes we'll fallback to non-ALLOW_MULTI mode however, which means that BPF delegation is not available there at all, if IP fw/acct is turned on for the unit. This is a change from earlier behaviour, where we use the BPF_F_ALLOW_OVERRIDE flag, so that our fw/acct would lose its effect as soon as delegation was turned on and some client made use of that. I think the new behaviour is the safer choice in this case, as silent bypassing of our fw rules is not possible anymore. And if people want proper delegation then the way out is a more modern kernel or turning off IP firewalling/acct for the unit algother.
author: Lennart Poettering <lennart@poettering.net> 2018-02-16 15:35:49 +0100
committer: Sven Eden <yamakuzure@gmx.net> 2018-05-30 07:59:03 +0200
commit: d67058ae3d1ae68ed0a507109ac584df161c3c2b (patch)
tree: 70110539f5570d5fdfc81f92b708803055fe084c /src/core
parent: e614bab976056dd3b6e402a71a0a6de5b0b7572b (diff)
1 files changed, 2 insertions, 20 deletions
diff --git a/src/core/cgroup.c b/src/core/cgroup.c
index c2637a783..dc0556695 100644
--- a/src/core/cgroup.c
+++ b/src/core/cgroup.c
@@ -694,20 +694,14 @@ static void cgroup_apply_unified_memory_limit(Unit *u, const char *file, uint64_
 }
 
 static void cgroup_apply_firewall(Unit *u) {
-        int r;
-
         assert(u);
 
-        if (u->type == UNIT_SLICE) /* Skip this for slice units, they are inner cgroup nodes, and since bpf/cgroup is
-                                    * not recursive we don't ever touch the bpf on them */
-                return;
+        /* Best-effort: let's apply IP firewalling and/or accounting if that's enabled */
 
-        r = bpf_firewall_compile(u);
-        if (r < 0)
+        if (bpf_firewall_compile(u) < 0)
                 return;
 
         (void) bpf_firewall_install(u);
-        return;
 }
 
 static void cgroup_context_apply(
@@ -1228,11 +1222,6 @@ bool unit_get_needs_bpf(Unit *u) {
         Unit *p;
         assert(u);
 
-        /* We never attach BPF to slice units, as they are inner cgroup nodes and cgroup/BPF is not recursive at the
-         * moment. */
-        if (u->type == UNIT_SLICE)
-                return false;
-
         c = unit_get_cgroup_context(u);
         if (!c)
                 return false;
@@ -2623,13 +2612,6 @@ int unit_get_ip_accounting(
         assert(metric < _CGROUP_IP_ACCOUNTING_METRIC_MAX);
         assert(ret);
 
-        /* IP accounting is currently not recursive, and hence we refuse to return any data for slice nodes. Slices are
-         * inner cgroup nodes and hence have no processes directly attached, hence their counters would be zero
-         * anyway. And if we block this now we can later open this up, if the kernel learns recursive BPF cgroup
-         * filters. */
-        if (u->type == UNIT_SLICE)
-                return -ENODATA;
-
         if (!UNIT_CGROUP_BOOL(u, ip_accounting))
                 return -ENODATA;
author	Lennart Poettering <lennart@poettering.net>	2018-02-16 15:35:49 +0100
committer	Sven Eden <yamakuzure@gmx.net>	2018-05-30 07:59:03 +0200
commit	d67058ae3d1ae68ed0a507109ac584df161c3c2b (patch)
tree	70110539f5570d5fdfc81f92b708803055fe084c /src/core
parent	e614bab976056dd3b6e402a71a0a6de5b0b7572b (diff)