summaryrefslogtreecommitdiff
path: root/src/core
Commit message (Collapse)AuthorAge
* mount-setup: change bpf mount mode to 0700 (#8334)Lennart Poettering2018-05-30
| | | | | After discussing with the kernel folks, we agreed to default to 0700 for this. Better safe than sorry.
* core: turn on memory/cpu/tasks accounting by default for the root sliceLennart Poettering2018-05-30
| | | | | | | | The kernel exposes the necessary data in /proc anyway, let's expose it hence by default. With this in place "systemctl status -- -.slice" will show accounting data out-of-the-box now.
* core: hook up /proc queries for the root slice, tooLennart Poettering2018-05-30
| | | | | Do what we already prepped in cgtop for the root slice in PID 1 too: consult /proc for the data we need.
* cgroup-util: rework cg_get_keyed_attribute() a bitLennart Poettering2018-05-30
| | | | | | | | | | | | | Let's make sure we don't clobber the return parameter on failure, to follow our coding style. Also, break the loop early if we have all attributes we need. This also changes the keys parameter to a simple char**, so that we can use STRV_MAKE() for passing the list of attributes to read. This also makes it possible to distuingish the case when the whole attribute file doesn't exist from one key in it missing. In the former case we return -ENOENT, in the latter we now return -ENXIO.
* mount-setup: always use the same source as fstype for the API VFS we mountLennart Poettering2018-05-30
| | | | | | | So far, for all our API VFS mounts we used the fstype also as mount source, let's do that for the cgroupsv2 mounts too. The kernel doesn't really care about the source for API VFS, but it's visible to the user, hence let's clean this up and follow the rule we otherwise follow.
* bpf: use BPF_F_ALLOW_MULTI flag if it is availableLennart Poettering2018-05-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This new kernel 4.15 flag permits that multiple BPF programs can be executed for each packet processed: multiple per cgroup plus all programs defined up the tree on all parent cgroups. We can use this for two features: 1. Finally provide per-slice IP accounting (which was previously unavailable) 2. Permit delegation of BPF programs to services (i.e. leaf nodes). This patch beefs up PID1's handling of BPF to enable both. Note two special items to keep in mind: a. Our inner-node BPF programs (i.e. the ones we attach to slices) do not enforce IP access lists, that's done exclsuively in the leaf-node BPF programs. That's a good thing, since that way rules in leaf nodes can cancel out rules further up (i.e. for example to implement a logic of "disallow everything except httpd.service"). Inner node BPF programs to accounting however if that's requested. This is beneficial for performance reasons: it means in order to provide per-slice IP accounting we don't have to add up all child unit's data. b. When this code is run on pre-4.15 kernel (i.e. where BPF_F_ALLOW_MULTI is not available) we'll make IP acocunting on slice units unavailable (i.e. revert to behaviour from before this commit). For leaf nodes we'll fallback to non-ALLOW_MULTI mode however, which means that BPF delegation is not available there at all, if IP fw/acct is turned on for the unit. This is a change from earlier behaviour, where we use the BPF_F_ALLOW_OVERRIDE flag, so that our fw/acct would lose its effect as soon as delegation was turned on and some client made use of that. I think the new behaviour is the safer choice in this case, as silent bypassing of our fw rules is not possible anymore. And if people want proper delegation then the way out is a more modern kernel or turning off IP firewalling/acct for the unit algother.
* bpf: mount bpffs by default on bootLennart Poettering2018-05-30
| | | | | | We make heavy use of BPF functionality these days, hence expose the BPF file system too by default now. (Note however, that we don't actually make use bpf file systems object yet, but we might later on too.)
* pid1: do not initialize join_controllers by defaultZbigniew Jędrzejewski-Szmek2018-05-30
| | | | | We're moving towards unified cgroup hierarchy where this is not necessary. This makes main.c a bit simpler.
* meson: drop double .in suffix for o.fd.systemd1.policy fileZbigniew Jędrzejewski-Szmek2018-05-30
| | | | | This file is now undergoing just one transformation, so drop the unnecessary suffix.
* Gettextize policy filesGunnar Hjalmarsson2018-05-30
| | | | | | | * Don't merge translations into the files * Add gettext-domain="systemd" to description and message Closes #8162, replaces #8118.
* meson: add -Dmemory-accounting-default=true|falseZbigniew Jędrzejewski-Szmek2018-05-30
| | | | | This makes it easy to set the default for distributions and users which want to default to off because they primarily use older kernels.
* core: add new new bus call for migrating foreign processes to scope/service ↵Lennart Poettering2018-05-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | units This adds a new bus call to service and scope units called AttachProcesses() that moves arbitrary processes into the cgroup of the unit. The primary user for this new API is systemd itself: the systemd --user instance uses this call of the systemd --system instance to migrate processes if itself gets the request to migrate processes and the kernel refuses this due to access restrictions. The primary use-case of this is to make "systemd-run --scope --user …" invoked from user session scopes work correctly on pure cgroupsv2 environments. There, the kernel refuses to migrate processes between two unprivileged-owned cgroups unless the requestor as well as the ownership of the closest parent cgroup all match. This however is not the case between the session-XYZ.scope unit of a login session and the user@ABC.service of the systemd --user instance. The new logic always tries to move the processes on its own, but if that doesn't work when being the user manager, then the system manager is asked to do it instead. The new operation is relatively restrictive: it will only allow to move the processes like this if the caller is root, or the UID of the target unit, caller and process all match. Note that this means that unprivileged users cannot attach processes to scope units, as those do not have "owning" users (i.e. they have now User= field). Fixes: #3388
* cgroup: add a new "can_delegate" flag to the unit vtable, and set it for ↵Lennart Poettering2018-05-30
| | | | | | | | | | | | | | | | scope and service units only Currently we allowed delegation for alluntis with cgroup backing except for slices. Let's make this a bit more strict for now, and only allow this in service and scope units. Let's also add a generic accessor unit_cgroup_delegate() for checking whether a unit has delegation turned on that checks the new bool first. Also, when doing transient units, let's explcitly refuse turning on delegation for unit types that don#t support it. This is mostly cosmetical as we wouldn't act on the delegation request anyway, but certainly helpful for debugging.
* core: propagate TasksMax= on the root slice to sysctlsLennart Poettering2018-05-30
| | | | | | | | | | The cgroup "pids" controller is not supported on the root cgroup. However we expose TasksMax= on it, but currently don't actually apply it to anything. Let's correct this: if set, let's propagate things to the right sysctls. This way we can expose TasksMax= on all units in a somewhat sensible way.
* cgroup: when querying the number of tasks in the root slice use the pid_max ↵Lennart Poettering2018-05-30
| | | | | | | sysctl The root cgroup doesn't expose and properties in the "pids" cgroup controller, hence we need to get the data from somewhere else.
* cgroup: add proper API to determine whether our unit manags to root cgroupLennart Poettering2018-05-30
|
* cgroup: use CGROUP_LIMIT_MAX where appropriateLennart Poettering2018-05-30
|
* core: rework how we track which PIDs to watch for a unitLennart Poettering2018-05-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, we'd maintain two hashmaps keyed by PIDs, pointing to Unit interested in SIGCHLD events for them. This scheme allowed a specific PID to be watched by exactly 0, 1 or 2 units. With this rework this is replaced by a single hashmap which is primarily keyed by the PID and points to a Unit interested in it. However, it optionally also keyed by the negated PID, in which case it points to a NULL terminated array of additional Unit objects also interested. This scheme means arbitrary numbers of Units may now watch the same PID. Runtime and memory behaviour should not be impact by this change, as for the common case (i.e. each PID only watched by a single unit) behaviour stays the same, but for the uncommon case (a PID watched by more than one unit) we only pay with a single additional memory allocation for the array. Why this all? Primarily, because allowing exactly two units to watch a specific PID is not sufficient for some niche cases, as processes can belong to more than one unit these days: 1. sd_notify() with MAINPID= can be used to attach a process from a different cgroup to multiple units. 2. Similar, the PIDFile= setting in unit files can be used for similar setups, 3. By creating a scope unit a main process of a service may join a different unit, too. 4. On cgroupsv1 we frequently end up watching all processes remaining in a scope, and if a process opens lots of scopes one after the other it might thus end up being watch by many of them. This patch hence removes the 2-unit-per-PID limit. It also makes a couple of other changes, some of them quite relevant: - manager_get_unit_by_pid() (and the bus call wrapping it) when there's ambiguity will prefer returning the Unit the process belongs to based on cgroup membership, and only check the watch-pids hashmap if that fails. This change in logic is probably more in line with what people expect and makes things more stable as each process can belong to exactly one cgroup only. - Every SIGCHLD event is now dispatched to all units interested in its PID. Previously, there was some magic conditionalization: the SIGCHLD would only be dispatched to the unit if it was only interested in a single PID only, or the PID belonged to the control or main PID or we didn't dispatch a signle SIGCHLD to the unit in the current event loop iteration yet. These rules were quite arbitrary and also redundant as the the per-unit handlers would filter the PIDs anyway a second time. With this change we'll hence relax the rules: all we do now is dispatch every SIGCHLD event exactly once to each unit interested in it, and it's up to the unit to then use or ignore this. We use a generation counter in the unit to ensure that we only invoke the unit handler once for each event, protecting us from confusion if a unit is both associated with a specific PID through cgroup membership and through the "watch_pids" logic. It also protects us from being confused if the "watch_pids" hashmap is altered while we are dispatching to it (which is a very likely case). - sd_notify() message dispatching has been reworked to be very similar to SIGCHLD handling now. A generation counter is used for dispatching as well. This also adds a new test that validates that "watch_pid" registration and unregstration works correctly.
* core: unify call we use to synthesize cgroup empty events when we stopped ↵Lennart Poettering2018-05-30
| | | | | | | | | watching any unit PIDs This code is very similar in scope and service units, let's unify it in one function. This changes little for service units, but for scope units makes sure we go through the cgroup queue, which is something we should do anyway.
* core: fix manager_get_unit_by_pid() special casing of manager PIDLennart Poettering2018-05-30
| | | | | | | Previously, we'd hard map PID 1 to the manager scope unit. That's wrong however when we are run in --user mode, as the PID 1 is outside of the subtree we manage and the manager PID might be very differently. Correct that by checking for getpid() rather than hardcoding 1.
* core: un-break PrivateDevices= by allowing it to mknod /dev/ptmxAlan Jenkins2018-05-30
| | | | | | | | | | | | #7886 caused PrivateDevices= to silently fail-open. https://github.com/systemd/systemd/pull/7886#issuecomment-358542849 Allow PrivateDevices= to succeed, in creating /dev/ptmx, even though DeviceControl=closed applies. No specific justification was given for blocking mknod of /dev/ptmx. Only that we didn't seem to need it, because we weren't creating it correctly as a device node.
* core: add dbus-util.[ch] to simplify creating transient unitsYu Watanabe2018-05-30
| | | | The functions and macros introduced by them will be used in later commits.
* basic: split out blockdev-util.[ch] from util.hLennart Poettering2018-05-30
| | | | With three functions it makes sense to split this out now.
* mount-setup: fix MNT_CHECK_WRITABLE error handling, and log about the issueLennart Poettering2018-05-30
| | | | | Let's correct the error handling (the error is in errno, not r), and let's add logging like the rest of the function has it.
* Reverted accidential renaming of /run/systemd to /run/elogind. Applications ↵Sven Eden2018-04-23
| | | | using elogind as a drop-in replacement expect the first.
* Prep v236 : Add missing SPDX-License-Identifier (3/9) src/coreSven Eden2018-03-26
|
* Prep v236: Apply missing upstream updates to the build systemSven Eden2018-03-13
|
* Fix SELinux labels in cgroup filesystem root directory (#7496)Krzysztof Nowicki2017-11-30
| | | | | | | | | | | | | | | When using SELinux with legacy cgroups the tmpfs on /sys/fs/cgroup is by default labelled as tmpfs_t. This label is also inherited by the "cpu" and "cpuacct" symbolic links. Unfortunately the policy expects them to be labelled as cgroup_t, which is used for all the actual cgroup filesystems. Failure to do so results in a stream of denials. This state cannot be fixed reliably when the cgroup filesystem structure is set-up as the SELinux policy is not yet loaded at this moment. It also cannot be fixed later as the root of the cgroup filesystem is remounted read-only. In order to fix it the root of the cgroup filesystem needs to be temporary remounted read-write, relabelled and remounted back read-only.
* core: warn about left-over processes in cgroup on unit startLennart Poettering2017-11-24
| | | | | | Now that we don't kill control processes anymore, let's at least warn about any processes left-over in the unit cgroup at the moment of starting the unit.
* unit: initialize bpf cgroup realization state properlyLennart Poettering2017-11-24
| | | | | | | | | | | | | | | | | | Before this patch, the bpf cgroup realization state was implicitly set to "NO", meaning that the bpf configuration was realized but was turned off. That means invalidation requests for the bpf stuff (which we issue in blanket fashion when doing a daemon reload) would actually later result in a us re-realizing the unit, under the assumption it was already realized once, even though in reality it never was realized before. This had the effect that after each daemon-reload we'd end up realizing *all* defined units, even the unloaded ones, populating cgroupfs with lots of unneeded empty cgroups. With this fix we properly set the realiazation state to "INVALIDATED", i.e. indicating the bpf stuff was never set up for the unit, and hence when we try to invalidate it later we won't do anything.
* cgroup: when dispatching the cgroup realization queue, check again if we ↵Lennart Poettering2017-11-24
| | | | | | | | | | | | | | | shall actually realize We add units to the cgroup realization queue when propagating realizing requests to sibling units, and when invalidating cgroup settings because some cgroup setting changed. In the time between where we add the unit to the queue until the cgroup is actually dispatched the unit's state might have changed however, so that the unit doesn't actually need to be realized anymore, for example because the unit went down. To handle that, check the unit state again, if realization makes sense. Redundant realization is usually not a problem, except when the unit is not actually running, hence check exactly for that.
* cgroup: drop unused parameter from functionLennart Poettering2017-11-24
|
* cgroup: downgrade the log level of "invocation id" messages to debug (#7422)Evgeny Vereshchagin2017-11-23
| | | | | Now that d3070fbdf6077d7d has been merged, these errors are not as critical as they used to be.
* cgroup: fix delegation on the unified hierarchyLennart Poettering2017-11-17
| | | | | | | | | | | | | | | | | | | | | Make sure to add the delegation mask to the mask of controllers we have to enable on our own unit. Do not claim it was a members mask, as such a logic would mean we'd collide with cgroupv2's "no processes on inner nodes policy". This change does the right thing: it means any controller enabled through Controllers= will be made available to subcrgoups of our unit, but the unit itself has to still enable it through cgroup.subtree_control (which it can since that file is delegated too) to be inherited further down. Or to say this differently: we only should manipulate cgroup.subtree_control ourselves for inner nodes (i.e. slices), and for leaves we need to provide a way to enable controllers in the slices above, but stay away from the cgroup's own cgroup.subtree_control — which is what this patch ensures. Fixes: #7355
* core: fix message about detected memory hierarchyZbigniew Jędrzejewski-Szmek2017-11-15
| | | | Just the error check and message were wrong, otherwise the logic was OK.
* Use plural DelegateControllers= consistentlyZbigniew Jędrzejewski-Szmek2017-11-14
|
* core: rework the Delegate= unit file setting to take a list of controller namesLennart Poettering2017-11-09
| | | | | | | | Previously it was not possible to select which controllers to enable for a unit where Delegate=yes was set, as all controllers were enabled. With this change, this is made configurable, and thus delegation units can pick specifically what they want to manage themselves, and what they don't care about.
* cgroup: make use of unit_get_subtree_mask() where appropriateLennart Poettering2017-11-08
| | | | | subtree_mask is own_mask | members_mask, let's make use of that to shorten a few things
* core: track why unit dependencies came to beLennart Poettering2017-10-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces the dependencies Set* objects by Hashmap* objects, where the key is the depending Unit, and the value is a bitmask encoding why the specific dependency was created. The bitmask contains a number of different, defined bits, that indicate why dependencies exist, for example whether they are created due to explicitly configured deps in files, by udev rules or implicitly. Note that memory usage is not increased by this change, even though we store more information, as we manage to encode the bit mask inside the value pointer each Hashmap entry contains. Why this all? When we know how a dependency came to be, we can update dependencies correctly when a configuration source changes but others are left unaltered. Specifically: 1. We can fix UDEV_WANTS dependency generation: so far we kept adding dependencies configured that way, but if a device lost such a dependency we couldn't them again as there was no scheme for removing of dependencies in place. 2. We can implement "pin-pointed" reload of unit files. If we know what dependencies were created as result of configuration in a unit file, then we know what to flush out when we want to reload it. 3. It's useful for debugging: "elogind-analyze dump" now shows this information, helping substantially with understanding how elogind's dependency tree came to be the way it came to be.
* Some minor cleanupsSven Eden2018-03-07
|
* Meson build system: Add missing '#' in masked blocksSven Eden2018-03-07
|
* Prep 235: Make cgroups2 available, hybrid mode already works.Sven Eden2018-01-10
|
* Fix various build failures with the latest systemd updates.Sven Eden2017-12-08
|
* tree-wide: use IN_SET macro (#6977)Yu Watanabe2017-12-08
|
* build-sys: s/HAVE_SMACK/ENABLE_SMACK/Zbigniew Jędrzejewski-Szmek2017-12-08
| | | | Same justification as for HAVE_UTMP.
* Apply updates from upstreamSven Eden2017-12-07
|
* build-sys: use #if Y instead of #ifdef Y everywhereZbigniew Jędrzejewski-Szmek2017-11-23
| | | | | | | | | | | | | | | The advantage is that is the name is mispellt, cpp will warn us. $ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/" $ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;' $ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g' $ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g' + manual changes to meson.build squash! build-sys: use #if Y instead of #ifdef Y everywhere v2: - fix incorrect setting of HAVE_LIBIDN2
* core: chown() StateDirectory= and friends recursively when starting a serviceLennart Poettering2017-11-22
| | | | | | | This is particularly useful when used in conjunction with DynamicUser=1, where the UID might change for every invocation, but is useful in other cases too, for example, when these directories are shared between systems where the UID assignments differ slightly.
* cgroup: IN_SET() FTW!Lennart Poettering2017-09-26
|
* cgroup: after determining that a cgroup is empty, asynchronously dispatch thisLennart Poettering2017-11-22
| | | | | | | | | | | | | This makes sure that if we learn via inotify or another event source that a cgroup is empty, and we checked that this is indeed the case (as we might get spurious notifications through inotify, as the inotify logic through the "cgroups.event" is pretty unspecific and might be trigger for a variety of reasons), then we'll enqueue a defer event for it, at a priority lower than SIGCHLD handling, so that we know for sure that if there's waitid() data for a process we used it before considering the cgroup empty notification. Fixes: #6608