summaryrefslogtreecommitdiff
path: root/src/core/cgroup.c
Commit message (Collapse)AuthorAge
...
* core: rename cgroup_queue → cgroup_realize_queueLennart Poettering2017-11-22
| | | | | | | | | We are about to add second cgroup-related queue, called "cgroup_empty_queue", hence let's rename "cgroup_queue" to "cgroup_realize_queue" (as that is its purpose) to minimize confusion about the two queues. Just a rename, no functional changes.
* core/cgroup: add a helper macro for a common pattern (#6926)Zbigniew Jędrzejewski-Szmek2017-11-22
|
* cgroup: refuse to return accounting data if accounting isn't turned onLennart Poettering2017-11-21
| | | | | | | | | | We used to be a bit sloppy on this, and handed out accounting data even for units where accounting wasn't explicitly enabled. Let's be stricter here, so that we know the accounting data is actually fully valid. This is necessary, as the accounting data is no longer stored exclusively in cgroupfs, but is partly maintained external of that, and flushed during unit starts. We should hence only expose accounting data we really know is fully current.
* core: when coming back from reload/reexec, reapply all cgroup propertiesLennart Poettering2017-09-07
| | | | | | | | | | | | With this change we'll invalidate all cgroup settings after coming back from a daemon reload/reexec, so that the new settings are instantly applied. This is useful for the BPF case, because we don't serialize/deserialize the BPF program fd, and hence have to install a new, updated BPF program when coming back from the reload/reexec. However, this is also useful for the rest of the cgroup settings, as it ensures that user configuration really takes effect wherever we can.
* core: serialize/deserialize IP accounting across daemon reload/reexecLennart Poettering2017-09-07
| | | | | | | | | | | | | | | Make sure the current IP accounting counters aren't lost during reload/reexec. Note that we destroy all BPF file objects during a reload: the BPF programs, the access and the accounting maps. The former two need to be regenerated anyway with the newly loaded configuration data, but the latter one needs to survive reloads/reexec. In this implementation I opted to only save/restore the accounting map content instead of the map itself. While this opens a (theoretic) window where IP traffic is still accounted to the old map after we read it out, and we thus miss a few bytes this has the benefit that we can alter the map layout between versions should the need arise.
* cgroup: dump the newly added IP settings in the cgroup contextLennart Poettering2017-09-01
|
* cgroup, unit, fragment parser: make use of new firewall functionsDaniel Mack2017-11-21
|
* cgroup: add fields to accommodate eBPF related detailsDaniel Mack2017-11-21
| | | | | Add pointers for compiled eBPF programs as well as list heads for allowed and denied hosts for both directions.
* manager: watching the cgroup2 inotify fd is safe in test runs tooLennart Poettering2017-11-20
| | | | | Less deviation between test runs and normal runs is always a good idea, hence enable more stuff that is safe in test runs
* cgroup: always invalidate "cpu" and "cpuacct" togetherLennart Poettering2017-09-05
| | | | | | | This doesn't really matter, as we never invalidate cpuacct explicitly, and there's no real reason to care for it explicitly, however it's prettier if we always treat cpu and cpuacct as belonging together, the same way we conisder "io" and "blkio" to belong together.
* Make test_run into a flags field and disable generators againZbigniew Jędrzejewski-Szmek2017-09-25
| | | | | | | | | | Now generators are only run in elogind --test mode, where this makes most sense (how are you going to test what would happen otherwise?). Fixes #6842. v2: - rename test_run to test_run_flags
* Prep v235: Apply pending upstream updates in src/core [2/4]Sven Eden2017-08-30
|
* Prep v235: Apply upstream fixes (4/10) [src/core]Sven Eden2017-08-14
|
* Prep v234: Eventually fix the cgroup stuff. elogind is not init.Sven Eden2017-07-27
|
* tree-wide: when %m is used in log_*, always specify errno explicitlyZbigniew Jędrzejewski-Szmek2017-07-25
| | | | | | | | All those uses were correct, but I think it's better to be explicit. Using implicit errno is too error prone, and with this change we can require (in the sense of a style guideline) that the code is always specified. Helpful query: git grep -n -P 'log_[^s][a-z]+\(.*%m'
* cgroup: rename cg_unified() → cg_unified_controller()Lennart Poettering2017-07-17
| | | | | cg_unified() is a bit generic a name, let's make clear that it checks whether a specified controller is in unified mode.
* cgroup: change cg_unified() to possibly return errors againLennart Poettering2017-07-17
| | | | | | | | | We use our cgroup APIs in various contexts, including from our libraries sd-login, sd-bus. As we don#t control those environments we can't rely that the unified cgroup setup logic succeeds, and hence really shouldn't assert on it. This more or less reverts 415fc41ceaeada2e32639f24f134b1c248b9e43f.
* cgroup: properly check for ignore-notfound paths (#4803)Dave Reisner2017-07-17
| | | | Follow-up to #4687 and e7330dfe14b1965f.
* cgroup: support prefix "-" in cgroups whitelisting entries (#4687)Dongsu Park2017-07-17
| | | | | | | | | | | | | | | | | | | So far elogind-nspawn container has been creating files under /run/elogind/inaccessible, no matter whether it's running in user namespace or not. That's fine for regular files, dirs, socks, fifos. However, it's not for block and character devices, because kernel doesn't allow them to be created under user namespace. It results in warnings at booting like that: ==== Couldn't stat device /run/elogind/inaccessible/chr Couldn't stat device /run/elogind/inaccessible/blk ==== Thus we need to have the cgroups whitelisting handler to silently ignore a file, when the device path is prefixed with "-". That's exactly the same convention used in directives like ReadOnlyPaths=. Also insert the prefix "-" to inaccessible entries.
* core: make SYSTEMD_CGROUP_CONTROLLER a special stringTejun Heo2017-07-17
| | | | | | | | | | | | | | | | SYSTEMD_CGROUP_CONTROLLER is currently defined as "name=elogind" which cgroup utility functions interpret as a named cgroup hierarchy with the specified named. With the planned cgroup hybrid mode changes, SYSTEMD_CGROUP_CONTROLLER would map to different hierarchy names. This patch makes SYSTEMD_CGROUP_CONTROLLER a special string "_elogind" which is substituted to "name=elogind" by the cgroup utility functions. This allows the callers to address the elogind hierarchy without actually specifying the hierarchy name allowing the cgroup utility functions to map it to whatever is appropriate. Note that SYSTEMD_CGROUP_CONTROLLER was already special on full unified cgroup hierarchy even before this patch.
* core: simplify cg_[all_]unified()Tejun Heo2017-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | cg_[all_]unified() test whether a specific controller or all controllers are on the unified hierarchy. While what's being asked is a simple binary question, the callers must assume that the functions may fail any time, which unnecessarily complicates their usages. This complication is unnecessary. Internally, the test result is cached anyway and there are only a few places where the test actually needs to be performed. This patch simplifies cg_[all_]unified(). * cg_[all_]unified() are updated to return bool. If the result can't be decided, assertion failure is triggered. Error handlings from their callers are dropped. * cg_unified_flush() is updated to calculate the new result synchrnously and return whether it succeeded or not. Places which need to flush the test result are updated to test for failure. This ensures that all the following cg_[all_]unified() tests succeed. * Places which expected possible cg_[all_]unified() failures are updated to call and test cg_unified_flush() before calling cg_[all_]unified(). This includes functions used while setting up mounts during boot and manager_setup_cgroup().
* tree-wide: drop NULL sentinel from strjoinZbigniew Jędrzejewski-Szmek2017-07-17
| | | | | | | | | | | | | This makes strjoin and strjoina more similar and avoids the useless final argument. spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/elogind -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libelogind/sd-bus -I ./src/libelogind/sd-event -I ./src/libelogind/sd-login -I ./src/libelogind/sd-netlink -I ./src/libelogind/sd-network -I ./src/libelogind/sd-hwdb -I ./src/libelogind/sd-device -I ./src/libelogind/sd-id128 -I ./src/libelogind-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c) git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/' This might have missed a few cases (spatch has a really hard time dealing with _cleanup_ macros), but that's no big issue, they can always be fixed later.
* tree-wide: use startswith return value to avoid hardcoded offsetZbigniew Jędrzejewski-Szmek2017-07-17
| | | | | | | I think it's an antipattern to have to count the number of bytes in the prefix by hand. We should do this automatically to avoid wasting programmer time, and possible errors. I didn't any offsets that were wrong, so this change is mostly to make future development easier.
* core: make settings for unified cgroup hierarchy supersede the ones for ↵Tejun Heo2017-07-05
| | | | | | | | | | | | | | | legacy hierarchy (#4269) There are overlapping control group resource settings for the unified and legacy hierarchies. To help transition, the settings are translated back and forth. When both versions of a given setting are present, the one matching the cgroup hierarchy type in use is used. Unfortunately, this is more confusing to use and document than necessary because there is no clear static precedence. Update the translation logic so that the settings for the unified hierarchy are always preferred. elogind.resource-control man page is updated to reflect the change and reorganized so that the deprecated settings are at the end in its own section.
* core: add "invocation ID" concept to service managerLennart Poettering2017-07-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a new invocation ID concept to the service manager. The invocation ID identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is generated each time a unit moves from and inactive to an activating or active state. The primary usecase for this concept is to connect the runtime data PID 1 maintains about a service with the offline data the journal stores about it. Previously we'd use the unit name plus start/stop times, which however is highly racy since the journal will generally process log data after the service already ended. The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel, except that it applies to an individual unit instead of the whole system. The invocation ID is passed to the activated processes as environment variable. It is additionally stored as extended attribute on the cgroup of the unit. The latter is used by journald to automatically retrieve it for each log logged message and attach it to the log entry. The environment variable is very easily accessible, even for unprivileged services. OTOH the extended attribute is only accessible to privileged processes (this is because cgroupfs only supports the "trusted." xattr namespace, not "user."). The environment variable may be altered by services, the extended attribute may not be, hence is the better choice for the journal. Note that reading the invocation ID off the extended attribute from journald is racy, similar to the way reading the unit name for a logging process is. This patch adds APIs to read the invocation ID to sd-id128: sd_id128_get_invocation() may be used in a similar fashion to sd_id128_get_boot(). PID1's own logging is updated to always include the invocation ID when it logs information about a unit. A new bus call GetUnitByInvocationID() is added that allows retrieving a bus path to a unit by its invocation ID. The bus path is built using the invocation ID, thus providing a path for referring to a unit that is valid only for the current runtime cycleof it. Outlook for the future: should the kernel eventually allow passing of cgroup information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we can alter the invocation ID to be generated as hash from that rather than entirely randomly. This way we can derive the invocation race-freely from the messages.
* logind: update empty and "infinity" handling for [User]TasksMax (#3835)Tejun Heo2017-07-05
| | | | | | | | | | | | | | | | | The parsing functions for [User]TasksMax were inconsistent. Empty string and "infinity" were interpreted as no limit for TasksMax but not accepted for UserTasksMax. Update them so that they're consistent with other knobs. * Empty string indicates the default value. * "infinity" indicates no limit. While at it, replace opencoded (uint64_t) -1 with CGROUP_LIMIT_MAX in TasksMax handling. v2: Update empty string to indicate the default value as suggested by Zbigniew Jędrzejewski-Szmek. v3: Fixed empty UserTasksMax handling.
* core: cache last CPU usage counter, before destorying a cgroupLennart Poettering2017-07-05
| | | | | | | It is useful for clients to be able to read the last CPU usage counter value of a unit even if the unit is already terminated. Hence, before destroying a cgroup's cgroup cache the last CPU usage counter and return it if the cgroup is gone.
* core: rename cg_unified() to cg_all_unified()Tejun Heo2017-07-05
| | | | | | | | | | | A following patch will update cgroup handling so that the elogind controller (/sys/fs/cgroup/elogind) can use the unified hierarchy even if the kernel resource controllers are on the legacy hierarchies. This would require distinguishing whether all controllers are on cgroup v2 or only the elogind controller is. In preparation, this patch renames cg_unified() to cg_all_unified(). This patch doesn't cause any functional changes.
* core: use the unified hierarchy for the elogind cgroup controller hierarchyTejun Heo2017-07-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, elogind uses either the legacy hierarchies or the unified hierarchy. When the legacy hierarchies are used, elogind uses a named legacy hierarchy mounted on /sys/fs/cgroup/elogind without any kernel controllers for process management. Due to the shortcomings in the legacy hierarchy, this involves a lot of workarounds and complexities. Because the unified hierarchy can be mounted and used in parallel to legacy hierarchies, there's no reason for elogind to use a legacy hierarchy for management even if the kernel resource controllers need to be mounted on legacy hierarchies. It can simply mount the unified hierarchy under /sys/fs/cgroup/elogind and use it without affecting other legacy hierarchies. This disables a significant amount of fragile workaround logics and would allow using features which depend on the unified hierarchy membership such bpf cgroup v2 membership test. In time, this would also allow deleting the said complexities. This patch updates elogind so that it prefers the unified hierarchy for the elogind cgroup controller hierarchy when legacy hierarchies are used for kernel resource controllers. * cg_unified(@controller) is introduced which tests whether the specific controller in on unified hierarchy and used to choose the unified hierarchy code path for process and service management when available. Kernel controller specific operations remain gated by cg_all_unified(). * "elogind.legacy_elogind_cgroup_controller" kernel argument can be used to force the use of legacy hierarchy for elogind cgroup controller. * nspawn: By default nspawn uses the same hierarchies as the host. If UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If 0, legacy for all. * nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of three options - legacy, only elogind controller on unified, and unified. The value is passed into mount setup functions and controls cgroup configuration. * nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount option is moved to mount_legacy_cgroup_hierarchy() so that it can take an appropriate action depending on the configuration of the host. v2: - CGroupUnified enum replaces open coded integer values to indicate the cgroup operation mode. - Various style updates. v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2. v4: Restored legacy container on unified host support and fixed another bug in detect_unified_cgroup_hierarchy().
* core: add cgroup CPU controller support on the unified hierarchyTejun Heo2017-07-05
| | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately, due to the disagreements in the kernel development community, CPU controller cgroup v2 support has not been merged and enabling it requires applying two small out-of-tree kernel patches. The situation is explained in the following documentation. https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu While it isn't clear what will happen with CPU controller cgroup v2 support, there are critical features which are possible only on cgroup v2 such as buffered write control making cgroup v2 essential for a lot of workloads. This commit implements elogind CPU controller support on the unified hierarchy so that users who choose to deploy CPU controller cgroup v2 support can easily take advantage of it. On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max" replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us". [Startup]CPUWeight config options are added with the usual compat translation. CPU quota settings remain unchanged and apply to both legacy and unified hierarchies. v2: - Error in man page corrected. - CPU config application in cgroup_context_apply() refactored. - CPU accounting now works on unified hierarchy.
* core: introduce MemorySwapMax=WaLyong Cho2017-07-05
| | | | | Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls controls "memory.swap.max" attribute in unified cgroup.
* Prep v231: Apply missing fixes from upstream (2/6) src/coreSven Eden2017-06-16
|
* cgroup: whitelist inaccessible devices for "auto" and "closed" DevicePolicy.Alessandro Puccetti2017-06-16
| | | | | | | https://github.com/elogind/elogind/pull/3685 introduced /run/elogind/inaccessible/{chr,blk} to map inacessible devices, this patch allows elogind running inside a nspawn container to create /run/elogind/inaccessible/{chr,blk}.
* core: when forcibly killing/aborting left-over unit processes log about itLennart Poettering2017-06-16
| | | | | | | | | | | | | | | Let's lot at LOG_NOTICE about any processes that we are going to SIGKILL/SIGABRT because clean termination of them didn't work. This turns the various boolean flag parameters to cg_kill(), cg_migrate() and related calls into a single binary flags parameter, simply because the function now gained even more parameters and the parameter listed shouldn't get too long. Logging for killing processes is done either when the kill signal is SIGABRT or SIGKILL, or on explicit request if KILL_TERMINATE_AND_LOG instead of LOG_TERMINATE is passed. This isn't used yet in this patch, but is made use of in a later patch.
* Various fixes for typos found by lintian (#3705)Michael Biebl2017-06-16
|
* core: log the right set of the supported controllers (#3558)Evgeny Vereshchagin2017-06-16
| | | | | | | | | | | | Jun 16 05:12:08 elogind[1]: Controller 'io' supported: yes Jun 16 05:12:08 elogind[1]: Controller 'memory' supported: yes Jun 16 05:12:08 elogind[1]: Controller 'pids' supported: yes instead of Jun 16 04:06:50 elogind[1]: Controller 'memory' supported: yes Jun 16 04:06:50 elogind[1]: Controller 'devices' supported: yes Jun 16 04:06:50 elogind[1]: Controller 'pids' supported: yes
* core: pass Unit into cgroup_context_apply() and use log_unit*()Tejun Heo2017-06-16
| | | | | | | | | | | | cgroup_context_apply() and friends take CGroupContext and cgroup path as input and has no way of getting back to the associated Unit and thus uses raw cgroup path for logging. This makes the log messages difficult to track down. There's no reason to avoid passing in Unit into these functions. Pass in Unit and use log_unit*() instead. While at it, make cgroup_context_apply(), which has no outside users, static. Also, drop cgroup path from log messages where the path itself isn't too interesting and can be easily obtained from the unit.
* core: add cgroup memory controller support on the unified hierarchy (#3315)Tejun Heo2017-06-16
| | | | | | | | | | | | | | | | | | | | | | On the unified hierarchy, memory controller implements three control knobs - low, high and max which enables more useable and versatile control over memory usage. This patch implements support for the three control knobs. * MemoryLow, MemoryHigh and MemoryMax are added for memory.low, memory.high and memory.max, respectively. * As all absolute limits on the unified hierarchy use "max" for no limit, make memory limit parse functions accept "max" in addition to "infinity" and document "max" for the new knobs. * Implement compatibility translation between MemoryMax and MemoryLimit. v2: - Fixed missing else's in config_parse_memory_limit(). - Fixed missing newline when writing out drop-ins. - Coding style updates to use "val > 0" instead of "val". - Minor updates to documentation.
* Prep v230: Apply missing upstream fixes and updates (4/8) src/core.Sven Eden2017-06-16
|
* core: use an AF_UNIX/SOCK_DGRAM socket for cgroup agent notificationLennart Poettering2017-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dbus-daemon currently uses a backlog of 30 on its D-bus system bus socket. On overloaded systems this means that only 30 connections may be queued without dbus-daemon processing them before further connection attempts fail. Our cgroups-agent binary so far used D-Bus for its messaging, and hitting this limit hence may result in us losing cgroup empty messages. This patch adds a seperate cgroup agent socket of type AF_UNIX/SOCK_DGRAM. Since sockets of these types need no connection set up, no listen() backlog applies. Our cgroup-agent binary will hence simply block as long as it can't enqueue its datagram message, so that we won't lose cgroup empty messages as likely anymore. This also rearranges the ordering of the processing of SIGCHLD signals, service notification messages (sd_notify()...) and the two types of cgroup notifications (inotify for the unified hierarchy support, and agent for the classic hierarchy support). We now always process events for these in the following order: 1. service notification messages (SD_EVENT_PRIORITY_NORMAL-7) 2. SIGCHLD signals (SD_EVENT_PRIORITY_NORMAL-6) 3. cgroup inotify and cgroup agent (SD_EVENT_PRIORITY_NORMAL-5) This is because when receiving SIGCHLD we invalidate PID information, which we need to process the service notification messages which are bound to PIDs. Hence the order between the first two items. And we want to process SIGCHLD metadata to detect whether a service is gone, before using cgroup notifications, to decide when a service is gone, since the former carries more useful metadata. Related to this: https://bugs.freedesktop.org/show_bug.cgi?id=95264 https://github.com/elogind/elogind/issues/1961
* core: make unit_has_mask_realized() consider controller enable stateTejun Heo2017-06-16
| | | | | | | | | | | | | | unit_has_mask_realized() determines whether the specified unit has its cgroups set up properly given the desired target_mask; however, on the unified hierarchy, controllers need to be enabled explicitly for children and the mask of enabled controllers can deviate from target_mask. Only considering target_mask in unit_has_mask_realized() can lead to false positives and skipping enabling the requested controllers. This patch adds unit->cgroup_enabled_mask to track which controllers are enabled and updates unit_has_mask_realized() to also consider enable_mask. Signed-off-by: Tejun Heo <htejun@fb.com>
* core: update populated event handling in unified hierarchyTejun Heo2017-06-16
| | | | | | | Earlier during the development of unified hierarchy, the populated event was reported through by the dedicated "cgroup.populated" file; however, the interface was updated so that it's reported through the "populated" field of "cgroup.events" file. Update populated event handling logic accordingly.
* Prep v229: Remove remaining emacs settings [2/6] src/coreSven Eden2017-05-17
|
* cgroup: remove support for NetClass= directiveDaniel Mack2017-05-17
| | | | | | | | | | | | | | | | | | | Support for net_cls.class_id through the NetClass= configuration directive has been added in v227 in preparation for a per-unit packet filter mechanism. However, it turns out the kernel people have decided to deprecate the net_cls and net_prio controllers in v2. Tejun provides a comprehensive justification for this in his commit, which has landed during the merge window for kernel v4.5: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bd1060a1d671 As we're aiming for full support for the v2 cgroup hierarchy, we can no longer support this feature. Userspace tool such as nftables are moving over to setting rules that are specific to the full cgroup path of a task, which obsoletes these controllers anyway. This commit removes support for tweaking details in the net_cls controller, but keeps the NetClass= directive around for legacy compatibility reasons.
* Prep v228: Condense elogind source masks (5/5)Sven Eden2017-04-26
|
* Prep v228: Condense elogind source masks (3/5)Sven Eden2017-04-26
|
* Prep v228: Substitute declaration masks (3/4)Sven Eden2017-04-26
|
* Prep v228: Add remaining updates from upstream (3/3)Sven Eden2017-04-26
| | | | | Apply remaining fixes and the performed move of utility functions into their own foo-util.[hc] files on the rest of elogind.
* core: don't generate warnings when write access to the cgroup fs fails in ↵Lennart Poettering2017-04-26
| | | | | | --user due to EACCES After all, in the classic hierarchy that's pretty much the default case.
* [3/5] Apply missing fixes from upstreamSven Eden2017-03-29
|