| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.
The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.
The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.
The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.
Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.
This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().
PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.
A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.
Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
|
| |
|
|
|
|
|
| |
Most important is a fix to negate the error number if necessary, before we
first access it.
|
|
|
|
|
|
|
|
|
|
| |
fileio makes use of O_TMPFILE when it is available.
We now always have O_TMPFILE, defined in missing.h if missing
from the toolchain headers.
Have fileio include missing.h and drop the guards around the
use of O_TMPFILE.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, a missing __O_TMPFILE was only defined for i386 and x86_64,
leaving any other architectures with an "old" toolchain fail miserably
at build time:
src/import/export-raw.c: In function 'reflink_snapshot':
src/import/export-raw.c:271:26: error: 'O_TMPFILE' undeclared (first use in this function)
new_fd = open(d, O_TMPFILE|O_CLOEXEC|O_NOCTTY|O_RDWR, 0600);
^
__O_TMPFILE (and O_TMPFILE) are available since glibc 2.19. However, a
lot of existing toolchains are still using glibc-2.18, and some even
before that, and it is not really possible to update those toolchains.
Instead of defining it only for i386 and x86_64, define __O_TMPFILE
with the specific values for those archs where it is different from the
generic value. Use the values as found in the Linux kernel (v4.8-rc3,
current as of time of commit).
|
|
|
|
| |
This way, we can make use of this in other code, too.
|
|
|
|
|
|
| |
Let's make sure people invoking STRV_FOREACH_BACKWARDS() as a single statement
of an if statement don't fall into a trap, and find the tail for the list via
strv_length().
|
|
|
|
|
|
|
| |
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
|
|
|
|
|
|
| |
bus_connect_transport() is exclusively used from our command line tools, hence
let's set exit-on-disconnect for all of them, making behaviour a bit nicer in
case dbus-daemon goes down.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Old libdbus has a feature that the process is terminated whenever the the bus
connection receives a disconnect. This is pretty useful on desktop apps (where
a disconnect indicates session termination), as well as on command line apps
(where we really shouldn't stay hanging in most cases if dbus daemon goes
down).
Add a similar feature to sd-bus, but make it opt-in rather than opt-out, like
it is on libdbus. Also, if the bus is attached to an event loop just exit the
event loop rather than the the whole process.
|
|
|
|
|
|
|
| |
objects immediately
If the server side kicks us from the bus, from our view no names are on the bus
anymore, hence let's make sure to dispatch all tracking objects immediately.
|
|
|
|
|
|
|
|
|
|
|
| |
In order to add a name to a bus tracking object we need to do some bus
operations: we need to check if the name already exists and add match for it.
Both are synchronous bus calls. While processing those we need to make sure
that the tracking object is not dispatched yet, as it might still be empty, but
is not going to be empty for very long.
hence, block dispatching by removing the object from the dispatch queue while
adding it, and readding it on error.
|
|
|
|
|
|
|
| |
When a bus connection is closed we dispatch all reply callbacks. Do so in a new
function if its own.
No behaviour changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After the call of the isatty() we check its result twice in the
open_terminal(). There are no sense to check result of isatty() that
it is less than zero and return -errno, because as described in
documentation:
isatty() returns 1 if fd is an open file descriptor referring to a
terminal; otherwise 0 is returned, and errno is set to indicate the
error.
So it can't be less than zero.
|
|
|
|
|
|
|
| |
This changes the semantics a bit: before, SYSTEMD_COLORS= would be treated as
"yes", same as SYSTEMD_COLORS=xxx and SYSTEMD_COLORS=1, and only
SYSTEMD_COLORS=0 would be treated as "no". Now, only valid booleans are treated
as "yes". This actually matches how $SYSTEMD_COLORS was announced in NEWS.
|
|
|
|
|
| |
Dynamic users should be treated like system users, and their logs
should end up in the main system journal.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The parsing functions for [User]TasksMax were inconsistent. Empty string and
"infinity" were interpreted as no limit for TasksMax but not accepted for
UserTasksMax. Update them so that they're consistent with other knobs.
* Empty string indicates the default value.
* "infinity" indicates no limit.
While at it, replace opencoded (uint64_t) -1 with CGROUP_LIMIT_MAX in TasksMax
handling.
v2: Update empty string to indicate the default value as suggested by Zbigniew
Jędrzejewski-Szmek.
v3: Fixed empty UserTasksMax handling.
|
|
|
|
|
|
|
|
|
|
|
| |
When started by the kernel, we are connected to the console, and we'll set TERM
properly to some value in fixup_environment(). We'll then enable or disable
colors based on the value of $SYSTEMD_COLORS and $TERM.
When reexecuting, TERM should be already set, so we can use this value.
Effectively, behaviour is the same as before affd7ed1a was reverted, but instead
of reopening the console before configuring color output, we just ignore what
stdout is connected to and decide based on the variables only.
|
|
|
|
|
|
|
|
|
|
|
| |
Rework 17eb9a9ddba3f03fcba33445c1c1eedeb948da04 a bit.
Let's make sure we don't clobber the input parameter args[1], following our
coding style to not clobber parameters unless explicitly indicated. (in
particular, as we don't want to have our changes appear in the command line
shown in "ps"...)
No functional change.
|
|
|
|
|
|
|
| |
It is useful for clients to be able to read the last CPU usage counter value of
a unit even if the unit is already terminated. Hence, before destroying a
cgroup's cgroup cache the last CPU usage counter and return it if the cgroup is
gone.
|
|
|
|
| |
Make sure we return proper errors for types not understood yet.
|
|
|
|
|
| |
Instead of ignoring empty strings retrieved via the bus, treat them as NULL, as
it's customary in elogind.
|
|
|
|
|
|
| |
Let's make sure we can read the exit code/status properties exposed by PID 1
properly. Let's reuse the existing code for unsigned fields, as we just use it
to copy words around, and don't calculate it.
|
|
|
|
|
|
|
|
|
|
|
| |
A following patch will update cgroup handling so that the elogind controller
(/sys/fs/cgroup/elogind) can use the unified hierarchy even if the kernel
resource controllers are on the legacy hierarchies. This would require
distinguishing whether all controllers are on cgroup v2 or only the elogind
controller is. In preparation, this patch renames cg_unified() to
cg_all_unified().
This patch doesn't cause any functional changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, elogind uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, elogind uses a named legacy hierarchy
mounted on /sys/fs/cgroup/elogind without any kernel controllers for process
management. Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.
Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for elogind to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies. It can simply mount the unified hierarchy under
/sys/fs/cgroup/elogind and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test. In time, this would also allow deleting the said
complexities.
This patch updates elogind so that it prefers the unified hierarchy for the
elogind cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.
* cg_unified(@controller) is introduced which tests whether the specific
controller in on unified hierarchy and used to choose the unified hierarchy
code path for process and service management when available. Kernel
controller specific operations remain gated by cg_all_unified().
* "elogind.legacy_elogind_cgroup_controller" kernel argument can be used to
force the use of legacy hierarchy for elogind cgroup controller.
* nspawn: By default nspawn uses the same hierarchies as the host. If
UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If
0, legacy for all.
* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
three options - legacy, only elogind controller on unified, and unified. The
value is passed into mount setup functions and controls cgroup configuration.
* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
appropriate action depending on the configuration of the host.
v2: - CGroupUnified enum replaces open coded integer values to indicate the
cgroup operation mode.
- Various style updates.
v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.
v4: Restored legacy container on unified host support and fixed another bug in
detect_unified_cgroup_hierarchy().
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds two (privileged) bus calls Ref() and Unref() to the Unit interface.
The two calls may be used by clients to pin a unit into memory, so that various
runtime properties aren't flushed out by the automatic GC. This is necessary
to permit clients to race-freely acquire runtime results (such as process exit
status/code or accumulated CPU time) on successful service termination.
Ref() and Unref() are fully recursive, hence act like the usual reference
counting concept in C. Taking a reference is a privileged operation, as this
allows pinning units into memory which consumes resources.
Transient units may also gain a reference at the time of creation, via the new
AddRef property (that is only defined for transient units at the time of
creation).
|
|
|
|
|
|
|
|
|
|
| |
This adds an optional "recursive" counting mode to sd_bus_track. If enabled
adding the same name multiple times to an sd_bus_track object is counted
individually, so that it also has to be removed the same number of times before
it is gone again from the tracking object.
This functionality is useful for implementing local ref counted objects that
peers make take references on.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches. The situation is explained in
the following documentation.
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu
While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads. This
commit implements elogind CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.
On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us". [Startup]CPUWeight config
options are added with the usual compat translation. CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.
v2: - Error in man page corrected.
- CPU config application in cgroup_context_apply() refactored.
- CPU accounting now works on unified hierarchy.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds "elogind-mount" which is for transient mount and automount units what
"elogind-run" is for transient service, scope and timer units.
The tool allows establishing mounts and automounts during runtime. It is very
similar to the usual /bin/mount commands, but can pull in additional
dependenices on access (for example, it pulls in fsck automatically), an take
benefit of the automount logic.
This tool is particularly useful for mount removable file systems (such as USB
sticks), as the automount logic (together with automatic unmount-on-idle), as
well as automatic fsck on first access ensure that the removable file system
has a high chance to remain in a fully clean state even when it is unplugged
abruptly, and returns to a clean state on the next re-plug.
This is a follow-up for #2471, as it adds a simple client-side for the
transient automount logic added in that PR.
In later work it might make sense to invoke this tool automatically from udev
rules in order to implement a simpler and safer version of removable media
management á la udisks.
|
|
|
|
|
|
|
|
| |
This adds parse_nice() that parses a nice level and ensures it is in the right
range, via a new nice_is_valid() helper. It then ports over a number of users
to this.
No functional changes.
|
|
|
|
|
| |
The intention is to clamp the value to READ_FULL_BYTES_MAX, which
would be the minimum of the two.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
read_full_stream() _always_ allocated twice the memory needed, due to
only breaking the realloc() && fread() loop when fread() returned 0,
requiring another iteration and exponentially enlarged buffer just to
discover the EOF condition.
This also caused file sizes >2MiB && <= 4MiB to erroneously be treated
as E2BIG, due to the inappropriately doubled buffer size exceeding
4*1024*1024.
Also made the 4*1024*1024 magic number a READ_FULL_BYTES_MAX constant.
|
|
|
|
| |
This permits CPUQuota to accept greater values as documented.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch improves parsing and generation of timestamps and calendar
specifications in two ways:
- The week day is now always printed in the abbreviated English form, instead
of the locale's setting. This makes sure we can always parse the week day
again, even if the locale is changed. Given that we don't follow locale
settings for printing timestamps in any other way either (for example, we
always use 24h syntax in order to make uniform parsing possible), it only
makes sense to also stick to a generic, non-localized form for the timestamp,
too.
- When parsing a timestamp, the local timezone (in its DST or non-DST name)
may be specified, in addition to "UTC". Other timezones are still not
supported however (not because we wouldn't want to, but mostly because libc
offers no nice API for that). In itself this brings no new features, however
it ensures that any locally formatted timestamp's timezone is also parsable
again.
These two changes ensure that the output of format_timestamp() may always be
passed to parse_timestamp() and results in the original input. The related
flavours for usec/UTC also work accordingly. Calendar specifications are
extended in a similar way.
The man page is updated accordingly, in particular this removes the claim that
timestamps elogind prints wouldn't be parsable by elogind. They are now.
The man page previously showed invalid timestamps as examples. This has been
removed, as the man page shouldn't be a unit test, where such negative examples
would be useful. The man page also no longer mentions the names of internal
functions, such as format_timestamp_us() or UNIX error codes such as EINVAL.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e. all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.
This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.
In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
|
|
|
|
|
| |
The CPUID and DMI vendor strings do not seem to be documented.
Values were found experimentally and by inspecting the source code.
|
| |
|
| |
|
|
|
|
|
|
|
| |
This way, invoking nspawn from a shell in the best case inherits the TERM
setting all the way down into the login shell spawned in the container.
Fixes: #3697
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"#pragma GCC optimize" is merely a convenience to decorate multiple
functions with attribute optimize. And the manual has this to say about
this attribute:
This attribute should be used for debugging purposes only. It
is not suitable in production code.
Some versions of GCC also seem to have a problem with this pragma in
combination with LTO, resulting in ICEs.
So use a different approach (indirect the memset call via a volatile
function pointer) as implemented in openssl's crypto/mem_clr.c.
Closes: #3811
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Beef up the existing var_tmp() call, rename it to var_tmp_dir() and add a
matching tmp_dir() call (the former looks for the place for /var/tmp, the
latter for /tmp).
Both calls check $TMPDIR, $TEMP, $TMP, following the algorithm Python3 uses.
All dirs are validated before use. secure_getenv() is used in order to limite
exposure in suid binaries.
This also ports a couple of users over to these new APIs.
The var_tmp() return parameter is changed from an allocated buffer the caller
will own to a const string either pointing into environ[], or into a static
const buffer. Given that environ[] is mostly considered constant (and this is
exposed in the very well-known getenv() call), this should be OK behaviour and
allows us to avoid memory allocations in most cases.
Note that $TMPDIR and friends override both /var/tmp and /tmp usage if set.
|
|
|
|
|
| |
In this patch "enabled" and "disabled" is used exclusively, but "enable" and
"disable" forms are need for the following patch.
|
|
|
|
|
|
| |
We already have tolower() calls there, hence let's unify this at one place.
Also, update the code to only use ASCII operations, so that we don't end up
being locale dependant.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
service is running
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).
For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.
A simple way to test this is:
elogind-run -p DynamicUser=1 /bin/sleep 99999
|
|
|
|
|
|
|
| |
This way we can reuse them for validating User=/Group= settings in unit files
(to be added in a later commit).
Also, add some tests for them.
|
|
|
|
|
| |
Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls
controls "memory.swap.max" attribute in unified cgroup.
|
|
|
|
|
| |
- define CLONE_NEWCGROUP
- add fun to detect whether cgroup namespaces are supported
|
|
|
|
|
|
|
| |
This reverts commit aac7c5ed8bc6ffaba417b9c0b87bcf342865431b. This
visibility bug originated in ld.gold and has been fixed upstream:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=5417c94d1a944d1a27f99240e5d62a6d7cd324f1
|