diff options
Diffstat (limited to 'doc/sed-in.texi')
-rw-r--r-- | doc/sed-in.texi | 4187 |
1 files changed, 0 insertions, 4187 deletions
diff --git a/doc/sed-in.texi b/doc/sed-in.texi deleted file mode 100644 index bf5158c..0000000 --- a/doc/sed-in.texi +++ /dev/null @@ -1,4187 +0,0 @@ -\input texinfo @c -*-texinfo-*- -@c -@c -- Stuff that needs adding: ---------------------------------------------- -@c (nothing!) -@c -------------------------------------------------------------------------- -@c Check for consistency: regexps in @code, text that they match in @samp. -@c -@c Tips: -@c @command for command -@c @samp for command fragments: @samp{cat -s} -@c @code for sed commands and flags -@c Use ``quote'' not `quote' or "quote". -@c -@c %**start of header -@setfilename sed.info -@settitle sed, a stream editor -@c %**end of header - -@c @smallbook - -@include version.texi - -@c Combine indices. -@syncodeindex ky cp -@syncodeindex pg cp -@syncodeindex tp cp - -@defcodeindex op -@syncodeindex op fn - -@include config.texi - -@copying -This file documents version @value{VERSION} of -@value{SSED}, a stream editor. - -Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free -Software Foundation, Inc. - -This document is released under the terms of the @acronym{GNU} Free -Documentation License as published by the Free Software Foundation; -either version 1.1, or (at your option) any later version. - -You should have received a copy of the @acronym{GNU} Free Documentation -License along with @value{SSED}; see the file @file{COPYING.DOC}. -If not, write to the Free Software Foundation, 59 Temple Place - Suite -330, Boston, MA 02110-1301, USA. - -There are no Cover Texts and no Invariant Sections; this text, along -with its equivalent in the printed manual, constitutes the Title Page. -@end copying - -@setchapternewpage off - -@titlepage -@title @command{sed}, a stream editor -@subtitle version @value{VERSION}, @value{UPDATED} -@author by Ken Pizzini, Paolo Bonzini - -@page -@vskip 0pt plus 1filll -Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc. - -@insertcopying - -Published by the Free Software Foundation, @* -51 Franklin Street, Fifth Floor @* -Boston, MA 02110-1301, USA -@end titlepage - - -@node Top -@top - -@ifnottex -@insertcopying -@end ifnottex - -@menu -* Introduction:: Introduction -* Invoking sed:: Invocation -* sed Programs:: @command{sed} programs -* Examples:: Some sample scripts -* Limitations:: Limitations and (non-)limitations of @value{SSED} -* Other Resources:: Other resources for learning about @command{sed} -* Reporting Bugs:: Reporting bugs - -* Extended regexps:: @command{egrep}-style regular expressions -@ifset PERL -* Perl regexps:: Perl-style regular expressions -@end ifset - -* Concept Index:: A menu with all the topics in this manual. -* Command and Option Index:: A menu with all @command{sed} commands and - command-line options. - -@detailmenu ---- The detailed node listing --- - -sed Programs: -* Execution Cycle:: How @command{sed} works -* Addresses:: Selecting lines with @command{sed} -* Regular Expressions:: Overview of regular expression syntax -* Common Commands:: Often used commands -* The "s" Command:: @command{sed}'s Swiss Army Knife -* Other Commands:: Less frequently used commands -* Programming Commands:: Commands for @command{sed} gurus -* Extended Commands:: Commands specific of @value{SSED} -* Escapes:: Specifying special characters - -Examples: -* Centering lines:: -* Increment a number:: -* Rename files to lower case:: -* Print bash environment:: -* Reverse chars of lines:: -* tac:: Reverse lines of files -* cat -n:: Numbering lines -* cat -b:: Numbering non-blank lines -* wc -c:: Counting chars -* wc -w:: Counting words -* wc -l:: Counting lines -* head:: Printing the first lines -* tail:: Printing the last lines -* uniq:: Make duplicate lines unique -* uniq -d:: Print duplicated lines of input -* uniq -u:: Remove all duplicated lines -* cat -s:: Squeezing blank lines - -@ifset PERL -Perl regexps:: Perl-style regular expressions -* Backslash:: Introduces special sequences -* Circumflex/dollar sign/period:: Behave specially with regard to new lines -* Square brackets:: Are a bit different in strange cases -* Options setting:: Toggle modifiers in the middle of a regexp -* Non-capturing subpatterns:: Are not counted when backreferencing -* Repetition:: Allows for non-greedy matching -* Backreferences:: Allows for more than 10 back references -* Assertions:: Allows for complex look ahead matches -* Non-backtracking subpatterns:: Often gives more performance -* Conditional subpatterns:: Allows if/then/else branches -* Recursive patterns:: For example to match parentheses -* Comments:: Because things can get complex... -@end ifset - -@end detailmenu -@end menu - - -@node Introduction -@chapter Introduction - -@cindex Stream editor -@command{sed} is a stream editor. -A stream editor is used to perform basic text -transformations on an input stream -(a file or input from a pipeline). -While in some ways similar to an editor which -permits scripted edits (such as @command{ed}), -@command{sed} works by making only one pass over the -input(s), and is consequently more efficient. -But it is @command{sed}'s ability to filter text in a pipeline -which particularly distinguishes it from other types of -editors. - - -@node Invoking sed -@chapter Invocation - -Normally @command{sed} is invoked like this: - -@example -sed SCRIPT INPUTFILE... -@end example - -The full format for invoking @command{sed} is: - -@example -sed OPTIONS... [SCRIPT] [INPUTFILE...] -@end example - -If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-}, -@command{sed} filters the contents of the standard input. The @var{script} -is actually the first non-option parameter, which @command{sed} specially -considers a script and not an input file if (and only if) none of the -other @var{options} specifies a script to be executed, that is if neither -of the @option{-e} and @option{-f} options is specified. - -@command{sed} may be invoked with the following command-line options: - -@table @code -@item --version -@opindex --version -@cindex Version, printing -Print out the version of @command{sed} that is being run and a copyright notice, -then exit. - -@item --help -@opindex --help -@cindex Usage summary, printing -Print a usage message briefly summarizing these command-line options -and the bug-reporting address, -then exit. - -@item -n -@itemx --quiet -@itemx --silent -@opindex -n -@opindex --quiet -@opindex --silent -@cindex Disabling autoprint, from command line -By default, @command{sed} prints out the pattern space -at the end of each cycle through the script (@pxref{Execution Cycle, , -How @code{sed} works}). -These options disable this automatic printing, -and @command{sed} only produces output when explicitly told to -via the @code{p} command. - -@item -e @var{script} -@itemx --expression=@var{script} -@opindex -e -@opindex --expression -@cindex Script, from command line -Add the commands in @var{script} to the set of commands to be -run while processing the input. - -@item -f @var{script-file} -@itemx --file=@var{script-file} -@opindex -f -@opindex --file -@cindex Script, from a file -Add the commands contained in the file @var{script-file} -to the set of commands to be run while processing the input. - -@item -i[@var{SUFFIX}] -@itemx --in-place[=@var{SUFFIX}] -@opindex -i -@opindex --in-place -@cindex In-place editing, activating -@cindex @value{SSEDEXT}, in-place editing -This option specifies that files are to be edited in-place. -@value{SSED} does this by creating a temporary file and -sending output to this file rather than to the standard -output.@footnote{This applies to commands such as @code{=}, -@code{a}, @code{c}, @code{i}, @code{l}, @code{p}. You can -still write to the standard output by using the @code{w} -@cindex @value{SSEDEXT}, @file{/dev/stdout} file -or @code{W} commands together with the @file{/dev/stdout} -special file}. - -This option implies @option{-s}. - -When the end of the file is reached, the temporary file is -renamed to the output file's original name. The extension, -if supplied, is used to modify the name of the old file -before renaming the temporary file, thereby making a backup -copy@footnote{Note that @value{SSED} creates the backup -file whether or not any output is actually changed.}). - -@cindex In-place editing, Perl-style backup file names -This rule is followed: if the extension doesn't contain a @code{*}, -then it is appended to the end of the current filename as a -suffix; if the extension does contain one or more @code{*} -characters, then @emph{each} asterisk is replaced with the -current filename. This allows you to add a prefix to the -backup file, instead of (or in addition to) a suffix, or -even to place backup copies of the original files into another -directory (provided the directory already exists). - -If no extension is supplied, the original file is -overwritten without making a backup. - -@item -l @var{N} -@itemx --line-length=@var{N} -@opindex -l -@opindex --line-length -@cindex Line length, setting -Specify the default line-wrap length for the @code{l} command. -A length of 0 (zero) means to never wrap long lines. If -not specified, it is taken to be 70. - -@item --posix -@opindex --posix -@cindex @value{SSEDEXT}, disabling -@value{SSED} includes several extensions to @acronym{POSIX} -sed. In order to simplify writing portable scripts, this -option disables all the extensions that this manual documents, -including additional commands. -@cindex @code{POSIXLY_CORRECT} behavior, enabling -Most of the extensions accept @command{sed} programs that -are outside the syntax mandated by @acronym{POSIX}, but some -of them (such as the behavior of the @command{N} command -described in @pxref{Reporting Bugs}) actually violate the -standard. If you want to disable only the latter kind of -extension, you can set the @code{POSIXLY_CORRECT} variable -to a non-empty value. - -@item -b -@itemx --binary -@opindex -b -@opindex --binary -This option is available on every platform, but is only effective where the -operating system makes a distinction between text files and binary files. -When such a distinction is made---as is the case for MS-DOS, Windows, -Cygwin---text files are composed of lines separated by a carriage return -@emph{and} a line feed character, and @command{sed} does not see the -ending CR. When this option is specified, @command{sed} will open -input files in binary mode, thus not requesting this special processing -and considering lines to end at a line feed. - -@item --follow-symlinks -@opindex --follow-symlinks -This option is available only on platforms that support -symbolic links and has an effect only if option @option{-i} -is specified. In this case, if the file that is specified -on the command line is a symbolic link, @command{sed} will -follow the link and edit the ultimate destination of the -link. The default behavior is to break the symbolic link, -so that the link destination will not be modified. - -@item -r -@itemx --regexp-extended -@opindex -r -@opindex --regexp-extended -@cindex Extended regular expressions, choosing -@cindex @acronym{GNU} extensions, extended regular expressions -Use extended regular expressions rather than basic -regular expressions. Extended regexps are those that -@command{egrep} accepts; they can be clearer because they -usually have less backslashes, but are a @acronym{GNU} extension -and hence scripts that use them are not portable. -@xref{Extended regexps, , Extended regular expressions}. - -@ifset PERL -@item -R -@itemx --regexp-perl -@opindex -R -@opindex --regexp-perl -@cindex Perl-style regular expressions, choosing -@cindex @value{SSEDEXT}, Perl-style regular expressions -Use Perl-style regular expressions rather than basic -regular expressions. Perl-style regexps are extremely -powerful but are a @value{SSED} extension and hence scripts that -use it are not portable. @xref{Perl regexps, , -Perl-style regular expressions}. -@end ifset - -@item -s -@itemx --separate -@opindex -s -@opindex --separate -@cindex Working on separate files -By default, @command{sed} will consider the files specified on the -command line as a single continuous long stream. This @value{SSED} -extension allows the user to consider them as separate files: -range addresses (such as @samp{/abc/,/def/}) are not allowed -to span several files, line numbers are relative to the start -of each file, @code{$} refers to the last line of each file, -and files invoked from the @code{R} commands are rewound at the -start of each file. - -@item -u -@itemx --unbuffered -@opindex -u -@opindex --unbuffered -@cindex Unbuffered I/O, choosing -Buffer both input and output as minimally as practical. -(This is particularly useful if the input is coming from -the likes of @samp{tail -f}, and you wish to see the transformed -output as soon as possible.) - -@item -z -@itemx --null-data -@itemx --zero-terminated -@opindex -z -@opindex --null-data -@opindex --zero-terminated -Treat the input as a set of lines, each terminated by a zero byte -(the ASCII @samp{NUL} character) instead of a newline. This option can -be used with commands like @samp{sort -z} and @samp{find -print0} -to process arbitrary file names. -@end table - -If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file} -options are given on the command-line, -then the first non-option argument on the command line is -taken to be the @var{script} to be executed. - -@cindex Files to be processed as input -If any command-line parameters remain after processing the above, -these parameters are interpreted as the names of input files to -be processed. -@cindex Standard input, processing as input -A file name of @samp{-} refers to the standard input stream. -The standard input will be processed if no file names are specified. - - -@node sed Programs -@chapter @command{sed} Programs - -@cindex @command{sed} program structure -@cindex Script structure -A @command{sed} program consists of one or more @command{sed} commands, -passed in by one or more of the -@option{-e}, @option{-f}, @option{--expression}, and @option{--file} -options, or the first non-option argument if zero of these -options are used. -This document will refer to ``the'' @command{sed} script; -this is understood to mean the in-order catenation -of all of the @var{script}s and @var{script-file}s passed in. - -Commands within a @var{script} or @var{script-file} can be -separated by semicolons (@code{;}) or newlines (ASCII 10). -Some commands, due to their syntax, cannot be followed by semicolons -working as command separators and thus should be terminated -with newlines or be placed at the end of a @var{script} or @var{script-file}. -Commands can also be preceded with optional non-significant -whitespace characters. - -Each @code{sed} command consists of an optional address or -address range, followed by a one-character command name -and any additional command-specific code. - -@menu -* Execution Cycle:: How @command{sed} works -* Addresses:: Selecting lines with @command{sed} -* Regular Expressions:: Overview of regular expression syntax -* Common Commands:: Often used commands -* The "s" Command:: @command{sed}'s Swiss Army Knife -* Other Commands:: Less frequently used commands -* Programming Commands:: Commands for @command{sed} gurus -* Extended Commands:: Commands specific of @value{SSED} -* Escapes:: Specifying special characters -@end menu - - -@node Execution Cycle -@section How @command{sed} Works - -@cindex Buffer spaces, pattern and hold -@cindex Spaces, pattern and hold -@cindex Pattern space, definition -@cindex Hold space, definition -@command{sed} maintains two data buffers: the active @emph{pattern} space, -and the auxiliary @emph{hold} space. Both are initially empty. - -@command{sed} operates by performing the following cycle on each -line of input: first, @command{sed} reads one line from the input -stream, removes any trailing newline, and places it in the pattern space. -Then commands are executed; each command can have an address associated -to it: addresses are a kind of condition code, and a command is only -executed if the condition is verified before the command is to be -executed. - -When the end of the script is reached, unless the @option{-n} option -is in use, the contents of pattern space are printed out to the output -stream, adding back the trailing newline if it was removed.@footnote{Actually, -if @command{sed} prints a line without the terminating newline, it will -nevertheless print the missing newline as soon as more text is sent to -the same output stream, which gives the ``least expected surprise'' -even though it does not make commands like @samp{sed -n p} exactly -identical to @command{cat}.} Then the next cycle starts for the next -input line. - -Unless special commands (like @samp{D}) are used, the pattern space is -deleted between two cycles. The hold space, on the other hand, keeps -its data between cycles (see commands @samp{h}, @samp{H}, @samp{x}, -@samp{g}, @samp{G} to move data between both buffers). - - -@node Addresses -@section Selecting lines with @command{sed} -@cindex Addresses, in @command{sed} scripts -@cindex Line selection -@cindex Selecting lines to process - -Addresses in a @command{sed} script can be in any of the following forms: -@table @code -@item @var{number} -@cindex Address, numeric -@cindex Line, selecting by number -Specifying a line number will match only that line in the input. -(Note that @command{sed} counts lines continuously across all input files -unless @option{-i} or @option{-s} options are specified.) - -@item @var{first}~@var{step} -@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses -This @acronym{GNU} extension matches every @var{step}th line -starting with line @var{first}. -In particular, lines will be selected when there exists -a non-negative @var{n} such that the current line-number equals -@var{first} + (@var{n} * @var{step}). -Thus, to select the odd-numbered lines, -one would use @code{1~2}; -to pick every third line starting with the second, @samp{2~3} would be used; -to pick every fifth line starting with the tenth, use @samp{10~5}; -and @samp{50~0} is just an obscure way of saying @code{50}. - -@item $ -@cindex Address, last line -@cindex Last line, selecting -@cindex Line, selecting last -This address matches the last line of the last file of input, or -the last line of each file when the @option{-i} or @option{-s} options -are specified. - -@item /@var{regexp}/ -@cindex Address, as a regular expression -@cindex Line, selecting by regular expression match -This will select any line which matches the regular expression @var{regexp}. -If @var{regexp} itself includes any @code{/} characters, -each must be escaped by a backslash (@code{\}). - -@cindex empty regular expression -@cindex @value{SSEDEXT}, modifiers and the empty regular expression -The empty regular expression @samp{//} repeats the last regular -expression match (the same holds if the empty regular expression is -passed to the @code{s} command). Note that modifiers to regular expressions -are evaluated when the regular expression is compiled, thus it is invalid to -specify them together with the empty regular expression. - -@item \%@var{regexp}% -(The @code{%} may be replaced by any other single character.) - -@cindex Slash character, in regular expressions -This also matches the regular expression @var{regexp}, -but allows one to use a different delimiter than @code{/}. -This is particularly useful if the @var{regexp} itself contains -a lot of slashes, since it avoids the tedious escaping of every @code{/}. -If @var{regexp} itself includes any delimiter characters, -each must be escaped by a backslash (@code{\}). - -@item /@var{regexp}/I -@itemx \%@var{regexp}%I -@cindex @acronym{GNU} extensions, @code{I} modifier -@ifset PERL -@cindex Perl-style regular expressions, case-insensitive -@end ifset -The @code{I} modifier to regular-expression matching is a @acronym{GNU} -extension which causes the @var{regexp} to be matched in -a case-insensitive manner. - -@item /@var{regexp}/M -@itemx \%@var{regexp}%M -@cindex @value{SSEDEXT}, @code{M} modifier -@ifset PERL -@cindex Perl-style regular expressions, multiline -@end ifset -The @code{M} modifier to regular-expression matching is a @value{SSED} -extension which directs @value{SSED} to match the regular expression -in @cite{multi-line} mode. The modifier causes @code{^} and @code{$} to -match respectively (in addition to the normal behavior) the empty string -after a newline, and the empty string before a newline. There are -special character sequences -@ifset PERL -(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} -in basic or extended regular expression modes) -@end ifset -@ifclear PERL -(@code{\`} and @code{\'}) -@end ifclear -which always match the beginning or the end of the buffer. -In addition, -@ifset PERL -just like in Perl mode without the @code{S} modifier, -@end ifset -the period character does not match a new-line character in -multi-line mode. - -@ifset PERL -@item /@var{regexp}/S -@itemx \%@var{regexp}%S -@cindex @value{SSEDEXT}, @code{S} modifier -@cindex Perl-style regular expressions, single line -The @code{S} modifier to regular-expression matching is only valid -in Perl mode and specifies that the dot character (@code{.}) will -match the newline character too. @code{S} stands for @cite{single-line}. -@end ifset - -@ifset PERL -@item /@var{regexp}/X -@itemx \%@var{regexp}%X -@cindex @value{SSEDEXT}, @code{X} modifier -@cindex Perl-style regular expressions, extended -The @code{X} modifier to regular-expression matching is also -valid in Perl mode only. If it is used, whitespace in the -pattern (other than in a character class) and -characters between a @kbd{#} outside a character class and the -next newline character are ignored. An escaping backslash -can be used to include a whitespace or @kbd{#} character as part -of the pattern. -@end ifset -@end table - -If no addresses are given, then all lines are matched; -if one address is given, then only lines matching that -address are matched. - -@cindex Range of lines -@cindex Several lines, selecting -An address range can be specified by specifying two addresses -separated by a comma (@code{,}). An address range matches lines -starting from where the first address matches, and continues -until the second address matches (inclusively). - -If the second address is a @var{regexp}, then checking for the -ending match will start with the line @emph{following} the -line which matched the first address: a range will always -span at least two lines (except of course if the input stream -ends). - -If the second address is a @var{number} less than (or equal to) -the line matching the first address, then only the one line is -matched. - -@cindex Special addressing forms -@cindex Range with start address of zero -@cindex Zero, as range start address -@cindex @var{addr1},+N -@cindex @var{addr1},~N -@cindex @acronym{GNU} extensions, special two-address forms -@cindex @acronym{GNU} extensions, @code{0} address -@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing -@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing -@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing -@value{SSED} also supports some special two-address forms; all these -are @acronym{GNU} extensions: -@table @code -@item 0,/@var{regexp}/ -A line number of @code{0} can be used in an address specification like -@code{0,/@var{regexp}/} so that @command{sed} will try to match -@var{regexp} in the first input line too. In other words, -@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/}, -except that if @var{addr2} matches the very first line of input the -@code{0,/@var{regexp}/} form will consider it to end the range, whereas -the @code{1,/@var{regexp}/} form will match the beginning of its range and -hence make the range span up to the @emph{second} occurrence of the -regular expression. - -Note that this is the only place where the @code{0} address makes -sense; there is no 0-th line and commands which are given the @code{0} -address in any other way will give an error. - -@item @var{addr1},+@var{N} -Matches @var{addr1} and the @var{N} lines following @var{addr1}. - -@item @var{addr1},~@var{N} -Matches @var{addr1} and the lines following @var{addr1} -until the next line whose input line number is a multiple of @var{N}. -@end table - -@cindex Excluding lines -@cindex Selecting non-matching lines -Appending the @code{!} character to the end of an address -specification negates the sense of the match. -That is, if the @code{!} character follows an address range, -then only lines which do @emph{not} match the address range -will be selected. -This also works for singleton addresses, -and, perhaps perversely, for the null address. - - -@node Regular Expressions -@section Overview of Regular Expression Syntax - -To know how to use @command{sed}, people should understand regular -expressions (@dfn{regexp} for short). A regular expression -is a pattern that is matched against a -subject string from left to right. Most characters are -@dfn{ordinary}: they stand for -themselves in a pattern, and match the corresponding characters -in the subject. As a trivial example, the pattern - -@example -The quick brown fox -@end example - -@noindent -matches a portion of a subject string that is identical to -itself. The power of regular expressions comes from the -ability to include alternatives and repetitions in the pattern. -These are encoded in the pattern by the use of @dfn{special characters}, -which do not stand for themselves but instead -are interpreted in some special way. Here is a brief description -of regular expression syntax as used in @command{sed}. - -@table @code -@item @var{char} -A single ordinary character matches itself. - -@item * -@cindex @acronym{GNU} extensions, to basic regular expressions -Matches a sequence of zero or more instances of matches for the -preceding regular expression, which must be an ordinary character, a -special character preceded by @code{\}, a @code{.}, a grouped regexp -(see below), or a bracket expression. As a @acronym{GNU} extension, a -postfixed regular expression can also be followed by @code{*}; for -example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX} -1003.1-2001 says that @code{*} stands for itself when it appears at -the start of a regular expression or subexpression, but many -non@acronym{GNU} implementations do not support this and portable -scripts should instead use @code{\*} in these contexts. - -@item \+ -@cindex @acronym{GNU} extensions, to basic regular expressions -As @code{*}, but matches one or more. It is a @acronym{GNU} extension. - -@item \? -@cindex @acronym{GNU} extensions, to basic regular expressions -As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension. - -@item \@{@var{i}\@} -As @code{*}, but matches exactly @var{i} sequences (@var{i} is a -decimal integer; for portability, keep it between 0 and 255 -inclusive). - -@item \@{@var{i},@var{j}\@} -Matches between @var{i} and @var{j}, inclusive, sequences. - -@item \@{@var{i},\@} -Matches more than or equal to @var{i} sequences. - -@item \(@var{regexp}\) -Groups the inner @var{regexp} as a whole, this is used to: - -@itemize @bullet -@item -@cindex @acronym{GNU} extensions, to basic regular expressions -Apply postfix operators, like @code{\(abcd\)*}: -this will search for zero or more whole sequences -of @samp{abcd}, while @code{abcd*} would search -for @samp{abc} followed by zero or more occurrences -of @samp{d}. Note that support for @code{\(abcd\)*} is -required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU} -implementations do not support it and hence it is not universally -portable. - -@item -Use back references (see below). -@end itemize - -@item . -Matches any character, including newline. - -@item ^ -Matches the null string at beginning of the pattern space, i.e. what -appears after the circumflex must appear at the beginning of the -pattern space. - -In most scripts, pattern space is initialized to the content of each -line (@pxref{Execution Cycle, , How @code{sed} works}). So, it is a -useful simplification to think of @code{^#include} as matching only -lines where @samp{#include} is the first thing on line---if there are -spaces before, for example, the match fails. This simplification is -valid as long as the original content of pattern space is not modified, -for example with an @code{s} command. - -@code{^} acts as a special character only at the beginning of the -regular expression or subexpression (that is, after @code{\(} or -@code{\|}). Portable scripts should avoid @code{^} at the beginning of -a subexpression, though, as @acronym{POSIX} allows implementations that -treat @code{^} as an ordinary character in that context. - -@item $ -It is the same as @code{^}, but refers to end of pattern space. -@code{$} also acts as a special character only at the end -of the regular expression or subexpression (that is, before @code{\)} -or @code{\|}), and its use at the end of a subexpression is not -portable. - - -@item [@var{list}] -@itemx [^@var{list}] -Matches any single character in @var{list}: for example, -@code{[aeiou]} matches all vowels. A list may include -sequences like @code{@var{char1}-@var{char2}}, which -matches any character between (inclusive) @var{char1} -and @var{char2}. - -A leading @code{^} reverses the meaning of @var{list}, so that -it matches any single character @emph{not} in @var{list}. To include -@code{]} in the list, make it the first character (after -the @code{^} if needed), to include @code{-} in the list, -make it the first or last; to include @code{^} put -it after the first character. - -@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions -The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\} -are normally not special within @var{list}. For example, @code{[\*]} -matches either @samp{\} or @samp{*}, because the @code{\} is not -special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and -@code{[:space:]} are special within @var{list} and represent collating -symbols, equivalence classes, and character classes, respectively, and -@code{[} is therefore special within @var{list} when it is followed by -@code{.}, @code{=}, or @code{:}. Also, when not in -@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and -@code{\t} are recognized within @var{list}. @xref{Escapes}. - -@item @var{regexp1}\|@var{regexp2} -@cindex @acronym{GNU} extensions, to basic regular expressions -Matches either @var{regexp1} or @var{regexp2}. Use -parentheses to use complex alternative regular expressions. -The matching process tries each alternative in turn, from -left to right, and the first one that succeeds is used. -It is a @acronym{GNU} extension. - -@item @var{regexp1}@var{regexp2} -Matches the concatenation of @var{regexp1} and @var{regexp2}. -Concatenation binds more tightly than @code{\|}, @code{^}, and -@code{$}, but less tightly than the other regular expression -operators. - -@item \@var{digit} -Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized -subexpression in the regular expression. This is called a @dfn{back -reference}. Subexpressions are implicity numbered by counting -occurrences of @code{\(} left-to-right. - -@item \n -Matches the newline character. - -@item \@var{char} -Matches @var{char}, where @var{char} is one of @code{$}, -@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}. -Note that the only C-like -backslash sequences that you can portably assume to be -interpreted are @code{\n} and @code{\\}; in particular -@code{\t} is not portable, and matches a @samp{t} under most -implementations of @command{sed}, rather than a tab character. - -@end table - -@cindex Greedy regular expression matching -Note that the regular expression matcher is greedy, i.e., matches -are attempted from left to right and, if two or more matches are -possible starting at the same character, it selects the longest. - -@noindent -Examples: -@table @samp -@item abcdef -Matches @samp{abcdef}. - -@item a*b -Matches zero or more @samp{a}s followed by a single -@samp{b}. For example, @samp{b} or @samp{aaaaab}. - -@item a\?b -Matches @samp{b} or @samp{ab}. - -@item a\+b\+ -Matches one or more @samp{a}s followed by one or more -@samp{b}s: @samp{ab} is the shortest possible match, but -other examples are @samp{aaaab} or @samp{abbbbb} or -@samp{aaaaaabbbbbbb}. - -@item .* -@itemx .\+ -These two both match all the characters in a string; -however, the first matches every string (including the empty -string), while the second matches only strings containing -at least one character. - -@item ^main.*(.*) -This matches a string starting with @samp{main}, -followed by an opening and closing -parenthesis. The @samp{n}, @samp{(} and @samp{)} need not -be adjacent. - -@item ^# -This matches a string beginning with @samp{#}. - -@item \\$ -This matches a string ending with a single backslash. The -regexp contains two backslashes for escaping. - -@item \$ -Instead, this matches a string consisting of a single dollar sign, -because it is escaped. - -@item [a-zA-Z0-9] -In the C locale, this matches any @acronym{ASCII} letters or digits. - -@item [^ @kbd{tab}]\+ -(Here @kbd{tab} stands for a single tab character.) -This matches a string of one or more -characters, none of which is a space or a tab. -Usually this means a word. - -@item ^\(.*\)\n\1$ -This matches a string consisting of two equal substrings separated by -a newline. - -@item .\@{9\@}A$ -This matches nine characters followed by an @samp{A}. - -@item ^.\@{15\@}A -This matches the start of a string that contains 16 characters, -the last of which is an @samp{A}. - -@end table - - - -@node Common Commands -@section Often-Used Commands - -If you use @command{sed} at all, you will quite likely want to know -these commands. - -@table @code -@item # -[No addresses allowed.] - -@findex # (comments) -@cindex Comments, in scripts -The @code{#} character begins a comment; -the comment continues until the next newline. - -@cindex Portability, comments -If you are concerned about portability, be aware that -some implementations of @command{sed} (which are not @sc{posix} -conformant) may only support a single one-line comment, -and then only when the very first character of the script is a @code{#}. - -@findex -n, forcing from within a script -@cindex Caveat --- #n on first line -Warning: if the first two characters of the @command{sed} script -are @code{#n}, then the @option{-n} (no-autoprint) option is forced. -If you want to put a comment in the first line of your script -and that comment begins with the letter @samp{n} -and you do not want this behavior, -then be sure to either use a capital @samp{N}, -or place at least one space before the @samp{n}. - -@item q [@var{exit-code}] -This command only accepts a single address. - -@findex q (quit) command -@cindex @value{SSEDEXT}, returning an exit code -@cindex Quitting -Exit @command{sed} without processing any more commands or input. -Note that the current pattern space is printed if auto-print is -not disabled with the @option{-n} options. The ability to return -an exit code from the @command{sed} script is a @value{SSED} extension. - -@item d -@findex d (delete) command -@cindex Text, deleting -Delete the pattern space; -immediately start next cycle. - -@item p -@findex p (print) command -@cindex Text, printing -Print out the pattern space (to the standard output). -This command is usually only used in conjunction with the @option{-n} -command-line option. - -@item n -@findex n (next-line) command -@cindex Next input line, replace pattern space with -@cindex Read next input line -If auto-print is not disabled, print the pattern space, -then, regardless, replace the pattern space with the next line of input. -If there is no more input then @command{sed} exits without processing -any more commands. - -@item @{ @var{commands} @} -@findex @{@} command grouping -@cindex Grouping commands -@cindex Command groups -A group of commands may be enclosed between -@code{@{} and @code{@}} characters. -This is particularly useful when you want a group of commands -to be triggered by a single address (or address-range) match. - -@end table - -@node The "s" Command -@section The @code{s} Command - -The syntax of the @code{s} (as in substitute) command is -@samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/} -characters may be uniformly replaced by any other single -character within any given @code{s} command. The @code{/} -character (or whatever other character is used in its stead) -can appear in the @var{regexp} or @var{replacement} -only if it is preceded by a @code{\} character. - -The @code{s} command is probably the most important in @command{sed} -and has a lot of different options. Its basic concept is simple: -the @code{s} command attempts to match the pattern -space against the supplied @var{regexp}; if the match is -successful, then that portion of the pattern -space which was matched is replaced with @var{replacement}. - -@cindex Backreferences, in regular expressions -@cindex Parenthesized substrings -The @var{replacement} can contain @code{\@var{n}} (@var{n} being -a number from 1 to 9, inclusive) references, which refer to -the portion of the match which is contained between the @var{n}th -@code{\(} and its matching @code{\)}. -Also, the @var{replacement} can contain unescaped @code{&} -characters which reference the whole matched portion -of the pattern space. -@cindex @value{SSEDEXT}, case modifiers in @code{s} commands -Finally, as a @value{SSED} extension, you can include a -special sequence made of a backslash and one of the letters -@code{L}, @code{l}, @code{U}, @code{u}, or @code{E}. -The meaning is as follows: - -@table @code -@item \L -Turn the replacement -to lowercase until a @code{\U} or @code{\E} is found, - -@item \l -Turn the -next character to lowercase, - -@item \U -Turn the replacement to uppercase -until a @code{\L} or @code{\E} is found, - -@item \u -Turn the next character -to uppercase, - -@item \E -Stop case conversion started by @code{\L} or @code{\U}. -@end table - -When the @code{g} flag is being used, case conversion does not -propagate from one occurrence of the regular expression to -another. For example, when the following command is executed -with @samp{a-b-} in pattern space: -@example -s/\(b\?\)-/x\u\1/g -@end example - -@noindent -the output is @samp{axxB}. When replacing the first @samp{-}, -the @samp{\u} sequence only affects the empty replacement of -@samp{\1}. It does not affect the @code{x} character that is -added to pattern space when replacing @code{b-} with @code{xB}. - -On the other hand, @code{\l} and @code{\u} do affect the remainder -of the replacement text if they are followed by an empty substitution. -With @samp{a-b-} in pattern space, the following command: -@example -s/\(b\?\)-/\u\1x/g -@end example - -@noindent -will replace @samp{-} with @samp{X} (uppercase) and @samp{b-} with -@samp{Bx}. If this behavior is undesirable, you can prevent it by -adding a @samp{\E} sequence---after @samp{\1} in this case. - -To include a literal @code{\}, @code{&}, or newline in the final -replacement, be sure to precede the desired @code{\}, @code{&}, -or newline in the @var{replacement} with a @code{\}. - -@findex s command, option flags -@cindex Substitution of text, options -The @code{s} command can be followed by zero or more of the -following @var{flags}: - -@table @code -@item g -@cindex Global substitution -@cindex Replacing all text matching regexp in a line -Apply the replacement to @emph{all} matches to the @var{regexp}, -not just the first. - -@item @var{number} -@cindex Replacing only @var{n}th match of regexp in a line -Only replace the @var{number}th match of the @var{regexp}. - -@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command -@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command -Note: the @sc{posix} standard does not specify what should happen -when you mix the @code{g} and @var{number} modifiers, -and currently there is no widely agreed upon meaning -across @command{sed} implementations. -For @value{SSED}, the interaction is defined to be: -ignore matches before the @var{number}th, -and then match and replace all matches from -the @var{number}th on. - -@item p -@cindex Text, printing after substitution -If the substitution was made, then print the new pattern space. - -Note: when both the @code{p} and @code{e} options are specified, -the relative ordering of the two produces very different results. -In general, @code{ep} (evaluate then print) is what you want, -but operating the other way round can be useful for debugging. -For this reason, the current version of @value{SSED} interprets -specially the presence of @code{p} options both before and after -@code{e}, printing the pattern space before and after evaluation, -while in general flags for the @code{s} command show their -effect just once. This behavior, although documented, might -change in future versions. - -@item w @var{file-name} -@cindex Text, writing to a file after substitution -@cindex @value{SSEDEXT}, @file{/dev/stdout} file -@cindex @value{SSEDEXT}, @file{/dev/stderr} file -If the substitution was made, then write out the result to the named file. -As a @value{SSED} extension, two special values of @var{file-name} are -supported: @file{/dev/stderr}, which writes the result to the standard -error, and @file{/dev/stdout}, which writes to the standard -output.@footnote{This is equivalent to @code{p} unless the @option{-i} -option is being used.} - -@item e -@cindex Evaluate Bourne-shell commands, after substitution -@cindex Subprocesses -@cindex @value{SSEDEXT}, evaluating Bourne-shell commands -@cindex @value{SSEDEXT}, subprocesses -This command allows one to pipe input from a shell command -into pattern space. If a substitution was made, the command -that is found in pattern space is executed and pattern space -is replaced with its output. A trailing newline is suppressed; -results are undefined if the command to be executed contains -a @sc{nul} character. This is a @value{SSED} extension. - -@item I -@itemx i -@cindex @acronym{GNU} extensions, @code{I} modifier -@cindex Case-insensitive matching -@ifset PERL -@cindex Perl-style regular expressions, case-insensitive -@end ifset -The @code{I} modifier to regular-expression matching is a @acronym{GNU} -extension which makes @command{sed} match @var{regexp} in a -case-insensitive manner. - -@item M -@itemx m -@cindex @value{SSEDEXT}, @code{M} modifier -@ifset PERL -@cindex Perl-style regular expressions, multiline -@end ifset -The @code{M} modifier to regular-expression matching is a @value{SSED} -extension which directs @value{SSED} to match the regular expression -in @cite{multi-line} mode. The modifier causes @code{^} and @code{$} to -match respectively (in addition to the normal behavior) the empty string -after a newline, and the empty string before a newline. There are -special character sequences -@ifset PERL -(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} -in basic or extended regular expression modes) -@end ifset -@ifclear PERL -(@code{\`} and @code{\'}) -@end ifclear -which always match the beginning or the end of the buffer. -In addition, -@ifset PERL -just like in Perl mode without the @code{S} modifier, -@end ifset -the period character does not match a new-line character in -multi-line mode. - -@ifset PERL -@item S -@itemx s -@cindex @value{SSEDEXT}, @code{S} modifier -@cindex Perl-style regular expressions, single line -The @code{S} modifier to regular-expression matching is only valid -in Perl mode and specifies that the dot character (@code{.}) will -match the newline character too. @code{S} stands for @cite{single-line}. -@end ifset - -@ifset PERL -@item X -@itemx x -@cindex @value{SSEDEXT}, @code{X} modifier -@cindex Perl-style regular expressions, extended -The @code{X} modifier to regular-expression matching is also -valid in Perl mode only. If it is used, whitespace in the -pattern (other than in a character class) and -characters between a @kbd{#} outside a character class and the -next newline character are ignored. An escaping backslash -can be used to include a whitespace or @kbd{#} character as part -of the pattern. -@end ifset -@end table - - -@node Other Commands -@section Less Frequently-Used Commands - -Though perhaps less frequently used than those in the previous -section, some very small yet useful @command{sed} scripts can be built with -these commands. - -@table @code -@item y/@var{source-chars}/@var{dest-chars}/ -(The @code{/} characters may be uniformly replaced by -any other single character within any given @code{y} command.) - -@findex y (transliterate) command -@cindex Transliteration -Transliterate any characters in the pattern space which match -any of the @var{source-chars} with the corresponding character -in @var{dest-chars}. - -Instances of the @code{/} (or whatever other character is used in its stead), -@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars} -lists, provide that each instance is escaped by a @code{\}. -The @var{source-chars} and @var{dest-chars} lists @emph{must} -contain the same number of characters (after de-escaping). - -@item a\ -@itemx @var{text} -@cindex @value{SSEDEXT}, two addresses supported by most commands -As a @acronym{GNU} extension, this command accepts two addresses. - -@findex a (append text lines) command -@cindex Appending text after a line -@cindex Text, appending -Queue the lines of text which follow this command -(each but the last ending with a @code{\}, -which are removed from the output) -to be output at the end of the current cycle, -or when the next input line is read. - -Escape sequences in @var{text} are processed, so you should -use @code{\\} in @var{text} to print a single backslash. - -As a @acronym{GNU} extension, if between the @code{a} and the newline there is -other than a whitespace-@code{\} sequence, then the text of this line, -starting at the first non-whitespace character after the @code{a}, -is taken as the first line of the @var{text} block. -(This enables a simplification in scripting a one-line add.) -This extension also works with the @code{i} and @code{c} commands. - -@item i\ -@itemx @var{text} -@cindex @value{SSEDEXT}, two addresses supported by most commands -As a @acronym{GNU} extension, this command accepts two addresses. - -@findex i (insert text lines) command -@cindex Inserting text before a line -@cindex Text, insertion -Immediately output the lines of text which follow this command -(each but the last ending with a @code{\}, -which are removed from the output). - -@item c\ -@itemx @var{text} -@findex c (change to text lines) command -@cindex Replacing selected lines with other text -Delete the lines matching the address or address-range, -and output the lines of text which follow this command -(each but the last ending with a @code{\}, -which are removed from the output) -in place of the last line -(or in place of each line, if no addresses were specified). -A new cycle is started after this command is done, -since the pattern space will have been deleted. - -@item = -@cindex @value{SSEDEXT}, two addresses supported by most commands -As a @acronym{GNU} extension, this command accepts two addresses. - -@findex = (print line number) command -@cindex Printing line number -@cindex Line number, printing -Print out the current input line number (with a trailing newline). - -@item l @var{n} -@findex l (list unambiguously) command -@cindex List pattern space -@cindex Printing text unambiguously -@cindex Line length, setting -@cindex @value{SSEDEXT}, setting line length -Print the pattern space in an unambiguous form: -non-printable characters (and the @code{\} character) -are printed in C-style escaped form; long lines are split, -with a trailing @code{\} character to indicate the split; -the end of each line is marked with a @code{$}. - -@var{n} specifies the desired line-wrap length; -a length of 0 (zero) means to never wrap long lines. If omitted, -the default as specified on the command line is used. The @var{n} -parameter is a @value{SSED} extension. - -@item r @var{filename} -@cindex @value{SSEDEXT}, two addresses supported by most commands -As a @acronym{GNU} extension, this command accepts two addresses. - -@findex r (read file) command -@cindex Read text from a file -@cindex @value{SSEDEXT}, @file{/dev/stdin} file -Queue the contents of @var{filename} to be read and -inserted into the output stream at the end of the current cycle, -or when the next input line is read. -Note that if @var{filename} cannot be read, it is treated as -if it were an empty file, without any error indication. - -As a @value{SSED} extension, the special value @file{/dev/stdin} -is supported for the file name, which reads the contents of the -standard input. - -@item w @var{filename} -@findex w (write file) command -@cindex Write to a file -@cindex @value{SSEDEXT}, @file{/dev/stdout} file -@cindex @value{SSEDEXT}, @file{/dev/stderr} file -Write the pattern space to @var{filename}. -As a @value{SSED} extension, two special values of @var{file-name} are -supported: @file{/dev/stderr}, which writes the result to the standard -error, and @file{/dev/stdout}, which writes to the standard -output.@footnote{This is equivalent to @code{p} unless the @option{-i} -option is being used.} - -The file will be created (or truncated) before the first input line is -read; all @code{w} commands (including instances of the @code{w} flag -on successful @code{s} commands) which refer to the same @var{filename} -are output without closing and reopening the file. - -@item D -@findex D (delete first line) command -@cindex Delete first line from pattern space -If pattern space contains no newline, start a normal new cycle as if -the @code{d} command was issued. Otherwise, delete text in the pattern -space up to the first newline, and restart cycle with the resultant -pattern space, without reading a new line of input. - -@item N -@findex N (append Next line) command -@cindex Next input line, append to pattern space -@cindex Append next input line to pattern space -Add a newline to the pattern space, -then append the next line of input to the pattern space. -If there is no more input then @command{sed} exits without processing -any more commands. - -@item P -@findex P (print first line) command -@cindex Print first line from pattern space -Print out the portion of the pattern space up to the first newline. - -@item h -@findex h (hold) command -@cindex Copy pattern space into hold space -@cindex Replace hold space with copy of pattern space -@cindex Hold space, copying pattern space into -Replace the contents of the hold space with the contents of the pattern space. - -@item H -@findex H (append Hold) command -@cindex Append pattern space to hold space -@cindex Hold space, appending from pattern space -Append a newline to the contents of the hold space, -and then append the contents of the pattern space to that of the hold space. - -@item g -@findex g (get) command -@cindex Copy hold space into pattern space -@cindex Replace pattern space with copy of hold space -@cindex Hold space, copy into pattern space -Replace the contents of the pattern space with the contents of the hold space. - -@item G -@findex G (appending Get) command -@cindex Append hold space to pattern space -@cindex Hold space, appending to pattern space -Append a newline to the contents of the pattern space, -and then append the contents of the hold space to that of the pattern space. - -@item x -@findex x (eXchange) command -@cindex Exchange hold space with pattern space -@cindex Hold space, exchange with pattern space -Exchange the contents of the hold and pattern spaces. - -@end table - - -@node Programming Commands -@section Commands for @command{sed} gurus - -In most cases, use of these commands indicates that you are -probably better off programming in something like @command{awk} -or Perl. But occasionally one is committed to sticking -with @command{sed}, and these commands can enable one to write -quite convoluted scripts. - -@cindex Flow of control in scripts -@table @code -@item : @var{label} -[No addresses allowed.] - -@findex : (label) command -@cindex Labels, in scripts -Specify the location of @var{label} for branch commands. -In all other respects, a no-op. - -@item b @var{label} -@findex b (branch) command -@cindex Branch to a label, unconditionally -@cindex Goto, in scripts -Unconditionally branch to @var{label}. -The @var{label} may be omitted, in which case the next cycle is started. - -@item t @var{label} -@findex t (test and branch if successful) command -@cindex Branch to a label, if @code{s///} succeeded -@cindex Conditional branch -Branch to @var{label} only if there has been a successful @code{s}ubstitution -since the last input line was read or conditional branch was taken. -The @var{label} may be omitted, in which case the next cycle is started. - -@end table - -@node Extended Commands -@section Commands Specific to @value{SSED} - -These commands are specific to @value{SSED}, so you -must use them with care and only when you are sure that -hindering portability is not evil. They allow you to check -for @value{SSED} extensions or to do tasks that are required -quite often, yet are unsupported by standard @command{sed}s. - -@table @code -@item e [@var{command}] -@findex e (evaluate) command -@cindex Evaluate Bourne-shell commands -@cindex Subprocesses -@cindex @value{SSEDEXT}, evaluating Bourne-shell commands -@cindex @value{SSEDEXT}, subprocesses -This command allows one to pipe input from a shell command -into pattern space. Without parameters, the @code{e} command -executes the command that is found in pattern space and -replaces the pattern space with the output; a trailing newline -is suppressed. - -If a parameter is specified, instead, the @code{e} command -interprets it as a command and sends its output to the output stream. -The command can run across multiple lines, all but the last ending with -a back-slash. - -In both cases, the results are undefined if the command to be -executed contains a @sc{nul} character. - -Note that, unlike the @code{r} command, the output of the command will -be printed immediately; the @code{r} command instead delays the output -to the end of the current cycle. - -@item F -@findex F (File name) command -@cindex Printing file name -@cindex File name, printing -Print out the file name of the current input file (with a trailing -newline). - -@item L @var{n} -@findex L (fLow paragraphs) command -@cindex Reformat pattern space -@cindex Reformatting paragraphs -@cindex @value{SSEDEXT}, reformatting paragraphs -@cindex @value{SSEDEXT}, @code{L} command -This @value{SSED} extension fills and joins lines in pattern space -to produce output lines of (at most) @var{n} characters, like -@code{fmt} does; if @var{n} is omitted, the default as specified -on the command line is used. This command is considered a failed -experiment and unless there is enough request (which seems unlikely) -will be removed in future versions. - -@ignore -Blank lines, spaces between words, and indentation are -preserved in the output; successive input lines with different -indentation are not joined; tabs are expanded to 8 columns. - -If the pattern space contains multiple lines, they are joined, but -since the pattern space usually contains a single line, the behavior -of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e., -it does not join short lines to form longer ones). - -@var{n} specifies the desired line-wrap length; if omitted, -the default as specified on the command line is used. -@end ignore - -@item Q [@var{exit-code}] -This command only accepts a single address. - -@findex Q (silent Quit) command -@cindex @value{SSEDEXT}, quitting silently -@cindex @value{SSEDEXT}, returning an exit code -@cindex Quitting -This command is the same as @code{q}, but will not print the -contents of pattern space. Like @code{q}, it provides the -ability to return an exit code to the caller. - -This command can be useful because the only alternative ways -to accomplish this apparently trivial function are to use -the @option{-n} option (which can unnecessarily complicate -your script) or resorting to the following snippet, which -wastes time by reading the whole file without any visible effect: - -@example -:eat -$d @i{@r{Quit silently on the last line}} -N @i{@r{Read another line, silently}} -g @i{@r{Overwrite pattern space each time to save memory}} -b eat -@end example - -@item R @var{filename} -@findex R (read line) command -@cindex Read text from a file -@cindex @value{SSEDEXT}, reading a file a line at a time -@cindex @value{SSEDEXT}, @code{R} command -@cindex @value{SSEDEXT}, @file{/dev/stdin} file -Queue a line of @var{filename} to be read and -inserted into the output stream at the end of the current cycle, -or when the next input line is read. -Note that if @var{filename} cannot be read, or if its end is -reached, no line is appended, without any error indication. - -As with the @code{r} command, the special value @file{/dev/stdin} -is supported for the file name, which reads a line from the -standard input. - -@item T @var{label} -@findex T (test and branch if failed) command -@cindex @value{SSEDEXT}, branch if @code{s///} failed -@cindex Branch to a label, if @code{s///} failed -@cindex Conditional branch -Branch to @var{label} only if there have been no successful -@code{s}ubstitutions since the last input line was read or -conditional branch was taken. The @var{label} may be omitted, -in which case the next cycle is started. - -@item v @var{version} -@findex v (version) command -@cindex @value{SSEDEXT}, checking for their presence -@cindex Requiring @value{SSED} -This command does nothing, but makes @command{sed} fail if -@value{SSED} extensions are not supported, simply because other -versions of @command{sed} do not implement it. In addition, you -can specify the version of @command{sed} that your script -requires, such as @code{4.0.5}. The default is @code{4.0} -because that is the first version that implemented this command. - -This command enables all @value{SSEDEXT} even if -@env{POSIXLY_CORRECT} is set in the environment. - -@item W @var{filename} -@findex W (write first line) command -@cindex Write first line to a file -@cindex @value{SSEDEXT}, writing first line to a file -Write to the given filename the portion of the pattern space up to -the first newline. Everything said under the @code{w} command about -file handling holds here too. - -@item z -@findex z (Zap) command -@cindex @value{SSEDEXT}, emptying pattern space -@cindex Emptying pattern space -This command empties the content of pattern space. It is -usually the same as @samp{s/.*//}, but is more efficient -and works in the presence of invalid multibyte sequences -in the input stream. @sc{posix} mandates that such sequences -are @emph{not} matched by @samp{.}, so that there is no portable -way to clear @command{sed}'s buffers in the middle of the -script in most multibyte locales (including UTF-8 locales). -@end table - -@node Escapes -@section @acronym{GNU} Extensions for Escapes in Regular Expressions - -@cindex @acronym{GNU} extensions, special escapes -Until this chapter, we have only encountered escapes of the form -@samp{\^}, which tell @command{sed} not to interpret the circumflex -as a special character, but rather to take it literally. For -example, @samp{\*} matches a single asterisk rather than zero -or more backslashes. - -@cindex @code{POSIXLY_CORRECT} behavior, escapes -This chapter introduces another kind of escape@footnote{All -the escapes introduced here are @acronym{GNU} -extensions, with the exception of @code{\n}. In basic regular -expression mode, setting @code{POSIXLY_CORRECT} disables them inside -bracket expressions.}---that -is, escapes that are applied to a character or sequence of characters -that ordinarily are taken literally, and that @command{sed} replaces -with a special character. This provides a way -of encoding non-printable characters in patterns in a visible manner. -There is no restriction on the appearance of non-printing characters -in a @command{sed} script but when a script is being prepared in the -shell or by text editing, it is usually easier to use one of -the following escape sequences than the binary character it -represents: - -The list of these escapes is: - -@table @code -@item \a -Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7). - -@item \f -Produces or matches a form feed (@sc{ascii} 12). - -@item \n -Produces or matches a newline (@sc{ascii} 10). - -@item \r -Produces or matches a carriage return (@sc{ascii} 13). - -@item \t -Produces or matches a horizontal tab (@sc{ascii} 9). - -@item \v -Produces or matches a so called ``vertical tab'' (@sc{ascii} 11). - -@item \c@var{x} -Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is -any character. The precise effect of @samp{\c@var{x}} is as follows: -if @var{x} is a lower case letter, it is converted to upper case. -Then bit 6 of the character (hex 40) is inverted. Thus @samp{\cz} becomes -hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B. - -@item \d@var{xxx} -Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}. - -@item \o@var{xxx} -@ifset PERL -@item \@var{xxx} -@end ifset -Produces or matches a character whose octal @sc{ascii} value is @var{xxx}. -@ifset PERL -The syntax without the @code{o} is active in Perl mode, while the one -with the @code{o} is active in the normal or extended @sc{posix} regular -expression modes. -@end ifset - -@item \x@var{xx} -Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}. -@end table - -@samp{\b} (backspace) was omitted because of the conflict with -the existing ``word boundary'' meaning. - -Other escapes match a particular character class and are valid only in -regular expressions: - -@table @code -@item \w -Matches any ``word'' character. A ``word'' character is any -letter or digit or the underscore character. - -@item \W -Matches any ``non-word'' character. - -@item \b -Matches a word boundary; that is it matches if the character -to the left is a ``word'' character and the character to the -right is a ``non-word'' character, or vice-versa. - -@item \B -Matches everywhere but on a word boundary; that is it matches -if the character to the left and the character to the right -are either both ``word'' characters or both ``non-word'' -characters. - -@item \` -Matches only at the start of pattern space. This is different -from @code{^} in multi-line mode. - -@item \' -Matches only at the end of pattern space. This is different -from @code{$} in multi-line mode. - -@ifset PERL -@item \G -Match only at the start of pattern space or, when doing a global -substitution using the @code{s///g} command and option, at -the end-of-match position of the prior match. For example, -@samp{s/\Ga/Z/g} will change an initial run of @code{a}s to -a run of @code{Z}s -@end ifset -@end table - -@node Examples -@chapter Some Sample Scripts - -Here are some @command{sed} scripts to guide you in the art of mastering -@command{sed}. - -@menu -Some exotic examples: -* Centering lines:: -* Increment a number:: -* Rename files to lower case:: -* Print bash environment:: -* Reverse chars of lines:: - -Emulating standard utilities: -* tac:: Reverse lines of files -* cat -n:: Numbering lines -* cat -b:: Numbering non-blank lines -* wc -c:: Counting chars -* wc -w:: Counting words -* wc -l:: Counting lines -* head:: Printing the first lines -* tail:: Printing the last lines -* uniq:: Make duplicate lines unique -* uniq -d:: Print duplicated lines of input -* uniq -u:: Remove all duplicated lines -* cat -s:: Squeezing blank lines -@end menu - -@node Centering lines -@section Centering Lines - -This script centers all lines of a file on a 80 columns width. -To change that width, the number in @code{\@{@dots{}\@}} must be -replaced, and the number of added spaces also must be changed. - -Note how the buffer commands are used to separate parts in -the regular expressions to be matched---this is a common -technique. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -f - -# Put 80 spaces in the buffer -1 @{ - x - s/^$/ / - s/^.*$/&&&&&&&&/ - x -@} - -# del leading and trailing spaces -y/@kbd{tab}/ / -s/^ *// -s/ *$// - -# add a newline and 80 spaces to end of line -G - -# keep first 81 chars (80 + a newline) -s/^\(.\@{81\@}\).*$/\1/ - -# \2 matches half of the spaces, which are moved to the beginning -s/^\(.*\)\n\(.*\)\2/\2\1/ -@end example -@c end--------------------------------------------- - -@node Increment a number -@section Increment a Number - -This script is one of a few that demonstrate how to do arithmetic -in @command{sed}. This is indeed possible,@footnote{@command{sed} guru Greg -Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator! -It is distributed together with sed.} but must be done manually. - -To increment one number you just add 1 to last digit, replacing -it by the following digit. There is one exception: when the digit -is a nine the previous digits must be also incremented until you -don't have a nine. - -This solution by Bruno Haible is very clever and smart because -it uses a single buffer; if you don't have this limitation, the -algorithm used in @ref{cat -n, Numbering lines}, is faster. -It works by replacing trailing nines with an underscore, then -using multiple @code{s} commands to increment the last digit, -and then again substituting underscores with zeros. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -f - -/[^0-9]/ d - -# replace all trailing 9s by _ (any other character except digits, could -# be used) -:d -s/9\(_*\)$/_\1/ -td - -# incr last digit only. The first line adds a most-significant -# digit of 1 if we have to add a digit. - -s/^\(_*\)$/1\1/; tn -s/8\(_*\)$/9\1/; tn -s/7\(_*\)$/8\1/; tn -s/6\(_*\)$/7\1/; tn -s/5\(_*\)$/6\1/; tn -s/4\(_*\)$/5\1/; tn -s/3\(_*\)$/4\1/; tn -s/2\(_*\)$/3\1/; tn -s/1\(_*\)$/2\1/; tn -s/0\(_*\)$/1\1/; tn - -:n -y/_/0/ -@end example -@c end--------------------------------------------- - -@node Rename files to lower case -@section Rename Files to Lower Case - -This is a pretty strange use of @command{sed}. We transform text, and -transform it to be shell commands, then just feed them to shell. -Don't worry, even worse hacks are done when using @command{sed}; I have -seen a script converting the output of @command{date} into a @command{bc} -program! - -The main body of this is the @command{sed} script, which remaps the name -from lower to upper (or vice-versa) and even checks out -if the remapped name is the same as the original name. -Note how the script is parameterized using shell -variables and proper quoting. - -@c start------------------------------------------- -@example -#! /bin/sh -# rename files to lower/upper case... -# -# usage: -# move-to-lower * -# move-to-upper * -# or -# move-to-lower -R . -# move-to-upper -R . -# - -help() -@{ - cat << eof -Usage: $0 [-n] [-r] [-h] files... - --n do nothing, only see what would be done --R recursive (use find) --h this message -files files to remap to lower case - -Examples: - $0 -n * (see if everything is ok, then...) - $0 * - - $0 -R . - -eof -@} - -apply_cmd='sh' -finder='echo "$@@" | tr " " "\n"' -files_only= - -while : -do - case "$1" in - -n) apply_cmd='cat' ;; - -R) finder='find "$@@" -type f';; - -h) help ; exit 1 ;; - *) break ;; - esac - shift -done - -if [ -z "$1" ]; then - echo Usage: $0 [-h] [-n] [-r] files... - exit 1 -fi - -LOWER='abcdefghijklmnopqrstuvwxyz' -UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ' - -case `basename $0` in - *upper*) TO=$UPPER; FROM=$LOWER ;; - *) FROM=$UPPER; TO=$LOWER ;; -esac - -eval $finder | sed -n ' - -# remove all trailing slashes -s/\/*$// - -# add ./ if there is no path, only a filename -/\//! s/^/.\// - -# save path+filename -h - -# remove path -s/.*\/// - -# do conversion only on filename -y/'$FROM'/'$TO'/ - -# now line contains original path+file, while -# hold space contains the new filename -x - -# add converted file name to line, which now contains -# path/file-name\nconverted-file-name -G - -# check if converted file name is equal to original file name, -# if it is, do not print anything -/^.*\/\(.*\)\n\1/b - -# escape special characters for the shell -s/["$`\\]/\\&/g - -# now, transform path/fromfile\n, into -# mv path/fromfile path/tofile and print it -s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p - -' | $apply_cmd -@end example -@c end--------------------------------------------- - -@node Print bash environment -@section Print @command{bash} Environment - -This script strips the definition of the shell functions -from the output of the @command{set} Bourne-shell command. - -@c start------------------------------------------- -@example -#!/bin/sh - -set | sed -n ' -:x - -@ifinfo -# if no occurrence of "=()" print and load next line -@end ifinfo -@ifnotinfo -# if no occurrence of @samp{=()} print and load next line -@end ifnotinfo -/=()/! @{ p; b; @} -/ () $/! @{ p; b; @} - -# possible start of functions section -# save the line in case this is a var like FOO="() " -h - -# if the next line has a brace, we quit because -# nothing comes after functions -n -/^@{/ q - -# print the old line -x; p - -# work on the new line now -x; bx -' -@end example -@c end--------------------------------------------- - -@node Reverse chars of lines -@section Reverse Characters of Lines - -This script can be used to reverse the position of characters -in lines. The technique moves two characters at a time, hence -it is faster than more intuitive implementations. - -Note the @code{tx} command before the definition of the label. -This is often needed to reset the flag that is tested by -the @code{t} command. - -Imaginative readers will find uses for this script. An example -is reversing the output of @command{banner}.@footnote{This requires -another script to pad the output of banner; for example - -@example -#! /bin/sh - -banner -w $1 $2 $3 $4 | - sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' | - ~/sedscripts/reverseline.sed -@end example -} - -@c start------------------------------------------- -@example -#!/usr/bin/sed -f - -/../! b - -# Reverse a line. Begin embedding the line between two newlines -s/^.*$/\ -&\ -/ - -# Move first character at the end. The regexp matches until -# there are zero or one characters between the markers -tx -:x -s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/ -tx - -# Remove the newline markers -s/\n//g -@end example -@c end--------------------------------------------- - -@node tac -@section Reverse Lines of Files - -This one begins a series of totally useless (yet interesting) -scripts emulating various Unix commands. This, in particular, -is a @command{tac} workalike. - -Note that on implementations other than @acronym{GNU} @command{sed} -@ifset PERL -and @value{SSED} -@end ifset -this script might easily overflow internal buffers. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -nf - -# reverse all lines of input, i.e. first line became last, ... - -# from the second line, the buffer (which contains all previous lines) -# is *appended* to current line, so, the order will be reversed -1! G - -# on the last line we're done -- print everything -$ p - -# store everything on the buffer again -h -@end example -@c end--------------------------------------------- - -@node cat -n -@section Numbering Lines - -This script replaces @samp{cat -n}; in fact it formats its output -exactly like @acronym{GNU} @command{cat} does. - -Of course this is completely useless and for two reasons: first, -because somebody else did it in C, second, because the following -Bourne-shell script could be used for the same purpose and would -be much faster: - -@c start------------------------------------------- -@example -#! /bin/sh -sed -e "=" $@@ | sed -e ' - s/^/ / - N - s/^ *\(......\)\n/\1 / -' -@end example -@c end--------------------------------------------- - -It uses @command{sed} to print the line number, then groups lines two -by two using @code{N}. Of course, this script does not teach as much as -the one presented below. - -The algorithm used for incrementing uses both buffers, so the line -is printed as soon as possible and then discarded. The number -is split so that changing digits go in a buffer and unchanged ones go -in the other; the changed digits are modified in a single step -(using a @code{y} command). The line number for the next line -is then composed and stored in the hold space, to be used in the -next iteration. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -nf - -# Prime the pump on the first line -x -/^$/ s/^.*$/1/ - -# Add the correct line number before the pattern -G -h - -# Format it and print it -s/^/ / -s/^ *\(......\)\n/\1 /p - -# Get the line number from hold space; add a zero -# if we're going to add a digit on the next line -g -s/\n.*$// -/^9*$/ s/^/0/ - -# separate changing/unchanged digits with an x -s/.9*$/x&/ - -# keep changing digits in hold space -h -s/^.*x// -y/0123456789/1234567890/ -x - -# keep unchanged digits in pattern space -s/x.*$// - -# compose the new number, remove the newline implicitly added by G -G -s/\n// -h -@end example -@c end--------------------------------------------- - -@node cat -b -@section Numbering Non-blank Lines - -Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only -have to select which lines are to be numbered and which are not. - -The part that is common to this script and the previous one is -not commented to show how important it is to comment @command{sed} -scripts properly... - -@c start------------------------------------------- -@example -#!/usr/bin/sed -nf - -/^$/ @{ - p - b -@} - -# Same as cat -n from now -x -/^$/ s/^.*$/1/ -G -h -s/^/ / -s/^ *\(......\)\n/\1 /p -x -s/\n.*$// -/^9*$/ s/^/0/ -s/.9*$/x&/ -h -s/^.*x// -y/0123456789/1234567890/ -x -s/x.*$// -G -s/\n// -h -@end example -@c end--------------------------------------------- - -@node wc -c -@section Counting Characters - -This script shows another way to do arithmetic with @command{sed}. -In this case we have to add possibly large numbers, so implementing -this by successive increments would not be feasible (and possibly -even more complicated to contrive than this script). - -The approach is to map numbers to letters, kind of an abacus -implemented with @command{sed}. @samp{a}s are units, @samp{b}s are -tens and so on: we simply add the number of characters -on the current line as units, and then propagate the carry -to tens, hundreds, and so on. - -As usual, running totals are kept in hold space. - -On the last line, we convert the abacus form back to decimal. -For the sake of variety, this is done with a loop rather than -with some 80 @code{s} commands@footnote{Some implementations -have a limit of 199 commands per script}: first we -convert units, removing @samp{a}s from the number; then we -rotate letters so that tens become @samp{a}s, and so on -until no more letters remain. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -nf - -# Add n+1 a's to hold space (+1 is for the newline) -s/./a/g -H -x -s/\n/a/ - -# Do the carry. The t's and b's are not necessary, -# but they do speed up the thing -t a -: a; s/aaaaaaaaaa/b/g; t b; b done -: b; s/bbbbbbbbbb/c/g; t c; b done -: c; s/cccccccccc/d/g; t d; b done -: d; s/dddddddddd/e/g; t e; b done -: e; s/eeeeeeeeee/f/g; t f; b done -: f; s/ffffffffff/g/g; t g; b done -: g; s/gggggggggg/h/g; t h; b done -: h; s/hhhhhhhhhh//g - -: done -$! @{ - h - b -@} - -# On the last line, convert back to decimal - -: loop -/a/! s/[b-h]*/&0/ -s/aaaaaaaaa/9/ -s/aaaaaaaa/8/ -s/aaaaaaa/7/ -s/aaaaaa/6/ -s/aaaaa/5/ -s/aaaa/4/ -s/aaa/3/ -s/aa/2/ -s/a/1/ - -: next -y/bcdefgh/abcdefg/ -/[a-h]/ b loop -p -@end example -@c end--------------------------------------------- - -@node wc -w -@section Counting Words - -This script is almost the same as the previous one, once each -of the words on the line is converted to a single @samp{a} -(in the previous script each letter was changed to an @samp{a}). - -It is interesting that real @command{wc} programs have optimized -loops for @samp{wc -c}, so they are much slower at counting -words rather than characters. This script's bottleneck, -instead, is arithmetic, and hence the word-counting one -is faster (it has to manage smaller numbers). - -Again, the common parts are not commented to show the importance -of commenting @command{sed} scripts. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -nf - -# Convert words to a's -s/[ @kbd{tab}][ @kbd{tab}]*/ /g -s/^/ / -s/ [^ ][^ ]*/a /g -s/ //g - -# Append them to hold space -H -x -s/\n// - -# From here on it is the same as in wc -c. -/aaaaaaaaaa/! bx; s/aaaaaaaaaa/b/g -/bbbbbbbbbb/! bx; s/bbbbbbbbbb/c/g -/cccccccccc/! bx; s/cccccccccc/d/g -/dddddddddd/! bx; s/dddddddddd/e/g -/eeeeeeeeee/! bx; s/eeeeeeeeee/f/g -/ffffffffff/! bx; s/ffffffffff/g/g -/gggggggggg/! bx; s/gggggggggg/h/g -s/hhhhhhhhhh//g -:x -$! @{ h; b; @} -:y -/a/! s/[b-h]*/&0/ -s/aaaaaaaaa/9/ -s/aaaaaaaa/8/ -s/aaaaaaa/7/ -s/aaaaaa/6/ -s/aaaaa/5/ -s/aaaa/4/ -s/aaa/3/ -s/aa/2/ -s/a/1/ -y/bcdefgh/abcdefg/ -/[a-h]/ by -p -@end example -@c end--------------------------------------------- - -@node wc -l -@section Counting Lines - -No strange things are done now, because @command{sed} gives us -@samp{wc -l} functionality for free!!! Look: - -@c start------------------------------------------- -@example -#!/usr/bin/sed -nf -$= -@end example -@c end--------------------------------------------- - -@node head -@section Printing the First Lines - -This script is probably the simplest useful @command{sed} script. -It displays the first 10 lines of input; the number of displayed -lines is right before the @code{q} command. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -f -10q -@end example -@c end--------------------------------------------- - -@node tail -@section Printing the Last Lines - -Printing the last @var{n} lines rather than the first is more complex -but indeed possible. @var{n} is encoded in the second line, before -the bang character. - -This script is similar to the @command{tac} script in that it keeps the -final output in the hold space and prints it at the end: - -@c start------------------------------------------- -@example -#!/usr/bin/sed -nf - -1! @{; H; g; @} -1,10 !s/[^\n]*\n// -$p -h -@end example -@c end--------------------------------------------- - -Mainly, the scripts keeps a window of 10 lines and slides it -by adding a line and deleting the oldest (the substitution command -on the second line works like a @code{D} command but does not -restart the loop). - -The ``sliding window'' technique is a very powerful way to write -efficient and complex @command{sed} scripts, because commands like -@code{P} would require a lot of work if implemented manually. - -To introduce the technique, which is fully demonstrated in the -rest of this chapter and is based on the @code{N}, @code{P} -and @code{D} commands, here is an implementation of @command{tail} -using a simple ``sliding window.'' - -This looks complicated but in fact the working is the same as -the last script: after we have kicked in the appropriate number -of lines, however, we stop using the hold space to keep inter-line -state, and instead use @code{N} and @code{D} to slide pattern -space by one line: - -@c start------------------------------------------- -@example -#!/usr/bin/sed -f - -1h -2,10 @{; H; g; @} -$q -1,9d -N -D -@end example -@c end--------------------------------------------- - -Note how the first, second and fourth line are inactive after -the first ten lines of input. After that, all the script does -is: exiting on the last line of input, appending the next input -line to pattern space, and removing the first line. - -@node uniq -@section Make Duplicate Lines Unique - -This is an example of the art of using the @code{N}, @code{P} -and @code{D} commands, probably the most difficult to master. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -f -h - -:b -# On the last line, print and exit -$b -N -/^\(.*\)\n\1$/ @{ - # The two lines are identical. Undo the effect of - # the n command. - g - bb -@} - -# If the @code{N} command had added the last line, print and exit -$b - -# The lines are different; print the first and go -# back working on the second. -P -D -@end example -@c end--------------------------------------------- - -As you can see, we mantain a 2-line window using @code{P} and @code{D}. -This technique is often used in advanced @command{sed} scripts. - -@node uniq -d -@section Print Duplicated Lines of Input - -This script prints only duplicated lines, like @samp{uniq -d}. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -nf - -$b -N -/^\(.*\)\n\1$/ @{ - # Print the first of the duplicated lines - s/.*\n// - p - - # Loop until we get a different line - :b - $b - N - /^\(.*\)\n\1$/ @{ - s/.*\n// - bb - @} -@} - -# The last line cannot be followed by duplicates -$b - -# Found a different one. Leave it alone in the pattern space -# and go back to the top, hunting its duplicates -D -@end example -@c end--------------------------------------------- - -@node uniq -u -@section Remove All Duplicated Lines - -This script prints only unique lines, like @samp{uniq -u}. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -f - -# Search for a duplicate line --- until that, print what you find. -$b -N -/^\(.*\)\n\1$/ ! @{ - P - D -@} - -:c -# Got two equal lines in pattern space. At the -# end of the file we simply exit -$d - -# Else, we keep reading lines with @code{N} until we -# find a different one -s/.*\n// -N -/^\(.*\)\n\1$/ @{ - bc -@} - -# Remove the last instance of the duplicate line -# and go back to the top -D -@end example -@c end--------------------------------------------- - -@node cat -s -@section Squeezing Blank Lines - -As a final example, here are three scripts, of increasing complexity -and speed, that implement the same function as @samp{cat -s}, that is -squeezing blank lines. - -The first leaves a blank line at the beginning and end if there are -some already. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -f - -# on empty lines, join with next -# Note there is a star in the regexp -:x -/^\n*$/ @{ -N -bx -@} - -# now, squeeze all '\n', this can be also done by: -# s/^\(\n\)*/\1/ -s/\n*/\ -/ -@end example -@c end--------------------------------------------- - -This one is a bit more complex and removes all empty lines -at the beginning. It does leave a single blank line at end -if one was there. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -f - -# delete all leading empty lines -1,/^./@{ -/./!d -@} - -# on an empty line we remove it and all the following -# empty lines, but one -:x -/./!@{ -N -s/^\n$// -tx -@} -@end example -@c end--------------------------------------------- - -This removes leading and trailing blank lines. It is also the -fastest. Note that loops are completely done with @code{n} and -@code{b}, without relying on @command{sed} to restart the -the script automatically at the end of a line. - -@c start------------------------------------------- -@example -#!/usr/bin/sed -nf - -# delete all (leading) blanks -/./!d - -# get here: so there is a non empty -:x -# print it -p -# get next -n -# got chars? print it again, etc... -/./bx - -# no, don't have chars: got an empty line -:z -# get next, if last line we finish here so no trailing -# empty lines are written -n -# also empty? then ignore it, and get next... this will -# remove ALL empty lines -/./!bz - -# all empty lines were deleted/ignored, but we have a non empty. As -# what we want to do is to squeeze, insert a blank line artificially -i\ - -bx -@end example -@c end--------------------------------------------- - -@node Limitations -@chapter @value{SSED}'s Limitations and Non-limitations - -@cindex @acronym{GNU} extensions, unlimited line length -@cindex Portability, line length limitations -For those who want to write portable @command{sed} scripts, -be aware that some implementations have been known to -limit line lengths (for the pattern and hold spaces) -to be no more than 4000 bytes. -The @sc{posix} standard specifies that conforming @command{sed} -implementations shall support at least 8192 byte line lengths. -@value{SSED} has no built-in limit on line length; -as long as it can @code{malloc()} more (virtual) memory, -you can feed or construct lines as long as you like. - -However, recursion is used to handle subpatterns and indefinite -repetition. This means that the available stack space may limit -the size of the buffer that can be processed by certain patterns. - -@ifset PERL -There are some size limitations in the regular expression -matcher but it is hoped that they will never in practice -be relevant. The maximum length of a compiled pattern -is 65539 (sic) bytes. All values in repeating quantifiers -must be less than 65536. The maximum nesting depth of -all parenthesized subpatterns, including capturing and -non-capturing subpatterns@footnote{The -distinction is meaningful when referring to Perl-style -regular expressions.}, assertions, and other types of -subpattern, is 200. - -Also, @value{SSED} recognizes the @sc{posix} syntax -@code{[.@var{ch}.]} and @code{[=@var{ch}=]} -where @var{ch} is a ``collating element'', but these -are not supported, and an error is given if they are -encountered. - -Here are a few distinctions between the real Perl-style -regular expressions and those that @option{-R} recognizes. - -@enumerate -@item -Lookahead assertions do not allow repeat quantifiers after them -Perl permits them, but they do not mean what you -might think. For example, @samp{(?!a)@{3@}} does not assert that the -next three characters are not @samp{a}. It just asserts three times that the -next character is not @samp{a} --- a waste of time and nothing else. - -@item -Capturing subpatterns that occur inside negative lookahead -head assertions are counted, but their entries are counted -as empty in the second half of an @code{s} command. -Perl sets its numerical variables from any such patterns -that are matched before the assertion fails to match -something (thereby succeeding), but only if the negative -lookahead assertion contains just one branch. - -@item -The following Perl escape sequences are not supported: -@samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E}, -@samp{\Q}. In fact these are implemented by Perl's general -string-handling and are not part of its pattern matching engine. - -@item -The Perl @samp{\G} assertion is not supported as it is not -relevant to single pattern matches. - -@item -Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})} -and @samp{(?p@{code@})} constructions. However, there is some experimental -support for recursive patterns using the non-Perl item @samp{(?R)}. - -@item -There are at the time of writing some oddities in Perl -5.005_02 concerned with the settings of captured strings -when part of a pattern is repeated. For example, matching -@samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets -@samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.} -to the value @samp{b}, but matching @samp{aabbaa} -against @samp{/^(aa(bb)?)+$/} leaves @samp{$2} -unset. However, if the pattern is changed to -@samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set. -In Perl 5.004 @samp{$2} is set in both cases, and that is also -true of @value{SSED}. - -@item -Another as yet unresolved discrepancy is that in Perl -5.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches -the string @samp{a}, whereas in @value{SSED} it does not. -However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched -against @samp{a} leaves $1 unset. -@end enumerate -@end ifset - -@node Other Resources -@chapter Other Resources for Learning About @command{sed} - -@cindex Additional reading about @command{sed} -In addition to several books that have been written about @command{sed} -(either specifically or as chapters in books which discuss -shell programming), one can find out more about @command{sed} -(including suggestions of a few books) from the FAQ -for the @code{sed-users} mailing list, available from: -@display -@uref{http://sed.sourceforge.net/sedfaq.html} -@end display - -Also of interest are -@uref{http://www.student.northpark.edu/pemente/sed/index.htm} -and @uref{http://sed.sf.net/grabbag}, -which include @command{sed} tutorials and other @command{sed}-related goodies. - -The @code{sed-users} mailing list itself maintained by Sven Guckes. -To subscribe, visit @uref{http://groups.yahoo.com} and search -for the @code{sed-users} mailing list. - -@node Reporting Bugs -@chapter Reporting Bugs - -@cindex Bugs, reporting -Email bug reports to @email{bug-sed@@gnu.org}. -Also, please include the output of @samp{sed --version} in the body -of your report if at all possible. - -Please do not send a bug report like this: - -@example -@i{@i{@r{while building frobme-1.3.4}}} -$ configure -@error{} sed: file sedscr line 1: Unknown option to 's' -@end example - -If @value{SSED} doesn't configure your favorite package, take a -few extra minutes to identify the specific problem and make a stand-alone -test case. Unlike other programs such as C compilers, making such test -cases for @command{sed} is quite simple. - -A stand-alone test case includes all the data necessary to perform the -test, and the specific invocation of @command{sed} that causes the problem. -The smaller a stand-alone test case is, the better. A test case should -not involve something as far removed from @command{sed} as ``try to configure -frobme-1.3.4''. Yes, that is in principle enough information to look -for the bug, but that is not a very practical prospect. - -Here are a few commonly reported bugs that are not bugs. - -@table @asis -@item @code{N} command on the last line -@cindex Portability, @code{N} command on the last line -@cindex Non-bugs, @code{N} command on the last line - -Most versions of @command{sed} exit without printing anything when -the @command{N} command is issued on the last line of a file. -@value{SSED} prints pattern space before exiting unless of course -the @command{-n} command switch has been specified. This choice is -by design. - -For example, the behavior of -@example -sed N foo bar -@end example -@noindent -would depend on whether foo has an even or an odd number of -lines@footnote{which is the actual ``bug'' that prompted the -change in behavior}. Or, when writing a script to read the -next few lines following a pattern match, traditional -implementations of @code{sed} would force you to write -something like -@example -/foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @} -@end example -@noindent -instead of just -@example -/foo/@{ N;N;N;N;N;N;N;N;N; @} -@end example - -@cindex @code{POSIXLY_CORRECT} behavior, @code{N} command -In any case, the simplest workaround is to use @code{$d;N} in -scripts that rely on the traditional behavior, or to set -the @code{POSIXLY_CORRECT} variable to a non-empty value. - -@item Regex syntax clashes (problems with backslashes) -@cindex @acronym{GNU} extensions, to basic regular expressions -@cindex Non-bugs, regex syntax clashes -@command{sed} uses the @sc{posix} basic regular expression syntax. According to -the standard, the meaning of some escape sequences is undefined in -this syntax; notable in the case of @command{sed} are @code{\|}, -@code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<}, -@code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}. - -As in all @acronym{GNU} programs that use @sc{posix} basic regular -expressions, @command{sed} interprets these escape sequences as special -characters. So, @code{x\+} matches one or more occurrences of @samp{x}. -@code{abc\|def} matches either @samp{abc} or @samp{def}. - -This syntax may cause problems when running scripts written for other -@command{sed}s. Some @command{sed} programs have been written with the -assumption that @code{\|} and @code{\+} match the literal characters -@code{|} and @code{+}. Such scripts must be modified by removing the -spurious backslashes if they are to be used with modern implementations -of @command{sed}, like -@ifset PERL -@value{SSED} or -@end ifset -@acronym{GNU} @command{sed}. - -On the other hand, some scripts use s|abc\|def||g to remove occurrences -of @emph{either} @code{abc} or @code{def}. While this worked until -@command{sed} 4.0.x, newer versions interpret this as removing the -string @code{abc|def}. This is again undefined behavior according to -@acronym{POSIX}, and this interpretation is arguably more robust: older -@command{sed}s, for example, required that the regex matcher parsed -@code{\/} as @code{/} in the common case of escaping a slash, which is -again undefined behavior; the new behavior avoids this, and this is good -because the regex matcher is only partially under our control. - -@cindex @acronym{GNU} extensions, special escapes -In addition, this version of @command{sed} supports several escape characters -(some of which are multi-character) to insert non-printable characters -in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r}, -@code{\t}, @code{\v}, @code{\x}). These can cause similar problems -with scripts written for other @command{sed}s. - -@item @option{-i} clobbers read-only files -@cindex In-place editing -@cindex @value{SSEDEXT}, in-place editing -@cindex Non-bugs, in-place editing - -In short, @samp{sed -i} will let you delete the contents of -a read-only file, and in general the @option{-i} option -(@pxref{Invoking sed, , Invocation}) lets you clobber -protected files. This is not a bug, but rather a consequence -of how the Unix filesystem works. - -The permissions on a file say what can happen to the data -in that file, while the permissions on a directory say what can -happen to the list of files in that directory. @samp{sed -i} -will not ever open for writing a file that is already on disk. -Rather, it will work on a temporary file that is finally renamed -to the original name: if you rename or delete files, you're actually -modifying the contents of the directory, so the operation depends on -the permissions of the directory, not of the file. For this same -reason, @command{sed} does not let you use @option{-i} on a writeable file -in a read-only directory, and will break hard or symbolic links when -@option{-i} is used on such a file. - -@item @code{0a} does not work (gives an error) -@cindex @code{0} address -@cindex @acronym{GNU} extensions, @code{0} address -@cindex Non-bugs, @code{0} address - -There is no line 0. 0 is a special address that is only used to treat -addresses like @code{0,/@var{RE}/} as active when the script starts: if -you write @code{1,/abc/d} and the first line includes the word @samp{abc}, -then that match would be ignored because address ranges must span at least -two lines (barring the end of the file); but what you probably wanted is -to delete every line up to the first one including @samp{abc}, and this -is obtained with @code{0,/abc/d}. - -@ifclear PERL -@item @code{[a-z]} is case insensitive -@cindex Non-bugs, localization-related - -You are encountering problems with locales. POSIX mandates that @code{[a-z]} -uses the current locale's collation order -- in C parlance, that means using -@code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a -case-insensitive collation order, others don't. - -Another problem is that @code{[a-z]} tries to use collation symbols. -This only happens if you are on the @acronym{GNU} system, using -@acronym{GNU} libc's regular expression matcher instead of compiling the -one supplied with @acronym{GNU} sed. In a Danish locale, for example, -the regular expression @code{^[a-z]$} matches the string @samp{aa}, -because this is a single collating symbol that comes after @samp{a} -and before @samp{b}; @samp{ll} behaves similarly in Spanish -locales, or @samp{ij} in Dutch locales. - -To work around these problems, which may cause bugs in shell scripts, set -the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. - -@item @code{s/.*//} does not clear pattern space -@cindex Non-bugs, localization-related -@cindex @value{SSEDEXT}, emptying pattern space -@cindex Emptying pattern space - -This happens if your input stream includes invalid multibyte -sequences. @sc{posix} mandates that such sequences -are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear -pattern space as you would expect. In fact, there is no way to clear -sed's buffers in the middle of the script in most multibyte locales -(including UTF-8 locales). For this reason, @value{SSED} provides a `z' -command (for `zap') as an extension. - -To work around these problems, which may cause bugs in shell scripts, set -the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. -@end ifclear -@end table - - -@node Extended regexps -@appendix Extended regular expressions -@cindex Extended regular expressions, syntax - -The only difference between basic and extended regular expressions is in -the behavior of a few characters: @samp{?}, @samp{+}, parentheses, -braces (@samp{@{@}}), and @samp{|}. While basic regular expressions -require these to be escaped if you want them to behave as special -characters, when using extended regular expressions you must escape -them if you want them @emph{to match a literal character}. @samp{|} -is special here because @samp{\|} is a GNU extension -- standard -basic regular expressions do not provide its functionality. - -@noindent -Examples: -@table @code -@item abc? -becomes @samp{abc\?} when using extended regular expressions. It matches -the literal string @samp{abc?}. - -@item c\+ -becomes @samp{c+} when using extended regular expressions. It matches -one or more @samp{c}s. - -@item a\@{3,\@} -becomes @samp{a@{3,@}} when using extended regular expressions. It matches -three or more @samp{a}s. - -@item \(abc\)\@{2,3\@} -becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It -matches either @samp{abcabc} or @samp{abcabcabc}. - -@item \(abc*\)\1 -becomes @samp{(abc*)\1} when using extended regular expressions. -Backreferences must still be escaped when using extended regular -expressions. -@end table - -@ifset PERL -@node Perl regexps -@appendix Perl-style regular expressions -@cindex Perl-style regular expressions, syntax - -@emph{This part is taken from the @file{pcre.txt} file distributed together -with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.} - -Perl introduced several extensions to regular expressions, some -of them incompatible with the syntax of regular expressions -accepted by Emacs and other @acronym{GNU} tools (whose matcher was -based on the Emacs matcher). @value{SSED} implements -both kinds of extensions. - -@iftex -Summarizing, we have: - -@itemize @bullet -@item -A backslash can introduce several special sequences - -@item -The circumflex, dollar sign, and period characters behave specially -with regard to new lines - -@item -Strange uses of square brackets are parsed differently - -@item -You can toggle modifiers in the middle of a regular expression - -@item -You can specify that a subpattern does not count when numbering backreferences - -@item -@cindex Greedy regular expression matching -You can specify greedy or non-greedy matching - -@item -You can have more than ten back references - -@item -You can do complex look aheads and look behinds (in the spirit of -@code{\b}, but with subpatterns). - -@item -You can often improve performance by avoiding that @command{sed} wastes -time with backtracking - -@item -You can have if/then/else branches - -@item -You can do recursive matches, for example to look for unbalanced parentheses - -@item -You can have comments and non-significant whitespace, because things can -get complex... -@end itemize - -Most of these extensions are introduced by the special @code{(?} -sequence, which gives special meanings to parenthesized groups. -@end iftex -@menu -Other extensions can be roughly subdivided in two categories -On one hand Perl introduces several more escaped sequences -(that is, sequences introduced by a backslash). On the other -hand, it specifies that if a question mark follows an open -parentheses it should give a special meaning to the parenthesized -group. - -* Backslash:: Introduces special sequences -* Circumflex/dollar sign/period:: Behave specially with regard to new lines -* Square brackets:: Are a bit different in strange cases -* Options setting:: Toggle modifiers in the middle of a regexp -* Non-capturing subpatterns:: Are not counted when backreferencing -* Repetition:: Allows for non-greedy matching -* Backreferences:: Allows for more than 10 back references -* Assertions:: Allows for complex look ahead matches -* Non-backtracking subpatterns:: Often gives more performance -* Conditional subpatterns:: Allows if/then/else branches -* Recursive patterns:: For example to match parentheses -* Comments:: Because things can get complex... -@end menu - -@node Backslash -@appendixsec Backslash -@cindex Perl-style regular expressions, escaped sequences - -There are a few difference in the handling of backslashed -sequences in Perl mode. - -First of all, there are no @code{\o} and @code{\d} sequences. -@sc{ascii} values for characters can be specified in octal -with a @code{\@var{xxx}} sequence, where @var{xxx} is a -sequence of up to three octal digits. If the first digit -is a zero, the treatment of the sequence is straightforward; -just note that if the character that follows the escaped digit -is itself an octal digit, you have to supply three octal digits -for @var{xxx}. For example @code{\07} is a @sc{bel} character -rather than a @sc{nul} and a literal @code{7} (this sequence is -instead represented by @code{\0007}). - -@cindex Perl-style regular expressions, backreferences -The handling of a backslash followed by a digit other than 0 -is complicated. Outside a character class, @command{sed} reads it -and any following digits as a decimal number. If the number -is less than 10, or if there have been at least that many -previous capturing left parentheses in the expression, the -entire sequence is taken as a back reference. A description -of how this works is given later, following the discussion -of parenthesized subpatterns. - -Inside a character class, or if the decimal number is -greater than 9 and there have not been that many capturing -subpatterns, @command{sed} re-reads up to three octal digits following -the backslash, and generates a single byte from the -least significant 8 bits of the value. Any subsequent digits -stand for themselves. For example: - -@example -\040 @i{@r{is another way of writing a space}} -\40 @i{@r{is the same, provided there are fewer than 40}} - @i{@r{previous capturing subpatterns}} -\7 @i{@r{is always a back reference}} -\011 @i{@r{is always a tab}} -\11 @i{@r{might be a back reference, or another way of writing a tab}} -\0113 @i{@r{is a tab followed by the character @samp{3}}} -\113 @i{@r{is the character with octal code 113 (since there}} - @i{@r{can be no more than 99 back references)}} -\377 @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}} -\81 @i{@r{is either a back reference, or a binary zero}} - @i{@r{followed by the two characters @samp{81}}} -@end example - -Note that octal values of 100 or greater must not be introduced -by a leading zero, because no more than three octal -digits are ever read. Note that this applies only to the LHS -pattern; it is not possible yet to specify more than 9 backreferences -on the RHS of the `s' command. - -All the sequences that define a single byte value can be -used both inside and outside character classes. In addition, -inside a character class, the sequence @code{\b} is interpreted -as the backspace character (hex 08). Outside a character -class it has a different meaning (see below). - -In addition, there are four additional escapes specifying -generic character classes (like @code{\w} and @code{\W} do): - -@cindex Perl-style regular expressions, character classes -@table @samp -@item \d -Matches any decimal digit - -@item \D -Matches any character that is not a decimal digit -@end table - -In Perl mode, these character type sequences can appear both inside and -outside character classes. Instead, in @sc{posix} mode these sequences -(as well as @code{\w} and @code{\W}) are treated as two literal characters -(a backslash and a letter) inside square brackets. - -Escaped sequences specifying assertions are also different in -Perl mode. An assertion specifies a condition that has to be met -at a particular point in a match, without consuming any -characters from the subject string. The use of subpatterns -for more complicated assertions is described below. The -backslashed assertions are - -@cindex Perl-style regular expressions, assertions -@table @samp -@item \b -Asserts that the point is at a word boundary. -A word boundary is a position in the subject string where -the current character and the previous character do not both -match @code{\w} or @code{\W} (i.e. one matches @code{\w} and -the other matches @code{\W}), or the start or end of the string -if the first or last character matches @code{\w}, respectively. - -@item \B -Asserts that the point is not at a word boundary. - -@item \A -Asserts the matcher is at the start of pattern space (independent -of multiline mode). - -@item \Z -Asserts the matcher is at the end of pattern space, -or at a newline before the end of pattern space (independent of -multiline mode) - -@item \z -Asserts the matcher is at the end of pattern space (independent -of multiline mode) -@end table - -These assertions may not appear in character classes (but -note that @code{\b} has a different meaning, namely the -backspace character, inside a character class). -Note that Perl mode does not support directly assertions -for the beginning and the end of word; the @acronym{GNU} extensions -@code{\<} and @code{\>} achieve this purpose in @sc{posix} mode -instead. - -The @code{\A}, @code{\Z}, and @code{\z} assertions differ -from the traditional circumflex and dollar sign (described below) -in that they only ever match at the very start and end of the -subject string, whatever options are set; in particular @code{\A} -and @code{\z} are the same as the @acronym{GNU} extensions -@code{\`} and @code{\'} that are active in @sc{posix} mode. - -@node Circumflex/dollar sign/period -@appendixsec Circumflex, dollar sign, period -@cindex Perl-style regular expressions, newlines - -Outside a character class, in the default matching mode, the -circumflex character is an assertion which is true only if -the current matching point is at the start of the subject -string. Inside a character class, the circumflex has an entirely -different meaning (see below). - -The circumflex need not be the first character of the pattern if -a number of alternatives are involved, but it should be the -first thing in each alternative in which it appears if the -pattern is ever to match that branch. If all possible alternatives, -start with a circumflex, that is, if the pattern is -constrained to match only at the start of the subject, it is -said to be an @dfn{anchored} pattern. (There are also other constructs -structs that can cause a pattern to be anchored.) - -A dollar sign is an assertion which is true only if the -current matching point is at the end of the subject string, -or immediately before a newline character that is the last -character in the string (by default). A dollar sign need not be the -last character of the pattern if a number of alternatives -are involved, but it should be the last item in any branch -in which it appears. A dollar sign has no special meaning in a -character class. - -@cindex Perl-style regular expressions, multiline -The meanings of the circumflex and dollar sign characters are -changed if the @code{M} modifier option is used. When this is -the case, they match immediately after and immediately -before an internal @code{\n} character, respectively, in addition -to matching at the start and end of the subject string. For -example, the pattern @code{/^abc$/} matches the subject string -@samp{def\nabc} in multiline mode, but not otherwise. Consequently, -patterns that are anchored in single line mode -because all branches start with @code{^} are not anchored in -multiline mode. - -@cindex Perl-style regular expressions, multiline -Note that the sequences @code{\A}, @code{\Z}, and @code{\z} -can be used to match the start and end of the subject in both -modes, and if all branches of a pattern start with @code{\A} -is it always anchored, whether the @code{M} modifier is set or not. - -@cindex Perl-style regular expressions, single line -Outside a character class, a dot in the pattern matches any -one character in the subject, including a non-printing character, -but not (by default) newline. If the @code{S} modifier is used, -dots match newlines as well. Actually, the handling of -dot is entirely independent of the handling of circumflex -and dollar sign, the only relationship being that they both -involve newline characters. Dot has no special meaning in a -character class. - -@node Square brackets -@appendixsec Square brackets -@cindex Perl-style regular expressions, character classes - -An opening square bracket introduces a character class, terminated -by a closing square bracket. A closing square bracket on its own -is not special. If a closing square bracket is required as a -member of the class, it should be the first data character in -the class (after an initial circumflex, if present) or escaped with a backslash. - -A character class matches a single character in the subject; -the character must be in the set of characters defined by -the class, unless the first character in the class is a circumflex, -in which case the subject character must not be in -the set defined by the class. If a circumflex is actually -required as a member of the class, ensure it is not the -first character, or escape it with a backslash. - -For example, the character class [aeiou] matches any lower -case vowel, while [^aeiou] matches any character that is not -a lower case vowel. Note that a circumflex is just a convenient -venient notation for specifying the characters which are in -the class by enumerating those that are not. It is not an -assertion: it still consumes a character from the subject -string, and fails if the current pointer is at the end of -the string. - -@cindex Perl-style regular expressions, case-insensitive -When caseless matching is set, any letters in a class -represent both their upper case and lower case versions, so -for example, a caseless @code{[aeiou]} matches uppercase -and lowercase @samp{A}s, and a caseless @code{[^aeiou]} -does not match @samp{A}, whereas a case-sensitive version would. - -@cindex Perl-style regular expressions, single line -@cindex Perl-style regular expressions, multiline -The newline character is never treated in any special way in -character classes, whatever the setting of the @code{S} and -@code{M} options (modifiers) is. A class such as @code{[^a]} will -always match a newline. - -The minus (hyphen) character can be used to specify a range -of characters in a character class. For example, @code{[d-m]} -matches any letter between d and m, inclusive. If a minus -character is required in a class, it must be escaped with a -backslash or appear in a position where it cannot be interpreted -as indicating a range, typically as the first or last -character in the class. - -It is not possible to have the literal character @code{]} as the -end character of a range. A pattern such as @code{[W-]46]} is -interpreted as a class of two characters (@code{W} and @code{-}) -followed by a literal string @code{46]}, so it would match -@samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped -with a backslash it is interpreted as the end of range, so -@code{[W-\]46]} is interpreted as a single class containing a -range followed by two separate characters. The octal or -hexadecimal representation of @code{]} can also be used to end a range. - -Ranges operate in @sc{ascii} collating sequence. They can also be -used for characters specified numerically, for example -@code{[\000-\037]}. If a range that includes letters is used when -caseless matching is set, it matches the letters in either -case. For example, a caseless @code{[W-c]} is equivalent to -@code{[][\^_`wxyzabc]}, matched caselessly, and if character -tables for the French locale are in use, @code{[\xc8-\xcb]} -matches accented E characters in both cases. - -Unlike in @sc{posix} mode, the character types @code{\d}, -@code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W} -may also appear in a character class, and add the characters -that they match to the class. For example, @code{[\dABCDEF]} matches any -hexadecimal digit. A circumflex can conveniently be used -with the upper case character types to specify a more restricted -set of characters than the matching lower case type. -For example, the class @code{[^\W_]} matches any letter or digit, -but not underscore. - -All non-alphameric characters other than @code{\}, @code{-}, -@code{^} (at the start) and the terminating @code{]} -are non-special in character classes, but it does no harm -if they are escaped. - -Perl 5.6 supports the @sc{posix} notation for character classes, which -uses names enclosed by @code{[:} and @code{:]} within the enclosing -square brackets, and @value{SSED} supports this notation as well. -For example, - -@example -[01[:alpha:]%] -@end example - -@noindent -matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}. -The supported class names are - -@table @code -@item alnum -Matches letters and digits - -@item alpha -Matches letters - -@item ascii -Matches character codes 0 - 127 - -@item cntrl -Matches control characters - -@item digit -Matches decimal digits (same as \d) - -@item graph -Matches printing characters, excluding space - -@item lower -Matches lower case letters - -@item print -Matches printing characters, including space - -@item punct -Matches printing characters, excluding letters and digits - -@item space -Matches white space (same as \s) - -@item upper -Matches upper case letters - -@item word -Matches ``word'' characters (same as \w) - -@item xdigit -Matches hexadecimal digits -@end table - -The names @code{ascii} and @code{word} are extensions valid only in -Perl mode. Another Perl extension is negation, which is -indicated by a circumflex character after the colon. For example, - -@example -[12[:^digit:]] -@end example - -@noindent -matches @samp{1}, @samp{2}, or any non-digit. - -@node Options setting -@appendixsec Options setting -@cindex Perl-style regular expressions, toggling options -@cindex Perl-style regular expressions, case-insensitive -@cindex Perl-style regular expressions, multiline -@cindex Perl-style regular expressions, single line -@cindex Perl-style regular expressions, extended - -The settings of the @code{I}, @code{M}, @code{S}, @code{X} -modifiers can be changed from within the pattern by -a sequence of Perl option letters enclosed between @code{(?} -and @code{)}. The option letters must be lowercase. - -For example, @code{(?im)} sets caseless, multiline matching. It is -also possible to unset these options by preceding the letter -with a hyphen; you can also have combined settings and unsettings: -@code{(?im-sx)} sets caseless and multiline matching, -while unsets single line matching (for dots) and extended -whitespace interpretation. If a letter appears both before -and after the hyphen, the option is unset. - -The scope of these option changes depends on where in the -pattern the setting occurs. For settings that are outside -any subpattern (defined below), the effect is the same as if -the options were set or unset at the start of matching. The -following patterns all behave in exactly the same way: - -@example -(?i)abc -a(?i)bc -ab(?i)c -abc(?i) -@end example - -which in turn is the same as specifying the pattern abc with -the @code{I} modifier. In other words, ``top level'' settings -apply to the whole pattern (unless there are other -changes inside subpatterns). If there is more than one setting -of the same option at top level, the rightmost setting -is used. - -If an option change occurs inside a subpattern, the effect -is different. This is a change of behaviour in Perl 5.005. -An option change inside a subpattern affects only that part -of the subpattern @emph{that follows} it, so - -@example -(a(?i)b)c -@end example - -@noindent -matches abc and aBc and no other strings (assuming -case-sensitive matching is used). By this means, options can -be made to have different settings in different parts of the -pattern. Any changes made in one alternative do carry on -into subsequent branches within the same subpattern. For -example, - -@example -(a(?i)b|c) -@end example - -@noindent -matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C}, -even though when matching @samp{C} the first branch is -abandoned before the option setting. -This is because the effects of option settings happen at -compile time. There would be some very weird behaviour otherwise. - -@ignore -There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA -that can be changed in the same way as the Perl-compatible options by -using the characters U and X respectively. The (?X) flag -setting is special in that it must always occur earlier in -the pattern than any of the additional features it turns on, -even when it is at top level. It is best put at the start. -@end ignore - - -@node Non-capturing subpatterns -@appendixsec Non-capturing subpatterns -@cindex Perl-style regular expressions, non-capturing subpatterns - -Marking part of a pattern as a subpattern does two things. -On one hand, it localizes a set of alternatives; on the other -hand, it sets up the subpattern as a capturing subpattern (as -defined above). The subpattern can be backreferenced and -referenced in the right side of @code{s} commands. - -For example, if the string @samp{the red king} is matched against -the pattern - -@example -the ((red|white) (king|queen)) -@end example - -@noindent -the captured substrings are @samp{red king}, @samp{red}, -and @samp{king}, and are numbered 1, 2, and 3. - -The fact that plain parentheses fulfil two functions is not -always helpful. There are often times when a grouping -subpattern is required without a capturing requirement. If an -opening parenthesis is followed by @code{?:}, the subpattern does -not do any capturing, and is not counted when computing the -number of any subsequent capturing subpatterns. For example, -if the string @samp{the white queen} is matched against the pattern - -@example -the ((?:red|white) (king|queen)) -@end example - -@noindent -the captured substrings are @samp{white queen} and @samp{queen}, -and are numbered 1 and 2. The maximum number of captured -substrings is 99, while the maximum number of all subpatterns, -both capturing and non-capturing, is 200. - -As a convenient shorthand, if any option settings are -equired at the start of a non-capturing subpattern, the -option letters may appear between the @code{?} and the -@code{:}. Thus the two patterns - -@example -(?i:saturday|sunday) -(?:(?i)saturday|sunday) -@end example - -@noindent -match exactly the same set of strings. Because alternative -branches are tried from left to right, and options are not -reset until the end of the subpattern is reached, an option -setting in one branch does affect subsequent branches, so -the above patterns match @samp{SUNDAY} as well as @samp{Saturday}. - - -@node Repetition -@appendixsec Repetition -@cindex Perl-style regular expressions, repetitions - -Repetition is specified by quantifiers, which can follow any -of the following items: - -@itemize @bullet -@item -a single character, possibly escaped - -@item -the @code{.} special character - -@item -a character class - -@item -a back reference (see next section) - -@item -a parenthesized subpattern (unless it is an assertion; @pxref{Assertions}) -@end itemize - -The general repetition quantifier specifies a minimum and -maximum number of permitted matches, by giving the two -numbers in curly brackets (braces), separated by a comma. -The numbers must be less than 65536, and the first must be -less than or equal to the second. For example: - -@example -z@{2,4@} -@end example - -@noindent -matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own -is not a special character. If the second number is omitted, -but the comma is present, there is no upper limit; if the -second number and the comma are both omitted, the quantifier -specifies an exact number of required matches. Thus - -@example -[aeiou]@{3,@} -@end example - -@noindent -matches at least 3 successive vowels, but may match many -more, while - -@example -\d@{8@} -@end example - -@noindent -matches exactly 8 digits. An opening curly bracket that -appears in a position where a quantifier is not allowed, or -one that does not match the syntax of a quantifier, is taken -as a literal character. For example, @{,6@} is not a quantifier, -but a literal string of four characters.@footnote{It -raises an error if @option{-R} is not used.} - -The quantifier @samp{@{0@}} is permitted, causing the expression to -behave as if the previous item and the quantifier were not -present. - -For convenience (and historical compatibility) the three -most common quantifiers have single-character abbreviations: - -@table @code -@item * -is equivalent to @{0,@} - -@item + -is equivalent to @{1,@} - -@item ? -is equivalent to @{0,1@} -@end table - -It is possible to construct infinite loops by following a -subpattern that can match no characters with a quantifier -that has no upper limit, for example: - -@example -(a?)* -@end example - -Earlier versions of Perl used to give an error at -compile time for such patterns. However, because there are -cases where this can be useful, such patterns are now -accepted, but if any repetition of the subpattern does in -fact match no characters, the loop is forcibly broken. - -@cindex Greedy regular expression matching -@cindex Perl-style regular expressions, stingy repetitions -By default, the quantifiers are @dfn{greedy} like in @sc{posix} -mode, that is, they match as much as possible (up to the maximum -number of permitted times), without causing the rest of the -pattern to fail. The classic example of where this gives problems -is in trying to match comments in C programs. These appear between -the sequences @code{/*} and @code{*/} and within the sequence, individual -@code{*} and @code{/} characters may appear. An attempt to match C -comments by applying the pattern - -@example -/\*.*\*/ -@end example - -@noindent -to the string - -@example -/* first command */ not comment /* second comment */ -@end example - -@noindent - -fails, because it matches the entire string owing to the -greediness of the @code{.*} item. - -However, if a quantifier is followed by a question mark, it -ceases to be greedy, and instead matches the minimum number -of times possible, so the pattern @code{/\*.*?\*/} -does the right thing with the C comments. The meaning of the -various quantifiers is not otherwise changed, just the preferred -number of matches. Do not confuse this use of question -mark with its use as a quantifier in its own right. -Because it has two uses, it can sometimes appear doubled, as in - -@example -\d??\d -@end example - -which matches one digit by preference, but can match two if -that is the only way the rest of the pattern matches. - -Note that greediness does not matter when specifying addresses, -but can be nevertheless used to improve performance. - -@ignore -If the PCRE_UNGREEDY option is set (an option which is not -available in Perl), the quantifiers are not greedy by -default, but individual ones can be made greedy by following -them with a question mark. In other words, it inverts the -default behaviour. -@end ignore - -When a parenthesized subpattern is quantified with a minimum -repeat count that is greater than 1 or with a limited maximum, -more store is required for the compiled pattern, in -proportion to the size of the minimum or maximum. - -@cindex Perl-style regular expressions, single line -If a pattern starts with @code{.*} or @code{.@{0,@}} and the -@code{S} modifier is used, the pattern is implicitly anchored, -because whatever follows will be tried against every character -position in the subject string, so there is no point in -retrying the overall match at any position after the first. -PCRE treats such a pattern as though it were preceded by \A. - -When a capturing subpattern is repeated, the value captured -is the substring that matched the final iteration. For example, -after - -@example -(tweedle[dume]@{3@}\s*)+ -@end example - -@noindent -has matched @samp{tweedledum tweedledee} the value of the -captured substring is @samp{tweedledee}. However, if there are -nested capturing subpatterns, the corresponding captured -values may have been set in previous iterations. For example, -after - -@example -/(a|(b))+/ -@end example - -matches @samp{aba}, the value of the second captured substring is -@samp{b}. - -@node Backreferences -@appendixsec Backreferences -@cindex Perl-style regular expressions, backreferences - -Outside a character class, a backslash followed by a digit -greater than 0 (and possibly further digits) is a back -reference to a capturing subpattern earlier (i.e. to its -left) in the pattern, provided there have been that many -previous capturing left parentheses. - -However, if the decimal number following the backslash is -less than 10, it is always taken as a back reference, and -causes an error only if there are not that many capturing -left parentheses in the entire pattern. In other words, the -parentheses that are referenced need not be to the left of -the reference for numbers less than 10. @ref{Backslash} -for further details of the handling of digits following a backslash. - -A back reference matches whatever actually matched the capturing -subpattern in the current subject string, rather than -anything matching the subpattern itself. So the pattern - -@example -(sens|respons)e and \1ibility -@end example - -@noindent -matches @samp{sense and sensibility} and @samp{response and responsibility}, -but not @samp{sense and responsibility}. If caseful -matching is in force at the time of the back reference, the -case of letters is relevant. For example, - -@example -((?i)blah)\s+\1 -@end example - -@noindent -matches @samp{blah blah} and @samp{Blah Blah}, but not -@samp{BLAH blah}, even though the original capturing -subpattern is matched caselessly. - -There may be more than one back reference to the same subpattern. -Also, if a subpattern has not actually been used in a -particular match, any back references to it always fail. For -example, the pattern - -@example -(a|(bc))\2 -@end example - -@noindent -always fails if it starts to match @samp{a} rather than -@samp{bc}. Because there may be up to 99 back references, all -digits following the backslash are taken as part of a potential -back reference number; this is different from what happens -in @sc{posix} mode. If the pattern continues with a digit -character, some delimiter must be used to terminate the back -reference. If the @code{X} modifier option is set, this can be -whitespace. Otherwise an empty comment can be used, or the -following character can be expressed in hexadecimal or octal. -Note that this applies only to the LHS pattern; it is -not possible yet to specify more than 9 backreferences on the -RHS of the `s' command. - -A back reference that occurs inside the parentheses to which -it refers fails when the subpattern is first used, so, for -example, @code{(a\1)} never matches. However, such references -can be useful inside repeated subpatterns. For example, the -pattern - -@example -(a|b\1)+ -@end example - -@noindent -matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa}, -etc. At each iteration of the subpattern, the back reference matches -the character string corresponding to the previous iteration. In -order for this to work, the pattern must be such that the first -iteration does not need to match the back reference. This can be -done using alternation, as in the example above, or by a -quantifier with a minimum of zero. - -@node Assertions -@appendixsec Assertions -@cindex Perl-style regular expressions, assertions -@cindex Perl-style regular expressions, asserting subpatterns - -An assertion is a test on the characters following or -preceding the current matching point that does not actually -consume any characters. The simple assertions coded as @code{\b}, -@code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$} -are described above. More complicated assertions are coded as -subpatterns. There are two kinds: those that look ahead of the -current position in the subject string, and those that look behind it. - -@cindex Perl-style regular expressions, lookahead subpatterns -An assertion subpattern is matched in the normal way, except -that it does not cause the current matching position to be -changed. Lookahead assertions start with @code{(?=} for positive -assertions and @code{(?!} for negative assertions. For example, - -@example -\w+(?=;) -@end example - -@noindent -matches a word followed by a semicolon, but does not include -the semicolon in the match, and - -@example -foo(?!bar) -@end example - -@noindent -matches any occurrence of @samp{foo} that is not followed by -@samp{bar}. - -Note that the apparently similar pattern - -@example -(?!foo)bar -@end example - -@noindent -@cindex Perl-style regular expressions, lookbehind subpatterns -finds any occurrence of @samp{bar} even if it is preceded by -@samp{foo}, because the assertion @code{(?!foo)} is always true -when the next three characters are @samp{bar}. A lookbehind -assertion is needed to achieve this effect. -Lookbehind assertions start with @code{(?<=} for positive -assertions and @code{(?<!} for negative assertions. So, - -@example -(?<!foo)bar -@end example - -achieves the required effect of finding an occurrence of -@samp{bar} that is not preceded by @samp{foo}. The contents of a -lookbehind assertion are restricted -such that all the strings it matches must have a fixed -length. However, if there are several alternatives, they do -not all have to have the same fixed length. This is an extension -compared with Perl 5.005, which requires all branches to match -the same length of string. Thus - -@example -(?<=dogs|cats|) -@end example - -@noindent -is permitted, but the apparently equivalent regular expression - -@example -(?<!dogs?|cats?) -@end example - -@noindent -causes an error at compile time. Branches that match different -length strings are permitted only at the top level of -a lookbehind assertion: an assertion such as - -@example -(?<=ab(c|de)) -@end example - -@noindent -is not permitted, because its single top-level branch can -match two different lengths, but it is acceptable if rewritten -to use two top-level branches: - -@example -(?<=abc|abde) -@end example - -All this is required because lookbehind assertions simply -move the current position back by the alternative's fixed -width and then try to match. If there are -insufficient characters before the current position, the -match is deemed to fail. Lookbehinds, in conjunction with -non-backtracking subpatterns can be particularly useful for -matching at the ends of strings; an example is given at the end -of the section on non-backtracking subpatterns. - -Several assertions (of any sort) may occur in succession. -For example, - -@example -(?<=\d@{3@})(?<!999)foo -@end example - -@noindent -matches @samp{foo} preceded by three digits that are not @samp{999}. -Notice that each of the assertions is applied independently -at the same point in the subject string. First there is a -check that the previous three characters are all digits, and -then there is a check that the same three characters are not -@samp{999}. This pattern does not match @samp{foo} preceded by six -characters, the first of which are digits and the last three -of which are not @samp{999}. For example, it doesn't match -@samp{123abcfoo}. A pattern to do that is - -@example -(?<=\d@{3@}...)(?<!999)foo -@end example - -@noindent -This time the first assertion looks at the preceding six -characters, checking that the first three are digits, and -then the second assertion checks that the preceding three -characters are not @samp{999}. Actually, assertions can be -nested in any combination, so one can write this as - -@example -(?<=\d@{3@}(?!999)...)foo -@end example - -or - -@example -(?<=\d@{3@}...(?<!999))foo -@end example - -@noindent -both of which might be considered more readable. - -Assertion subpatterns are not capturing subpatterns, and may -not be repeated, because it makes no sense to assert the -same thing several times. If any kind of assertion contains -capturing subpatterns within it, these are counted for the -purposes of numbering the capturing subpatterns in the whole -pattern. However, substring capturing is carried out only -for positive assertions, because it does not make sense for -negative assertions. - -Assertions count towards the maximum of 200 parenthesized -subpatterns. - -@node Non-backtracking subpatterns -@appendixsec Non-backtracking subpatterns -@cindex Perl-style regular expressions, non-backtracking subpatterns - -With both maximizing and minimizing repetition, failure of -what follows normally causes the repeated item to be evaluated -again to see if a different number of repeats allows the -rest of the pattern to match. Sometimes it is useful to -prevent this, either to change the nature of the match, or -to cause it fail earlier than it otherwise might, when the -author of the pattern knows there is no point in carrying -on. - -Consider, for example, the pattern @code{\d+foo} when applied to -the subject line - -@example -123456bar -@end example - -After matching all 6 digits and then failing to match @samp{foo}, -the normal action of the matcher is to try again with only 5 -digits matching the @code{\d+} item, and then with 4, and so on, -before ultimately failing. Non-backtracking subpatterns -provide the means for specifying that once a portion of the -pattern has matched, it is not to be re-evaluated in this way, -so the matcher would give up immediately on failing to match -@samp{foo} the first time. The notation is another kind of special -parenthesis, starting with @code{(?>} as in this example: - -@example -(?>\d+)bar -@end example - -This kind of parenthesis ``locks up'' the part of the pattern -it contains once it has matched, and a failure further into -the pattern is prevented from backtracking into it. -Backtracking past it to previous items, however, works as -normal. - -Non-backtracking subpatterns are not capturing subpatterns. Simple -cases such as the above example can be thought of as a maximizing -repeat that must swallow everything it can. So, -while both @code{\d+} and @code{\d+?} are prepared to adjust the number of -digits they match in order to make the rest of the pattern -match, @code{(?>\d+)} can only match an entire sequence of digits. - -This construction can of course contain arbitrarily complicated -subpatterns, and it can be nested. - -@cindex Perl-style regular expressions, lookbehind subpatterns -Non-backtracking subpatterns can be used in conjunction with look-behind -assertions to specify efficient matching at the end -of the subject string. Consider a simple pattern such as - -@example -abcd$ -@end example - -@noindent -when applied to a long string which does not match. Because -matching proceeds from left to right, @command{sed} will look for -each @samp{a} in the subject and then see if what follows matches -the rest of the pattern. If the pattern is specified as - -@example -^.*abcd$ -@end example - -@noindent -the initial @code{.*} matches the entire string at first, but when -this fails (because there is no following @samp{a}), it backtracks -to match all but the last character, then all but the -last two characters, and so on. Once again the search for -@samp{a} covers the entire string, from right to left, so we are -no better off. However, if the pattern is written as - -@example -^(?>.*)(?<=abcd) -@end example - -there can be no backtracking for the .* item; it can match -only the entire string. The subsequent lookbehind assertion -does a single test on the last four characters. If it fails, -the match fails immediately. For long strings, this approach -makes a significant difference to the processing time. - -When a pattern contains an unlimited repeat inside a subpattern -that can itself be repeated an unlimited number of -times, the use of a once-only subpattern is the only way to -avoid some failing matches taking a very long time -indeed.@footnote{Actually, the matcher embedded in @value{SSED} -tries to do something for this in the simplest cases, -like @code{([^b]*b)*}. These cases are actually quite -common: they happen for example in a regular expression -like @code{\/\*([^*]*\*)*\/} which matches C comments.} - -The pattern - -@example -(\D+|<\d+>)*[!?] -@end example - -([^0-9<]+<(\d+>)?)*[!?] - -@noindent -matches an unlimited number of substrings that either consist -of non-digits, or digits enclosed in angular brackets, followed by -an exclamation or question mark. When it matches, it runs quickly. -However, if it is applied to - -@example -aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -@end example - -@noindent -it takes a long time before reporting failure. This is -because the string can be divided between the two repeats in -a large number of ways, and all have to be tried.@footnote{The -example used @code{[!?]} rather than a single character at the end, -because both @value{SSED} and Perl have an optimization that allows -for fast failure when a single character is used. They -remember the last single character that is required for a -match, and fail early if it is not present in the string.} - -If the pattern is changed to - -@example -((?>\D+)|<\d+>)*[!?] -@end example - -sequences of non-digits cannot be broken, and failure happens -quickly. - -@node Conditional subpatterns -@appendixsec Conditional subpatterns -@cindex Perl-style regular expressions, conditional subpatterns - -It is possible to cause the matching process to obey a subpattern -conditionally or to choose between two alternative -subpatterns, depending on the result of an assertion, or -whether a previous capturing subpattern matched or not. The -two possible forms of conditional subpattern are - -@example -(?(@var{condition})@var{yes-pattern}) -(?(@var{condition})@var{yes-pattern}|@var{no-pattern}) -@end example - -If the condition is satisfied, the yes-pattern is used; otherwise -the no-pattern (if present) is used. If there are more than two -alternatives in the subpattern, a compile-time error occurs. - -There are two kinds of condition. If the text between the -parentheses consists of a sequence of digits, the condition -is satisfied if the capturing subpattern of that number has -previously matched. The number must be greater than zero. -Consider the following pattern, which contains non-significant -white space to make it more readable (assume the @code{X} modifier) -and to divide it into three parts for ease of discussion: - -@example -( \( )? [^()]+ (?(1) \) ) -@end example - -The first part matches an optional opening parenthesis, and -if that character is present, sets it as the first captured -substring. The second part matches one or more characters -that are not parentheses. The third part is a conditional -subpattern that tests whether the first set of parentheses -matched or not. If they did, that is, if subject started -with an opening parenthesis, the condition is true, and so -the yes-pattern is executed and a closing parenthesis is -required. Otherwise, since no-pattern is not present, the -subpattern matches nothing. In other words, this pattern -matches a sequence of non-parentheses, optionally enclosed -in parentheses. - -@cindex Perl-style regular expressions, lookahead subpatterns -If the condition is not a sequence of digits, it must be an -assertion. This may be a positive or negative lookahead or -lookbehind assertion. Consider this pattern, again containing -non-significant white space, and with the two alternatives -on the second line: - -@example -(?(?=...[a-z]) - \d\d-[a-z]@{3@}-\d\d | - \d\d-\d\d-\d\d ) -@end example - -The condition is a positive lookahead assertion that matches -a letter that is three characters away from the current point. -If a letter is found, the subject is matched against the first -alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are -letters and @var{dd} are digits); otherwise it is matched against -the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}. - - -@node Recursive patterns -@appendixsec Recursive patterns -@cindex Perl-style regular expressions, recursive patterns -@cindex Perl-style regular expressions, recursion - -Consider the problem of matching a string in parentheses, -allowing for unlimited nested parentheses. Without the use -of recursion, the best that can be done is to use a pattern -that matches up to some fixed depth of nesting. It is not -possible to handle an arbitrary nesting depth. Perl 5.6 has -provided an experimental facility that allows regular -expressions to recurse (amongst other things). It does this -by interpolating Perl code in the expression at run time, -and the code can refer to the expression itself. A Perl pattern -tern to solve the parentheses problem can be created like -this: - -@example -$re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x; -@end example - -The @code{(?p@{...@})} item interpolates Perl code at run time, -and in this case refers recursively to the pattern in which it -appears. Obviously, @command{sed} cannot support the interpolation of -Perl code. Instead, the special item @code{(?R)} is provided for -the specific case of recursion. This pattern solves the -parentheses problem (assume the @code{X} modifier option is used -so that white space is ignored): - -@example -\( ( (?>[^()]+) | (?R) )* \) -@end example - -First it matches an opening parenthesis. Then it matches any -number of substrings which can either be a sequence of -non-parentheses, or a recursive match of the pattern itself -(i.e. a correctly parenthesized substring). Finally there is -a closing parenthesis. - -This particular example pattern contains nested unlimited -repeats, and so the use of a non-backtracking subpattern for -matching strings of non-parentheses is important when applying -the pattern to strings that do not match. For example, when -it is applied to - -@example -(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() -@end example - -it yields a ``no match'' response quickly. However, if a -standard backtracking subpattern is not used, the match runs -for a very long time indeed because there are so many different -ways the @code{+} and @code{*} repeats can carve up the subject, -and all have to be tested before failure can be reported. - -The values set for any capturing subpatterns are those from -the outermost level of the recursion at which the subpattern -value is set. If the pattern above is matched against - -@example -(ab(cd)ef) -@end example - -@noindent -the value for the capturing parentheses is @samp{ef}, which is -the last value taken on at the top level. - -@node Comments -@appendixsec Comments -@cindex Perl-style regular expressions, comments - -The sequence (?# marks the start of a comment which continues -ues up to the next closing parenthesis. Nested parentheses -are not permitted. The characters that make up a comment -play no part in the pattern matching at all. - -@cindex Perl-style regular expressions, extended -If the @code{X} modifier option is used, an unescaped @code{#} character -outside a character class introduces a comment that continues -up to the next newline character in the pattern. -@end ifset - - -@page -@node Concept Index -@unnumbered Concept Index - -This is a general index of all issues discussed in this manual, with the -exception of the @command{sed} commands and command-line options. - -@printindex cp - -@page -@node Command and Option Index -@unnumbered Command and Option Index - -This is an alphabetical list of all @command{sed} commands and command-line -options. - -@printindex fn - -@contents -@bye - -@c XXX FIXME: the term "cycle" is never defined... |