summaryrefslogtreecommitdiff
path: root/doc/sed.texi
diff options
context:
space:
mode:
Diffstat (limited to 'doc/sed.texi')
-rw-r--r--doc/sed.texi4670
1 files changed, 2534 insertions, 2136 deletions
diff --git a/doc/sed.texi b/doc/sed.texi
index 6efc48c..121e405 100644
--- a/doc/sed.texi
+++ b/doc/sed.texi
@@ -1,11 +1,10 @@
\input texinfo @c -*-texinfo-*-
-@c Do not edit this file!! It is automatically generated from sed-in.texi.
@c
@c -- Stuff that needs adding: ----------------------------------------------
@c (nothing!)
@c --------------------------------------------------------------------------
@c Check for consistency: regexps in @code, text that they match in @samp.
-@c
+@c
@c Tips:
@c @command for command
@c @samp for command fragments: @samp{cat -s}
@@ -35,116 +34,54 @@
This file documents version @value{VERSION} of
@value{SSED}, a stream editor.
-Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free
-Software Foundation, Inc.
-
-This document is released under the terms of the @acronym{GNU} Free
-Documentation License as published by the Free Software Foundation;
-either version 1.1, or (at your option) any later version.
-
-You should have received a copy of the @acronym{GNU} Free Documentation
-License along with @value{SSED}; see the file @file{COPYING.DOC}.
-If not, write to the Free Software Foundation, 59 Temple Place - Suite
-330, Boston, MA 02110-1301, USA.
+Copyright @copyright{} 1998-2016 Free Software Foundation, Inc.
-There are no Cover Texts and no Invariant Sections; this text, along
-with its equivalent in the printed manual, constitutes the Title Page.
+@quotation
+Permission is granted to copy, distribute and/or modify this document
+under the terms of the GNU Free Documentation License, Version 1.3
+or any later version published by the Free Software Foundation;
+with no Invariant Sections, no Front-Cover Texts, and no
+Back-Cover Texts. A copy of the license is included in the
+section entitled ``GNU Free Documentation License''.
+@end quotation
@end copying
@setchapternewpage off
@titlepage
-@title @command{sed}, a stream editor
+@title @value{SSED}, a stream editor
@subtitle version @value{VERSION}, @value{UPDATED}
@author by Ken Pizzini, Paolo Bonzini
@page
@vskip 0pt plus 1filll
-Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.
-
@insertcopying
-
-Published by the Free Software Foundation, @*
-51 Franklin Street, Fifth Floor @*
-Boston, MA 02110-1301, USA
@end titlepage
+@contents
+@ifnottex
@node Top
-@top
+@top @value{SSED}
-@ifnottex
@insertcopying
@end ifnottex
@menu
* Introduction:: Introduction
* Invoking sed:: Invocation
-* sed Programs:: @command{sed} programs
+* sed scripts:: @command{sed} scripts
+* sed addresses:: Addresses: selecting lines
+* sed regular expressions:: Regular expressions: selecting text
+* advanced sed:: Advanced @command{sed}: cycles and buffers
* Examples:: Some sample scripts
* Limitations:: Limitations and (non-)limitations of @value{SSED}
* Other Resources:: Other resources for learning about @command{sed}
* Reporting Bugs:: Reporting bugs
-
-* Extended regexps:: @command{egrep}-style regular expressions
-@ifset PERL
-* Perl regexps:: Perl-style regular expressions
-@end ifset
-
+* GNU Free Documentation License:: Copying and sharing this manual
* Concept Index:: A menu with all the topics in this manual.
* Command and Option Index:: A menu with all @command{sed} commands and
command-line options.
-
-@detailmenu
---- The detailed node listing ---
-
-sed Programs:
-* Execution Cycle:: How @command{sed} works
-* Addresses:: Selecting lines with @command{sed}
-* Regular Expressions:: Overview of regular expression syntax
-* Common Commands:: Often used commands
-* The "s" Command:: @command{sed}'s Swiss Army Knife
-* Other Commands:: Less frequently used commands
-* Programming Commands:: Commands for @command{sed} gurus
-* Extended Commands:: Commands specific of @value{SSED}
-* Escapes:: Specifying special characters
-
-Examples:
-* Centering lines::
-* Increment a number::
-* Rename files to lower case::
-* Print bash environment::
-* Reverse chars of lines::
-* tac:: Reverse lines of files
-* cat -n:: Numbering lines
-* cat -b:: Numbering non-blank lines
-* wc -c:: Counting chars
-* wc -w:: Counting words
-* wc -l:: Counting lines
-* head:: Printing the first lines
-* tail:: Printing the last lines
-* uniq:: Make duplicate lines unique
-* uniq -d:: Print duplicated lines of input
-* uniq -u:: Remove all duplicated lines
-* cat -s:: Squeezing blank lines
-
-@ifset PERL
-Perl regexps:: Perl-style regular expressions
-* Backslash:: Introduces special sequences
-* Circumflex/dollar sign/period:: Behave specially with regard to new lines
-* Square brackets:: Are a bit different in strange cases
-* Options setting:: Toggle modifiers in the middle of a regexp
-* Non-capturing subpatterns:: Are not counted when backreferencing
-* Repetition:: Allows for non-greedy matching
-* Backreferences:: Allows for more than 10 back references
-* Assertions:: Allows for complex look ahead matches
-* Non-backtracking subpatterns:: Often gives more performance
-* Conditional subpatterns:: Allows if/then/else branches
-* Recursive patterns:: For example to match parentheses
-* Comments:: Because things can get complex...
-@end ifset
-
-@end detailmenu
@end menu
@@ -166,26 +103,125 @@ editors.
@node Invoking sed
-@chapter Invocation
+@chapter Running sed
+
+This chapter covers how to run @command{sed}. Details of @command{sed}
+scripts and individual @command{sed} commands are discussed in the
+next chapter.
+@menu
+* Overview::
+* Command-Line Options::
+* Exit status::
+@end menu
+
+
+@node Overview
+@section Overview
Normally @command{sed} is invoked like this:
@example
sed SCRIPT INPUTFILE...
@end example
-The full format for invoking @command{sed} is:
+For example, to replace all occurrences of @samp{hello} to @samp{world}
+in the file @file{input.txt}:
@example
-sed OPTIONS... [SCRIPT] [INPUTFILE...]
+sed 's/hello/world/' input.txt > output.txt
@end example
+@cindex stdin
+@cindex standard input
If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
-@command{sed} filters the contents of the standard input. The @var{script}
-is actually the first non-option parameter, which @command{sed} specially
-considers a script and not an input file if (and only if) none of the
-other @var{options} specifies a script to be executed, that is if neither
-of the @option{-e} and @option{-f} options is specified.
+@command{sed} filters the contents of the standard input. The following
+commands are equivalent:
+
+@example
+sed 's/hello/world/' input.txt > output.txt
+sed 's/hello/world/' < input.txt > output.txt
+cat input.txt | sed 's/hello/world/' - > output.txt
+@end example
+
+@cindex stdout
+@cindex output
+@cindex standard output
+@cindex -i, example
+@command{sed} writes output to standard output. Use @option{-i} to edit
+files in-place instead of printing to standard output.
+See also the @code{W} and @code{s///w} commands for writing output to
+other files. The following command modifies @file{file.txt} and
+does not produce any output:
+
+@example
+sed -i 's/hello/world' file.txt
+@end example
+
+@cindex -n, example
+@cindex p, example
+@cindex suppressing output
+@cindex output, suppressing
+By default @command{sed} prints all processed input (except input
+that has been modified/deleted by commands such as @command{d}).
+Use @option{-n} to suppress output, and the @code{p} command
+to print specific lines. The following command prints only line 45
+of the input file:
+
+@example
+sed -n '45p' file.txt
+@end example
+
+
+
+@cindex multiple files
+@cindex -s, example
+@command{sed} treats multiple input files as one long stream.
+The following example prints the first line of the first file
+(@file{one.txt}) and the last line of the last file (@file{three.txt}).
+Use @option{-s} to reverse this behavior.
+
+@example
+sed -n '1p ; $p' one.txt two.txt three.txt
+@end example
+
+
+@cindex -e, example
+@cindex --expression, example
+@cindex -f, example
+@cindex --file, example
+@cindex script parameter
+@cindex parameters, script
+Without @option{-e} or @option{-f} options, @command{sed} uses
+the first non-option parameter as the @var{script}, and the following
+non-option parameters as input files.
+If @option{-e} or @option{-f} options are used to specify a @var{script},
+all non-option parameters are taken as input files.
+Options @option{-e} and @option{-f} can be combined, and can appear
+multiple times (in which case the final effective @var{script} will be
+concatenation of all the individual @var{script}s).
+
+The following examples are equivalent:
+
+@example
+sed 's/hello/world/' input.txt > output.txt
+
+sed -e 's/hello/world/' input.txt > output.txt
+sed --expression='s/hello/world/' input.txt > output.txt
+
+echo 's/hello/world/' > myscript.sed
+sed -f myscript.sed input.txt > output.txt
+sed --file=myscript.sed input.txt > output.txt
+@end example
+
+
+@node Command-Line Options
+@section Command-Line Options
+
+The full format for invoking @command{sed} is:
+
+@example
+sed OPTIONS... [SCRIPT] [INPUTFILE...]
+@end example
@command{sed} may be invoked with the following command-line options:
@@ -291,7 +327,7 @@ including additional commands.
Most of the extensions accept @command{sed} programs that
are outside the syntax mandated by @acronym{POSIX}, but some
of them (such as the behavior of the @command{N} command
-described in @pxref{Reporting Bugs}) actually violate the
+described in @ref{Reporting Bugs}) actually violate the
standard. If you want to disable only the latter kind of
extension, you can set the @code{POSIXLY_CORRECT} variable
to a non-empty value.
@@ -319,8 +355,10 @@ follow the link and edit the ultimate destination of the
link. The default behavior is to break the symbolic link,
so that the link destination will not be modified.
-@item -r
+@item -E
+@itemx -r
@itemx --regexp-extended
+@opindex -E
@opindex -r
@opindex --regexp-extended
@cindex Extended regular expressions, choosing
@@ -328,23 +366,17 @@ so that the link destination will not be modified.
Use extended regular expressions rather than basic
regular expressions. Extended regexps are those that
@command{egrep} accepts; they can be clearer because they
-usually have less backslashes, but are a @acronym{GNU} extension
-and hence scripts that use them are not portable.
-@xref{Extended regexps, , Extended regular expressions}.
-
-@ifset PERL
-@item -R
-@itemx --regexp-perl
-@opindex -R
-@opindex --regexp-perl
-@cindex Perl-style regular expressions, choosing
-@cindex @value{SSEDEXT}, Perl-style regular expressions
-Use Perl-style regular expressions rather than basic
-regular expressions. Perl-style regexps are extremely
-powerful but are a @value{SSED} extension and hence scripts that
-use it are not portable. @xref{Perl regexps, ,
-Perl-style regular expressions}.
-@end ifset
+usually have fewer backslashes.
+Historically this was a @acronym{GNU} extension,
+but the @option{-E}
+extension has since been added to the POSIX standard
+(http://austingroupbugs.net/view.php?id=528),
+so use @option{-E} for portability.
+GNU sed has accepted @option{-E} as an undocumented option for years,
+and *BSD seds have accepted @option{-E} for years as well,
+but scripts that use @option{-E} might not port to other older systems.
+@xref{ERE syntax, , Extended regular expressions}.
+
@item -s
@itemx --separate
@@ -360,6 +392,15 @@ of each file, @code{$} refers to the last line of each file,
and files invoked from the @code{R} commands are rewound at the
start of each file.
+@item --sandbox
+@opindex --sandbox
+@cindex Sandbox mode
+In sandbox mode, @code{e/w/r} commands are rejected - programs containing
+them will be aborted without being run. Sandbox mode ensures @command{sed}
+operates only on the input files designated on the command line, and
+cannot run external programs.
+
+
@item -u
@itemx --unbuffered
@opindex -u
@@ -395,606 +436,351 @@ be processed.
A file name of @samp{-} refers to the standard input stream.
The standard input will be processed if no file names are specified.
+@node Exit status
+@section Exit status
+@cindex exit status
+An exit status of zero indicates success, and a nonzero value
+indicates failure. @value{SSED} returns the following exit status
+error values:
-@node sed Programs
-@chapter @command{sed} Programs
+@table @asis
+@item 0
+Successful completion.
-@cindex @command{sed} program structure
-@cindex Script structure
-A @command{sed} program consists of one or more @command{sed} commands,
-passed in by one or more of the
-@option{-e}, @option{-f}, @option{--expression}, and @option{--file}
-options, or the first non-option argument if zero of these
-options are used.
-This document will refer to ``the'' @command{sed} script;
-this is understood to mean the in-order catenation
-of all of the @var{script}s and @var{script-file}s passed in.
+@item 1
+Invalid command, invalid syntax, invalid regular expression or a
+@value{SSED} extension command used with @option{--posix}.
-Commands within a @var{script} or @var{script-file} can be
-separated by semicolons (@code{;}) or newlines (ASCII 10).
-Some commands, due to their syntax, cannot be followed by semicolons
-working as command separators and thus should be terminated
-with newlines or be placed at the end of a @var{script} or @var{script-file}.
-Commands can also be preceded with optional non-significant
-whitespace characters.
+@item 2
+One or more of the input file specified on the command line could not be
+opened (e.g. if a file is not found, or read permission is denied).
+Processing continued with other files.
+
+@item 4
+An I/O error, or a serious processing error during runtime,
+@value{SSED} aborted immediately.
+@end table
+
+@cindex Q, example
+@cindex exit status, example
+Additionally, the commands @code{q} and @code{Q} can be used to terminate
+@command{sed} with a custom exit code value (this is a @value{SSED} extension):
+
+@example
+$ echo | sed 'Q42' ; echo $?
+42
+@end example
+
+
+@node sed scripts
+@chapter @command{sed} scripts
-Each @code{sed} command consists of an optional address or
-address range, followed by a one-character command name
-and any additional command-specific code.
@menu
-* Execution Cycle:: How @command{sed} works
-* Addresses:: Selecting lines with @command{sed}
-* Regular Expressions:: Overview of regular expression syntax
-* Common Commands:: Often used commands
+* sed script overview:: @command{sed} script overview
+* sed commands list:: @command{sed} commands summary
* The "s" Command:: @command{sed}'s Swiss Army Knife
+* Common Commands:: Often used commands
* Other Commands:: Less frequently used commands
* Programming Commands:: Commands for @command{sed} gurus
* Extended Commands:: Commands specific of @value{SSED}
-* Escapes:: Specifying special characters
@end menu
+@node sed script overview
+@section @command{sed} script overview
-@node Execution Cycle
-@section How @command{sed} Works
-
-@cindex Buffer spaces, pattern and hold
-@cindex Spaces, pattern and hold
-@cindex Pattern space, definition
-@cindex Hold space, definition
-@command{sed} maintains two data buffers: the active @emph{pattern} space,
-and the auxiliary @emph{hold} space. Both are initially empty.
-
-@command{sed} operates by performing the following cycle on each
-line of input: first, @command{sed} reads one line from the input
-stream, removes any trailing newline, and places it in the pattern space.
-Then commands are executed; each command can have an address associated
-to it: addresses are a kind of condition code, and a command is only
-executed if the condition is verified before the command is to be
-executed.
-
-When the end of the script is reached, unless the @option{-n} option
-is in use, the contents of pattern space are printed out to the output
-stream, adding back the trailing newline if it was removed.@footnote{Actually,
-if @command{sed} prints a line without the terminating newline, it will
-nevertheless print the missing newline as soon as more text is sent to
-the same output stream, which gives the ``least expected surprise''
-even though it does not make commands like @samp{sed -n p} exactly
-identical to @command{cat}.} Then the next cycle starts for the next
-input line.
-
-Unless special commands (like @samp{D}) are used, the pattern space is
-deleted between two cycles. The hold space, on the other hand, keeps
-its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
-@samp{g}, @samp{G} to move data between both buffers).
-
-
-@node Addresses
-@section Selecting lines with @command{sed}
-@cindex Addresses, in @command{sed} scripts
-@cindex Line selection
-@cindex Selecting lines to process
-
-Addresses in a @command{sed} script can be in any of the following forms:
-@table @code
-@item @var{number}
-@cindex Address, numeric
-@cindex Line, selecting by number
-Specifying a line number will match only that line in the input.
-(Note that @command{sed} counts lines continuously across all input files
-unless @option{-i} or @option{-s} options are specified.)
-
-@item @var{first}~@var{step}
-@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
-This @acronym{GNU} extension matches every @var{step}th line
-starting with line @var{first}.
-In particular, lines will be selected when there exists
-a non-negative @var{n} such that the current line-number equals
-@var{first} + (@var{n} * @var{step}).
-Thus, to select the odd-numbered lines,
-one would use @code{1~2};
-to pick every third line starting with the second, @samp{2~3} would be used;
-to pick every fifth line starting with the tenth, use @samp{10~5};
-and @samp{50~0} is just an obscure way of saying @code{50}.
-
-@item $
-@cindex Address, last line
-@cindex Last line, selecting
-@cindex Line, selecting last
-This address matches the last line of the last file of input, or
-the last line of each file when the @option{-i} or @option{-s} options
-are specified.
-
-@item /@var{regexp}/
-@cindex Address, as a regular expression
-@cindex Line, selecting by regular expression match
-This will select any line which matches the regular expression @var{regexp}.
-If @var{regexp} itself includes any @code{/} characters,
-each must be escaped by a backslash (@code{\}).
-
-@cindex empty regular expression
-@cindex @value{SSEDEXT}, modifiers and the empty regular expression
-The empty regular expression @samp{//} repeats the last regular
-expression match (the same holds if the empty regular expression is
-passed to the @code{s} command). Note that modifiers to regular expressions
-are evaluated when the regular expression is compiled, thus it is invalid to
-specify them together with the empty regular expression.
-
-@item \%@var{regexp}%
-(The @code{%} may be replaced by any other single character.)
-
-@cindex Slash character, in regular expressions
-This also matches the regular expression @var{regexp},
-but allows one to use a different delimiter than @code{/}.
-This is particularly useful if the @var{regexp} itself contains
-a lot of slashes, since it avoids the tedious escaping of every @code{/}.
-If @var{regexp} itself includes any delimiter characters,
-each must be escaped by a backslash (@code{\}).
-
-@item /@var{regexp}/I
-@itemx \%@var{regexp}%I
-@cindex @acronym{GNU} extensions, @code{I} modifier
-@ifset PERL
-@cindex Perl-style regular expressions, case-insensitive
-@end ifset
-The @code{I} modifier to regular-expression matching is a @acronym{GNU}
-extension which causes the @var{regexp} to be matched in
-a case-insensitive manner.
-
-@item /@var{regexp}/M
-@itemx \%@var{regexp}%M
-@cindex @value{SSEDEXT}, @code{M} modifier
-@ifset PERL
-@cindex Perl-style regular expressions, multiline
-@end ifset
-The @code{M} modifier to regular-expression matching is a @value{SSED}
-extension which directs @value{SSED} to match the regular expression
-in @cite{multi-line} mode. The modifier causes @code{^} and @code{$} to
-match respectively (in addition to the normal behavior) the empty string
-after a newline, and the empty string before a newline. There are
-special character sequences
-@ifset PERL
-(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
-in basic or extended regular expression modes)
-@end ifset
-@ifclear PERL
-(@code{\`} and @code{\'})
-@end ifclear
-which always match the beginning or the end of the buffer.
-In addition,
-@ifset PERL
-just like in Perl mode without the @code{S} modifier,
-@end ifset
-the period character does not match a new-line character in
-multi-line mode.
-
-@ifset PERL
-@item /@var{regexp}/S
-@itemx \%@var{regexp}%S
-@cindex @value{SSEDEXT}, @code{S} modifier
-@cindex Perl-style regular expressions, single line
-The @code{S} modifier to regular-expression matching is only valid
-in Perl mode and specifies that the dot character (@code{.}) will
-match the newline character too. @code{S} stands for @cite{single-line}.
-@end ifset
-
-@ifset PERL
-@item /@var{regexp}/X
-@itemx \%@var{regexp}%X
-@cindex @value{SSEDEXT}, @code{X} modifier
-@cindex Perl-style regular expressions, extended
-The @code{X} modifier to regular-expression matching is also
-valid in Perl mode only. If it is used, whitespace in the
-pattern (other than in a character class) and
-characters between a @kbd{#} outside a character class and the
-next newline character are ignored. An escaping backslash
-can be used to include a whitespace or @kbd{#} character as part
-of the pattern.
-@end ifset
-@end table
-
-If no addresses are given, then all lines are matched;
-if one address is given, then only lines matching that
-address are matched.
-
-@cindex Range of lines
-@cindex Several lines, selecting
-An address range can be specified by specifying two addresses
-separated by a comma (@code{,}). An address range matches lines
-starting from where the first address matches, and continues
-until the second address matches (inclusively).
-
-If the second address is a @var{regexp}, then checking for the
-ending match will start with the line @emph{following} the
-line which matched the first address: a range will always
-span at least two lines (except of course if the input stream
-ends).
-
-If the second address is a @var{number} less than (or equal to)
-the line matching the first address, then only the one line is
-matched.
+@cindex @command{sed} script structure
+@cindex Script structure
-@cindex Special addressing forms
-@cindex Range with start address of zero
-@cindex Zero, as range start address
-@cindex @var{addr1},+N
-@cindex @var{addr1},~N
-@cindex @acronym{GNU} extensions, special two-address forms
-@cindex @acronym{GNU} extensions, @code{0} address
-@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
-@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
-@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
-@value{SSED} also supports some special two-address forms; all these
-are @acronym{GNU} extensions:
-@table @code
-@item 0,/@var{regexp}/
-A line number of @code{0} can be used in an address specification like
-@code{0,/@var{regexp}/} so that @command{sed} will try to match
-@var{regexp} in the first input line too. In other words,
-@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
-except that if @var{addr2} matches the very first line of input the
-@code{0,/@var{regexp}/} form will consider it to end the range, whereas
-the @code{1,/@var{regexp}/} form will match the beginning of its range and
-hence make the range span up to the @emph{second} occurrence of the
-regular expression.
+A @command{sed} program consists of one or more @command{sed} commands,
+passed in by one or more of the
+@option{-e}, @option{-f}, @option{--expression}, and @option{--file}
+options, or the first non-option argument if zero of these
+options are used.
+This document will refer to ``the'' @command{sed} script;
+this is understood to mean the in-order concatenation
+of all of the @var{script}s and @var{script-file}s passed in.
+@xref{Overview}.
-Note that this is the only place where the @code{0} address makes
-sense; there is no 0-th line and commands which are given the @code{0}
-address in any other way will give an error.
-@item @var{addr1},+@var{N}
-Matches @var{addr1} and the @var{N} lines following @var{addr1}.
+@cindex @command{sed} commands syntax
+@cindex syntax, @command{sed} commands
+@cindex addresses, syntax
+@cindex syntax, addresses
+@command{sed} commands follow this syntax:
-@item @var{addr1},~@var{N}
-Matches @var{addr1} and the lines following @var{addr1}
-until the next line whose input line number is a multiple of @var{N}.
-@end table
+@example
+[addr]@var{X}[options]
+@end example
-@cindex Excluding lines
-@cindex Selecting non-matching lines
-Appending the @code{!} character to the end of an address
-specification negates the sense of the match.
-That is, if the @code{!} character follows an address range,
-then only lines which do @emph{not} match the address range
-will be selected.
-This also works for singleton addresses,
-and, perhaps perversely, for the null address.
+@var{X} is a single-letter @command{sed} command.
+@c TODO: add @pxref{commands} when there is a command-list section.
+@code{[addr]} is an optional line address. If @code{[addr]} is specified,
+the command @var{X} will be executed only on the matched lines.
+@code{[addr]} can be a single line number, a regular expression,
+or a range of lines (@pxref{sed addresses}).
+Additional @code{[options]} are used for some @command{sed} commands.
+@cindex @command{d}, example
+@cindex address range, example
+@cindex example, address range
+The following example deletes lines 30 to 35 in the input.
+@code{30,35} is an address range. @command{d} is the delete command:
-@node Regular Expressions
-@section Overview of Regular Expression Syntax
+@example
+sed '30,35d' input.txt > output.txt
+@end example
-To know how to use @command{sed}, people should understand regular
-expressions (@dfn{regexp} for short). A regular expression
-is a pattern that is matched against a
-subject string from left to right. Most characters are
-@dfn{ordinary}: they stand for
-themselves in a pattern, and match the corresponding characters
-in the subject. As a trivial example, the pattern
+@cindex @command{q}, example
+@cindex regular expression, example
+@cindex example, regular expression
+The following example prints all input until a line
+starting with the word @samp{foo} is found. If such line is found,
+@command{sed} will terminate with exit status 42.
+If such line was not found (and no other error occurred), @command{sed}
+will exit with status 0.
+@code{/^foo/} is a regular-expression address.
+@command{q} is the quit command. @code{42} is the command option.
@example
-The quick brown fox
+sed '/^foo/q42' input.txt > output.txt
@end example
-@noindent
-matches a portion of a subject string that is identical to
-itself. The power of regular expressions comes from the
-ability to include alternatives and repetitions in the pattern.
-These are encoded in the pattern by the use of @dfn{special characters},
-which do not stand for themselves but instead
-are interpreted in some special way. Here is a brief description
-of regular expression syntax as used in @command{sed}.
-@table @code
-@item @var{char}
-A single ordinary character matches itself.
+@cindex multiple @command{sed} commands
+@cindex @command{sed} commands, multiple
+@cindex newline, command separator
+@cindex semicolons, command separator
+@cindex ;, command separator
+@cindex -e, example
+@cindex -f, example
+Commands within a @var{script} or @var{script-file} can be
+separated by semicolons (@code{;}) or newlines (ASCII 10).
+Multiple scripts can be specified with @option{-e} or @option{-f}
+options.
-@item *
-@cindex @acronym{GNU} extensions, to basic regular expressions
-Matches a sequence of zero or more instances of matches for the
-preceding regular expression, which must be an ordinary character, a
-special character preceded by @code{\}, a @code{.}, a grouped regexp
-(see below), or a bracket expression. As a @acronym{GNU} extension, a
-postfixed regular expression can also be followed by @code{*}; for
-example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX}
-1003.1-2001 says that @code{*} stands for itself when it appears at
-the start of a regular expression or subexpression, but many
-non@acronym{GNU} implementations do not support this and portable
-scripts should instead use @code{\*} in these contexts.
+The following examples are all equivalent. They perform two @command{sed}
+operations: deleting any lines matching the regular expression @code{/^foo/},
+and replacing all occurrences of the string @samp{hello} with @samp{world}:
-@item \+
-@cindex @acronym{GNU} extensions, to basic regular expressions
-As @code{*}, but matches one or more. It is a @acronym{GNU} extension.
+@example
+sed '/^foo/d ; s/hello/world/' input.txt > output.txt
-@item \?
-@cindex @acronym{GNU} extensions, to basic regular expressions
-As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension.
+sed -e '/^foo/d' -e 's/hello/world/' input.txt > output.txt
-@item \@{@var{i}\@}
-As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
-decimal integer; for portability, keep it between 0 and 255
-inclusive).
+echo '/^foo/d' > script.sed
+echo 's/hello/world/' >> script.sed
+sed -f script.sed input.txt > output.txt
-@item \@{@var{i},@var{j}\@}
-Matches between @var{i} and @var{j}, inclusive, sequences.
+echo 's/hello/world/' > script2.sed
+sed -e '/^foo/d' -f script2.sed input.txt > output.txt
+@end example
-@item \@{@var{i},\@}
-Matches more than or equal to @var{i} sequences.
-@item \(@var{regexp}\)
-Groups the inner @var{regexp} as a whole, this is used to:
+@cindex @command{a}, and semicolons
+@cindex @command{c}, and semicolons
+@cindex @command{i}, and semicolons
+Commands @command{a}, @command{c}, @command{i}, due to their syntax,
+cannot be followed by semicolons working as command separators and
+thus should be terminated
+with newlines or be placed at the end of a @var{script} or @var{script-file}.
+Commands can also be preceded with optional non-significant
+whitespace characters.
-@itemize @bullet
-@item
-@cindex @acronym{GNU} extensions, to basic regular expressions
-Apply postfix operators, like @code{\(abcd\)*}:
-this will search for zero or more whole sequences
-of @samp{abcd}, while @code{abcd*} would search
-for @samp{abc} followed by zero or more occurrences
-of @samp{d}. Note that support for @code{\(abcd\)*} is
-required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
-implementations do not support it and hence it is not universally
-portable.
-@item
-Use back references (see below).
-@end itemize
-@item .
-Matches any character, including newline.
+@node sed commands list
+@section @command{sed} commands summary
-@item ^
-Matches the null string at beginning of the pattern space, i.e. what
-appears after the circumflex must appear at the beginning of the
-pattern space.
+The following commands are supported in @value{SSED}.
+Some are standard POSIX commands, while other are @value{SSEDEXT}.
+Details and examples for each command are in the following sections.
+(Mnemonics) are shown in parentheses.
-In most scripts, pattern space is initialized to the content of each
-line (@pxref{Execution Cycle, , How @code{sed} works}). So, it is a
-useful simplification to think of @code{^#include} as matching only
-lines where @samp{#include} is the first thing on line---if there are
-spaces before, for example, the match fails. This simplification is
-valid as long as the original content of pattern space is not modified,
-for example with an @code{s} command.
+@table @code
-@code{^} acts as a special character only at the beginning of the
-regular expression or subexpression (that is, after @code{\(} or
-@code{\|}). Portable scripts should avoid @code{^} at the beginning of
-a subexpression, though, as @acronym{POSIX} allows implementations that
-treat @code{^} as an ordinary character in that context.
+@item a\
+@itemx @var{text}
+Append @var{text} after a line.
-@item $
-It is the same as @code{^}, but refers to end of pattern space.
-@code{$} also acts as a special character only at the end
-of the regular expression or subexpression (that is, before @code{\)}
-or @code{\|}), and its use at the end of a subexpression is not
-portable.
+@item a @var{text}
+Append @var{text} after a line (alternative syntax).
+@item b @var{label}
+Branch unconditionally to @var{label}.
+The @var{label} may be omitted, in which case the next cycle is started.
-@item [@var{list}]
-@itemx [^@var{list}]
-Matches any single character in @var{list}: for example,
-@code{[aeiou]} matches all vowels. A list may include
-sequences like @code{@var{char1}-@var{char2}}, which
-matches any character between (inclusive) @var{char1}
-and @var{char2}.
+@item c\
+@itemx @var{text}
+Replace (change) lines with @var{text}.
-A leading @code{^} reverses the meaning of @var{list}, so that
-it matches any single character @emph{not} in @var{list}. To include
-@code{]} in the list, make it the first character (after
-the @code{^} if needed), to include @code{-} in the list,
-make it the first or last; to include @code{^} put
-it after the first character.
+@item c @var{text}
+Replace (change) lines with @var{text} (alternative syntax).
-@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
-The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
-are normally not special within @var{list}. For example, @code{[\*]}
-matches either @samp{\} or @samp{*}, because the @code{\} is not
-special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and
-@code{[:space:]} are special within @var{list} and represent collating
-symbols, equivalence classes, and character classes, respectively, and
-@code{[} is therefore special within @var{list} when it is followed by
-@code{.}, @code{=}, or @code{:}. Also, when not in
-@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
-@code{\t} are recognized within @var{list}. @xref{Escapes}.
+@item d
+Delete the pattern space;
+immediately start next cycle.
-@item @var{regexp1}\|@var{regexp2}
-@cindex @acronym{GNU} extensions, to basic regular expressions
-Matches either @var{regexp1} or @var{regexp2}. Use
-parentheses to use complex alternative regular expressions.
-The matching process tries each alternative in turn, from
-left to right, and the first one that succeeds is used.
-It is a @acronym{GNU} extension.
+@item D
+If pattern space contains newlines, delete text in the pattern
+space up to the first newline, and restart cycle with the resultant
+pattern space, without reading a new line of input.
-@item @var{regexp1}@var{regexp2}
-Matches the concatenation of @var{regexp1} and @var{regexp2}.
-Concatenation binds more tightly than @code{\|}, @code{^}, and
-@code{$}, but less tightly than the other regular expression
-operators.
+If pattern space contains no newline, start a normal new cycle as if
+the @code{d} command was issued.
+@c TODO: add a section about D+N and D+n commands
-@item \@var{digit}
-Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
-subexpression in the regular expression. This is called a @dfn{back
-reference}. Subexpressions are implicity numbered by counting
-occurrences of @code{\(} left-to-right.
+@item e
+Executes the command that is found in pattern space and
+replaces the pattern space with the output; a trailing newline
+is suppressed.
-@item \n
-Matches the newline character.
+@item e @var{command}
+Executes @var{command} and sends its output to the output stream.
+The command can run across multiple lines, all but the last ending with
+a back-slash.
-@item \@var{char}
-Matches @var{char}, where @var{char} is one of @code{$},
-@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
-Note that the only C-like
-backslash sequences that you can portably assume to be
-interpreted are @code{\n} and @code{\\}; in particular
-@code{\t} is not portable, and matches a @samp{t} under most
-implementations of @command{sed}, rather than a tab character.
+@item F
+(filename) Print the file name of the current input file (with a trailing
+newline).
-@end table
+@item g
+Replace the contents of the pattern space with the contents of the hold space.
-@cindex Greedy regular expression matching
-Note that the regular expression matcher is greedy, i.e., matches
-are attempted from left to right and, if two or more matches are
-possible starting at the same character, it selects the longest.
+@item G
+Append a newline to the contents of the pattern space,
+and then append the contents of the hold space to that of the pattern space.
-@noindent
-Examples:
-@table @samp
-@item abcdef
-Matches @samp{abcdef}.
+@item h
+(hold) Replace the contents of the hold space with the contents of the
+pattern space.
-@item a*b
-Matches zero or more @samp{a}s followed by a single
-@samp{b}. For example, @samp{b} or @samp{aaaaab}.
+@item H
+Append a newline to the contents of the hold space,
+and then append the contents of the pattern space to that of the hold space.
-@item a\?b
-Matches @samp{b} or @samp{ab}.
+@item i\
+@itemx @var{text}
+insert @var{text} before a line.
-@item a\+b\+
-Matches one or more @samp{a}s followed by one or more
-@samp{b}s: @samp{ab} is the shortest possible match, but
-other examples are @samp{aaaab} or @samp{abbbbb} or
-@samp{aaaaaabbbbbbb}.
+@item i @var{text}
+insert @var{text} before a line (alternative syntax).
-@item .*
-@itemx .\+
-These two both match all the characters in a string;
-however, the first matches every string (including the empty
-string), while the second matches only strings containing
-at least one character.
+@item l
+Print the pattern space in an unambiguous form.
-@item ^main.*(.*)
-This matches a string starting with @samp{main},
-followed by an opening and closing
-parenthesis. The @samp{n}, @samp{(} and @samp{)} need not
-be adjacent.
+@item n
+(next) If auto-print is not disabled, print the pattern space,
+then, regardless, replace the pattern space with the next line of input.
+If there is no more input then @command{sed} exits without processing
+any more commands.
-@item ^#
-This matches a string beginning with @samp{#}.
+@item N
+Add a newline to the pattern space,
+then append the next line of input to the pattern space.
+If there is no more input then @command{sed} exits without processing
+any more commands.
-@item \\$
-This matches a string ending with a single backslash. The
-regexp contains two backslashes for escaping.
+@item p
+Print the pattern space.
+@c useful with @option{-n}
-@item \$
-Instead, this matches a string consisting of a single dollar sign,
-because it is escaped.
+@item P
+Print the pattern space, up to the first <newline>.
-@item [a-zA-Z0-9]
-In the C locale, this matches any @acronym{ASCII} letters or digits.
+@item q@var{[exit-code]}
+(quit) Exit @command{sed} without processing any more commands or input.
-@item [^ @kbd{tab}]\+
-(Here @kbd{tab} stands for a single tab character.)
-This matches a string of one or more
-characters, none of which is a space or a tab.
-Usually this means a word.
+@item Q@var{[exit-code]}
+(quit) This command is the same as @code{q}, but will not print the
+contents of pattern space. Like @code{q}, it provides the
+ability to return an exit code to the caller.
+@c useful to quit on a conditional without printing
-@item ^\(.*\)\n\1$
-This matches a string consisting of two equal substrings separated by
-a newline.
+@item r filename
+Reads text file a file. Example:
-@item .\@{9\@}A$
-This matches nine characters followed by an @samp{A}.
+@item R filename
+Queue a line of @var{filename} to be read and
+inserted into the output stream at the end of the current cycle,
+or when the next input line is read.
+@c useful to interleave files
-@item ^.\@{15\@}A
-This matches the start of a string that contains 16 characters,
-the last of which is an @samp{A}.
+@item s@var{/regexp/replacement/[flags]}
+(substitute) Match the regular-expression against the content of the
+pattern space. If found, replace matched string with
+@var{replacement}.
-@end table
+@item t @var{label}
+(test) Branch to @var{label} only if there has been a successful
+@code{s}ubstitution since the last input line was read or conditional
+branch was taken. The @var{label} may be omitted, in which case the
+next cycle is started.
+@item T @var{label}
+(test) Branch to @var{label} only if there have been no successful
+@code{s}ubstitutions since the last input line was read or
+conditional branch was taken. The @var{label} may be omitted,
+in which case the next cycle is started.
+@item v @var{[version]}
+(version) This command does nothing, but makes @command{sed} fail if
+@value{SSED} extensions are not supported, or if the requested version
+is not available.
-@node Common Commands
-@section Often-Used Commands
+@item w filename
+Write the pattern space to @var{filename}.
-If you use @command{sed} at all, you will quite likely want to know
-these commands.
+@item W filename
+Write to the given filename the portion of the pattern space up to
+the first newline
-@table @code
-@item #
-[No addresses allowed.]
+@item x
+Exchange the contents of the hold and pattern spaces.
-@findex # (comments)
-@cindex Comments, in scripts
-The @code{#} character begins a comment;
-the comment continues until the next newline.
-@cindex Portability, comments
-If you are concerned about portability, be aware that
-some implementations of @command{sed} (which are not @sc{posix}
-conformant) may only support a single one-line comment,
-and then only when the very first character of the script is a @code{#}.
+@item y/src/dst/
+Transliterate any characters in the pattern space which match
+any of the @var{source-chars} with the corresponding character
+in @var{dest-chars}.
-@findex -n, forcing from within a script
-@cindex Caveat --- #n on first line
-Warning: if the first two characters of the @command{sed} script
-are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
-If you want to put a comment in the first line of your script
-and that comment begins with the letter @samp{n}
-and you do not want this behavior,
-then be sure to either use a capital @samp{N},
-or place at least one space before the @samp{n}.
-@item q [@var{exit-code}]
-This command only accepts a single address.
+@item z
+(zap) This command empties the content of pattern space.
-@findex q (quit) command
-@cindex @value{SSEDEXT}, returning an exit code
-@cindex Quitting
-Exit @command{sed} without processing any more commands or input.
-Note that the current pattern space is printed if auto-print is
-not disabled with the @option{-n} options. The ability to return
-an exit code from the @command{sed} script is a @value{SSED} extension.
+@item #
+A comment, until the next newline.
-@item d
-@findex d (delete) command
-@cindex Text, deleting
-Delete the pattern space;
-immediately start next cycle.
-@item p
-@findex p (print) command
-@cindex Text, printing
-Print out the pattern space (to the standard output).
-This command is usually only used in conjunction with the @option{-n}
-command-line option.
+@item @{ @var{cmd ; cmd ...} @}
+Group several commands together.
+@c useful for multiple commands on same address
-@item n
-@findex n (next-line) command
-@cindex Next input line, replace pattern space with
-@cindex Read next input line
-If auto-print is not disabled, print the pattern space,
-then, regardless, replace the pattern space with the next line of input.
-If there is no more input then @command{sed} exits without processing
-any more commands.
+@item =
+Print the current input line number (with a trailing newline).
-@item @{ @var{commands} @}
-@findex @{@} command grouping
-@cindex Grouping commands
-@cindex Command groups
-A group of commands may be enclosed between
-@code{@{} and @code{@}} characters.
-This is particularly useful when you want a group of commands
-to be triggered by a single address (or address-range) match.
+@item : @var{label}
+Specify the location of @var{label} for branch commands (@code{b},
+@code{t}, @code{T}).
@end table
+
@node The "s" Command
@section The @code{s} Command
-The syntax of the @code{s} (as in substitute) command is
-@samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/}
-characters may be uniformly replaced by any other single
-character within any given @code{s} command. The @code{/}
-character (or whatever other character is used in its stead)
-can appear in the @var{regexp} or @var{replacement}
-only if it is preceded by a @code{\} character.
+The @code{s} command (as in substitute) is probably the most important
+in @command{sed} and has a lot of different options. The syntax of
+the @code{s} command is
+@samp{s/@var{regexp}/@var{replacement}/@var{flags}}.
+
+Its basic concept is simple: the @code{s} command attempts to match
+the pattern space against the supplied regular expression @var{regexp};
+if the match is successful, then that portion of the
+pattern space which was matched is replaced with @var{replacement}.
-The @code{s} command is probably the most important in @command{sed}
-and has a lot of different options. Its basic concept is simple:
-the @code{s} command attempts to match the pattern
-space against the supplied @var{regexp}; if the match is
-successful, then that portion of the pattern
-space which was matched is replaced with @var{replacement}.
+For details about @var{regexp} syntax @pxref{Regexp Addresses,,Regular
+Expression Addresses}.
@cindex Backreferences, in regular expressions
@cindex Parenthesized substrings
@@ -1005,6 +791,18 @@ the portion of the match which is contained between the @var{n}th
Also, the @var{replacement} can contain unescaped @code{&}
characters which reference the whole matched portion
of the pattern space.
+
+@c TODO: xref to backreference section mention @var{\'}.
+
+The @code{/}
+characters may be uniformly replaced by any other single
+character within any given @code{s} command. The @code{/}
+character (or whatever other character is used in its stead)
+can appear in the @var{regexp} or @var{replacement}
+only if it is preceded by a @code{\} character.
+
+
+
@cindex @value{SSEDEXT}, case modifiers in @code{s} commands
Finally, as a @value{SSED} extension, you can include a
special sequence made of a backslash and one of the letters
@@ -1078,7 +876,8 @@ not just the first.
@cindex Replacing only @var{n}th match of regexp in a line
Only replace the @var{number}th match of the @var{regexp}.
-@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command
+@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier
+interaction in @code{s} command
@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command
Note: the @sc{posix} standard does not specify what should happen
when you mix the @code{g} and @var{number} modifiers,
@@ -1131,9 +930,6 @@ a @sc{nul} character. This is a @value{SSED} extension.
@itemx i
@cindex @acronym{GNU} extensions, @code{I} modifier
@cindex Case-insensitive matching
-@ifset PERL
-@cindex Perl-style regular expressions, case-insensitive
-@end ifset
The @code{I} modifier to regular-expression matching is a @acronym{GNU}
extension which makes @command{sed} match @var{regexp} in a
case-insensitive manner.
@@ -1141,53 +937,162 @@ case-insensitive manner.
@item M
@itemx m
@cindex @value{SSEDEXT}, @code{M} modifier
-@ifset PERL
-@cindex Perl-style regular expressions, multiline
-@end ifset
The @code{M} modifier to regular-expression matching is a @value{SSED}
extension which directs @value{SSED} to match the regular expression
in @cite{multi-line} mode. The modifier causes @code{^} and @code{$} to
match respectively (in addition to the normal behavior) the empty string
after a newline, and the empty string before a newline. There are
special character sequences
-@ifset PERL
-(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
-in basic or extended regular expression modes)
-@end ifset
@ifclear PERL
(@code{\`} and @code{\'})
@end ifclear
which always match the beginning or the end of the buffer.
In addition,
-@ifset PERL
-just like in Perl mode without the @code{S} modifier,
-@end ifset
the period character does not match a new-line character in
multi-line mode.
-@ifset PERL
-@item S
-@itemx s
-@cindex @value{SSEDEXT}, @code{S} modifier
-@cindex Perl-style regular expressions, single line
-The @code{S} modifier to regular-expression matching is only valid
-in Perl mode and specifies that the dot character (@code{.}) will
-match the newline character too. @code{S} stands for @cite{single-line}.
-@end ifset
-
-@ifset PERL
-@item X
-@itemx x
-@cindex @value{SSEDEXT}, @code{X} modifier
-@cindex Perl-style regular expressions, extended
-The @code{X} modifier to regular-expression matching is also
-valid in Perl mode only. If it is used, whitespace in the
-pattern (other than in a character class) and
-characters between a @kbd{#} outside a character class and the
-next newline character are ignored. An escaping backslash
-can be used to include a whitespace or @kbd{#} character as part
-of the pattern.
-@end ifset
+
+@end table
+
+@node Common Commands
+@section Often-Used Commands
+
+If you use @command{sed} at all, you will quite likely want to know
+these commands.
+
+@table @code
+@item #
+[No addresses allowed.]
+
+@findex # (comments)
+@cindex Comments, in scripts
+The @code{#} character begins a comment;
+the comment continues until the next newline.
+
+@cindex Portability, comments
+If you are concerned about portability, be aware that
+some implementations of @command{sed} (which are not @sc{posix}
+conforming) may only support a single one-line comment,
+and then only when the very first character of the script is a @code{#}.
+
+@findex -n, forcing from within a script
+@cindex Caveat --- #n on first line
+Warning: if the first two characters of the @command{sed} script
+are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
+If you want to put a comment in the first line of your script
+and that comment begins with the letter @samp{n}
+and you do not want this behavior,
+then be sure to either use a capital @samp{N},
+or place at least one space before the @samp{n}.
+
+@item q [@var{exit-code}]
+@findex q (quit) command
+@cindex @value{SSEDEXT}, returning an exit code
+@cindex Quitting
+Exit @command{sed} without processing any more commands or input.
+
+Example: stop after printing the second line:
+@example
+$ seq 3 | sed 2q
+1
+2
+@end example
+
+This command only accepts a single address.
+Note that the current pattern space is printed if auto-print is
+not disabled with the @option{-n} options. The ability to return
+an exit code from the @command{sed} script is a @value{SSED} extension.
+
+See also the @value{SSED} extension @code{Q} command which quits silently
+without printing the current pattern space.
+
+@item d
+@findex d (delete) command
+@cindex Text, deleting
+Delete the pattern space;
+immediately start next cycle.
+
+Example: delete the second input line:
+@example
+$ seq 3 | sed 2d
+1
+3
+@end example
+
+@item p
+@findex p (print) command
+@cindex Text, printing
+Print out the pattern space (to the standard output).
+This command is usually only used in conjunction with the @option{-n}
+command-line option.
+
+Example: print only the second input line:
+@example
+$ seq 3 | sed -n 2p
+2
+@end example
+
+@item n
+@findex n (next-line) command
+@cindex Next input line, replace pattern space with
+@cindex Read next input line
+If auto-print is not disabled, print the pattern space,
+then, regardless, replace the pattern space with the next line of input.
+If there is no more input then @command{sed} exits without processing
+any more commands.
+
+This command is useful to skip lines (e.g. process every Nth line).
+
+Example: perform substitution on every 3rd line (i.e. two @code{n} commands
+skip two lines):
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 6 | sed 'n;n;s/./x/'
+1
+2
+x
+4
+5
+x
+@end example
+
+@value{SSED} provides an extension address syntax of @var{first}~@var{step}
+to achieve the same result:
+
+@example
+$ seq 6 | sed '0~3s/./x/'
+1
+2
+x
+4
+5
+x
+@end example
+
+@codequotebacktick off
+@codequoteundirected off
+
+
+@item @{ @var{commands} @}
+@findex @{@} command grouping
+@cindex Grouping commands
+@cindex Command groups
+A group of commands may be enclosed between
+@code{@{} and @code{@}} characters.
+This is particularly useful when you want a group of commands
+to be triggered by a single address (or address-range) match.
+
+Example: perform substitution then print the second input line:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed -n '2@{s/2/X/ ; p@}'
+X
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
@end table
@@ -1200,79 +1105,309 @@ these commands.
@table @code
@item y/@var{source-chars}/@var{dest-chars}/
-(The @code{/} characters may be uniformly replaced by
-any other single character within any given @code{y} command.)
-
@findex y (transliterate) command
@cindex Transliteration
Transliterate any characters in the pattern space which match
any of the @var{source-chars} with the corresponding character
in @var{dest-chars}.
+Example: transliterate @samp{a-j} into @samp{0-9}:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ echo hello world | sed 'y/abcdefghij/0123456789/'
+74llo worl3
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+(The @code{/} characters may be uniformly replaced by
+any other single character within any given @code{y} command.)
+
Instances of the @code{/} (or whatever other character is used in its stead),
@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars}
lists, provide that each instance is escaped by a @code{\}.
The @var{source-chars} and @var{dest-chars} lists @emph{must}
contain the same number of characters (after de-escaping).
+See the @command{tr} command from GNU coreutils for similar functionality.
+
+@item a @var{text}
+Appending @var{text} after a line. This is a @acronym{GNU} extension
+to the standard @code{a} command - see below for details.
+
+Example: Add the word @samp{hello} after the second line:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed '2a hello'
+1
+2
+hello
+3
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+Leading whitespaces after the @code{a} command are ignored.
+The text to add is read until the end of the line.
+
+
@item a\
@itemx @var{text}
-@cindex @value{SSEDEXT}, two addresses supported by most commands
-As a @acronym{GNU} extension, this command accepts two addresses.
-
@findex a (append text lines) command
@cindex Appending text after a line
@cindex Text, appending
-Queue the lines of text which follow this command
+Appending @var{text} after a line.
+
+Example: Add @samp{hello} after the second line
+(@print{} indicates printed output lines):
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed '2a\
+hello'
+@print{}1
+@print{}2
+@print{}hello
+@print{}3
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+The @code{a} command queues the lines of text which follow this command
(each but the last ending with a @code{\},
which are removed from the output)
to be output at the end of the current cycle,
or when the next input line is read.
+@cindex @value{SSEDEXT}, two addresses supported by most commands
+As a @acronym{GNU} extension, this command accepts two addresses.
+
Escape sequences in @var{text} are processed, so you should
use @code{\\} in @var{text} to print a single backslash.
-As a @acronym{GNU} extension, if between the @code{a} and the newline there is
-other than a whitespace-@code{\} sequence, then the text of this line,
-starting at the first non-whitespace character after the @code{a},
-is taken as the first line of the @var{text} block.
-(This enables a simplification in scripting a one-line add.)
-This extension also works with the @code{i} and @code{c} commands.
+The commands resume after the last line without a backslash (@code{\}) -
+@samp{world} in the following example:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed '2a\
+hello\
+world
+3s/./X/'
+@print{}1
+@print{}2
+@print{}hello
+@print{}world
+@print{}X
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+As a @acronym{GNU} extension, the @code{a} command and @var{text} can be
+separated into two @code{-e} parameters, enabling easier scripting:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed -e '2a\' -e hello
+1
+2
+hello
+3
+
+$ sed -e '2a\' -e "$VAR"
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+@item i @var{text}
+insert @var{text} before a line. This is a @acronym{GNU} extension
+to the standard @code{i} command - see below for details.
+
+Example: Insert the word @samp{hello} before the second line:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed '2i hello'
+1
+hello
+2
+3
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+Leading whitespaces after the @code{i} command are ignored.
+The text to add is read until the end of the line.
@item i\
@itemx @var{text}
-@cindex @value{SSEDEXT}, two addresses supported by most commands
-As a @acronym{GNU} extension, this command accepts two addresses.
-
@findex i (insert text lines) command
@cindex Inserting text before a line
@cindex Text, insertion
-Immediately output the lines of text which follow this command
-(each but the last ending with a @code{\},
-which are removed from the output).
+Immediately output the lines of text which follow this command.
+
+Example: Insert @samp{hello} before the second line
+(@print{} indicates printed output lines):
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed '2i\
+hello'
+@print{}1
+@print{}hello
+@print{}2
+@print{}3
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+@cindex @value{SSEDEXT}, two addresses supported by most commands
+As a @acronym{GNU} extension, this command accepts two addresses.
+
+Escape sequences in @var{text} are processed, so you should
+use @code{\\} in @var{text} to print a single backslash.
+
+The commands resume after the last line without a backslash (@code{\}) -
+@samp{world} in the following example:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed '2i\
+hello\
+world
+s/./X/'
+@print{}X
+@print{}hello
+@print{}world
+@print{}X
+@print{}X
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+As a @acronym{GNU} extension, the @code{i} command and @var{text} can be
+separated into two @code{-e} parameters, enabling easier scripting:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed -e '2i\' -e hello
+1
+hello
+2
+3
+
+$ sed -e '2i\' -e "$VAR"
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+@item c @var{text}
+Replaces the line(s) with @var{text}. This is a @acronym{GNU} extension
+to the standard @code{c} command - see below for details.
+
+Example: Replace the 2nd to 9th lines with the word @samp{hello}:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 10 | sed '2,9c hello'
+1
+hello
+10
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+Leading whitespaces after the @code{c} command are ignored.
+The text to add is read until the end of the line.
@item c\
@itemx @var{text}
@findex c (change to text lines) command
@cindex Replacing selected lines with other text
Delete the lines matching the address or address-range,
-and output the lines of text which follow this command
-(each but the last ending with a @code{\},
-which are removed from the output)
-in place of the last line
-(or in place of each line, if no addresses were specified).
+and output the lines of text which follow this command.
+
+Example: Replace 2nd to 4th lines with the words @samp{hello} and
+@samp{world} (@print{} indicates printed output lines):
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 5 | sed '2,4c\
+hello\
+world'
+@print{}1
+@print{}hello
+@print{}world
+@print{}5
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+If no addresses are given, each line is replaced.
+
A new cycle is started after this command is done,
since the pattern space will have been deleted.
+In the following example, the @code{c} starts a
+new cycle and the substitution command is not performed
+on the replaced text:
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed '2c\
+hello
+s/./X/'
+@print{}X
+@print{}hello
+@print{}X
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+As a @acronym{GNU} extension, the @code{c} command and @var{text} can be
+separated into two @code{-e} parameters, enabling easier scripting:
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed -e '2c\' -e hello
+1
+hello
+3
+
+$ sed -e '2c\' -e "$VAR"
+@end example
+@codequoteundirected off
+@codequotebacktick off
-@item =
-@cindex @value{SSEDEXT}, two addresses supported by most commands
-As a @acronym{GNU} extension, this command accepts two addresses.
+@item =
@findex = (print line number) command
@cindex Printing line number
@cindex Line number, printing
Print out the current input line number (with a trailing newline).
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ printf '%s\n' aaa bbb ccc | sed =
+1
+aaa
+2
+bbb
+3
+ccc
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+@cindex @value{SSEDEXT}, two addresses supported by most commands
+As a @acronym{GNU} extension, this command accepts two addresses.
+
+
+
+
@item l @var{n}
@findex l (list unambiguously) command
@cindex List pattern space
@@ -1291,11 +1426,23 @@ the default as specified on the command line is used. The @var{n}
parameter is a @value{SSED} extension.
@item r @var{filename}
-@cindex @value{SSEDEXT}, two addresses supported by most commands
-As a @acronym{GNU} extension, this command accepts two addresses.
@findex r (read file) command
@cindex Read text from a file
+Reads text file a file. Example:
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ seq 3 | sed '2r/etc/hostname'
+1
+2
+fencepost.gnu.org
+3
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
@cindex @value{SSEDEXT}, @file{/dev/stdin} file
Queue the contents of @var{filename} to be read and
inserted into the output stream at the end of the current cycle,
@@ -1307,6 +1454,10 @@ As a @value{SSED} extension, the special value @file{/dev/stdin}
is supported for the file name, which reads the contents of the
standard input.
+@cindex @value{SSEDEXT}, two addresses supported by most commands
+As a @acronym{GNU} extension, this command accepts two addresses. The
+file will then be reread and inserted on each of the addressed lines.
+
@item w @var{filename}
@findex w (write file) command
@cindex Write to a file
@@ -1341,6 +1492,14 @@ then append the next line of input to the pattern space.
If there is no more input then @command{sed} exits without processing
any more commands.
+When @option{-z} is used, a zero byte (the ascii @samp{NUL} character) is
+added between the lines (instead of a new line).
+
+By default @command{sed} does not terminate if there is no 'next' input line.
+This is a GNU extension which can be disabled with @option{--posix}.
+@xref{N_command_last_line,,N command on the last line}.
+
+
@item P
@findex P (print first line) command
@cindex Print first line from pattern space
@@ -1460,33 +1619,6 @@ to the end of the current cycle.
Print out the file name of the current input file (with a trailing
newline).
-@item L @var{n}
-@findex L (fLow paragraphs) command
-@cindex Reformat pattern space
-@cindex Reformatting paragraphs
-@cindex @value{SSEDEXT}, reformatting paragraphs
-@cindex @value{SSEDEXT}, @code{L} command
-This @value{SSED} extension fills and joins lines in pattern space
-to produce output lines of (at most) @var{n} characters, like
-@code{fmt} does; if @var{n} is omitted, the default as specified
-on the command line is used. This command is considered a failed
-experiment and unless there is enough request (which seems unlikely)
-will be removed in future versions.
-
-@ignore
-Blank lines, spaces between words, and indentation are
-preserved in the output; successive input lines with different
-indentation are not joined; tabs are expanded to 8 columns.
-
-If the pattern space contains multiple lines, they are joined, but
-since the pattern space usually contains a single line, the behavior
-of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e.,
-it does not join short lines to form longer ones).
-
-@var{n} specifies the desired line-wrap length; if omitted,
-the default as specified on the command line is used.
-@end ignore
-
@item Q [@var{exit-code}]
This command only accepts a single address.
@@ -1573,8 +1705,1171 @@ way to clear @command{sed}'s buffers in the middle of the
script in most multibyte locales (including UTF-8 locales).
@end table
+
+
+
+
+@node sed addresses
+@chapter Addresses: selecting lines
+
+@menu
+* Addresses overview:: Addresses overview
+* Numeric Addresses:: selecting lines by numbers
+* Regexp Addresses:: selecting lines by text matching
+* Range Addresses:: selecting a range of lines
+@end menu
+
+@node Addresses overview
+@section Addresses overview
+
+@cindex addresses, numeric
+@cindex numeric addresses
+Addresses determine on which line(s) the @command{sed} command will be
+executed. The following command replaces the word @samp{hello}
+with @samp{world} only on line 144:
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+sed '144s/hello/world/' input.txt > output.txt
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+
+
+If no addresses are given, the command is performed on all lines.
+The following command replaces the word @samp{hello} with @samp{world}
+on all lines in the input file:
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+sed 's/hello/world/' input.txt > output.txt
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+
+
+@cindex addresses, regular expression
+@cindex regular expression addresses
+Addresses can contain regular expressions to match lines based
+on content instead of line numbers. The following command replaces
+the word @samp{hello} with @samp{world} only in lines
+containing the word @samp{apple}:
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+sed '/apple/s/hello/world/' input.txt > output.txt
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+
+
+@cindex addresses, range
+@cindex range addresses
+An address range is specified with two addresses separated by a comma
+(@code{,}). Addresses can be numeric, regular expressions, or a mix of
+both.
+The following command replaces the word @samp{hello} with @samp{world}
+only in lines 4 to 17 (inclusive):
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+sed '4,17s/hello/world/' input.txt > output.txt
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+
+
+@cindex Excluding lines
+@cindex Selecting non-matching lines
+@cindex addresses, negating
+@cindex addresses, excluding
+Appending the @code{!} character to the end of an address
+specification (before the command letter) negates the sense of the
+match. That is, if the @code{!} character follows an address or an
+address range, then only lines which do @emph{not} match the addresses
+will be selected. The following command replaces the word @samp{hello}
+with @samp{world} only in lines @emph{not} containing the word
+@samp{apple}:
+
+@example
+sed '/apple/!s/hello/world/' input.txt > output.txt
+@end example
+
+The following command replaces the word @samp{hello} with
+@samp{world} only in lines 1 to 3 and 18 till the last line of the input file
+(i.e. excluding lines 4 to 17):
+
+@example
+sed '4,17!s/hello/world/' input.txt > output.txt
+@end example
+
+
+
+
+
+@node Numeric Addresses
+@section Selecting lines by numbers
+@cindex Addresses, in @command{sed} scripts
+@cindex Line selection
+@cindex Selecting lines to process
+
+Addresses in a @command{sed} script can be in any of the following forms:
+@table @code
+@item @var{number}
+@cindex Address, numeric
+@cindex Line, selecting by number
+Specifying a line number will match only that line in the input.
+(Note that @command{sed} counts lines continuously across all input files
+unless @option{-i} or @option{-s} options are specified.)
+
+@item $
+@cindex Address, last line
+@cindex Last line, selecting
+@cindex Line, selecting last
+This address matches the last line of the last file of input, or
+the last line of each file when the @option{-i} or @option{-s} options
+are specified.
+
+
+@item @var{first}~@var{step}
+@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
+This @acronym{GNU} extension matches every @var{step}th line
+starting with line @var{first}.
+In particular, lines will be selected when there exists
+a non-negative @var{n} such that the current line-number equals
+@var{first} + (@var{n} * @var{step}).
+Thus, one would use @code{1~2} to select the odd-numbered lines and
+@code{0~2} for even-numbered lines;
+to pick every third line starting with the second, @samp{2~3} would be used;
+to pick every fifth line starting with the tenth, use @samp{10~5};
+and @samp{50~0} is just an obscure way of saying @code{50}.
+
+The following commands demonstrate the step address usage:
+
+@example
+$ seq 10 | sed -n '0~4p'
+4
+8
+
+$ seq 10 | sed -n '1~3p'
+1
+4
+7
+10
+@end example
+
+
+@end table
+
+
+
+@node Regexp Addresses
+@section selecting lines by text matching
+
+@value{SSED} supports the following regular expression addresses.
+The default regular expression is
+@ref{BRE syntax, , Basic Regular Expression (BRE)}.
+If @option{-E} or @option{-r} options are used, The regular expression should be
+in @ref{ERE syntax, , Extended Regular Expression (ERE)} syntax.
+@xref{BRE vs ERE}.
+
+@table @code
+@item /@var{regexp}/
+@cindex Address, as a regular expression
+@cindex Line, selecting by regular expression match
+This will select any line which matches the regular expression @var{regexp}.
+If @var{regexp} itself includes any @code{/} characters,
+each must be escaped by a backslash (@code{\}).
+
+The following command prints lines in @file{/etc/passwd}
+which end with @samp{bash}@footnote{
+There are of course many other ways to do the same,
+e.g.
+@example
+grep 'bash$' /etc/passwd
+awk -F: '$7 == "/bin/bash"' /etc/passwd
+@end example
+}:
+
+@example
+sed -n '/bash$/p' /etc/passwd
+@end example
+
+@cindex empty regular expression
+@cindex @value{SSEDEXT}, modifiers and the empty regular expression
+The empty regular expression @samp{//} repeats the last regular
+expression match (the same holds if the empty regular expression is
+passed to the @code{s} command). Note that modifiers to regular expressions
+are evaluated when the regular expression is compiled, thus it is invalid to
+specify them together with the empty regular expression.
+
+@item \%@var{regexp}%
+(The @code{%} may be replaced by any other single character.)
+
+@cindex Slash character, in regular expressions
+This also matches the regular expression @var{regexp},
+but allows one to use a different delimiter than @code{/}.
+This is particularly useful if the @var{regexp} itself contains
+a lot of slashes, since it avoids the tedious escaping of every @code{/}.
+If @var{regexp} itself includes any delimiter characters,
+each must be escaped by a backslash (@code{\}).
+
+The following two commands are equivalent. They print lines
+which start with @samp{/home/alice/documents/}:
+
+@example
+sed -n '/^\/home\/alice\/documents\//p'
+sed -n '\%^/home/alice/documents/%p'
+sed -n '\;^/home/alice/documents/;p'
+@end example
+
+
+@item /@var{regexp}/I
+@itemx \%@var{regexp}%I
+@cindex @acronym{GNU} extensions, @code{I} modifier
+@cindex case insensitive, regular expression
+The @code{I} modifier to regular-expression matching is a @acronym{GNU}
+extension which causes the @var{regexp} to be matched in
+a case-insensitive manner.
+
+In many other programming languages, a lower case @code{i} is used
+for case-insensitive regular expression matching. However, in @command{sed}
+the @code{i} is used for the insert command (TODO: add @code{pxref}).
+
+Observe the difference between the following examples.
+
+In this example, @code{/b/I} is the address: regular expression with @code{I}
+modifier. @code{d} is the delete command:
+
+@example
+$ printf "%s\n" a b c | sed '/b/Id'
+a
+c
+@end example
+
+Here, @code{/b/} is the address: a regular expression.
+@code{i} is the insert command.
+@code{d} is the value to insert.
+A line with @samp{d} is then inserted above the matched line:
+
+@example
+$ printf "%s\n" a b c | sed '/b/id'
+a
+d
+b
+c
+@end example
+
+@item /@var{regexp}/M
+@itemx \%@var{regexp}%M
+@cindex @value{SSEDEXT}, @code{M} modifier
+The @code{M} modifier to regular-expression matching is a @value{SSED}
+extension which directs @value{SSED} to match the regular expression
+in @cite{multi-line} mode. The modifier causes @code{^} and @code{$} to
+match respectively (in addition to the normal behavior) the empty string
+after a newline, and the empty string before a newline. There are
+special character sequences
+@ifclear PERL
+(@code{\`} and @code{\'})
+@end ifclear
+which always match the beginning or the end of the buffer.
+In addition,
+the period character does not match a new-line character in
+multi-line mode.
+@end table
+
+@node Range Addresses
+@section Range Addresses
+
+@cindex Range of lines
+@cindex Several lines, selecting
+An address range can be specified by specifying two addresses
+separated by a comma (@code{,}). An address range matches lines
+starting from where the first address matches, and continues
+until the second address matches (inclusively):
+
+@example
+$ seq 10 | sed -n '4,6p'
+4
+5
+6
+@end example
+
+If the second address is a @var{regexp}, then checking for the
+ending match will start with the line @emph{following} the
+line which matched the first address: a range will always
+span at least two lines (except of course if the input stream
+ends).
+
+@example
+$ seq 10 | sed -n '4,/[0-9]/p'
+4
+5
+@end example
+
+If the second address is a @var{number} less than (or equal to)
+the line matching the first address, then only the one line is
+matched:
+
+@example
+$ seq 10 | sed -n '4,1p'
+4
+@end example
+
+@cindex Special addressing forms
+@cindex Range with start address of zero
+@cindex Zero, as range start address
+@cindex @var{addr1},+N
+@cindex @var{addr1},~N
+@cindex @acronym{GNU} extensions, special two-address forms
+@cindex @acronym{GNU} extensions, @code{0} address
+@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
+@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
+@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
+@value{SSED} also supports some special two-address forms; all these
+are @acronym{GNU} extensions:
+@table @code
+@item 0,/@var{regexp}/
+A line number of @code{0} can be used in an address specification like
+@code{0,/@var{regexp}/} so that @command{sed} will try to match
+@var{regexp} in the first input line too. In other words,
+@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
+except that if @var{addr2} matches the very first line of input the
+@code{0,/@var{regexp}/} form will consider it to end the range, whereas
+the @code{1,/@var{regexp}/} form will match the beginning of its range and
+hence make the range span up to the @emph{second} occurrence of the
+regular expression.
+
+Note that this is the only place where the @code{0} address makes
+sense; there is no 0-th line and commands which are given the @code{0}
+address in any other way will give an error.
+
+The following examples demonstrate the difference between starting
+with address 1 and 0:
+
+@example
+$ seq 10 | sed -n '1,/[0-9]/p'
+1
+2
+
+$ seq 10 | sed -n '0,/[0-9]/p'
+1
+@end example
+
+
+@item @var{addr1},+@var{N}
+Matches @var{addr1} and the @var{N} lines following @var{addr1}.
+
+@example
+$ seq 10 | sed -n '6,+2p'
+6
+7
+8
+@end example
+
+@var{addr1} can be a line number or a regular expression.
+
+@item @var{addr1},~@var{N}
+Matches @var{addr1} and the lines following @var{addr1}
+until the next line whose input line number is a multiple of @var{N}.
+The following command prints starting at line 6, until the next line which
+is a multiple of 4 (i.e. line 8):
+
+@example
+$ seq 10 | sed -n '6,~4p'
+6
+7
+8
+@end example
+
+@var{addr1} can be a line number or a regular expression.
+
+@end table
+
+
+
+
+@node sed regular expressions
+@chapter Regular Expressions: selecting text
+
+@menu
+* Regular Expressions Overview:: Overview of Regular expression in @command{sed}
+* BRE vs ERE:: Basic (BRE) and extended (ERE) regular expression
+ syntax
+* BRE syntax:: Overview of basic regular expression syntax
+* ERE syntax:: Overview of extended regular expression syntax
+* Character Classes and Bracket Expressions::
+* regexp extensions:: Additional regular expression commands
+* Back-references and Subexpressions:: Back-references and Subexpressions
+* Escapes:: Specifying special characters
+* Locale Considerations::
+@end menu
+
+@node Regular Expressions Overview
+@section Overview of regular expression in @command{sed}
+
+@c NOTE: Keep examples in the 'overview' section
+@c neutral in regards to BRE/ERE - to ease understanding.
+
+
+To know how to use @command{sed}, people should understand regular
+expressions (@dfn{regexp} for short). A regular expression
+is a pattern that is matched against a
+subject string from left to right. Most characters are
+@dfn{ordinary}: they stand for
+themselves in a pattern, and match the corresponding characters.
+Regular expressions in @command{sed} are specified between two
+slashes.
+
+The following command prints lines containing the word
+@samp{hello}:
+
+@example
+sed -n '/hello/p'
+@end example
+
+The above example is equivalent to this @command{grep} command:
+
+@example
+grep 'hello'
+@end example
+
+The power of regular expressions comes from the ability to include
+alternatives and repetitions in the pattern. These are encoded in the
+pattern by the use of @dfn{special characters}, which do not stand for
+themselves but instead are interpreted in some special way.
+
+The character @code{^} (caret) in a regular expression matches the
+beginning of the line. The character @code{.} (dot) matches any single
+character. The following @command{sed} command matches and prints
+lines which start with the letter @samp{b}, followed by any single character,
+followed by the letter @samp{d}:
+
+@example
+$ printf "%s\n" abode bad bed bit bid byte body | sed -n '/^b.d/p'
+bad
+bed
+bid
+body
+@end example
+
+The following sections explain the meaning and usage of special
+characters in regular expressions.
+
+@node BRE vs ERE
+@section Basic (BRE) and extended (ERE) regular expression
+
+Basic and extended regular expressions are two variations on the
+syntax of the specified pattern. Basic Regular Expression (BRE) is the
+default in @command{sed} (and similarly in @command{grep}). Extended
+Regular Expression syntax (ERE) is activated by using the @option{-r}
+or @option{-E} options (and similarly, @command{grep -E}).
+
+In @value{SSED} the only difference between basic and extended regular
+expressions is in the behavior of a few special characters: @samp{?},
+@samp{+}, parentheses, braces (@samp{@{@}}), and @samp{|}.
+
+With basic (BRE) syntax, these characters do not have special meaning
+unless prefixed backslash (@samp{\}); While with extended (ERE) syntax
+it is reversed: these characters are special unless they are prefixed
+with backslash (@samp{\}).
+
+@multitable @columnfractions .33 .33 .33
+
+@headitem Desired pattern
+@tab Basic (BRE) Syntax
+@tab Extended (ERE) Syntax
+
+@item literal @samp{+} (plus sign)
+
+@tab
+@example
+$ echo "a+b=c" | sed -n '/a+b/p'
+a+b=c
+@end example
+
+@tab
+@example
+$ echo "a+b=c" | sed -E -n '/a\+b/p'
+a+b=c
+@end example
+
+
+@item One or more @samp{a} characters followed by @samp{b}
+(plus sign as special meta-character)
+
+@tab
+@example
+$ echo "aab" | sed -n '/a\+b/p'
+aab
+@end example
+
+@tab
+@example
+$ echo "aab" | sed -E -n '/a+b/p'
+aab
+@end example
+
+@end multitable
+
+
+
+
+@node BRE syntax
+@section Overview of basic regular expression syntax
+
+Here is a brief description
+of regular expression syntax as used in @command{sed}.
+
+@table @code
+@item @var{char}
+A single ordinary character matches itself.
+
+@item *
+@cindex @acronym{GNU} extensions, to basic regular expressions
+Matches a sequence of zero or more instances of matches for the
+preceding regular expression, which must be an ordinary character, a
+special character preceded by @code{\}, a @code{.}, a grouped regexp
+(see below), or a bracket expression. As a @acronym{GNU} extension, a
+postfixed regular expression can also be followed by @code{*}; for
+example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX}
+1003.1-2001 says that @code{*} stands for itself when it appears at
+the start of a regular expression or subexpression, but many
+non@acronym{GNU} implementations do not support this and portable
+scripts should instead use @code{\*} in these contexts.
+@item .
+Matches any character, including newline.
+
+@item ^
+Matches the null string at beginning of the pattern space, i.e. what
+appears after the circumflex must appear at the beginning of the
+pattern space.
+
+In most scripts, pattern space is initialized to the content of each
+line (@pxref{Execution Cycle, , How @code{sed} works}). So, it is a
+useful simplification to think of @code{^#include} as matching only
+lines where @samp{#include} is the first thing on line---if there are
+spaces before, for example, the match fails. This simplification is
+valid as long as the original content of pattern space is not modified,
+for example with an @code{s} command.
+
+@code{^} acts as a special character only at the beginning of the
+regular expression or subexpression (that is, after @code{\(} or
+@code{\|}). Portable scripts should avoid @code{^} at the beginning of
+a subexpression, though, as @acronym{POSIX} allows implementations that
+treat @code{^} as an ordinary character in that context.
+
+@item $
+It is the same as @code{^}, but refers to end of pattern space.
+@code{$} also acts as a special character only at the end
+of the regular expression or subexpression (that is, before @code{\)}
+or @code{\|}), and its use at the end of a subexpression is not
+portable.
+
+
+@item [@var{list}]
+@itemx [^@var{list}]
+Matches any single character in @var{list}: for example,
+@code{[aeiou]} matches all vowels. A list may include
+sequences like @code{@var{char1}-@var{char2}}, which
+matches any character between (inclusive) @var{char1}
+and @var{char2}.
+@xref{Character Classes and Bracket Expressions}.
+
+@item \+
+@cindex @acronym{GNU} extensions, to basic regular expressions
+As @code{*}, but matches one or more. It is a @acronym{GNU} extension.
+
+@item \?
+@cindex @acronym{GNU} extensions, to basic regular expressions
+As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension.
+
+@item \@{@var{i}\@}
+As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
+decimal integer; for portability, keep it between 0 and 255
+inclusive).
+
+@item \@{@var{i},@var{j}\@}
+Matches between @var{i} and @var{j}, inclusive, sequences.
+
+@item \@{@var{i},\@}
+Matches more than or equal to @var{i} sequences.
+
+@item \(@var{regexp}\)
+Groups the inner @var{regexp} as a whole, this is used to:
+
+@itemize @bullet
+@item
+@cindex @acronym{GNU} extensions, to basic regular expressions
+Apply postfix operators, like @code{\(abcd\)*}:
+this will search for zero or more whole sequences
+of @samp{abcd}, while @code{abcd*} would search
+for @samp{abc} followed by zero or more occurrences
+of @samp{d}. Note that support for @code{\(abcd\)*} is
+required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
+implementations do not support it and hence it is not universally
+portable.
+
+@item
+Use back references (see below).
+@end itemize
+
+
+@item @var{regexp1}\|@var{regexp2}
+@cindex @acronym{GNU} extensions, to basic regular expressions
+Matches either @var{regexp1} or @var{regexp2}. Use
+parentheses to use complex alternative regular expressions.
+The matching process tries each alternative in turn, from
+left to right, and the first one that succeeds is used.
+It is a @acronym{GNU} extension.
+
+@item @var{regexp1}@var{regexp2}
+Matches the concatenation of @var{regexp1} and @var{regexp2}.
+Concatenation binds more tightly than @code{\|}, @code{^}, and
+@code{$}, but less tightly than the other regular expression
+operators.
+
+@item \@var{digit}
+Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
+subexpression in the regular expression. This is called a @dfn{back
+reference}. Subexpressions are implicitly numbered by counting
+occurrences of @code{\(} left-to-right.
+
+@item \n
+Matches the newline character.
+
+@item \@var{char}
+Matches @var{char}, where @var{char} is one of @code{$},
+@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
+Note that the only C-like
+backslash sequences that you can portably assume to be
+interpreted are @code{\n} and @code{\\}; in particular
+@code{\t} is not portable, and matches a @samp{t} under most
+implementations of @command{sed}, rather than a tab character.
+
+@end table
+
+@cindex Greedy regular expression matching
+Note that the regular expression matcher is greedy, i.e., matches
+are attempted from left to right and, if two or more matches are
+possible starting at the same character, it selects the longest.
+
+@noindent
+Examples:
+@table @samp
+@item abcdef
+Matches @samp{abcdef}.
+
+@item a*b
+Matches zero or more @samp{a}s followed by a single
+@samp{b}. For example, @samp{b} or @samp{aaaaab}.
+
+@item a\?b
+Matches @samp{b} or @samp{ab}.
+
+@item a\+b\+
+Matches one or more @samp{a}s followed by one or more
+@samp{b}s: @samp{ab} is the shortest possible match, but
+other examples are @samp{aaaab} or @samp{abbbbb} or
+@samp{aaaaaabbbbbbb}.
+
+@item .*
+@itemx .\+
+These two both match all the characters in a string;
+however, the first matches every string (including the empty
+string), while the second matches only strings containing
+at least one character.
+
+@item ^main.*(.*)
+This matches a string starting with @samp{main},
+followed by an opening and closing
+parenthesis. The @samp{n}, @samp{(} and @samp{)} need not
+be adjacent.
+
+@item ^#
+This matches a string beginning with @samp{#}.
+
+@item \\$
+This matches a string ending with a single backslash. The
+regexp contains two backslashes for escaping.
+
+@item \$
+Instead, this matches a string consisting of a single dollar sign,
+because it is escaped.
+
+@item [a-zA-Z0-9]
+In the C locale, this matches any @acronym{ASCII} letters or digits.
+
+@item [^ @kbd{tab}]\+
+(Here @kbd{tab} stands for a single tab character.)
+This matches a string of one or more
+characters, none of which is a space or a tab.
+Usually this means a word.
+
+@item ^\(.*\)\n\1$
+This matches a string consisting of two equal substrings separated by
+a newline.
+
+@item .\@{9\@}A$
+This matches nine characters followed by an @samp{A} at the end of a line.
+
+@item ^.\@{15\@}A
+This matches the start of a string that contains 16 characters,
+the last of which is an @samp{A}.
+
+@end table
+
+
+@node ERE syntax
+@section Overview of extended regular expression syntax
+@cindex Extended regular expressions, syntax
+
+The only difference between basic and extended regular expressions is in
+the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
+braces (@samp{@{@}}), and @samp{|}. While basic regular expressions
+require these to be escaped if you want them to behave as special
+characters, when using extended regular expressions you must escape
+them if you want them @emph{to match a literal character}. @samp{|}
+is special here because @samp{\|} is a GNU extension -- standard
+basic regular expressions do not provide its functionality.
+
+@noindent
+Examples:
+@table @code
+@item abc?
+becomes @samp{abc\?} when using extended regular expressions. It matches
+the literal string @samp{abc?}.
+
+@item c\+
+becomes @samp{c+} when using extended regular expressions. It matches
+one or more @samp{c}s.
+
+@item a\@{3,\@}
+becomes @samp{a@{3,@}} when using extended regular expressions. It matches
+three or more @samp{a}s.
+
+@item \(abc\)\@{2,3\@}
+becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It
+matches either @samp{abcabc} or @samp{abcabcabc}.
+
+@item \(abc*\)\1
+becomes @samp{(abc*)\1} when using extended regular expressions.
+Backreferences must still be escaped when using extended regular
+expressions.
+
+@item a\|b
+becomes @samp{a|b} when using extended regular expressions. It matches
+@samp{a} or @samp{b}.
+@end table
+
+@node Character Classes and Bracket Expressions
+@section Character Classes and Bracket Expressions
+
+@c The 'character class' section is shamelessly copied from grep's manual.
+
+@cindex bracket expression
+@cindex character class
+A @dfn{bracket expression} is a list of characters enclosed by @samp{[} and
+@samp{]}.
+It matches any single character in that list;
+if the first character of the list is the caret @samp{^},
+then it matches any character @strong{not} in the list.
+For example, the following command replaces the words
+@samp{gray} or @samp{grey} with @samp{blue}:
+
+@example
+sed 's/gr[ae]y/blue/'
+@end example
+
+@c TODO: fix 'ref' to look good in both HTML and PDF
+Bracket expressions can be used in both
+@ref{BRE syntax,,basic} and @ref{ERE syntax,,extended}
+regular expressions (that is, with or without the @option{-E}/@option{-r}
+options).
+
+@cindex range expression
+Within a bracket expression, a @dfn{range expression} consists of two
+characters separated by a hyphen.
+It matches any single character that
+sorts between the two characters, inclusive.
+In the default C locale, the sorting sequence is the native character
+order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}.
+
+
+Finally, certain named classes of characters are predefined within
+bracket expressions, as follows.
+
+These named classes must be used @emph{inside} brackets
+themselves. Correct usage:
+@example
+$ echo 1 | sed 's/[[:digit:]]/X/'
+X
+@end example
+
+Incorrect usage is rejected by newer @command{sed} versions.
+Older versions accepted it but treated it as a single bracket expression
+(which is equivalent to @samp{[dgit:]},
+that is, only the characters @var{d/g/i/t/:}):
+@example
+# current GNU sed versions - incorrect usage rejected
+$ echo 1 | sed 's/[:digit:]/X/'
+sed: character class syntax is [[:space:]], not [:space:]
+
+# older GNU sed versions
+$ echo 1 | sed 's/[:digit:]/X/'
+1
+@end example
+
+
+@cindex classes of characters
+@cindex character classes
+@cindex named character classes
+@table @samp
+
+@item [:alnum:]
+@opindex alnum @r{character class}
+@cindex alphanumeric characters
+Alphanumeric characters:
+@samp{[:alpha:]} and @samp{[:digit:]}; in the @samp{C} locale and ASCII
+character encoding, this is the same as @samp{[0-9A-Za-z]}.
+
+@item [:alpha:]
+@opindex alpha @r{character class}
+@cindex alphabetic characters
+Alphabetic characters:
+@samp{[:lower:]} and @samp{[:upper:]}; in the @samp{C} locale and ASCII
+character encoding, this is the same as @samp{[A-Za-z]}.
+
+@item [:blank:]
+@opindex blank @r{character class}
+@cindex blank characters
+Blank characters:
+space and tab.
+
+@item [:cntrl:]
+@opindex cntrl @r{character class}
+@cindex control characters
+Control characters.
+In ASCII, these characters have octal codes 000
+through 037, and 177 (DEL).
+In other character sets, these are
+the equivalent characters, if any.
+
+@item [:digit:]
+@opindex digit @r{character class}
+@cindex digit characters
+@cindex numeric characters
+Digits: @code{0 1 2 3 4 5 6 7 8 9}.
+
+@item [:graph:]
+@opindex graph @r{character class}
+@cindex graphic characters
+Graphical characters:
+@samp{[:alnum:]} and @samp{[:punct:]}.
+
+@item [:lower:]
+@opindex lower @r{character class}
+@cindex lower-case letters
+Lower-case letters; in the @samp{C} locale and ASCII character
+encoding, this is
+@code{a b c d e f g h i j k l m n o p q r s t u v w x y z}.
+
+@item [:print:]
+@opindex print @r{character class}
+@cindex printable characters
+Printable characters:
+@samp{[:alnum:]}, @samp{[:punct:]}, and space.
+
+@item [:punct:]
+@opindex punct @r{character class}
+@cindex punctuation characters
+Punctuation characters; in the @samp{C} locale and ASCII character
+encoding, this is
+@code{!@: " # $ % & ' ( ) * + , - .@: / : ; < = > ?@: @@ [ \ ] ^ _ ` @{ | @} ~}.
+
+@item [:space:]
+@opindex space @r{character class}
+@cindex space characters
+@cindex whitespace characters
+Space characters: in the @samp{C} locale, this is
+tab, newline, vertical tab, form feed, carriage return, and space.
+
+
+@item [:upper:]
+@opindex upper @r{character class}
+@cindex upper-case letters
+Upper-case letters: in the @samp{C} locale and ASCII character
+encoding, this is
+@code{A B C D E F G H I J K L M N O P Q R S T U V W X Y Z}.
+
+@item [:xdigit:]
+@opindex xdigit @r{character class}
+@cindex xdigit class
+@cindex hexadecimal digits
+Hexadecimal digits:
+@code{0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f}.
+
+@end table
+Note that the brackets in these class names are
+part of the symbolic names, and must be included in addition to
+the brackets delimiting the bracket expression.
+
+Most meta-characters lose their special meaning inside bracket expressions:
+
+@table @samp
+@item ]
+ends the bracket expression if it's not the first list item.
+So, if you want to make the @samp{]} character a list item,
+you must put it first.
+
+@item -
+represents the range if it's not first or last in a list or the ending point
+of a range.
+
+@item ^
+represents the characters not in the list.
+If you want to make the @samp{^}
+character a list item, place it anywhere but first.
+@end table
+
+TODO: incorporate this paragraph (copied verbatim from BRE section).
+
+@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
+The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
+are normally not special within @var{list}. For example, @code{[\*]}
+matches either @samp{\} or @samp{*}, because the @code{\} is not
+special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and
+@code{[:space:]} are special within @var{list} and represent collating
+symbols, equivalence classes, and character classes, respectively, and
+@code{[} is therefore special within @var{list} when it is followed by
+@code{.}, @code{=}, or @code{:}. Also, when not in
+@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
+@code{\t} are recognized within @var{list}. @xref{Escapes}.
+@c ********
+
+
+@c TODO: improve explanation about collation classes and equivalence classes
+@c perhaps dedicate a section to Locales ??
+
+@table @samp
+@item [.
+represents the open collating symbol.
+
+@item .]
+represents the close collating symbol.
+
+@item [=
+represents the open equivalence class.
+
+@item =]
+represents the close equivalence class.
+
+@item [:
+represents the open character class symbol, and should be followed by a
+valid character class name.
+
+@item :]
+represents the close character class symbol.
+@end table
+
+
+@node regexp extensions
+@section regular expression extensions
+
+The following sequences have special meaning inside regular expressions
+(used in @ref{Regexp Addresses,,addresses} and the @code{s} command).
+
+These can be used in both
+@ref{BRE syntax,,basic} and @ref{ERE syntax,,extended}
+regular expressions (that is, with or without the @option{-E}/@option{-r}
+options).
+
+@table @code
+@item \w
+Matches any ``word'' character. A ``word'' character is any
+letter or digit or the underscore character.
+
+@example
+$ echo "abc %-= def." | sed 's/\w/X/g'
+XXX %-= XXX.
+@end example
+
+
+@item \W
+Matches any ``non-word'' character.
+
+@example
+$ echo "abc %-= def." | sed 's/\W/X/g'
+abcXXXXXdefX
+@end example
+
+
+@item \b
+Matches a word boundary; that is it matches if the character
+to the left is a ``word'' character and the character to the
+right is a ``non-word'' character, or vice-versa.
+
+@example
+$ echo "abc %-= def." | sed 's/\b/X/g'
+XabcX %-= XdefX.
+@end example
+
+
+@item \B
+Matches everywhere but on a word boundary; that is it matches
+if the character to the left and the character to the right
+are either both ``word'' characters or both ``non-word''
+characters.
+
+@example
+$ echo "abc %-= def." | sed 's/\w/X/g'
+aXbXc X%X-X=X dXeXf.X
+@end example
+
+
+@item \s
+Matches whitespace characters (spaces and tabs).
+Newlines embedded in the pattern/hold spaces will also match:
+
+@example
+$ echo "abc %-= def." | sed 's/\s/X/g'
+abcX%-=Xdef.
+@end example
+
+
+@item \S
+Matches non-whitespace characters.
+
+@example
+$ echo "abc %-= def." | sed 's/\w/X/g'
+XXX XXX XXXX
+@end example
+
+
+@item \<
+Matches the beginning of a word.
+
+@example
+$ echo "abc %-= def." | sed 's/\</X/g'
+Xabc %-= Xdef.
+@end example
+
+
+@item \>
+Matches the end of a word.
+
+@example
+$ echo "abc %-= def." | sed 's/\>/X/g'
+abcX %-= defX.
+@end example
+
+
+@item \`
+Matches only at the start of pattern space. This is different
+from @code{^} in multi-line mode.
+
+Compare the following two examples:
+
+@example
+$ printf "a\nb\nc\n" | sed 'N;N;s/^/X/gm'
+Xa
+Xb
+Xc
+
+$ printf "a\nb\nc\n" | sed 'N;N;s/\`/X/gm'
+Xa
+b
+c
+@end example
+
+@item \'
+Matches only at the end of pattern space. This is different
+from @code{$} in multi-line mode.
+
+
+
+@end table
+
+
+@node Back-references and Subexpressions
+@section Back-references and Subexpressions
+@cindex subexpression
+@cindex back-reference
+
+@dfn{back-references} are regular expression commands which refer to a
+previous part of the matched regular expression. Back-references are
+specified with backslash and a single digit (e.g. @samp{\1}). The
+part of the regular expression they refer to is called a
+@dfn{subexpression}, and is designated with parentheses.
+
+Back-references and subexpressions are used in two cases: in the
+regular expression search pattern, and in the @var{replacement} part
+of the @command{s} command (@pxref{Regexp Addresses,,Regular
+Expression Addresses} and @ref{The "s" Command}).
+
+In a regular expression pattern, back-references are used to match
+the same content as a previously matched subexpression. In the
+following example, the subexpression is @samp{.} - any single
+character (being surrounded by parentheses makes it a
+subexpression). The back-reference @samp{\1} asks to match the same
+content (same character) as the sub-expression.
+
+The command below matches words starting with any character,
+followed by the letter @samp{o}, followed by the same character as the
+first.
+
+@example
+$ sed -E -n '/^(.)o\1$/p' /usr/share/dict/words
+bob
+mom
+non
+pop
+sos
+tot
+wow
+@end example
+
+Multiple subexpressions are automatically numbered from
+left-to-right. This command searches for 6-letter
+palindromes (the first three letters are 3 subexpressions,
+followed by 3 back-references in reverse order):
+
+@example
+$ sed -E -n '/^(.)(.)(.)\3\2\1$/p' /usr/share/dict/words
+redder
+@end example
+
+In the @command{s} command, back-references can be
+used in the @var{replacement} part to refer back to subexpressions in
+the @var{regexp} part.
+
+The following example uses two subexpressions in the regular
+expression to match two space-separated words. The back-references in
+the @var{replacement} part prints the words in a different order:
+
+@example
+$ echo "James Bond" | sed -E 's/(.*) (.*)/The name is \2, \1 \2./'
+The name is Bond, James Bond.
+@end example
+
+
+When used with alternation, if the group does not participate in the
+match then the back-reference makes the whole match fail. For
+example, @samp{a(.)|b\1} will not match @samp{ba}. When multiple
+regular expressions are given with @option{-e} or from a file
+(@samp{-f @var{file}}), back-references are local to each expression.
+
+
@node Escapes
-@section @acronym{GNU} Extensions for Escapes in Regular Expressions
+@section Escape Sequences - specifying special characters
@cindex @acronym{GNU} extensions, special escapes
Until this chapter, we have only encountered escapes of the form
@@ -1631,15 +2926,7 @@ hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B.
Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}.
@item \o@var{xxx}
-@ifset PERL
-@item \@var{xxx}
-@end ifset
Produces or matches a character whose octal @sc{ascii} value is @var{xxx}.
-@ifset PERL
-The syntax without the @code{o} is active in Perl mode, while the one
-with the @code{o} is active in the normal or extended @sc{posix} regular
-expression modes.
-@end ifset
@item \x@var{xx}
Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}.
@@ -1648,46 +2935,246 @@ Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}.
@samp{\b} (backspace) was omitted because of the conflict with
the existing ``word boundary'' meaning.
-Other escapes match a particular character class and are valid only in
-regular expressions:
-@table @code
-@item \w
-Matches any ``word'' character. A ``word'' character is any
-letter or digit or the underscore character.
+@node Locale Considerations
+@section Locale Considerations
-@item \W
-Matches any ``non-word'' character.
+TODO: fix following paragraphs (copied verbatim from 'bracket
+expression' section).
-@item \b
-Matches a word boundary; that is it matches if the character
-to the left is a ``word'' character and the character to the
-right is a ``non-word'' character, or vice-versa.
+TODO: mention locale support is heavily dependent on the OS/libc, not on sed.
-@item \B
-Matches everywhere but on a word boundary; that is it matches
-if the character to the left and the character to the right
-are either both ``word'' characters or both ``non-word''
-characters.
+The current locale affects the characters matched by @command{sed}'s
+regular expressions.
-@item \`
-Matches only at the start of pattern space. This is different
-from @code{^} in multi-line mode.
-@item \'
-Matches only at the end of pattern space. This is different
-from @code{$} in multi-line mode.
+In other locales, the sorting sequence is not specified, and
+@samp{[a-d]} might be equivalent to @samp{[abcd]} or to
+@samp{[aBbCcDd]}, or it might fail to match any character, or the set of
+characters that it matches might even be erratic.
+To obtain the traditional interpretation
+of bracket expressions, you can use the @samp{C} locale by setting the
+@env{LC_ALL} environment variable to the value @samp{C}.
+
+@example
+# TODO: is there any real-world system/locale where 'A'
+# is replaced by '-' ?
+$ echo A | sed 's/[a-z]/-/'
+A
+@end example
+
+Their interpretation depends on the @env{LC_CTYPE} locale;
+for example, @samp{[[:alnum:]]} means the character class of numbers and letters
+in the current locale.
+
+TODO: show example of collation
+
+@example
+# TODO: this works on glibc systems, not on musl-libc/freebsd/macosx.
+$ printf 'cliché\n' | LC_ALL=fr_FR.utf8 sed 's/[[=e=]]/X/g'
+clichX
+@end example
+
+
+@node advanced sed
+@chapter Advanced @command{sed}: cycles and buffers
+
+@menu
+* Execution Cycle:: How @command{sed} works
+* Hold and Pattern Buffers::
+* Multiline techniques:: Using D,G,H,N,P to process multiple lines
+* Branching and flow control::
+@end menu
+
+@node Execution Cycle
+@section How @command{sed} Works
+
+@cindex Buffer spaces, pattern and hold
+@cindex Spaces, pattern and hold
+@cindex Pattern space, definition
+@cindex Hold space, definition
+@command{sed} maintains two data buffers: the active @emph{pattern} space,
+and the auxiliary @emph{hold} space. Both are initially empty.
+
+@command{sed} operates by performing the following cycle on each
+line of input: first, @command{sed} reads one line from the input
+stream, removes any trailing newline, and places it in the pattern space.
+Then commands are executed; each command can have an address associated
+to it: addresses are a kind of condition code, and a command is only
+executed if the condition is verified before the command is to be
+executed.
+
+When the end of the script is reached, unless the @option{-n} option
+is in use, the contents of pattern space are printed out to the output
+stream, adding back the trailing newline if it was removed.@footnote{Actually,
+if @command{sed} prints a line without the terminating newline, it will
+nevertheless print the missing newline as soon as more text is sent to
+the same output stream, which gives the ``least expected surprise''
+even though it does not make commands like @samp{sed -n p} exactly
+identical to @command{cat}.} Then the next cycle starts for the next
+input line.
+
+Unless special commands (like @samp{D}) are used, the pattern space is
+deleted between two cycles. The hold space, on the other hand, keeps
+its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
+@samp{g}, @samp{G} to move data between both buffers).
+
+@node Hold and Pattern Buffers
+@section Hold and Pattern Buffers
+
+TODO
+
+@node Multiline techniques
+@section Multiline techniques - using D,G,H,N,P to process multiple lines
+
+Multiple lines can be processed as one buffer using the
+@code{D},@code{G},@code{H},@code{N},@code{P}. They are similar to
+their lowercase counterparts (@code{d},@code{g},
+@code{h},@code{n},@code{p}), except that these commands append or
+subtract data while respecting embedded newlines - allowing adding and
+removing lines from the pattern and hold spaces.
+
+They operate as follows:
+@table @code
+@item D
+@emph{deletes} line from the pattern space until the first newline,
+and restarts the cycle.
+
+@item G
+@emph{appends} line from the hold space to the pattern space, with a
+newline before it.
+
+@item H
+@emph{appends} line from the pattern space to the hold space, with a
+newline before it.
+
+@item N
+@emph{appends} line from the input file to the pattern space.
+
+@item P
+@emph{prints} line from the pattern space until the first newline.
-@ifset PERL
-@item \G
-Match only at the start of pattern space or, when doing a global
-substitution using the @code{s///g} command and option, at
-the end-of-match position of the prior match. For example,
-@samp{s/\Ga/Z/g} will change an initial run of @code{a}s to
-a run of @code{Z}s
-@end ifset
@end table
+
+The following example illustrates the operation of @code{N} and
+@code{D} commands:
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+@group
+$ seq 6 | sed -n 'N;l;D'
+1\n2$
+2\n3$
+3\n4$
+4\n5$
+5\n6$
+@end group
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+@enumerate
+@item
+@command{sed} starts by reading the first line into the pattern space
+(i.e. @samp{1}).
+@item
+At the beginning of every cycle, the @code{N}
+command appends a newline and the next line to the pattern space
+(i.e. @samp{1}, @samp{\n}, @samp{2} in the first cycle).
+@item
+The @code{l} command prints the content of the pattern space
+unambigiously.
+@item
+The @code{D} command then removes the content of pattern
+space up to the first newline (leaving @samp{2} at the end of
+the first cycle).
+@item
+At the next cycle the @code{N} command appends a
+newline and the next input line to the pattern space
+(e.g. @samp{2}, @samp{\n}, @samp{3}).
+@end enumerate
+
+
+@cindex processing paragraphs
+@cindex paragraphs, processing
+A common technique to process blocks of text such as paragraphs
+(instead of line-by-line) is using the following construct:
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+sed '/./@{H;$!d@} ; x ; s/REGEXP/REPLACEMENT/'
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+@enumerate
+@item
+The first expression, @code{/./@{H;$!d@}} operates on all non-empty lines,
+and adds the current line (in the pattern space) to the hold space.
+On all lines except the last, the pattern space is deleted and the cycle is
+restarted.
+
+@item
+The other expressions @code{x} and @code{s} are executed only on empty
+lines (i.e. paragraph separators). The @code{x} command fetches the
+accumulated lines from the hold space back to the pattern space. The
+@code{s///} command then operates on all the text in the paragraph
+(including the embedded newlines).
+@end enumerate
+
+The following example demonstrates this technique:
+@codequoteundirected on
+@codequotebacktick on
+@example
+@group
+$ cat input.txt
+a a a aa aaa
+aaaa aaaa aa
+aaaa aaa aaa
+
+bbbb bbb bbb
+bb bb bbb bb
+bbbbbbbb bbb
+
+ccc ccc cccc
+cccc ccccc c
+cc cc cc cc
+
+$ sed '/./@{H;$!d@} ; x ; s/^/\nSTART-->/ ; s/$/\n<--END/' input.txt
+
+START-->
+a a a aa aaa
+aaaa aaaa aa
+aaaa aaa aaa
+<--END
+
+START-->
+bbbb bbb bbb
+bb bb bbb bb
+bbbbbbbb bbb
+<--END
+
+START-->
+ccc ccc cccc
+cccc ccccc c
+cc cc cc cc
+<--END
+@end group
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+For more annotated examples, @pxref{Text search across multiple lines}
+and @ref{Line length adjustment}.
+
+@node Branching and flow control
+@section Branching and Flow Control
+
+TODO
+
@node Examples
@chapter Some Sample Scripts
@@ -1695,12 +3182,18 @@ Here are some @command{sed} scripts to guide you in the art of mastering
@command{sed}.
@menu
+
+Useful one-liners:
+* Joining lines::
+
Some exotic examples:
* Centering lines::
* Increment a number::
* Rename files to lower case::
* Print bash environment::
* Reverse chars of lines::
+* Text search across multiple lines::
+* Line length adjustment::
Emulating standard utilities:
* tac:: Reverse lines of files
@@ -1717,6 +3210,53 @@ Emulating standard utilities:
* cat -s:: Squeezing blank lines
@end menu
+@node Joining lines
+@section Joining lines
+
+Join specific lines (e.g. if lines 2 and 3 need to be joined):
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ cat lines.txt
+hello
+hel
+lo
+hello
+
+$ sed '2@{N;s/\n//;@}' lines.txt
+hello
+hello
+hello
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+Join lines ending with backslashes:
+
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ cat 1.txt
+this \
+is \
+a \
+long \
+line
+and another \
+line
+
+$ sed -e ':x /\\$/ @{ N; s/\\\n//g ; bx @}' 1.txt
+this is a long line
+and another line
+
+
+#TODO: The above requires gnu sed.
+# non-gnu seds need newlines after ':' and 'b'
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
@node Centering lines
@section Centering Lines
@@ -1743,7 +3283,7 @@ technique.
@end group
@group
-# del leading and trailing spaces
+# delete leading and trailing spaces
y/@kbd{tab}/ /
s/^ *//
s/ *$//
@@ -1835,7 +3375,7 @@ seen a script converting the output of @command{date} into a @command{bc}
program!
The main body of this is the @command{sed} script, which remaps the name
-from lower to upper (or vice-versa) and even checks out
+from lower to upper (or vice-versa) and even checks out
if the remapped name is the same as the original name.
Note how the script is parameterized using shell
variables and proper quoting.
@@ -1844,11 +3384,11 @@ variables and proper quoting.
@example
@group
#! /bin/sh
-# rename files to lower/upper case...
+# rename files to lower/upper case...
#
-# usage:
-# move-to-lower *
-# move-to-upper *
+# usage:
+# move-to-lower *
+# move-to-upper *
# or
# move-to-lower -R .
# move-to-upper -R .
@@ -1891,7 +3431,7 @@ files_only=
@group
while :
do
- case "$1" in
+ case "$1" in
-n) apply_cmd='cat' ;;
-R) finder='find "$@@" -type f';;
-h) help ; exit 1 ;;
@@ -2085,6 +3625,212 @@ s/\n//g
@end example
@c end---------------------------------------------
+
+@node Text search across multiple lines
+@section Text search across multiple lines
+
+This section uses @code{N} and @code{D} commands to search for
+consecutive words spanning multiple lines. @xref{Multiline techniques}.
+
+These examples deal with finding doubled occurrences of words in a document.
+
+Finding doubled words in a single line is easy using GNU @command{grep}
+and similarly with @value{SSED}:
+
+@c NOTE: in all examples, 'the@ the' is used to prevent
+@c 'make syntax-check' from complaining about double words.
+@codequoteundirected on
+@codequotebacktick on
+@example
+@group
+$ cat two-cities-dup1.txt
+It was the best of times,
+it was the worst of times,
+it was the@ the age of wisdom,
+it was the age of foolishness,
+
+$ grep -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
+it was the@ the age of wisdom,
+
+$ grep -n -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
+3:it was the@ the age of wisdom,
+
+$ sed -En '/\b(\w+)\s+\1\b/p' two-cities-dup1.txt
+it was the@ the age of wisdom,
+
+$ sed -En '/\b(\w+)\s+\1\b/@{=;p@}' two-cities-dup1.txt
+3
+it was the@ the age of wisdom,
+@end group
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+@itemize @bullet
+@item
+The regular expression @samp{\b\w+\s+} searches for word-boundary (@samp{\b}),
+followed by one-or-more word-characters (@samp{\w+}), followed by whitespace
+(@samp{\s+}). @xref{regexp extensions}.
+
+@item
+Adding parentheses around the @samp{(\w+)} expression creates a subexpression.
+The regular expression pattern @samp{(PATTERN)\s+\1} defines a subexpression
+(in the parentheses) followed by a back-reference, separated by whitespace.
+A successful match means the @var{PATTERN} was repeated twice in succession.
+@xref{Back-references and Subexpressions}.
+
+@item
+The word-boundery expression (@samp{\b}) at both ends ensures partial
+words are not matched (e.g. @samp{the then} is not a desired match).
+@c Thanks to Jim for pointing this out in
+@c http://lists.gnu.org/archive/html/sed-devel/2016-12/msg00041.html
+
+@item
+The @option{-E} option enables extended regular expression syntax, alleviating
+the need to add backslashes before the parenthesis. @xref{ERE syntax}.
+
+@end itemize
+
+When the doubled word span two lines the above regular expression
+will not find them as @command{grep} and @command{sed} operate line-by-line.
+
+By using @command{N} and @command{D} commands, @command{sed} can apply
+regular expressions on multiple lines (that is, multiple lines are stored
+in the pattern space, and the regular expression works on it):
+
+@c NOTE: use 'the@*the' instead of a real new line to prevent
+@c 'make syntax-check' to complain about doubled-words.
+@codequoteundirected on
+@codequotebacktick on
+@example
+$ cat two-cities-dup2.txt
+It was the best of times, it was the
+worst of times, it was the@*the age of wisdom,
+it was the age of foolishness,
+
+$ sed -En '@{N; /\b(\w+)\s+\1\b/@{=;p@} ; D@}' two-cities-dup2.txt
+3
+worst of times, it was the@*the age of wisdom,
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+@itemize @bullet
+@item
+The @command{N} command appends the next line to the pattern space
+(thus ensuring it contains two consecutive lines in every cycle).
+
+@item
+The regular expression uses @samp{\s+} for word separator which matches
+both spaces and newlines.
+
+@item
+The regular expression matches, the entire pattern space is printed
+with @command{p}. No lines are printed by default due to the @option{-n} option.
+
+@item
+The @command{D} removes the first line from the pattern space (up until the
+first newline), readying it for the next cycle.
+@end itemize
+
+See the GNU @command{coreutils} manual for an alternative solution using
+@command{tr -s} and @command{uniq} at
+@c NOTE: cheating and keeping the URL line shorter than 80 characters
+@c by using 'gnu.org' and '/s/'.
+@url{https://gnu.org/s/coreutils/manual/html_node/Squeezing-and-deleting.html}.
+
+@node Line length adjustment
+@section Line length adjustment
+
+This section uses @code{N} and @code{D} commands to search for
+consecutive words spanning multiple lines, and the @code{b} command for
+branching.
+@xref{Multiline techniques} and @ref{Branching and flow control}.
+
+These (somewhat contrived) examples deal with formatting and wrapping
+lines of text of the following input file:
+
+@example
+@group
+$ cat two-cities-mix.txt
+It was the best of times, it was
+the worst of times, it
+was the age of
+wisdom,
+it
+was
+the age
+of foolishness,
+@end group
+@end example
+
+The following command will wrap lines at 40 characters:
+@codequoteundirected on
+@codequotebacktick on
+@example
+@group
+$ sed -E ':x @{N ; s/\n/ /g ; s/(.@{40,40@})/\1\n/ ; /\n/!bx ; P ; D@}' \
+ two-cities-mix.txt
+It was the best of times, it was the wor
+st of times, it was the age of wisdom, i
+t was the age of foolishness,
+@end group
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+The following command will split lines by comma character:
+@codequoteundirected on
+@codequotebacktick on
+@example
+@group
+$ sed -E ':x @{N ; s/\n/ /g ; s/,/,\n/ ; /\n/!bx ; s/^ *// ; P ; D@}' \
+ two-cities-mix.txt
+It was the best of times,
+it was the worst of times,
+it was the age of wisdom,
+it was the age of foolishness,
+@end group
+@end example
+@codequoteundirected off
+@codequotebacktick off
+
+Both examples use similar construct:
+
+@itemize @bullet
+
+@item
+The @samp{:x} is a label. It will be used later by the @command{b} command
+to jump to the beginning of the @command{sed} program without starting
+a new cycle.
+
+@item
+The @samp{N} command reads the next line from the input file, and appends
+it to the existing content of the pattern space (with a newline preceding it).
+
+@item
+The first @samp{s/\n/ /g} command replaces all newlines with spaces, discarding
+the line structure of the input file.
+
+@item
+The second @samp{s///} command adds newlines based on the desired pattern
+(after 40 characters in the first example, after comma character in the second
+example).
+
+@item
+The @samp{/\n/!bx} command searches for a newline in the pattern space
+(@samp{/n/}), and if it is @emph{not} found (@samp{!}), branches (=jumps)
+to the previously defined label @samp{x}. This will cause @command{sed}
+to read the next line without processing any further commands in this cycle.
+
+@item
+If a newline is found in the pattern space, @command{P} is used to print
+up to the newline (that is - the newly structured line) then @command{D}
+deletes the pattern space up to the newline, and starts a new cycle.
+@end itemize
+
+
+
@node tac
@section Reverse Lines of Files
@@ -2093,9 +3839,6 @@ scripts emulating various Unix commands. This, in particular,
is a @command{tac} workalike.
Note that on implementations other than @acronym{GNU} @command{sed}
-@ifset PERL
-and @value{SSED}
-@end ifset
this script might easily overflow internal buffers.
@c start-------------------------------------------
@@ -2542,7 +4285,7 @@ D
@end example
@c end---------------------------------------------
-As you can see, we mantain a 2-line window using @code{P} and @code{D}.
+As you can see, we maintain a 2-line window using @code{P} and @code{D}.
This technique is often used in advanced @command{sed} scripts.
@node uniq -d
@@ -2696,7 +4439,7 @@ tx
This removes leading and trailing blank lines. It is also the
fastest. Note that loops are completely done with @code{n} and
@code{b}, without relying on @command{sed} to restart the
-the script automatically at the end of a line.
+script automatically at the end of a line.
@c start-------------------------------------------
@example
@@ -2714,7 +4457,7 @@ the script automatically at the end of a line.
p
# get next
n
-# got chars? print it again, etc...
+# got chars? print it again, etc...
/./bx
@end group
@@ -2758,80 +4501,6 @@ However, recursion is used to handle subpatterns and indefinite
repetition. This means that the available stack space may limit
the size of the buffer that can be processed by certain patterns.
-@ifset PERL
-There are some size limitations in the regular expression
-matcher but it is hoped that they will never in practice
-be relevant. The maximum length of a compiled pattern
-is 65539 (sic) bytes. All values in repeating quantifiers
-must be less than 65536. The maximum nesting depth of
-all parenthesized subpatterns, including capturing and
-non-capturing subpatterns@footnote{The
-distinction is meaningful when referring to Perl-style
-regular expressions.}, assertions, and other types of
-subpattern, is 200.
-
-Also, @value{SSED} recognizes the @sc{posix} syntax
-@code{[.@var{ch}.]} and @code{[=@var{ch}=]}
-where @var{ch} is a ``collating element'', but these
-are not supported, and an error is given if they are
-encountered.
-
-Here are a few distinctions between the real Perl-style
-regular expressions and those that @option{-R} recognizes.
-
-@enumerate
-@item
-Lookahead assertions do not allow repeat quantifiers after them
-Perl permits them, but they do not mean what you
-might think. For example, @samp{(?!a)@{3@}} does not assert that the
-next three characters are not @samp{a}. It just asserts three times that the
-next character is not @samp{a} --- a waste of time and nothing else.
-
-@item
-Capturing subpatterns that occur inside negative lookahead
-head assertions are counted, but their entries are counted
-as empty in the second half of an @code{s} command.
-Perl sets its numerical variables from any such patterns
-that are matched before the assertion fails to match
-something (thereby succeeding), but only if the negative
-lookahead assertion contains just one branch.
-
-@item
-The following Perl escape sequences are not supported:
-@samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},
-@samp{\Q}. In fact these are implemented by Perl's general
-string-handling and are not part of its pattern matching engine.
-
-@item
-The Perl @samp{\G} assertion is not supported as it is not
-relevant to single pattern matches.
-
-@item
-Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}
-and @samp{(?p@{code@})} constructions. However, there is some experimental
-support for recursive patterns using the non-Perl item @samp{(?R)}.
-
-@item
-There are at the time of writing some oddities in Perl
-5.005_02 concerned with the settings of captured strings
-when part of a pattern is repeated. For example, matching
-@samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets
-@samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}
-to the value @samp{b}, but matching @samp{aabbaa}
-against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}
-unset. However, if the pattern is changed to
-@samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.
-In Perl 5.004 @samp{$2} is set in both cases, and that is also
-true of @value{SSED}.
-
-@item
-Another as yet unresolved discrepancy is that in Perl
-5.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches
-the string @samp{a}, whereas in @value{SSED} it does not.
-However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched
-against @samp{a} leaves $1 unset.
-@end enumerate
-@end ifset
@node Other Resources
@chapter Other Resources for Learning About @command{sed}
@@ -2867,7 +4536,7 @@ Please do not send a bug report like this:
@example
@i{@i{@r{while building frobme-1.3.4}}}
-$ configure
+$ configure
@error{} sed: file sedscr line 1: Unknown option to 's'
@end example
@@ -2886,6 +4555,7 @@ for the bug, but that is not a very practical prospect.
Here are a few commonly reported bugs that are not bugs.
@table @asis
+@anchor{N_command_last_line}
@item @code{N} command on the last line
@cindex Portability, @code{N} command on the last line
@cindex Non-bugs, @code{N} command on the last line
@@ -2896,6 +4566,21 @@ the @command{N} command is issued on the last line of a file.
the @command{-n} command switch has been specified. This choice is
by design.
+Default behavior (gnu extension, non-POSIX conforming):
+@example
+$ seq 3 | sed N
+1
+2
+3
+@end example
+@noindent
+To force POSIX-conforming behavior:
+@example
+$ seq 3 | sed --posix N
+1
+2
+@end example
+
For example, the behavior of
@example
sed N foo bar
@@ -2941,9 +4626,6 @@ assumption that @code{\|} and @code{\+} match the literal characters
@code{|} and @code{+}. Such scripts must be modified by removing the
spurious backslashes if they are to be used with modern implementations
of @command{sed}, like
-@ifset PERL
-@value{SSED} or
-@end ifset
@acronym{GNU} @command{sed}.
On the other hand, some scripts use s|abc\|def||g to remove occurrences
@@ -2972,7 +4654,7 @@ In short, @samp{sed -i} will let you delete the contents of
a read-only file, and in general the @option{-i} option
(@pxref{Invoking sed, , Invocation}) lets you clobber
protected files. This is not a bug, but rather a consequence
-of how the Unix filesystem works.
+of how the Unix file system works.
The permissions on a file say what can happen to the data
in that file, while the permissions on a directory say what can
@@ -2982,7 +4664,7 @@ Rather, it will work on a temporary file that is finally renamed
to the original name: if you rename or delete files, you're actually
modifying the contents of the directory, so the operation depends on
the permissions of the directory, not of the file. For this same
-reason, @command{sed} does not let you use @option{-i} on a writeable file
+reason, @command{sed} does not let you use @option{-i} on a writable file
in a read-only directory, and will break hard or symbolic links when
@option{-i} is used on such a file.
@@ -3039,1297 +4721,13 @@ the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
@end table
-@node Extended regexps
-@appendix Extended regular expressions
-@cindex Extended regular expressions, syntax
-
-The only difference between basic and extended regular expressions is in
-the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
-braces (@samp{@{@}}), and @samp{|}. While basic regular expressions
-require these to be escaped if you want them to behave as special
-characters, when using extended regular expressions you must escape
-them if you want them @emph{to match a literal character}. @samp{|}
-is special here because @samp{\|} is a GNU extension -- standard
-basic regular expressions do not provide its functionality.
-
-@noindent
-Examples:
-@table @code
-@item abc?
-becomes @samp{abc\?} when using extended regular expressions. It matches
-the literal string @samp{abc?}.
-
-@item c\+
-becomes @samp{c+} when using extended regular expressions. It matches
-one or more @samp{c}s.
-
-@item a\@{3,\@}
-becomes @samp{a@{3,@}} when using extended regular expressions. It matches
-three or more @samp{a}s.
-
-@item \(abc\)\@{2,3\@}
-becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It
-matches either @samp{abcabc} or @samp{abcabcabc}.
-
-@item \(abc*\)\1
-becomes @samp{(abc*)\1} when using extended regular expressions.
-Backreferences must still be escaped when using extended regular
-expressions.
-@end table
-
-@ifset PERL
-@node Perl regexps
-@appendix Perl-style regular expressions
-@cindex Perl-style regular expressions, syntax
-
-@emph{This part is taken from the @file{pcre.txt} file distributed together
-with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.}
-
-Perl introduced several extensions to regular expressions, some
-of them incompatible with the syntax of regular expressions
-accepted by Emacs and other @acronym{GNU} tools (whose matcher was
-based on the Emacs matcher). @value{SSED} implements
-both kinds of extensions.
-
-@iftex
-Summarizing, we have:
-
-@itemize @bullet
-@item
-A backslash can introduce several special sequences
-
-@item
-The circumflex, dollar sign, and period characters behave specially
-with regard to new lines
-
-@item
-Strange uses of square brackets are parsed differently
-
-@item
-You can toggle modifiers in the middle of a regular expression
-
-@item
-You can specify that a subpattern does not count when numbering backreferences
-
-@item
-@cindex Greedy regular expression matching
-You can specify greedy or non-greedy matching
-
-@item
-You can have more than ten back references
-
-@item
-You can do complex look aheads and look behinds (in the spirit of
-@code{\b}, but with subpatterns).
-
-@item
-You can often improve performance by avoiding that @command{sed} wastes
-time with backtracking
-
-@item
-You can have if/then/else branches
-
-@item
-You can do recursive matches, for example to look for unbalanced parentheses
-
-@item
-You can have comments and non-significant whitespace, because things can
-get complex...
-@end itemize
-
-Most of these extensions are introduced by the special @code{(?}
-sequence, which gives special meanings to parenthesized groups.
-@end iftex
-@menu
-Other extensions can be roughly subdivided in two categories
-On one hand Perl introduces several more escaped sequences
-(that is, sequences introduced by a backslash). On the other
-hand, it specifies that if a question mark follows an open
-parentheses it should give a special meaning to the parenthesized
-group.
-
-* Backslash:: Introduces special sequences
-* Circumflex/dollar sign/period:: Behave specially with regard to new lines
-* Square brackets:: Are a bit different in strange cases
-* Options setting:: Toggle modifiers in the middle of a regexp
-* Non-capturing subpatterns:: Are not counted when backreferencing
-* Repetition:: Allows for non-greedy matching
-* Backreferences:: Allows for more than 10 back references
-* Assertions:: Allows for complex look ahead matches
-* Non-backtracking subpatterns:: Often gives more performance
-* Conditional subpatterns:: Allows if/then/else branches
-* Recursive patterns:: For example to match parentheses
-* Comments:: Because things can get complex...
-@end menu
-
-@node Backslash
-@appendixsec Backslash
-@cindex Perl-style regular expressions, escaped sequences
-
-There are a few difference in the handling of backslashed
-sequences in Perl mode.
-
-First of all, there are no @code{\o} and @code{\d} sequences.
-@sc{ascii} values for characters can be specified in octal
-with a @code{\@var{xxx}} sequence, where @var{xxx} is a
-sequence of up to three octal digits. If the first digit
-is a zero, the treatment of the sequence is straightforward;
-just note that if the character that follows the escaped digit
-is itself an octal digit, you have to supply three octal digits
-for @var{xxx}. For example @code{\07} is a @sc{bel} character
-rather than a @sc{nul} and a literal @code{7} (this sequence is
-instead represented by @code{\0007}).
-
-@cindex Perl-style regular expressions, backreferences
-The handling of a backslash followed by a digit other than 0
-is complicated. Outside a character class, @command{sed} reads it
-and any following digits as a decimal number. If the number
-is less than 10, or if there have been at least that many
-previous capturing left parentheses in the expression, the
-entire sequence is taken as a back reference. A description
-of how this works is given later, following the discussion
-of parenthesized subpatterns.
-
-Inside a character class, or if the decimal number is
-greater than 9 and there have not been that many capturing
-subpatterns, @command{sed} re-reads up to three octal digits following
-the backslash, and generates a single byte from the
-least significant 8 bits of the value. Any subsequent digits
-stand for themselves. For example:
-
-@example
-\040 @i{@r{is another way of writing a space}}
-\40 @i{@r{is the same, provided there are fewer than 40}}
- @i{@r{previous capturing subpatterns}}
-\7 @i{@r{is always a back reference}}
-\011 @i{@r{is always a tab}}
-\11 @i{@r{might be a back reference, or another way of writing a tab}}
-\0113 @i{@r{is a tab followed by the character @samp{3}}}
-\113 @i{@r{is the character with octal code 113 (since there}}
- @i{@r{can be no more than 99 back references)}}
-\377 @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}}
-\81 @i{@r{is either a back reference, or a binary zero}}
- @i{@r{followed by the two characters @samp{81}}}
-@end example
-
-Note that octal values of 100 or greater must not be introduced
-by a leading zero, because no more than three octal
-digits are ever read. Note that this applies only to the LHS
-pattern; it is not possible yet to specify more than 9 backreferences
-on the RHS of the `s' command.
-
-All the sequences that define a single byte value can be
-used both inside and outside character classes. In addition,
-inside a character class, the sequence @code{\b} is interpreted
-as the backspace character (hex 08). Outside a character
-class it has a different meaning (see below).
-
-In addition, there are four additional escapes specifying
-generic character classes (like @code{\w} and @code{\W} do):
-
-@cindex Perl-style regular expressions, character classes
-@table @samp
-@item \d
-Matches any decimal digit
-
-@item \D
-Matches any character that is not a decimal digit
-@end table
-
-In Perl mode, these character type sequences can appear both inside and
-outside character classes. Instead, in @sc{posix} mode these sequences
-(as well as @code{\w} and @code{\W}) are treated as two literal characters
-(a backslash and a letter) inside square brackets.
-
-Escaped sequences specifying assertions are also different in
-Perl mode. An assertion specifies a condition that has to be met
-at a particular point in a match, without consuming any
-characters from the subject string. The use of subpatterns
-for more complicated assertions is described below. The
-backslashed assertions are
-
-@cindex Perl-style regular expressions, assertions
-@table @samp
-@item \b
-Asserts that the point is at a word boundary.
-A word boundary is a position in the subject string where
-the current character and the previous character do not both
-match @code{\w} or @code{\W} (i.e. one matches @code{\w} and
-the other matches @code{\W}), or the start or end of the string
-if the first or last character matches @code{\w}, respectively.
-
-@item \B
-Asserts that the point is not at a word boundary.
-
-@item \A
-Asserts the matcher is at the start of pattern space (independent
-of multiline mode).
-
-@item \Z
-Asserts the matcher is at the end of pattern space,
-or at a newline before the end of pattern space (independent of
-multiline mode)
-
-@item \z
-Asserts the matcher is at the end of pattern space (independent
-of multiline mode)
-@end table
-
-These assertions may not appear in character classes (but
-note that @code{\b} has a different meaning, namely the
-backspace character, inside a character class).
-Note that Perl mode does not support directly assertions
-for the beginning and the end of word; the @acronym{GNU} extensions
-@code{\<} and @code{\>} achieve this purpose in @sc{posix} mode
-instead.
-
-The @code{\A}, @code{\Z}, and @code{\z} assertions differ
-from the traditional circumflex and dollar sign (described below)
-in that they only ever match at the very start and end of the
-subject string, whatever options are set; in particular @code{\A}
-and @code{\z} are the same as the @acronym{GNU} extensions
-@code{\`} and @code{\'} that are active in @sc{posix} mode.
-
-@node Circumflex/dollar sign/period
-@appendixsec Circumflex, dollar sign, period
-@cindex Perl-style regular expressions, newlines
-
-Outside a character class, in the default matching mode, the
-circumflex character is an assertion which is true only if
-the current matching point is at the start of the subject
-string. Inside a character class, the circumflex has an entirely
-different meaning (see below).
-
-The circumflex need not be the first character of the pattern if
-a number of alternatives are involved, but it should be the
-first thing in each alternative in which it appears if the
-pattern is ever to match that branch. If all possible alternatives,
-start with a circumflex, that is, if the pattern is
-constrained to match only at the start of the subject, it is
-said to be an @dfn{anchored} pattern. (There are also other constructs
-structs that can cause a pattern to be anchored.)
-
-A dollar sign is an assertion which is true only if the
-current matching point is at the end of the subject string,
-or immediately before a newline character that is the last
-character in the string (by default). A dollar sign need not be the
-last character of the pattern if a number of alternatives
-are involved, but it should be the last item in any branch
-in which it appears. A dollar sign has no special meaning in a
-character class.
-
-@cindex Perl-style regular expressions, multiline
-The meanings of the circumflex and dollar sign characters are
-changed if the @code{M} modifier option is used. When this is
-the case, they match immediately after and immediately
-before an internal @code{\n} character, respectively, in addition
-to matching at the start and end of the subject string. For
-example, the pattern @code{/^abc$/} matches the subject string
-@samp{def\nabc} in multiline mode, but not otherwise. Consequently,
-patterns that are anchored in single line mode
-because all branches start with @code{^} are not anchored in
-multiline mode.
-
-@cindex Perl-style regular expressions, multiline
-Note that the sequences @code{\A}, @code{\Z}, and @code{\z}
-can be used to match the start and end of the subject in both
-modes, and if all branches of a pattern start with @code{\A}
-is it always anchored, whether the @code{M} modifier is set or not.
-
-@cindex Perl-style regular expressions, single line
-Outside a character class, a dot in the pattern matches any
-one character in the subject, including a non-printing character,
-but not (by default) newline. If the @code{S} modifier is used,
-dots match newlines as well. Actually, the handling of
-dot is entirely independent of the handling of circumflex
-and dollar sign, the only relationship being that they both
-involve newline characters. Dot has no special meaning in a
-character class.
-
-@node Square brackets
-@appendixsec Square brackets
-@cindex Perl-style regular expressions, character classes
-
-An opening square bracket introduces a character class, terminated
-by a closing square bracket. A closing square bracket on its own
-is not special. If a closing square bracket is required as a
-member of the class, it should be the first data character in
-the class (after an initial circumflex, if present) or escaped with a backslash.
-
-A character class matches a single character in the subject;
-the character must be in the set of characters defined by
-the class, unless the first character in the class is a circumflex,
-in which case the subject character must not be in
-the set defined by the class. If a circumflex is actually
-required as a member of the class, ensure it is not the
-first character, or escape it with a backslash.
-
-For example, the character class [aeiou] matches any lower
-case vowel, while [^aeiou] matches any character that is not
-a lower case vowel. Note that a circumflex is just a convenient
-venient notation for specifying the characters which are in
-the class by enumerating those that are not. It is not an
-assertion: it still consumes a character from the subject
-string, and fails if the current pointer is at the end of
-the string.
-
-@cindex Perl-style regular expressions, case-insensitive
-When caseless matching is set, any letters in a class
-represent both their upper case and lower case versions, so
-for example, a caseless @code{[aeiou]} matches uppercase
-and lowercase @samp{A}s, and a caseless @code{[^aeiou]}
-does not match @samp{A}, whereas a case-sensitive version would.
-
-@cindex Perl-style regular expressions, single line
-@cindex Perl-style regular expressions, multiline
-The newline character is never treated in any special way in
-character classes, whatever the setting of the @code{S} and
-@code{M} options (modifiers) is. A class such as @code{[^a]} will
-always match a newline.
-
-The minus (hyphen) character can be used to specify a range
-of characters in a character class. For example, @code{[d-m]}
-matches any letter between d and m, inclusive. If a minus
-character is required in a class, it must be escaped with a
-backslash or appear in a position where it cannot be interpreted
-as indicating a range, typically as the first or last
-character in the class.
-
-It is not possible to have the literal character @code{]} as the
-end character of a range. A pattern such as @code{[W-]46]} is
-interpreted as a class of two characters (@code{W} and @code{-})
-followed by a literal string @code{46]}, so it would match
-@samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped
-with a backslash it is interpreted as the end of range, so
-@code{[W-\]46]} is interpreted as a single class containing a
-range followed by two separate characters. The octal or
-hexadecimal representation of @code{]} can also be used to end a range.
-
-Ranges operate in @sc{ascii} collating sequence. They can also be
-used for characters specified numerically, for example
-@code{[\000-\037]}. If a range that includes letters is used when
-caseless matching is set, it matches the letters in either
-case. For example, a caseless @code{[W-c]} is equivalent to
-@code{[][\^_`wxyzabc]}, matched caselessly, and if character
-tables for the French locale are in use, @code{[\xc8-\xcb]}
-matches accented E characters in both cases.
-
-Unlike in @sc{posix} mode, the character types @code{\d},
-@code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W}
-may also appear in a character class, and add the characters
-that they match to the class. For example, @code{[\dABCDEF]} matches any
-hexadecimal digit. A circumflex can conveniently be used
-with the upper case character types to specify a more restricted
-set of characters than the matching lower case type.
-For example, the class @code{[^\W_]} matches any letter or digit,
-but not underscore.
-
-All non-alphameric characters other than @code{\}, @code{-},
-@code{^} (at the start) and the terminating @code{]}
-are non-special in character classes, but it does no harm
-if they are escaped.
-
-Perl 5.6 supports the @sc{posix} notation for character classes, which
-uses names enclosed by @code{[:} and @code{:]} within the enclosing
-square brackets, and @value{SSED} supports this notation as well.
-For example,
-
-@example
-[01[:alpha:]%]
-@end example
-
-@noindent
-matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}.
-The supported class names are
-
-@table @code
-@item alnum
-Matches letters and digits
-
-@item alpha
-Matches letters
-
-@item ascii
-Matches character codes 0 - 127
-
-@item cntrl
-Matches control characters
-
-@item digit
-Matches decimal digits (same as \d)
-
-@item graph
-Matches printing characters, excluding space
-
-@item lower
-Matches lower case letters
-
-@item print
-Matches printing characters, including space
-
-@item punct
-Matches printing characters, excluding letters and digits
-
-@item space
-Matches white space (same as \s)
-
-@item upper
-Matches upper case letters
-
-@item word
-Matches ``word'' characters (same as \w)
-
-@item xdigit
-Matches hexadecimal digits
-@end table
-
-The names @code{ascii} and @code{word} are extensions valid only in
-Perl mode. Another Perl extension is negation, which is
-indicated by a circumflex character after the colon. For example,
-
-@example
-[12[:^digit:]]
-@end example
-
-@noindent
-matches @samp{1}, @samp{2}, or any non-digit.
-
-@node Options setting
-@appendixsec Options setting
-@cindex Perl-style regular expressions, toggling options
-@cindex Perl-style regular expressions, case-insensitive
-@cindex Perl-style regular expressions, multiline
-@cindex Perl-style regular expressions, single line
-@cindex Perl-style regular expressions, extended
-
-The settings of the @code{I}, @code{M}, @code{S}, @code{X}
-modifiers can be changed from within the pattern by
-a sequence of Perl option letters enclosed between @code{(?}
-and @code{)}. The option letters must be lowercase.
-
-For example, @code{(?im)} sets caseless, multiline matching. It is
-also possible to unset these options by preceding the letter
-with a hyphen; you can also have combined settings and unsettings:
-@code{(?im-sx)} sets caseless and multiline matching,
-while unsets single line matching (for dots) and extended
-whitespace interpretation. If a letter appears both before
-and after the hyphen, the option is unset.
-
-The scope of these option changes depends on where in the
-pattern the setting occurs. For settings that are outside
-any subpattern (defined below), the effect is the same as if
-the options were set or unset at the start of matching. The
-following patterns all behave in exactly the same way:
-
-@example
-(?i)abc
-a(?i)bc
-ab(?i)c
-abc(?i)
-@end example
-
-which in turn is the same as specifying the pattern abc with
-the @code{I} modifier. In other words, ``top level'' settings
-apply to the whole pattern (unless there are other
-changes inside subpatterns). If there is more than one setting
-of the same option at top level, the rightmost setting
-is used.
-
-If an option change occurs inside a subpattern, the effect
-is different. This is a change of behaviour in Perl 5.005.
-An option change inside a subpattern affects only that part
-of the subpattern @emph{that follows} it, so
-
-@example
-(a(?i)b)c
-@end example
-
-@noindent
-matches abc and aBc and no other strings (assuming
-case-sensitive matching is used). By this means, options can
-be made to have different settings in different parts of the
-pattern. Any changes made in one alternative do carry on
-into subsequent branches within the same subpattern. For
-example,
-
-@example
-(a(?i)b|c)
-@end example
-
-@noindent
-matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C},
-even though when matching @samp{C} the first branch is
-abandoned before the option setting.
-This is because the effects of option settings happen at
-compile time. There would be some very weird behaviour otherwise.
-
-@ignore
-There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
-that can be changed in the same way as the Perl-compatible options by
-using the characters U and X respectively. The (?X) flag
-setting is special in that it must always occur earlier in
-the pattern than any of the additional features it turns on,
-even when it is at top level. It is best put at the start.
-@end ignore
-
-
-@node Non-capturing subpatterns
-@appendixsec Non-capturing subpatterns
-@cindex Perl-style regular expressions, non-capturing subpatterns
-
-Marking part of a pattern as a subpattern does two things.
-On one hand, it localizes a set of alternatives; on the other
-hand, it sets up the subpattern as a capturing subpattern (as
-defined above). The subpattern can be backreferenced and
-referenced in the right side of @code{s} commands.
-
-For example, if the string @samp{the red king} is matched against
-the pattern
-
-@example
-the ((red|white) (king|queen))
-@end example
-
-@noindent
-the captured substrings are @samp{red king}, @samp{red},
-and @samp{king}, and are numbered 1, 2, and 3.
-
-The fact that plain parentheses fulfil two functions is not
-always helpful. There are often times when a grouping
-subpattern is required without a capturing requirement. If an
-opening parenthesis is followed by @code{?:}, the subpattern does
-not do any capturing, and is not counted when computing the
-number of any subsequent capturing subpatterns. For example,
-if the string @samp{the white queen} is matched against the pattern
-
-@example
-the ((?:red|white) (king|queen))
-@end example
-
-@noindent
-the captured substrings are @samp{white queen} and @samp{queen},
-and are numbered 1 and 2. The maximum number of captured
-substrings is 99, while the maximum number of all subpatterns,
-both capturing and non-capturing, is 200.
-
-As a convenient shorthand, if any option settings are
-equired at the start of a non-capturing subpattern, the
-option letters may appear between the @code{?} and the
-@code{:}. Thus the two patterns
-
-@example
-(?i:saturday|sunday)
-(?:(?i)saturday|sunday)
-@end example
-
-@noindent
-match exactly the same set of strings. Because alternative
-branches are tried from left to right, and options are not
-reset until the end of the subpattern is reached, an option
-setting in one branch does affect subsequent branches, so
-the above patterns match @samp{SUNDAY} as well as @samp{Saturday}.
-
-
-@node Repetition
-@appendixsec Repetition
-@cindex Perl-style regular expressions, repetitions
-
-Repetition is specified by quantifiers, which can follow any
-of the following items:
-
-@itemize @bullet
-@item
-a single character, possibly escaped
-
-@item
-the @code{.} special character
-
-@item
-a character class
-
-@item
-a back reference (see next section)
-
-@item
-a parenthesized subpattern (unless it is an assertion; @pxref{Assertions})
-@end itemize
-
-The general repetition quantifier specifies a minimum and
-maximum number of permitted matches, by giving the two
-numbers in curly brackets (braces), separated by a comma.
-The numbers must be less than 65536, and the first must be
-less than or equal to the second. For example:
-
-@example
-z@{2,4@}
-@end example
-
-@noindent
-matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own
-is not a special character. If the second number is omitted,
-but the comma is present, there is no upper limit; if the
-second number and the comma are both omitted, the quantifier
-specifies an exact number of required matches. Thus
-
-@example
-[aeiou]@{3,@}
-@end example
-
-@noindent
-matches at least 3 successive vowels, but may match many
-more, while
-
-@example
-\d@{8@}
-@end example
-
-@noindent
-matches exactly 8 digits. An opening curly bracket that
-appears in a position where a quantifier is not allowed, or
-one that does not match the syntax of a quantifier, is taken
-as a literal character. For example, @{,6@} is not a quantifier,
-but a literal string of four characters.@footnote{It
-raises an error if @option{-R} is not used.}
-
-The quantifier @samp{@{0@}} is permitted, causing the expression to
-behave as if the previous item and the quantifier were not
-present.
-
-For convenience (and historical compatibility) the three
-most common quantifiers have single-character abbreviations:
-
-@table @code
-@item *
-is equivalent to @{0,@}
-
-@item +
-is equivalent to @{1,@}
-
-@item ?
-is equivalent to @{0,1@}
-@end table
-
-It is possible to construct infinite loops by following a
-subpattern that can match no characters with a quantifier
-that has no upper limit, for example:
-
-@example
-(a?)*
-@end example
-
-Earlier versions of Perl used to give an error at
-compile time for such patterns. However, because there are
-cases where this can be useful, such patterns are now
-accepted, but if any repetition of the subpattern does in
-fact match no characters, the loop is forcibly broken.
-@cindex Greedy regular expression matching
-@cindex Perl-style regular expressions, stingy repetitions
-By default, the quantifiers are @dfn{greedy} like in @sc{posix}
-mode, that is, they match as much as possible (up to the maximum
-number of permitted times), without causing the rest of the
-pattern to fail. The classic example of where this gives problems
-is in trying to match comments in C programs. These appear between
-the sequences @code{/*} and @code{*/} and within the sequence, individual
-@code{*} and @code{/} characters may appear. An attempt to match C
-comments by applying the pattern
-
-@example
-/\*.*\*/
-@end example
-
-@noindent
-to the string
-
-@example
-/* first command */ not comment /* second comment */
-@end example
-
-@noindent
-
-fails, because it matches the entire string owing to the
-greediness of the @code{.*} item.
-
-However, if a quantifier is followed by a question mark, it
-ceases to be greedy, and instead matches the minimum number
-of times possible, so the pattern @code{/\*.*?\*/}
-does the right thing with the C comments. The meaning of the
-various quantifiers is not otherwise changed, just the preferred
-number of matches. Do not confuse this use of question
-mark with its use as a quantifier in its own right.
-Because it has two uses, it can sometimes appear doubled, as in
-
-@example
-\d??\d
-@end example
-
-which matches one digit by preference, but can match two if
-that is the only way the rest of the pattern matches.
-
-Note that greediness does not matter when specifying addresses,
-but can be nevertheless used to improve performance.
-
-@ignore
-If the PCRE_UNGREEDY option is set (an option which is not
-available in Perl), the quantifiers are not greedy by
-default, but individual ones can be made greedy by following
-them with a question mark. In other words, it inverts the
-default behaviour.
-@end ignore
-
-When a parenthesized subpattern is quantified with a minimum
-repeat count that is greater than 1 or with a limited maximum,
-more store is required for the compiled pattern, in
-proportion to the size of the minimum or maximum.
-
-@cindex Perl-style regular expressions, single line
-If a pattern starts with @code{.*} or @code{.@{0,@}} and the
-@code{S} modifier is used, the pattern is implicitly anchored,
-because whatever follows will be tried against every character
-position in the subject string, so there is no point in
-retrying the overall match at any position after the first.
-PCRE treats such a pattern as though it were preceded by \A.
-
-When a capturing subpattern is repeated, the value captured
-is the substring that matched the final iteration. For example,
-after
-
-@example
-(tweedle[dume]@{3@}\s*)+
-@end example
-
-@noindent
-has matched @samp{tweedledum tweedledee} the value of the
-captured substring is @samp{tweedledee}. However, if there are
-nested capturing subpatterns, the corresponding captured
-values may have been set in previous iterations. For example,
-after
-
-@example
-/(a|(b))+/
-@end example
-
-matches @samp{aba}, the value of the second captured substring is
-@samp{b}.
-
-@node Backreferences
-@appendixsec Backreferences
-@cindex Perl-style regular expressions, backreferences
-
-Outside a character class, a backslash followed by a digit
-greater than 0 (and possibly further digits) is a back
-reference to a capturing subpattern earlier (i.e. to its
-left) in the pattern, provided there have been that many
-previous capturing left parentheses.
-
-However, if the decimal number following the backslash is
-less than 10, it is always taken as a back reference, and
-causes an error only if there are not that many capturing
-left parentheses in the entire pattern. In other words, the
-parentheses that are referenced need not be to the left of
-the reference for numbers less than 10. @ref{Backslash}
-for further details of the handling of digits following a backslash.
-
-A back reference matches whatever actually matched the capturing
-subpattern in the current subject string, rather than
-anything matching the subpattern itself. So the pattern
-
-@example
-(sens|respons)e and \1ibility
-@end example
-
-@noindent
-matches @samp{sense and sensibility} and @samp{response and responsibility},
-but not @samp{sense and responsibility}. If caseful
-matching is in force at the time of the back reference, the
-case of letters is relevant. For example,
-
-@example
-((?i)blah)\s+\1
-@end example
-
-@noindent
-matches @samp{blah blah} and @samp{Blah Blah}, but not
-@samp{BLAH blah}, even though the original capturing
-subpattern is matched caselessly.
-
-There may be more than one back reference to the same subpattern.
-Also, if a subpattern has not actually been used in a
-particular match, any back references to it always fail. For
-example, the pattern
-
-@example
-(a|(bc))\2
-@end example
-
-@noindent
-always fails if it starts to match @samp{a} rather than
-@samp{bc}. Because there may be up to 99 back references, all
-digits following the backslash are taken as part of a potential
-back reference number; this is different from what happens
-in @sc{posix} mode. If the pattern continues with a digit
-character, some delimiter must be used to terminate the back
-reference. If the @code{X} modifier option is set, this can be
-whitespace. Otherwise an empty comment can be used, or the
-following character can be expressed in hexadecimal or octal.
-Note that this applies only to the LHS pattern; it is
-not possible yet to specify more than 9 backreferences on the
-RHS of the `s' command.
-
-A back reference that occurs inside the parentheses to which
-it refers fails when the subpattern is first used, so, for
-example, @code{(a\1)} never matches. However, such references
-can be useful inside repeated subpatterns. For example, the
-pattern
-
-@example
-(a|b\1)+
-@end example
-
-@noindent
-matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa},
-etc. At each iteration of the subpattern, the back reference matches
-the character string corresponding to the previous iteration. In
-order for this to work, the pattern must be such that the first
-iteration does not need to match the back reference. This can be
-done using alternation, as in the example above, or by a
-quantifier with a minimum of zero.
-
-@node Assertions
-@appendixsec Assertions
-@cindex Perl-style regular expressions, assertions
-@cindex Perl-style regular expressions, asserting subpatterns
-
-An assertion is a test on the characters following or
-preceding the current matching point that does not actually
-consume any characters. The simple assertions coded as @code{\b},
-@code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$}
-are described above. More complicated assertions are coded as
-subpatterns. There are two kinds: those that look ahead of the
-current position in the subject string, and those that look behind it.
-
-@cindex Perl-style regular expressions, lookahead subpatterns
-An assertion subpattern is matched in the normal way, except
-that it does not cause the current matching position to be
-changed. Lookahead assertions start with @code{(?=} for positive
-assertions and @code{(?!} for negative assertions. For example,
-
-@example
-\w+(?=;)
-@end example
-
-@noindent
-matches a word followed by a semicolon, but does not include
-the semicolon in the match, and
-@example
-foo(?!bar)
-@end example
-
-@noindent
-matches any occurrence of @samp{foo} that is not followed by
-@samp{bar}.
-
-Note that the apparently similar pattern
-
-@example
-(?!foo)bar
-@end example
-
-@noindent
-@cindex Perl-style regular expressions, lookbehind subpatterns
-finds any occurrence of @samp{bar} even if it is preceded by
-@samp{foo}, because the assertion @code{(?!foo)} is always true
-when the next three characters are @samp{bar}. A lookbehind
-assertion is needed to achieve this effect.
-Lookbehind assertions start with @code{(?<=} for positive
-assertions and @code{(?<!} for negative assertions. So,
-
-@example
-(?<!foo)bar
-@end example
-
-achieves the required effect of finding an occurrence of
-@samp{bar} that is not preceded by @samp{foo}. The contents of a
-lookbehind assertion are restricted
-such that all the strings it matches must have a fixed
-length. However, if there are several alternatives, they do
-not all have to have the same fixed length. This is an extension
-compared with Perl 5.005, which requires all branches to match
-the same length of string. Thus
-
-@example
-(?<=dogs|cats|)
-@end example
-
-@noindent
-is permitted, but the apparently equivalent regular expression
-
-@example
-(?<!dogs?|cats?)
-@end example
-
-@noindent
-causes an error at compile time. Branches that match different
-length strings are permitted only at the top level of
-a lookbehind assertion: an assertion such as
-
-@example
-(?<=ab(c|de))
-@end example
-
-@noindent
-is not permitted, because its single top-level branch can
-match two different lengths, but it is acceptable if rewritten
-to use two top-level branches:
-
-@example
-(?<=abc|abde)
-@end example
-
-All this is required because lookbehind assertions simply
-move the current position back by the alternative's fixed
-width and then try to match. If there are
-insufficient characters before the current position, the
-match is deemed to fail. Lookbehinds, in conjunction with
-non-backtracking subpatterns can be particularly useful for
-matching at the ends of strings; an example is given at the end
-of the section on non-backtracking subpatterns.
-
-Several assertions (of any sort) may occur in succession.
-For example,
-
-@example
-(?<=\d@{3@})(?<!999)foo
-@end example
-
-@noindent
-matches @samp{foo} preceded by three digits that are not @samp{999}.
-Notice that each of the assertions is applied independently
-at the same point in the subject string. First there is a
-check that the previous three characters are all digits, and
-then there is a check that the same three characters are not
-@samp{999}. This pattern does not match @samp{foo} preceded by six
-characters, the first of which are digits and the last three
-of which are not @samp{999}. For example, it doesn't match
-@samp{123abcfoo}. A pattern to do that is
-
-@example
-(?<=\d@{3@}...)(?<!999)foo
-@end example
-
-@noindent
-This time the first assertion looks at the preceding six
-characters, checking that the first three are digits, and
-then the second assertion checks that the preceding three
-characters are not @samp{999}. Actually, assertions can be
-nested in any combination, so one can write this as
-
-@example
-(?<=\d@{3@}(?!999)...)foo
-@end example
-
-or
-
-@example
-(?<=\d@{3@}...(?<!999))foo
-@end example
-
-@noindent
-both of which might be considered more readable.
-
-Assertion subpatterns are not capturing subpatterns, and may
-not be repeated, because it makes no sense to assert the
-same thing several times. If any kind of assertion contains
-capturing subpatterns within it, these are counted for the
-purposes of numbering the capturing subpatterns in the whole
-pattern. However, substring capturing is carried out only
-for positive assertions, because it does not make sense for
-negative assertions.
-
-Assertions count towards the maximum of 200 parenthesized
-subpatterns.
-
-@node Non-backtracking subpatterns
-@appendixsec Non-backtracking subpatterns
-@cindex Perl-style regular expressions, non-backtracking subpatterns
-
-With both maximizing and minimizing repetition, failure of
-what follows normally causes the repeated item to be evaluated
-again to see if a different number of repeats allows the
-rest of the pattern to match. Sometimes it is useful to
-prevent this, either to change the nature of the match, or
-to cause it fail earlier than it otherwise might, when the
-author of the pattern knows there is no point in carrying
-on.
-
-Consider, for example, the pattern @code{\d+foo} when applied to
-the subject line
-
-@example
-123456bar
-@end example
-
-After matching all 6 digits and then failing to match @samp{foo},
-the normal action of the matcher is to try again with only 5
-digits matching the @code{\d+} item, and then with 4, and so on,
-before ultimately failing. Non-backtracking subpatterns
-provide the means for specifying that once a portion of the
-pattern has matched, it is not to be re-evaluated in this way,
-so the matcher would give up immediately on failing to match
-@samp{foo} the first time. The notation is another kind of special
-parenthesis, starting with @code{(?>} as in this example:
-
-@example
-(?>\d+)bar
-@end example
-
-This kind of parenthesis ``locks up'' the part of the pattern
-it contains once it has matched, and a failure further into
-the pattern is prevented from backtracking into it.
-Backtracking past it to previous items, however, works as
-normal.
-
-Non-backtracking subpatterns are not capturing subpatterns. Simple
-cases such as the above example can be thought of as a maximizing
-repeat that must swallow everything it can. So,
-while both @code{\d+} and @code{\d+?} are prepared to adjust the number of
-digits they match in order to make the rest of the pattern
-match, @code{(?>\d+)} can only match an entire sequence of digits.
-
-This construction can of course contain arbitrarily complicated
-subpatterns, and it can be nested.
-
-@cindex Perl-style regular expressions, lookbehind subpatterns
-Non-backtracking subpatterns can be used in conjunction with look-behind
-assertions to specify efficient matching at the end
-of the subject string. Consider a simple pattern such as
-
-@example
-abcd$
-@end example
-
-@noindent
-when applied to a long string which does not match. Because
-matching proceeds from left to right, @command{sed} will look for
-each @samp{a} in the subject and then see if what follows matches
-the rest of the pattern. If the pattern is specified as
-
-@example
-^.*abcd$
-@end example
-
-@noindent
-the initial @code{.*} matches the entire string at first, but when
-this fails (because there is no following @samp{a}), it backtracks
-to match all but the last character, then all but the
-last two characters, and so on. Once again the search for
-@samp{a} covers the entire string, from right to left, so we are
-no better off. However, if the pattern is written as
-
-@example
-^(?>.*)(?<=abcd)
-@end example
-
-there can be no backtracking for the .* item; it can match
-only the entire string. The subsequent lookbehind assertion
-does a single test on the last four characters. If it fails,
-the match fails immediately. For long strings, this approach
-makes a significant difference to the processing time.
-
-When a pattern contains an unlimited repeat inside a subpattern
-that can itself be repeated an unlimited number of
-times, the use of a once-only subpattern is the only way to
-avoid some failing matches taking a very long time
-indeed.@footnote{Actually, the matcher embedded in @value{SSED}
-tries to do something for this in the simplest cases,
-like @code{([^b]*b)*}. These cases are actually quite
-common: they happen for example in a regular expression
-like @code{\/\*([^*]*\*)*\/} which matches C comments.}
-
-The pattern
-
-@example
-(\D+|<\d+>)*[!?]
-@end example
-
-([^0-9<]+<(\d+>)?)*[!?]
-
-@noindent
-matches an unlimited number of substrings that either consist
-of non-digits, or digits enclosed in angular brackets, followed by
-an exclamation or question mark. When it matches, it runs quickly.
-However, if it is applied to
-
-@example
-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-@end example
-
-@noindent
-it takes a long time before reporting failure. This is
-because the string can be divided between the two repeats in
-a large number of ways, and all have to be tried.@footnote{The
-example used @code{[!?]} rather than a single character at the end,
-because both @value{SSED} and Perl have an optimization that allows
-for fast failure when a single character is used. They
-remember the last single character that is required for a
-match, and fail early if it is not present in the string.}
-
-If the pattern is changed to
-
-@example
-((?>\D+)|<\d+>)*[!?]
-@end example
-
-sequences of non-digits cannot be broken, and failure happens
-quickly.
-
-@node Conditional subpatterns
-@appendixsec Conditional subpatterns
-@cindex Perl-style regular expressions, conditional subpatterns
-
-It is possible to cause the matching process to obey a subpattern
-conditionally or to choose between two alternative
-subpatterns, depending on the result of an assertion, or
-whether a previous capturing subpattern matched or not. The
-two possible forms of conditional subpattern are
-
-@example
-(?(@var{condition})@var{yes-pattern})
-(?(@var{condition})@var{yes-pattern}|@var{no-pattern})
-@end example
-
-If the condition is satisfied, the yes-pattern is used; otherwise
-the no-pattern (if present) is used. If there are more than two
-alternatives in the subpattern, a compile-time error occurs.
-
-There are two kinds of condition. If the text between the
-parentheses consists of a sequence of digits, the condition
-is satisfied if the capturing subpattern of that number has
-previously matched. The number must be greater than zero.
-Consider the following pattern, which contains non-significant
-white space to make it more readable (assume the @code{X} modifier)
-and to divide it into three parts for ease of discussion:
-
-@example
-( \( )? [^()]+ (?(1) \) )
-@end example
-
-The first part matches an optional opening parenthesis, and
-if that character is present, sets it as the first captured
-substring. The second part matches one or more characters
-that are not parentheses. The third part is a conditional
-subpattern that tests whether the first set of parentheses
-matched or not. If they did, that is, if subject started
-with an opening parenthesis, the condition is true, and so
-the yes-pattern is executed and a closing parenthesis is
-required. Otherwise, since no-pattern is not present, the
-subpattern matches nothing. In other words, this pattern
-matches a sequence of non-parentheses, optionally enclosed
-in parentheses.
-
-@cindex Perl-style regular expressions, lookahead subpatterns
-If the condition is not a sequence of digits, it must be an
-assertion. This may be a positive or negative lookahead or
-lookbehind assertion. Consider this pattern, again containing
-non-significant white space, and with the two alternatives
-on the second line:
-
-@example
-(?(?=...[a-z])
- \d\d-[a-z]@{3@}-\d\d |
- \d\d-\d\d-\d\d )
-@end example
-
-The condition is a positive lookahead assertion that matches
-a letter that is three characters away from the current point.
-If a letter is found, the subject is matched against the first
-alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are
-letters and @var{dd} are digits); otherwise it is matched against
-the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}.
-
-
-@node Recursive patterns
-@appendixsec Recursive patterns
-@cindex Perl-style regular expressions, recursive patterns
-@cindex Perl-style regular expressions, recursion
-
-Consider the problem of matching a string in parentheses,
-allowing for unlimited nested parentheses. Without the use
-of recursion, the best that can be done is to use a pattern
-that matches up to some fixed depth of nesting. It is not
-possible to handle an arbitrary nesting depth. Perl 5.6 has
-provided an experimental facility that allows regular
-expressions to recurse (amongst other things). It does this
-by interpolating Perl code in the expression at run time,
-and the code can refer to the expression itself. A Perl pattern
-tern to solve the parentheses problem can be created like
-this:
-
-@example
-$re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x;
-@end example
-
-The @code{(?p@{...@})} item interpolates Perl code at run time,
-and in this case refers recursively to the pattern in which it
-appears. Obviously, @command{sed} cannot support the interpolation of
-Perl code. Instead, the special item @code{(?R)} is provided for
-the specific case of recursion. This pattern solves the
-parentheses problem (assume the @code{X} modifier option is used
-so that white space is ignored):
-
-@example
-\( ( (?>[^()]+) | (?R) )* \)
-@end example
-
-First it matches an opening parenthesis. Then it matches any
-number of substrings which can either be a sequence of
-non-parentheses, or a recursive match of the pattern itself
-(i.e. a correctly parenthesized substring). Finally there is
-a closing parenthesis.
-
-This particular example pattern contains nested unlimited
-repeats, and so the use of a non-backtracking subpattern for
-matching strings of non-parentheses is important when applying
-the pattern to strings that do not match. For example, when
-it is applied to
-
-@example
-(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
-@end example
-
-it yields a ``no match'' response quickly. However, if a
-standard backtracking subpattern is not used, the match runs
-for a very long time indeed because there are so many different
-ways the @code{+} and @code{*} repeats can carve up the subject,
-and all have to be tested before failure can be reported.
-
-The values set for any capturing subpatterns are those from
-the outermost level of the recursion at which the subpattern
-value is set. If the pattern above is matched against
-
-@example
-(ab(cd)ef)
-@end example
+@page
+@node GNU Free Documentation License
+@appendix GNU Free Documentation License
-@noindent
-the value for the capturing parentheses is @samp{ef}, which is
-the last value taken on at the top level.
-
-@node Comments
-@appendixsec Comments
-@cindex Perl-style regular expressions, comments
-
-The sequence (?# marks the start of a comment which continues
-ues up to the next closing parenthesis. Nested parentheses
-are not permitted. The characters that make up a comment
-play no part in the pattern matching at all.
-
-@cindex Perl-style regular expressions, extended
-If the @code{X} modifier option is used, an unescaped @code{#} character
-outside a character class introduces a comment that continues
-up to the next newline character in the pattern.
-@end ifset
+@include fdl.texi
@page