Библиотека сайта rus-linux.net
Purchase | Copyright © 2002 Paul Sheer. Click here for copying permissions. | Home |
Next: 9. Processes, Environment Variables Up: rute Previous: 7. Shell Scripting   Contents
Subsections
- 8.1 Introduction
- 8.2 Tutorial
- 8.3 Piping Using
|
Notation - 8.4 A Complex Piping Example
- 8.5 Redirecting Streams with
>&
- 8.6 Using
sed
to Edit Streams - 8.7 Regular Expression Subexpressions
- 8.8 Inserting and Deleting Lines
8. Streams and
sed
-- The Stream Editor
The ability to use pipes is one of the powers of UNIX. This is one of the principle deficiencies of some non-UNIX systems. Pipes used on the command-line as explained in this chapter are a neat trick, but pipes used inside C programs enormously simplify program interaction. Without pipes, huge amounts of complex and buggy code usually needs to be written to perform simple tasks. It is hoped that this chapter will give the reader an idea of why UNIX is such a ubiquitous and enduring standard.
8.1 Introduction
The commands
grep
,
echo
,
df
and so on print some
output to the screen. In fact, what is happening on a lower level is that they
are printing characters one by one into a theoretical data stream (also
called a pipe) called the stdout pipe. The shell itself performs
the action of reading those characters one by one and displaying them on the
screen. The word pipe itself means exactly that: A program places data
in the one end of a funnel while another program reads that data from the other
end. Pipes allow two separate programs to perform simple
communications with each other. In this case, the program is merely communicating
with the shell in order to display some output.
The same is true with the
cat
command explained previously. This command,
when run with no arguments, reads from the stdin pipe. By default, this pipe is the
keyboard. One further pipe is the stderr pipe to which a program writes
error messages. It is not possible to see whether a program message is caused
by the program writing to its stderr or stdout pipe because usually both are
directed to the screen. Good programs, however, always write to the appropriate
pipes to allow output to be specially separated for diagnostic purposes if need
be.
8.2 Tutorial
Create a text file with lots of lines that contain the word
GNU
and
one line that contains the word
GNU
as well as the word
Linux
.
Then run
grep GNU myfile.txt
. The result is printed to stdout as usual.
Now try
grep GNU myfile.txt > gnu_lines.txt
. What is happening here
is that the output of the
grep
command is being redirected into
a file. The
> gnu_lines.txt
tells the shell to create a new file
gnu_lines.txt
and to fill it with any output from stdout instead of displaying the output as
it usually does. If the file already exists, it will be
truncated. [Shortened to zero length.]
Now suppose you want to append further output to this file. Using
>>
instead of
>
does not truncate the file, but appends output
to it. Try
|
echo "morestuff" >> gnu_lines.txt |
then view the contents of
gnu_lines.txt
.
8.3 Piping Using
|
Notation
The real power of pipes is realized when one program can read from the output of another
program. Consider the
grep
command, which reads from stdin when given
no arguments; run
grep
with one argument on the command-line:
5 |
[root@cericon]# A line without that word in it Another line without that word in it A line with the word GNU in it A line with the word GNU in it I have the idea now ^C # |
grep
's default behavior is to read from stdin when no files are given. As you
can see, it is doing its usual work of printing lines that have the word
GNU
in them. Hence, lines containing
GNU
will be printed
twice--as you type them in and again when
grep
reads them and decides that
they contain
GNU
.
Now try
grep GNU myfile.txt | grep Linux
. The first
grep
outputs
all lines with the word
GNU
in them to stdout. The
|
specifies
that all stdout is to be typed as stdin (as we just did above) into the next
command, which is also a
grep
command. The second
grep
command
scans that data for lines with the word
Linux
in them.
grep
is often used this way as a filter [Something that screens data.]and can be used multiple times, for example,
|
grep L myfile.txt | grep i | grep n | grep u | grep x |
The
<
character redirects the contents of a file in place of stdin. In other words,
the contents of a file replace what would normally come from a keyboard. Try
|
grep GNU < gnu_lines.txt |
8.4 A Complex Piping Example
In Chapter 5 we used
grep
on a dictionary to demonstrate regular expressions.
This is how a dictionary of words can be created (your dictionary might
be under
/var/share/
or under
/usr/lib/aspell
instead):
|
cat /usr/lib/ispell/english.hash | strings | tr 'A-Z' 'a-z' \ | grep '^[a-z]' | sort -u > mydict |
[A backslash
\
as the last character
on a line indicates that the line is to be continued. You can leave out
the
\
but then you must leave out the newline as well -- this
is known as line continuation.]
The file
english.hash
contains the UNIX dictionary normally used for
spell checking. With a bit of filtering, you can create a dictionary that will
make solving crossword puzzles a breeze. First, we use the command
strings
,
explained previously, to extract readable bits of text. Here we are using its
alternate mode of operation where it reads from stdin when no files are specified
on its command-line. The command
tr
(abbreviated from
translate--see
tr
(1))
then converts upper to lower case. The
grep
command then filters out lines that do not start with a letter. Finally, the
sort
command sorts the words in alphabetical order. The
-u
option stands for
u
nique, and specifies that duplicate
lines of text should be stripped. Now try
less mydict
.
8.5 Redirecting Streams with
>&
Try the command
ls nofile.txt > A
. We expect that
ls
will give an error
message if the file doesn't exist. The error message is, however, displayed and
not written into the file
A
. The reason is that
ls
has written
its error message to stderr while
>
has only redirected stdout. The
way to get both stdout and stderr to both go to the same file is to use a redirection
operator. As far as the shell is concerned, stdout is called
1
and stderr
is called
2
, and commands can be appended with a redirection
like
2>&1
to dictate that stderr is to be mixed into the output of
stdout. The actual words stderr and stdout are only used in C programming,
where the number 1, 2 are known as
file numbers or file descriptors.
Try the following:
|
touch existing_file rm -f non-existing_file ls existing_file non-existing_file |
ls
will output two lines: a line containing a listing for the file
existing_file
and a line containing an error message to explain that
the file
non-existing_file
does not exist. The error message would
have been written to stderr or file descriptor number
2
, and
the remaining line would have been written to stdout or file descriptor
number
1
.
Next we try
|
ls existing_file non-existing_file 2>A cat A |
Now
A
contains the error message, while the remaining output came to
the screen. Now try
|
ls existing_file non-existing_file 1>A cat A |
The notation
1>A
is the same as
>A
because the shell assumes
that you are referring to file descriptor
1
when you don't specify
a file descriptor. Now
A
contains the stdout output, while the error message has
been redirected to the screen.
Now try
|
ls existing_file non-existing_file 1>A 2>&1 cat A |
Now
A
contains both the error message and the normal output. The
>&
is called a redirection operator.
x
>&
y tells
the shell to write pipe x into pipe y. Redirection is specified
from right to left on the command-line. Hence, the above command means to mix
stderr into stdout and then to redirect stdout to the file
A
.
Finally,
|
ls existing_file non-existing_file 2>A 1>&2 cat A |
We notice that this has the same effect, except that here we are doing the reverse:
redirecting stdout into stderr and then redirecting stderr into a file
A
.
To see what happens if we redirect in reverse order, we can try,
|
ls existing_file non-existing_file 2>&1 1>A cat A |
which means to redirect stdout into a file
A
, and then to
redirect stderr into stdout. This command will therefore not mix stderr and stdout because
the redirection to
A
came first.
8.6 Using
sed
to Edit Streams
ed
used to be the standard text editor for UNIX. It is cryptic
to use but is compact and programmable.
sed
stands for stream
editor and is the only incarnation of
ed
that is commonly used today.
sed
allows editing of files non-interactively. In the way that
grep
can search for words and filter lines of text,
sed
can do search-replace
operations and insert and delete lines into text files.
sed
is one
of those programs with no man page to speak of. Do
info sed
to see
sed
's comprehensive
info
pages with examples.
The most common usage
of
sed
is to replace words in a stream with alternative words.
sed
reads from stdin and writes to stdout. Like
grep
, it is line buffered,
which means that it reads one line in at a time and then writes that line out
again after performing whatever editing operations. Replacements are typically
done with
|
cat <file> | sed -e 's/<search-regexp>/<replace-text>/<option>' \ > <resultfile> |
where
<search-regexp>
is a regular expression,
<replace-text>
is the
text you would like to replace each occurrence with, and
<option>
is nothing or
g
, which means to replace every occurrence in the same
line (usually
sed
just replaces the first occurrence of the regular
expression in each line). (There are other
<option>
; see the
sed
info
page.) For demonstration, type
|
sed -e 's/e/E/g' |
and type out a few lines of English text.
8.7 Regular Expression Subexpressions
The section explains how to do the apparently complex task of moving text around
within lines. Consider, for example, the output of
ls
: say you want
to automatically strip out only the size column--
sed
can do this
sort of editing if you use the special
\( \)
notation to group parts of the regular expression together. Consider the following
example:
|
sed -e 's/\(<[^ ]*>\)\([ ]*\)\(<[^ ]*>\)/\3\2\1/g' |
Here
sed
is searching for the expression
\<.*\>[
]*\<.*\>
. From the chapter on regular expressions,
we can see that it matches a whole word, an arbitrary amount of whitespace,
and then another whole word. The
\( \)
groups these three so that they can be referred to in
<replace-text>
. Each
part of the regular expression inside
\( \)
is called a subexpression of the regular expression. Each subexpression
is numbered--namely,
\1
,
\2
,
etc. Hence,
\1
in
<replace-text>
is the first
\<[^ ]*\>
,
\2
is
[ ]*
, and
\3
is the second
\<[^ ]*\>
.
Now test to see what happens when you run this:
|
sed -e 's/\(<[^ ]*>\)\([ ]*\)\(<[^ ]*>\)/\3\2\1/g' GNU Linux is cool Linux GNU cool is |
To return to our
ls
example (note that this is just an example, to
count file sizes you should instead use the
du
command), think about
how we could sum the bytes sizes of all the files in a directory:
|
expr 0 `ls -l | grep '^-' | \ sed 's/^\([^ ]*[ ]*\){4,4}\([0-9]*\).*$/ + \2/'` |
We know that
ls -l
output lines start with
-
for ordinary
files. So we use
grep
to strip lines not starting with
-
.
If we do an
ls -l
, we see that the output is divided into four columns of
stuff we are not interested in, and then a number indicating the size of the
file. A column (or field) can be described by the regular expression
[^ ]*[ ]*
, that is, a length of text with no whitespace,
followed by a length of whitespace. There are four of these, so we bracket it
with
\( \)
and then use the
\{
\}
notation to specify that we want exactly
4
. After
that come our number
[0-9]*
, and then any trailing characters,
which we are not interested in,
.*$
. Notice here that we have neglected
to use
\< \>
notation to indicate whole
words. The reason is that
sed
tries to match the maximum number of characters
legally allowed and, in the situation we have here, has exactly the same effect.
If you haven't yet figured it out, we are trying to get that column of byte sizes into a format like
|
+ 438 + 1525 + 76 + 92146 |
so that
expr
can understand it. Hence, we replace each line with subexpression
\2
and a leading
+
sign. Backquotes give the
output of this to
expr
, which studiously sums them, ignoring any newline
characters as though the summation were typed in on a single line. There is
one minor problem here: the first line contains a
+
with nothing before
it, which will cause
expr
to complain. To get around this, we can just
add a
0
to the expression, so that it becomes
0 +
....
8.8 Inserting and Deleting Lines
sed
can perform a few operations that make it easy to write
scripts that edit configuration files for you. For instance,
|
sed -e '7a\ an extra line.\ another one.\ one more.' |
a
ppends three lines after line 7, whereas
|
sed -e '7i\ an extra line.\ another one.\ one more.' |
i
nserts three lines before line 7. Then
|
sed -e '3,5D' |
D
eletes lines 3 through 5.
In
sed
terminology, the numbers here are called
addresses, which can also be regular expressions matches.
To demonstrate:
|
sed -e '/Dear Henry/,/Love Jane/D' |
deletes all the lines starting from a line matching the
regular expression
Dear Henry
up to a line matching
Love Jane
(or the end of the file if one does not exist).
This behavior applies just as well to to insertions:
|
sed -e '/Love Jane/i\ Love Carol\ Love Beth' |
Note that the
$
symbol indicates the last line:
|
sed -e '$i\ The new second last line\ The new last line.' |
and finally, the negation symbol,
!
, is used to match
all lines not specified; for instance,
|
sed -e '7,11!D' |
deletes all lines except lines 7 through 11.
Next: 9. Processes, Environment Variables Up: rute Previous: 7. Shell Scripting   Contents