UNIX Filters

JAVAERO – UNIX was designed with simplicity in mind. Each command is supposed to do only one thing and do it well. However, this does not mean that
unix can only do simple things. On the contrary, highly complex tasks can be done by stringing these commands together. The output of one command can be
the input of another.

$ cat poem.txt| less

In this example, the command “cat” will dump to the console the contents of the file poem.txt. This is then made as input to the less command. This is quite
a simplistic example. In fact we can just do “less poem.txt” to have to same effect. However, this example is just to illustrate the concept of a unix pipe. The symbol “|” is called the pipe symbol and separates commands. The output of “cat poem.txt” is fed to the “less” command. Notice that the word immediately to the right of the “|” symbol will be interpreted by the shell as a command. An error will occur if the command does not exist.

Now that we have some familiarity with unix, let us add more to our arsenal of commands. One very common task we want to do is to find a specified file matching a certain pattern. The command to do this is “find”.

$ find . -name test.c

This command will search for files in the current directory named test.c. I will also search for test.c in all subdirectories contained in the current directory. The first argument to find is the directory to search. In this case, the . tells find to search in the current directory. The option “-name test.c” tells find that the filename matches test.c. In this command, we specified an exact match. We can also tell find to search for all files ending in .c

$ find . -name *.c

The “” backslash is necessary in order for the command to work. The “” is an escape character. We will have more to say about it in the coming sections.

More advanced shell.

Some characters are treated specially by the shell. When the shell encounters them in a command, it does some preprocessing before executing the command. The most common of these characters have something to do filename expansion.

Filename expansion allows us to specify a group of files using a pattern. For example, the pattern *.c will expand into all files whose name end in .c.

$ ls

$ ls *.c

is the same as

$ ls blah.c foo.c …

In other words, the shell first looks for all files ending in .c and constructs a list of all these files and makes them the argument to the “ls” command.

The following characters can be used to construct a pattern for filename expansion.

* – matches one or more characters
? – matches a single character
[] – matches any single character contains within the “[]”. If “^” is the first character of the list, it matches all files not containing any character in the list.

$ ls
README
license.txt
main.c
pr.h

To match main.c and pr.h you can either do

$ ls *.?

or

$ ls *.[hc]

The first command will match all files with a single letter extension. The second will match all .c and .h files and is equivalent to the command ls *.h *.c.

To match the file README, you can

$ ls R*

Exercise: Suppose we have the following files in the directory

$ ls
README
INSTALL
main.c
init.c
Makefile

How do you match README and INsTALL?

Since the characters *, ?, and [] are interpreted by the shell, we must avoid filenames that contain them. For example, if we have a file named *, we can’t just do this:

$ rm *

for this will match all files in the directory and will delete all of them. To match the file named * we must quote the * using the “” backslash character.

$ rm *

will delete the file named *.

However, this expansion will only be applicable to the current files in the directory and not to the files in the subdirectories. To find all files ending in .c in all subdirectories of the current folder, we use the find command.

$ find . -name *.c

Notice the backslash character. This is necessary in order for the shell not to expand the *.c.

REGULAR EXPRESSIONS

A very important skill that any unix professional should acquire is the ability to create and use regular expressions. A regular expression is a way to specify a pattern that can be used in matching text. A regular expressioin is similar to shell filename expansions but they are more powerful. The power of regular expressions is very hard to express in writing, they can only be experienced.

Three important filters in unix make use extensively of regular expressions, namely: sed, grep and awk.

The following characters are used to construct regular expressions.

^ – matches the beginning of a string
$ – matches the end of a string
. – matches any single character
[..] – matches a single character in the list
[^..] – matches a single character NOT in the list
(..) – used to group a regular expression
| – used to indicate alternative regular expression
* – an operator, it specifies that the preceding regular expression be matched zero or more times
+ – similar to the * operator but requires the preceding regular expression to be matched at least once.
? – similar to the * operator but requires the preceding regular expression to be matched at most once.
{N} – also known as a braced regular expression, is an operator that the requires the preceding regex to be matched N times
{N,} – at least N times
{N,M} – N to M times.

The above list will already produce a lot of possible regular expressions. In daily work, we usually only use a few of these constructs.

Let us give examples of the above constructs.

$ touch Readme theReadme

To match the file “Readme”,

$ /bin/ls -1 |grep “^Readme”

To match the file “theReadme”

$ /bin/ls -1 |grep “Readme$”

$ touch main.c main.x test.C test.cxx
$ /bin/ls -1 |grep “.[ch]$”

will match all files ending in .c or .h Notice that the “.” is explicitly matched and should be escaped by a backslash. The $ requires that the character “c” or “h” should be the last character on the line. Without the $, the pattern will also match .cxx.

$ /bin/ls -1 | grep “.[^ch]”

will match all files not containing a c or h after the dot.

Suppose we have the following files:

$ /bin/ls *Readme*
theReadme
Readme
dontReadme

If we want to match on theReadme and Readme, we can write

$ /bin/ls -1|grep “^(the)*Readme”

Notice that we grouped “the” as a regular expression. The * operator after the regex “the” will match zero or more “the” patterns. The ^ in the beginning of the pattern acts as an anchor. It requires that the pattern occur at the beginning of the line. This prevents the file dontReadme to be matched.

$ touch peterson johnson benson

To match peterson and johnson we use the alternation operator

$ /bin/ls -1 |grep “(peter|john)son”

We need to escape the “|” in order to protect it from the shell.

Sed is another tool makes extensive use of regular expressions. Sed is short for stream editor. It has a lot of options but most of the time, sed is used in
find-replace operations.

The syntax of sed find-replace is

sed ‘s/pattern to find/replacement/g’

$ echo “the quick brown fox jumped over another fox”|sed ‘s/fox/cat/g’
the quick brown cat jumped over another cat

In the above example, sed substituted the word fox with cat. Since the word “fox” occurred twice in this string, the substitution occured
twice. To replace only the first occurece, we omit the “g” in the above sed argument.

$ echo “the quick brown fox jumped over another fox”|sed ‘s/fox/cat/’
the quick brown cat jumped over another fox

The pattern “fox” is an exact pattern, We can also specify a regular expression in place of fox, for example,

$ echo “the quick brown fox jumped over another fox”|sed ‘s/f.*x/cat/’
the quick brown cat

This tells sed to substitute the word cat to the string that matched the pattern f.*x. Let us analyze this pattern. It consists of strings that start with “f”, one or more characters in between ( as specified by .*) and ends in y. You might think that sed will match the word “fox”. However, in this case, sed scanned the input string and first meets the letter “f”. Then it examines the next character, which is “o”. Since this satisfied the condion “one or more character in between”, sed accepts this and continues to examine the next character. Upon seeing “x”, sed should probably stop. However, it does not. It continues to scan the whole string until it encounters the last “x”. Therefore, the pattern that matched is “fox jumped over another fox”. In this way, sed is said to match the longest matching pattern.

Advertisements

Published by

Bobby Corpus

Loves anything related to Mathematics, Physics, Computing and Economics.