Advantage of Pipeline |
---|
can eliminate the need of many temporary files
|
scripts are more readable
|
Spelling Tools |
---|
Word List |
---|
The Story (Sec. 5.4, Classic Shell Programming, Robbins & Beebe)
|
From 1983 to 1987, Bell Labs researcher Jon Bentley
wrote an interesting
column in Communications of the ACM titled Programming
Pearls. Some of the
columns were later collected, with substantial
changes, into two books. In one of the columns, Bentley
posed this challenge: write
a program to process a text file, and output a list of
the n most-frequent
words, with counts of their frequency of occurrence,
sorted by descending
count. Noted computer scientists Donald Knuth and
David Hanson responded
separately with interesting and clever literate
programs, each of which
took several hours to write. Bentley's original
specification was imprecise,
so Hanson rephrased it this way: Given a text file and
an integer n, you are
to print the words (and their frequencies of
occurrence) whose frequencies of
occurrence are among the n largest in order of
decreasing frequency.
|
References:
|
Programming Pearls: A Literate Program: A WEB
program for common
words, Comm. ACM 29(6), 471-483, June (1986).
|
Programming Pearls:
Literate Programming: Printing Common Words, 30(7),
594-599, July (1987).
|
Literate Programming, Knuth's, Stanford
University Center for the Study of Language and
Information, 1992, ISBN
0p-937073-80-6 (paper) and 0-937073-81-4 (cloth).
|
In the first of Bentley's articles, fellow Bell Labs
researcher Doug McIlroy
reviewed Knuth's program, and offered a six-step Unix
solution that took only
a couple of minutes to develop and worked correctly
the first time. Moreover,
unlike the two other programs, McIlroy's is devoid of
explicit magic
constants that limit the word lengths, the number of
unique words, and the
input file size. Also, its notion of what constitutes
a word is defined
entirely by simple patterns given in its first two
executable statements,
making changes to the word-recognition algorithm easy.
|
McIlroy's program illustrates the power of the Unix
tools approach: break a
complex problem into simpler parts that you already
know how to handle. To
solve the word-frequency problem, McIlroy converted
the text file to a list
of words, one per line (tr does the job), mapped words
to a single lettercase
(tr again), sorted the list (sort), reduced it to a
list of unique words with
counts (uniq), sorted that list by descending counts
(sort), and finally,
printed the first several entries in the list (sed,
though head would work
too).
|
Script wf
|
#! /bin/sh # Read a text stream on standard input, and output a list of # the n (default: 25) most frequently occurring words and # their frequency counts, in order of descending counts, on # standard output. # # Usage: # wf [n] tr -cs A-Za-z\' '\n' | #Replace nonletters with newlines tr A-Z a-z | #Map uppercase to lowercase sort | #Sort the words in ascending order uniq -c | #Eliminate duplicates, showing their counts sort -k1,1nr -k2 | #Sort by descending count, and then by ascending word sed ${1:-25}q #Print only the first n (default: 25) lines; |
---|
Tag List |
---|
finds all
begin/end tag pairs written on the same line and outputs a
sorted list that associates tag use with input files.
|
1. feed the input file into the pipeline with cat. | cat "$1" | ... 2.
apply sed to simplify the otherwise-complex markup needed for
web URLs:
| This converts tags such as <systemitem role="URL"> and </systemitem> into simpler <URL> and </URL> tags, respectively. | | sed -e 's#systemitem *role="URL"#URL#g' \ -e 's#/systemitem#/URL#' | ... 3.
uses tr to replace spaces and paired delimiters
by newlines:
| At this point, the input consists of one "word" per line (or empty lines). Words are either actual text or SGML/XML tags. | | tr ' ( ){ }[ ]' '\n\n\n\n\n\n\n' | ... 4.
Using egrep to select tag-enclosed words:
| At this point, the input consists of lines with tags. | | egrep '>[^<>]+ This regular expression matches tag-enclosed words: a right angle bracket, followed by at least one nonangle bracket, followed by a left angle bracket, followed by a slash (for the closing tag). 5.
The first awk
stage uses angle brackets as field separators, so the input
<literal>tr</literal> is split into four fields: an empty field, followed by
literal, tr, and /literal. The filename is passed to awk on the
command line,
where the -v option sets the awk variable FILE to the filename.
That variable
is then used in the print statement, which outputs the word, the
tag, and the filename:
| | | awk -F'[<>]' -v FILE="$1" \ '{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' | ... 6.
Sorts the lines into word order:
| | | sort | ... 7.
The uniq command supplies the initial count field. The output is
a list
of records, where the fields are count, word, tag, file:
| | | uniq -c | ... 8.
Sort the output by word and tag (the second and third fields):
| | | sort -k2,2 -k3,3 | ... 9.
The final stage uses a small awk program to filter successive
lines,
adding a trailing arrow when it sees the same word as on the
previous line.
| | | awk '{ print ($2 = = Last) ? ($0 " <----") : $0 Last = $2 }' |
---|
Script taglist
|
#! /bin/sh - # Read an HTML/SGML/XML file given on the command # line containing markup like word and output on # standard output a tab-separated list of # # count word tag filename # # sorted by ascending word and tag. # # Usage: # taglist xml-file cat "$1" | sed -e 's#systemitem *role="url"#URL#g' -e 's#/systemitem#/URL#' | tr ' ( ){ }[ ]' '\n\n\n\n\n\n\n' | egrep '>[^<>]+ awk -F'[<>]' -v FILE="$1" \ '{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' | sort | uniq -c | sort -k2,2 -k3,3 | awk '{ print ($2 = = Last) ? ($0 " <----") : $0 Last = $2 }' |
---|