Pipeline

Advantage of Pipeline

Spelling Tools

Word List

Tag List


Untitled Document
Advantage of Pipeline
 
can eliminate the need of many temporary files
 
scripts are more readable
Wed Mar 16 00:28:43 CST 2011

Untitled Document
Spelling Tools
Wed Mar 16 00:28:44 CST 2011 Untitled Document
Word List
The Story (Sec. 5.4, Classic Shell Programming, Robbins & Beebe)
   
From 1983 to 1987, Bell Labs researcher Jon Bentley wrote an interesting column in Communications of the ACM titled Programming Pearls. Some of the columns were later collected, with substantial changes, into two books. In one of the columns, Bentley posed this challenge: write a program to process a text file, and output a list of the n most-frequent words, with counts of their frequency of occurrence, sorted by descending count. Noted computer scientists Donald Knuth and David Hanson responded separately with interesting and clever literate programs, each of which took several hours to write. Bentley's original specification was imprecise, so Hanson rephrased it this way: Given a text file and an integer n, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are among the n largest in order of decreasing frequency.
 
References:
  
Programming Pearls: A Literate Program: A WEB program for common words, Comm. ACM 29(6), 471-483, June (1986).
  
Programming Pearls: Literate Programming: Printing Common Words, 30(7), 594-599, July (1987).
  
Literate Programming, Knuth's, Stanford University Center for the Study of Language and Information, 1992, ISBN 0p-937073-80-6 (paper) and 0-937073-81-4 (cloth).
   
In the first of Bentley's articles, fellow Bell Labs researcher Doug McIlroy reviewed Knuth's program, and offered a six-step Unix solution that took only a couple of minutes to develop and worked correctly the first time. Moreover, unlike the two other programs, McIlroy's is devoid of explicit magic constants that limit the word lengths, the number of unique words, and the input file size. Also, its notion of what constitutes a word is defined entirely by simple patterns given in its first two executable statements, making changes to the word-recognition algorithm easy.
 
McIlroy's program illustrates the power of the Unix tools approach: break a complex problem into simpler parts that you already know how to handle. To solve the word-frequency problem, McIlroy converted the text file to a list of words, one per line (tr does the job), mapped words to a single lettercase (tr again), sorted the list (sort), reduced it to a list of unique words with counts (uniq), sorted that list by descending counts (sort), and finally, printed the first several entries in the list (sed, though head would work too).
Script wf
#! /bin/sh
# Read a text stream on standard input, and output a list of
# the n (default: 25) most frequently occurring words and
# their frequency counts, in order of descending counts, on
# standard output.
#
# Usage:
# wf [n]
tr -cs A-Za-z\' '\n' |  #Replace nonletters with newlines 
tr A-Z a-z |  #Map uppercase to lowercase 
sort |  #Sort the words in ascending order 
uniq -c |  #Eliminate duplicates, showing their counts 
sort -k1,1nr -k2 |  #Sort by descending count, and then by ascending word 
sed ${1:-25}q  #Print only the first n (default: 25) lines; 
Wed Mar 16 00:28:44 CST 2011 Untitled Document
Tag List
 
finds all begin/end tag pairs written on the same line and outputs a sorted list that associates tag use with input files.
1. feed the input file into the pipeline with cat.
cat "$1" | ...
2. apply sed to simplify the otherwise-complex markup needed for web URLs:
This converts tags such as <systemitem role="URL"> and </systemitem> into simpler <URL> and </URL> tags, respectively.
    | sed -e 's#systemitem *role="URL"#URL#g' \
-e 's#/systemitem#/URL#' | ...
3. uses tr to replace spaces and paired delimiters by newlines:
At this point, the input consists of one "word" per line (or empty lines). Words are either actual text or SGML/XML tags.
    | tr ' ( ){ }[ ]' '\n\n\n\n\n\n\n' | ...
4. Using egrep to select tag-enclosed words:
At this point, the input consists of lines with tags.
    | egrep '>[^<>]+

This regular expression matches tag-enclosed words: a right angle bracket, followed by at least one nonangle bracket, followed by a left angle bracket, followed by a slash (for the closing tag).
5. The first awk stage uses angle brackets as field separators, so the input <literal>tr</literal> is split into four fields: an empty field, followed by literal, tr, and /literal. The filename is passed to awk on the command line, where the -v option sets the awk variable FILE to the filename. That variable is then used in the print statement, which outputs the word, the tag, and the filename:
    | awk -F'[<>]' -v FILE="$1" \
'{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' | ...
6. Sorts the lines into word order:
    | sort | ...
7. The uniq command supplies the initial count field. The output is a list of records, where the fields are count, word, tag, file:
    | uniq -c | ...
8. Sort the output by word and tag (the second and third fields):
    | sort -k2,2 -k3,3 | ...
9. The final stage uses a small awk program to filter successive lines, adding a trailing arrow when it sees the same word as on the previous line.
    | awk '{
print ($2 = = Last) ? ($0 " <----") : $0
Last = $2
}'
Script taglist
#! /bin/sh -
# Read an HTML/SGML/XML file given on the command
# line containing markup like word and output on
# standard output a tab-separated list of
#
# count word tag filename
#
# sorted by ascending word and tag.
#
# Usage:
# taglist xml-file
cat "$1" |
  sed -e 's#systemitem *role="url"#URL#g' -e 's#/systemitem#/URL#' |
    tr ' ( ){ }[ ]' '\n\n\n\n\n\n\n' |
      egrep '>[^<>]+ awk -F'[<>]' -v FILE="$1" \
        '{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' |
          sort |
            uniq -c |
              sort -k2,2 -k3,3 |
                awk '{
                      print ($2 = = Last) ? ($0 " <----") : $0
                      Last = $2
                    }'
Wed Mar 16 00:28:45 CST 2011