Regular Expression

Introduction

Regular Expression 歷史

檔案的Regular Expression

字串的Regular Expression

Simple Regular Expression Examples

高階的 Regular Expression

POXIS Regular Expresion

Bracket Regular Expression

Bracket Regular Expression

Bracket Regular Expression

Bracket Regular Expression

Regular Expression Examples

Examples: {}

Examples: Anchoring text matches

Back References

BRE operator precedence

Extended Regular Expression

GNU Extensions

Which Programs Use Which Regular Expressions?

Email Format Check

經驗


Untitled Document
Introduction
   
我們經常要用同一指令同時處理很多檔案, 或要進行搜尋/取代某一類字串時 regular expression是威力非常強大的一種表達方式, 讓使用者很方便很精準的表達某一類字串。
   
例如:當我們想要利用ex 編輯指令將一個HTML檔案裡面的所有 表格加上底色,語法是把表格標籤 <table> 加上底色屬性,成為 <table bgcolor=white>。 我們希望不需要逐一找出所有的表格標籤編輯, 而是希望能下一道編輯指令即可達到目的。 假設而HTML檔案內有兩種表格標籤,<table>、<Table>, 如果不使用Regular Expression, ex 編輯器需要兩個指令達到目的:

1,$s/<table/<table bgcolor=white/g
1,$s/<Table/<Table bgcolor=white/g
但是如果使用Regular Expression,只需要下面一個指令就能達到目的:

1,$s/<[tT]able/& bgcolor=white/g
在字串取代中,被取代的字串是 <Table 或 <table, 而符號 "&" 代表複製被取代的字串作為新字串的一部分。
是故,這個編輯指令作用為:從第一行到最後一行,將字串 "<Table" 及 "<table" 取代為 原字串加上 "bgcolor=white"。
再者,上面的編輯指令能保留表格標籤中原有的屬性定義。
如果檔案內的table 標籤有多種不同的大小寫寫法, Regexp 照樣可以應付裕如。
   
其實我們常用的檔案名稱的萬用字元(Wildcard)也是一種 Regular Expression。
   
Regular Expression (簡稱 regexp),中文譯名是百花齊放: 正規表示式、規律表達式、正規表達式、正規表示法、規則運算式、常規表示法,等等。 Unix 系統裡的很多工具都提供了 regexp 的功能,大約三四十個特殊符號,只要學會其中常用的幾個符號, 就能在搜尋字串/代換字串的任務中發揮強大的威力。
Sun Sep 8 10:52:46 CST 2024 Untitled Document
Regular Expression 歷史
   
Regular Expression 是自動機理論中的一種基本概念,一個語言如果是被歸類為Regular Expression, 那麼用這個語言所表達的字串都可以用有限狀態機(Finite State Automata)或更高等級的自動機來解析。 Ken Tompson為了方便字串比對定義了一套符號系統用在編輯器QED的字串比對 及取代這兩項功能上, 這套符號系統是符合Regular Expression的一種語言,並未有正式的命名,我們姑且名之為 KRE (Ken's Regular Expression)。Unix上的編輯器ed、ex、sed以及 grep 都採用了KRE。由於引入了KRE,Unix 的威力大增,對於Unix 的推廣起到了推波助瀾的作用。 此後,KRE被廣泛地應用於各種Unix或類Unix系統的工具中。 後來POSIX將KRE補上幾個符號定義了BRE(Basic Regular Expression), 後來更擴充成 ERE(Extended Regular Expression)。 後來出現的perl 更將 BRE/ERE 大幅擴充(稱為Perl Compatible Regular Expressions, PCRE),威力倍增。 時至今日,任何含有字串處理功能的程式語言如果沒有納入Regular Expression,絕無生存的空間。 所幸KRE所用的符號通用於所有具字串處理的程式語言中,任何人只要熟悉了 KRE,在任一個程式語言中不須重新學習新的符號系統,只須學習擴充部分。 時至今日,對於 regexp 的熟悉程度, 已經成為評價軟體工程師的一個重要指標。
   
下面這個網站 https://regex101.com/ 提供了一個練習 regexp 的平台, 可以測試各種 regexp 表達式,並對使用者輸入的字串做測試。 網站也會解釋你打了甚麼規則, 所match 的字串等,是一個不錯的練習與 測試平台。
Sun Sep 8 10:52:46 CST 2024 Untitled Document
檔案的Regular Expression
 
在檔案名稱方面所用的Wildcard符號也是可歸類為 Regular Expression 語法,
* 代表任意長度的字元
? 代表任意一個字元
[ ] 括弧內的任一個字元
   
例:
a* 代表所有a開頭的檔案名稱
a? 代表所有a開頭而長度為二的檔案名稱
a[1-9] 代表所有a開頭而第二字為 1-9 中任意數字的檔案名稱
   
這些該是耳熟能詳的吧?並不盡然,請看看下面兩個指令:
  ls a[1-9]
  echo a[1-9]
 
這兩個指令是否產生相同的結果?
   
大部份的人知道第一行指令的結果, 但卻認為第二行指令會產生下面的結果。
  a1 a2 a3 a4 a5 a6 a7 a8 a9
   
其實這兩個指令所產生的結果完全一樣,
  
在執行第二行指令時,shell 會先找到a[1-9]這個字串, 認定它是檔案名稱,先行找到該目錄中所有符合a[1-9] 這個regular expression 的檔案名稱,再叫出echo, 並將所找出來的檔案名稱傳給它,而echo則將拿到的檔案名稱 以字串方式一一印出於STDOUT。
  
因此請緊緊記住,當shell把一個字串當成檔案名稱時, shell都會去所在目錄找出所有符合條件的檔案名稱,再展開成字串。
   
如果要印出 a1 a2 a3 a4 a5 a6 a7 a8 a9 這樣的字串,應該使用 Range Generator 如下:
echo a{1..9}
如果沒有Range Generator,那只好用迴圈的方式來完成任務。
for i in 1 2 3  4 5 6 7 8 9 
do
  echo a$i
done
Sun Sep 8 10:52:47 CST 2024 Untitled Document
字串的Regular Expression
 
搜尋字串時所用的regular expression方面用處更大了, 筆者幾乎天天離不了它:
Regular Expression used in Ed Line Editor
代表任意字元
代表前面的字元出現任意多次(包括零次)
^ 代表一行字串的開頭
$ 代表一行字串的結尾
[...] 代表中括弧內的任一個字元都是待搜字元
[abcd] # 代表a或b或c或d 都是待搜字元
[a-d] # 代表a或b或c或d 都是待搜字元
[0-9] # 代表 [0123456789] 都是待搜字元
[0-9a-fA-F] # 代表 [0123456789abcdefABCDEF] 都是待搜字元
[-abcd] # '-'、'a'、'b'、'c'、'd'都是待搜字元
[]abcd] # ']'、'a'、'b'、'c'、'd'都是待搜字元
[]abcd-] # ']'、'-' 'a'、'b'、'c'、'd'都是待搜字元
[^a] 代表不是a的任意字元
[^a-d] 代表不是a或b或c或d的任意字元
\{n,m\} 前面的字元重複至少n次,至多m次
\{n\} 前面的字元重複正好n次
\{n,\} 前面的字元重複至少n次
\ Escape (將後面的特殊字元取消特殊意義),例外: '\{', '\}', '\(', '\)', '\<', '\>', '\b', '\B', '\w', '\W', '\`', '\'', '\+', 以及 '\?'.
\( \) 將夾在 \( 及 \) 中的字串儲存以備後面重複使用 (Back Reference)
\+ 代表前面的字元出現一次或一次以上
\? 代表前面的字元出現零次或一次
\w 一個英文字裡的字元 (matches a character within a word)
\W 非一個英文字裡的字元 (matches a character which is not within a word)
\< 一個英文字的開頭 (matches the beginning of a word)
\> 一個英文字的尾端 (matches the end of a word)
\b 一個英文字的邊界 (matches a word boundary)
\B 非一個英文字的邊界 (matches characters which are not a word boundary)
\` 整個輸入的前邊界 (matches the beginning of the whole input)
\' 整個輸入的後邊界 (matches the end of the whole input)
[:class:] 代表中括弧內的類別中任一個字元都是待搜字元
類別包括 [:alpha:], [:upper:], [:lower:], [:alnum:], [:blank:], [:space:], [:digit:], [:xdigit:], [:cntrl:], [:print:], [:graph:], [:punct:],
類別 (Class) 符合的字元 (Matching characters)
[:digit:] 數字
Numeric characters
[:xdigit:] 16進位數字
Hexadecimal digits
[:alnum:] 英數字
Alphanumeric characters
[:alpha:] 英文字母
Alphabetic characters
[:lower:] 小寫英文字母
Lowercase characters
[:upper:] 大寫英文字母
Uppercase characters
[:cntrl:] 標點符號
Control characters
[:print:] 可印字元
Printable characters
[:punct:] 標點符號
Punctuation characters
[:space:] 空白、\t (Tab), \r (Return), \f (Form Feed), \n (New Line) 等控制字元
Whitespace characters
[:blank:] 空白以及 \t (Tab)
Space and tab characters
[:graph:] 除[:space:]及[:cntrl:]以外之所有可視字元
Nonspace characters
POSIX Basic and Extended Regular Expression (BRE and ERE)
Character BRE/ ERE Meaning in a pattern
字元 BRE/ ERE 意義
ERE 代表前面的字元出現一次或一次以上 (#KRE 不支援 )
? ERE 代表前面的字元出現零次或一次
\ Both Usually, turn off the special meaning of the following character. Occasionally, enable a special meaning for the following character, such as for \(...\) and \{...\}.
. Both Match any single character except NUL. Individual programs may also disallow matching newline.
* Both Match any number (or none) of the single character that immediately precedes it. For EREs, the preceding character can instead be a regular expression.
For example, since . (dot) means any character, .* means "match any number of any character."
For BREs, * is not special if it's the first character of a regular expression.
* Both Match any number (or none) of the single character that immediately precedes it. For EREs, the preceding character can instead be a regular expression.
For example, since . (dot) means any character, .* means "match any number of any character."
For BREs, * is not special if it's the first character of a regular expression.
^ Both Match the following regular expression at the beginning of the line or string.
BRE: special only at the beginning of a regular expression.
ERE: special everywhere.
$ Both Match the preceding regular expression at the end of the line or string.
BRE: special only at the end of a regular expression.
ERE: special everywhere.
[...] Both Termed a bracket expression, this matches any one of the enclosed characters.
A hyphen (-) indicates a range of consecutive characters. (Caution: ranges are locale-sensitive, and thus not portable.)
A circumflex (^) as the first character in the brackets reverses the sense: it matches any one character not in the list.
A hyphen or close bracket (]) as the first character is treated as a member of the list.
All other metacharacters are treated as members of the list (i.e., literally).
Bracket expressions may contain collating symbols, equivalence classes, and character classes.
\{n,m\} BRE Termed an interval expression, this matches a range of occurrences of the single character that immediately precedes it.
\{n\} matches exactly n occurrences,
\{n,\} matches at least n occurrences, and
\{n,m\} matches any number of occurrences between n and m. n and m must be between 0 and RE_DUP_MAX (minimum value: 255), inclusive.
"exactly five occurrences of a" and "between 10 and 42 instances of q" are written a\{5\} and q\{10,42\}, respectively.
{n,m} ERE Just like the BRE \{n,m\} earlier, but without the backslashes in front of the braces.
{n}, {n,}, {n,m}, a{5, q{10,42}
\( \) BRE Save the pattern enclosed between \( and \) in a special holding space. Up to nine subpatterns can be saved on a single pattern. The text matched by the subpatterns can be reused later in the same pattern, by the escape sequences \1 to \9. For example, \(ab\).*\1 matches two occurrences of ab, with any number of characters in between.
( ) ERE Apply a match to the enclosed group of regular expressions.
\n BRE Replay the nth subpattern enclosed in \( and \) into the pattern at this point. n is a number from 1 to 9, with 1 starting on the left.
+ ERE Match one or more instances of the preceding regular expression.
? ERE Match zero or one instances of the preceding regular expression.
| ERE Match the regular expression specified before or after.
Perl Extended Regular Expression
\r Carridge Return
\t Horizontal Tab
\f Form Feed
\n New Line
\N not \n
\s 空白、\t, \r, \f, \n
\S not \s
\w a-z, A-Z, 0-9, 以及 '_' (underscore).
\W not \w
\d 0-9
\D not d
\b 英文字的邊界 (word boundry)
Sun Sep 8 10:52:47 CST 2024 Untitled Document
Simple Regular Expression Examples
Script ID Script 說明
filter-3
# remove "XXX" from front
# 將每一行開頭的 "XXX" 刪除
#-------------------------------------------------------------

sed "s/^XXX//"
filter-5
# remove "XXX" at the end
# 將每一行結尾的 "XXX" 刪除
#-------------------------------------------------------------

sed 's/XXX$//'
filter-6
# insert "XXX" in front
# 在每一行前面插入 "XXX"
#-------------------------------------------------------------

sed "s/^/XXX/"
filter-8
# append "XXX" to end
# 在每一行後面接上 "XXX"
#-------------------------------------------------------------

sed 's/$/XXX/g'
filter-9
# truncate text from "XXX" to the end
# 將每一行中的 XXX 字串及之後所有文字全部刪除
#-------------------------------------------------------------

sed "s/XXX.*//"
filter-15
# change whitespaces to a single space
# 將多個連續的"跳格"或"空格"取代成為一個"空格"
#-----------------------------------------------------------------------------------------------------

sed "s/[<tab><space>][<tab><space>]*/<space>/g"
filter-21
# delete empty lines
# 刪除空行,行內若只含空格,視為空行
#-------------------------------------------------------------

sed '/^[<tab><space>]*$/d'
filter-29
# delete comments
# 刪除最前端是"#"的行
#-------------------------------------------------------------

sed '/^#/d'
filter-30
# delete comments
# 刪除最前端是"#"的行
#-------------------------------------------------------------

sed 's/^#.*//'
filter-34
# print odd lines
# 印出奇數行
#-----------------------------------------------------------------------

cat -n file | sed -e '/^.....[02468]/d' -e 's/^.......//'
filter-35
# print even lines
# 印出偶數行
#-----------------------------------------------------------------------

cat -n file | sed -e '/^.....[13579]/d' -e 's/^.......//'
Sun Sep 8 10:52:47 CST 2024 Untitled Document
高階的 Regular Expression
   
在 shell script 中利用 regexp 處理字串,通常只需要用到簡單的符號及其組合,而KRE 在大部分的情況是足夠了。而且處理字串時(例如取代或刪除)regexp 必須非常精準, 以免取代或刪除了不該更動的地方。例如一個 sed 的取代指令 's/[Tt]he/THE/' 就可能誤將很多其他的英文字 改掉了,例如there 被改成 THEre。我們強烈建議不要使用太複雜的 regexp 於取代或刪除這種動作中。
 
Regular Expression 在大數據的應用
   
現在的資訊世界越來越仰賴大數據的運用,而在海量的資訊中撈取使用者有興趣的資料, 那就需要精準的搜尋了,如果搜尋條件太緊的話,會漏掉需要的資訊, 反之,如果搜尋條件太鬆的話,會撈出太多無關的資訊,淹沒了所需要的資訊。因此, 精準的搜尋在大數據相關應用中,非常重要。以電話號碼為例,如果要從一堆檔案中, 撈出所有的電話號碼,那是非常頭疼的一項任務,因為電話號碼有太多的表達形式, 舉例而言,至少有以下幾種常見的格式:
04-23456789
(04)2345-6789
(04)23456789
23456789
+886423456789
0932000001
0932-000-001
+886932000001
+886-932-000-001
如果再加上分機的話,那就多很多種形式了。 在這種搜尋任務中,regexp 可以提供很好的協助, 通常需要更為複雜精緻的 Regular Expression,初學者經常看到的是一堆奇奇怪怪符號的組合。
 
運用 Regular Expression 於輸入資料的格式檢查
   
另一個常用到 Regexp 的場合是,網頁資訊系統中對於輸入資料的格式檢查,例如 email,日期,身分證字號、密碼等。這些資料每一個人都不同,但必須符合一定格式,那就必定要靠 regexp 精確的描述合規的格式,才能輕鬆的解決。我們以email 格式為例:
 
Email Format Validation Condition
(1) 中間一定要出現一個 @
(2) 必須以一個以上的文字或數字開頭
(3) @ 之前可以出現 1 個以上的文字、數字與「-」的組合,例如 -abc-
(4) @ 之前可以出現 1 個以上的文字、數字與「.」的組合,例如 .abc.
(5) @ 之前以上兩項以 or 的關係出現,並且出現 0 次以上
(6) @ 之後出現一個以上的大小寫英文及數字的組合
(7) @ 之後只能出現「.」或是「-」,但這兩個字元不能連續時出現
(8) @ 之後出現 0 個以上的「.」或是「-」配上大小寫英文及數字的組合
(9) @ 之後出現 1 個以上的「.」配上大小寫英文及數字的組合,結尾需為大小寫英文
 
PCRE for email Format Validation
^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$
 
SSN (Social Security Number) Format Validation Condition
正好9個數字
It should have 9 digits.
分成三段,用'-'隔開
It should be divided into 3 parts by hyphen (-).
第一段為3個數字,排除000, 666, 900-999等數字
The first part should have 3 digits and should not be 000, 666, or between 900 and 999.
第二段為01-99之2位數
The second part should have 2 digits and it should be from 01 to 99.
第三 段為0001-9999之4位數
The third part should have 4 digits and it should be from 0001 to 9999.
 
PCRE for SSN Format Validation
^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$
Sun Sep 8 10:52:47 CST 2024 Untitled Document
POXIS Regular Expresion
POSIX BRE (Basic RE) and ERE (Extended RE) metacharacters
Sun Sep 8 10:52:48 CST 2024 Untitled Document
Bracket Regular Expression
Character classes
   
represent classes of characters, such as digits, lower- and uppercase letters, punctuation, whitespace, and so on.
   
They are written by enclosing the name of the class in [: and :].
   
The pre-POSIX range expressions for decimal and hexadecimal digits can (and should) be expressed portably, by using character classes: [[:digit:]] and [[:xdigit:]].

POSIX character classes

類別 (Class) 符合的字元 (Matching characters)
[:digit:] 數字
Numeric characters
[:xdigit:] 16進位數字
Hexadecimal digits
[:alnum:] 英數字
Alphanumeric characters
[:alpha:] 英文字母
Alphabetic characters
[:lower:] 小寫英文字母
Lowercase characters
[:upper:] 大寫英文字母
Uppercase characters
[:cntrl:] 標點符號
Control characters
[:print:] 可印字元
Printable characters
[:punct:] 標點符號
Punctuation characters
[:space:] 空白、空格以及 \r, \f 等控制字元
Whitespace characters
[:blank:] 空白以及空格
Space and tab characters
[:graph:] 除[:space:]及[:cntrl:]以外之所有可視字元
Nonspace characters
Sun Sep 8 10:52:48 CST 2024 Untitled Document
Bracket Regular Expression
Collating
 
the act of giving an ordering to some group or set of items.
   
A POSIX collating element consists of the name of the element in the current locale, enclosed by [. and .].
   
For example, in Czech and Spanish, the two characters ch are kept together and are treated as a single unit for comparison purposes
   
Thus, [ab[.ch.]de] matches any of the characters a, b, d, or e, or the pair ch.
It does not match a standalone c or h character.
Sun Sep 8 10:52:48 CST 2024 Untitled Document
Bracket Regular Expression
Equivalence class
 
used to represent different characters that should be treated the same when matching.
   
Equivalence classes enclose the name of the class between [= and =].
   
For example, in a French locale, there might be an [=e=] equivalence class. If it exists, then the regular expression [a[=e=]iouy] would match all the lowercase English vowels, as well as the letters e`, e', and so on.
Sun Sep 8 10:52:48 CST 2024 Untitled Document
Bracket Regular Expression
   
Collating elements, equivalence classes, and character classes are only recognized inside the square brackets of a bracket expression. Writing a standalone regular expression such as [:alpha:] matches the characters a, l, p, h, and :. The correct way to write it is [[:alpha:]].
   
Within bracket expressions, all other metacharacters lose their special meanings. Thus, [*\.] matches a literal asterisk, a literal backslash, or a literal period. To get a ] into the set, place it first in the list: [ ]*\.] adds the ] to the list. To get a minus character into the set, place it first in the list: [-*\.]. If you need both a right bracket and a minus, make the right bracket the first character, and make the minus the last one in the list: [ ]*\.-].
   
Finally, POSIX explicitly states that the NUL character (numeric value zero) need not be matchable. This character is used in the C language to indicate the end of a string, and the POSIX standard wanted to make it straightforward to implement its features using regular C strings. In addition, individual utilities may disallow matching of the newline character by the . (dot) metacharacter or by bracket expressions.
Sun Sep 8 10:52:49 CST 2024 Untitled Document
Regular Expression Examples
clinton The seven letters clinton, anywhere on a line
^clinton The seven letters clinton, at the beginning of a line
clinton$ The seven letters clinton, at the end of a line
^clinton$ A line containing exactly the seven letters clinton, and nothing else
[Cc]linton Either the seven letters Clinton, or the seven letters clinton, anywhere on a line cli.ton | The three letters cli, any character, and the three letters ton, anywhere on a line cli.*ton | The three letters cli, any sequence of zero or more characters, and the three letters ton, anywhere on a line (e.g., cliton, clinton, cliBILLton, and so on)
Sun Sep 8 10:52:49 CST 2024 Untitled Document
Examples: {}
Pattern Matches
\{n\} Exactly n occurrences of the preceding regular expression
\{n,\} At least n occurrences of the preceding regular expression
\{n,m\} Between n and m occurrences of the preceding regular expression
Sun Sep 8 10:52:49 CST 2024 Untitled Document
Examples: Anchoring text matches
 
Text to be matched: abcABCdefDEF
Pattern Matches? Text matched/Reason match fails
ABC Yes abcABCdefDEF
^ABC No Match is restricted to beginning of string
def Yes abcABCdefDEF
def$ No Match is restricted to end of string
[[:upper:]]\{3\} Yes Characters 4, 5, and 6, in the middle: abcABCdefDEF
[[:upper:]]\{3\}$ Yes Characters 10, 11, and 12, at the end: abcDEFdefDEF
^[[:alpha:]]\{3\} Yes Characters 1, 2, and 3, at the beginning: abcABCdefDEF
^$
 
match to empty lines or strings
#filter out all empty lines

grep -v '^$' myfile
Sun Sep 8 10:52:49 CST 2024 Untitled Document
Back References
Back References match whatever an earlier part of the regular expression matched
 
Back Reference exists in BRE only, not in ERE.

Step 1 to enclose a subexpression in \( and \).
There may be up to nine enclosed subexpressions within a single pattern,
and they may be nested.
Step 2 to use \digit, where digit is a number between 1 and 9, in a later part of the same pattern.
Its meaning there is "match whatever was matched by the nth earlier parenthesized subexpression."
 
Examples
Pattern Matches
\(ab\)\(cd\)[def]*\2\1 abcdcdab, abcdeeecdab, abcdddeeffcdab, ...
\(why\).*\1 A line with two occurrences of why
\([[:alpha:]_][[:alnum:]_]*\) = \1; Simple C/C++ assignment statement
\(["']\).*\1 Match single- or double-quoted words, like 'foo' or "bar"
Sun Sep 8 10:52:50 CST 2024 Untitled Document
BRE operator precedence
BRE operator precedence from highest to lowest
Operator Meaning
[. .] [= =] [: :] Bracket symbols for character collation
\metacharacter Escaped metacharacters
[ ] Bracket expressions
\( \) \digit subexpressions and backreferences
* \{ \} Repetition of the preceding single-character regular expression
no symbol Concatenation
^ $ Anchors
Sun Sep 8 10:52:50 CST 2024 Untitled Document
Extended Regular Expression
   
EREs, as the name implies, have more capabilities than do basic regular expressions. Many of the metacharacters and capabilities are identical. However, some of the metacharacters that look similar to their BRE counterparts have different meanings.
Matching single characters
   
EREs are essentially the same as BREs.
 
Exceptions
   
in awk, \ is special inside bracket expressions.
Thus, to match a left bracket, dash, right bracket, or backslash, you could use [\[\-\]\].
Backreferences don't exist
   
Parentheses are special in EREs, but serve a different purpose than they do in BREs In an ERE, \( and \) match literal left and right parentheses.
Matching multiple regular expressions with one expression
   
EREs have the most notable differences from BREs in the area of matching multiple characters.
 
The * does work the same as in BREs.
   
An exception is that the meaning of a * as the first character of an ERE is "undefined," whereas in a BRE it means "match a literal *."
 
Interval expressions are also available in EREs;
   
however, they are written using plain braces, not braces preceded by backslashes.
   
"exactly five occurrences of a" and "between 10 and 42 instances of q" are written a{5} and q{10,42}, respectively.
   
Use \{ and \} to match literal brace characters.
? and +
? Match zero or one of the preceding regular expression
? meaning "optional."
example, ab?c matches both ac and abc, but nothing else.
+ Match one or more of the preceding regular expression
similar to the * metacharacter, except that at least one occurrence of text matching the preceding regular expression must be present.
Thus, ab+c matches abc, abbc, abbbc, and so on, but does not match ac.
ab+c is same as abb*c
Alternation and Grouping
(why)+ matches one or more occurrences of the word why.
[Tt]he (CPU|computer) is matches sentences using either CPU or computer in between The (or the) and is.
(read|write)+ matches one or more occurrences of either of the words read or write
((read|write)[[:space:]]*)+ same as above but allow zero or more intervening whitespace between words
matches multiple successive occurrences of either read or write, possibly separated by whitespace characters.
((read|write)[[:space:]]+)+ same as above but allow one or more intervening whitespace between words
^abcd|efgh$ match abcd at the beginning of the string, or match efgh at the end of the string
^(abcd|efgh)$ match a string containing exactly abcd or exactly efgh
Anchoring text matches
   
The ^ and $ have the same meaning as in BREs
   
In EREs, ^ and $ are always metacharacters. Thus, regular expressions such as ab^cd and ef$gh are valid, but cannot match anything,
ERE operator precedence
   
Operator precedence applies to EREs as it does to BREs.
 
ERE operator precedence from highest to lowest
Operator Meaning
[. .] [= =] [: :] Bracket symbols for character collation
\metacharacter Escaped metacharacters
[ ] Bracket expressions
( ) Grouping
* + ? { } Repetition of the preceding regular expression
no symbol Concatenation
^ $ Anchors
| Alternation
Sun Sep 8 10:52:50 CST 2024 Untitled Document
GNU Extensions
Operator Meaning
\w Matches any word-constituent character. Equivalent to [[:alnum:]_].
\W Matches any nonword-constituent character. Equivalent to [^[:alnum:]_].
\< \> Matches the beginning and end of a word, as described previously.
\b Matches the null string found at either the beginning or the end of a word. This is a generalization of the \< and \> operators. Note: Because awk uses \b to represent the backspace character, GNU awk (gawk) uses \y.
\B Matches the null string between two word-constituent characters.
\' \` Matches the beginning and end of an emacs buffer, respectively. GNU programs (besides emacs) generally treat these as being equivalent to ^ and $.
Sun Sep 8 10:52:50 CST 2024 Untitled Document
Which Programs Use Which Regular Expressions?
Unix programs and their regular expression type
Type grep sed ed ex/vi more egrep awk lex
BRE Y Y Y Y Y      
ERE           Y Y Y
\< \> Y Y Y Y Y      
Sun Sep 8 10:52:50 CST 2024 Untitled Document
Email Format Check
.Rule
(1) 中間一定要出現一個 @
(2) 必須以一個以上的文字或數字開頭
(3) @ 之前可以出現 1 個以上的文字、數字與「-」的組合,例如 -abc-
(4) @ 之前可以出現 1 個以上的文字、數字與「.」的組合,例如 .abc.
(5) @ 之前以上兩項以 or 的關係出現,並且出現 0 次以上
(6) @ 之後出現一個以上的大小寫英文及數字的組合
(7) @ 之後只能出現「.」或是「-」,但這兩個字元不能連續時出現
(8) @ 之後出現 0 個以上的「.」或是「-」配上大小寫英文及數字的組合
(9) @ 之後出現 1 個以上的「.」配上大小寫英文及數字的組合,結尾需為大小寫英文
Sun Sep 8 10:52:51 CST 2024 Untitled Document
經驗
 
注意前面所提的檔名稱及這裡所提的字串的regular expression 雖然有相同的符號,但在不同的地方代表的意義是不同的,
   
很多人經常將兩者混在一起。
 
此外,另一個容易混淆的地方是:
   
不同的些軟體,所使用的regular expression也許有些差異, 例如 BRE 或 ERE, 甚至是POSIX 標準出來之前所使用的 Regular Expression.
  
以上所提的regular expression然只有幾個,但使用起來卻是 一大幫手, 所提的例子中,到處可以見到它們的蹤影,我們不在這裡多作說明, 而在各個例子中才說明。
   
原因是很多人看過之後,沒有實際運用,不容易體會到它們的好處。
Sun Sep 8 10:52:51 CST 2024