minoli
4 min readJun 4, 2021

--

Regular Expressions — part I

Introduction to Regex with Metacharacters and Metasequences

A Regular Expression is a formula in a special language that specifies simple classes of strings that can match one or more character strings. Regular expressions are invented by Stephen Kleene who is an American mathematician in 1951 to describe McCulloch-Pitts neural networks. Regular expressions are using to find patterns, especially in texts. It can be considered as a formal language for specifying text strings.

Today, regular expressions are included in most programming languages, many scripting languages as well as editors, databases, applications, and command-line interfaces. We can try online regular expression demonstration on https://regexr.com/

There are two rules that need to consider when predicting the results of most regular expressions(matches).

  1. The earliest(leftmost) match wins: that means the regular expressions are proceeding with the left to right. Start to proceed from leftmost and respectively proceeding until it reaches the end without any skipping.
  2. Standard quantifiers are greedy: specified quantifiers in the regular expression can be repeated.

A regular expression is a string containing a combination of normal characters and special metacharacters or meta sequences. Metacharacters are symbols or characters that have a special meaning within a regular expression. A meta sequence, like a metacharacter, has special meaning in a regular expression. A meta sequence is made up of more than one character. The normal characters match themselves. Metacharacters and meta sequences are characters or sequences of characters that represent locations, types, or quantity of characters.

Metacharacters

11 metacharacters are there which need to consider when writing regular expressions.

  1. \(Backslash): Escapes the special metacharacter meaning of special characters. And, use the backslash character if you want to use a forward slash character in a regular expression literal.
    Ex: /a\/b/ (to match the character a, followed by the forward-slash character, followed by the character b).
  2. ^ (Caret symbol): Three usages in caret symbol. As the neglect operator, caret symbol, and at the anchors.
    As the neglect operator:- /[^A-Z]/ ( Neglect the uppercase letter and select all the lower case letters, not an uppercase letter)
    As the caret symbol:- /[A^a]/(Take as optional, select all ‘A’, ‘a’, and ‘^’ characters)
    In anchors, this represents the start of an expression:- /^The/ (Matches the word ‘The’ only at the start of a line)
  3. $ (Doller sign): Matches at the end of a string
  4. . (Dot, period): There are two usages for the dot symbol. Just as the dot symbol, and it can use the wildcard.
    When using as the dot symbol it should use with backslash as ‘\.’
    When using as the wildcard it will identify all the characters other than the newlines.
  5. | (Pipe symbol): Used for alternation, to match either the part on the left side or the part on the right side:
    Ex: /abc|xyz/ (matches either abc or xyz)
  6. ? (Question mark): Exactly zero or one occurrence of the previous char or expression
    Ex: /colou?r/ (Matches colour or color)
  7. * (Asterisk): Zero or more occurrences of the previous char or expression
    Ex: /hii*/ (Matches hi, hii, hiii… but there can be any number of ‘i’s here)
  8. + (Plus sign): Least one or more occurrences of the previous char or expression
    Ex: /hi+/ (Matches hi, hii, hiii… but there can be any number of ‘i’s here also.)
  9. ( and ) (Opening and closing parenthesis): Defines groups within the regular expression. Use groups for the following
    To confine the scope of the | alternator: Ex: /(a|b)c/ (Matches with character a or b)
    To define the scope of a quantifier: Ex: /(dog.){1,2}/ (Matches with dog. or dog.dog.)
  10. [ and ] (Opening and closing square brackets): Match any character in the set
  11. { and } (Opening and closing curly braces): Match exactly the specified number of occurrences.
    Ex: /hel{2}o/ (Matches with hello only)

Metasequences

There are 10 most popular metasequences which are using when writing regular expressions.

  1. . (Dot): Any character except newlines(\n)
  2. \d: Any decimal digit. Same as /[0-9]/
  3. \D: Nondecimal digit. Same as /[0-9]/
  4. \s: Any whitespace character. Same as /[\t\n\r\f\v]/
  5. \S: Non whitespace character. Same as /[^\t\n\r\f\v]/
  6. \w: Any alphanumeric character. Same as /[0–9A-Za-z_]/
  7. \W: Non-alphanumeric character. Same as /[0-9A-Za-z_]/. Matches all the whitespaces and different symbols.
  8. {n}: The previous character repeated exactly n times.
  9. {n,}: The previous character repeated at least n times and maximum repetition is unlimited
  10. {n,m}: The previous character is repeated between n and m times(n and m both inclusive)

--

--