Regular Expressions

Hello, my Name is Christopher. I'm learning Frontend Web Devlopment on my own. I've found several resourses available to anyone wanting to learn web development.

I use a few different web sites for free courses and reference material, like "Free code Camp", "W3 Schools" and "MDN Web Docs". I use these sites for theory, testing and documentation. I also found lots of videos on YouTube that were great for beginers right up to more advanced programming. I did find the theory and having lectures from videos beneficial but they just didn't complete the learning process, I had to take notes too. I try to reference the sites I used for source material when possible in my notes.

I've started to build several reference document web pages to practice building web pages while taking notes on different subjects.


Welcome

Regular expressions, often shortened to "regex" or "regexp". Regular expressions are very powerful, but can be hard to read because they use special characters to make more complex, flexible matches.

Regular expressions are patterns used to match character combinations in strings. In JavaScript, regular expressions are also objects. These patterns are used with the exec() and test() methods of RegExp, and with the match(), matchAll(), replace(), replaceAll(), search(), and split() methods of String.


Free Code Camp

Softcorp TV

Learn Regular Expressions

Creating a regular expression

You construct a regular expression in one of two ways:

Using a expression literal, which consists of a pattern enclosed between slashes, as follows:

            
  const re = /ab+c/;
            
          

Regular expression literals provide compilation of the regular expression when the script is loaded. If the regular expression remains constant, using this can improve performance.


Using the constructor function of the new RegExp() object, as follows:

            
  const re = new RegExp('ab+c');
            
          

Using the constructor function provides runtime compilation of the regular expression. Use the constructor function when you know the regular expression pattern will be changing, or you don't know the pattern and are getting it from another source, such as user input.



Patterns

Writing a regular expression pattern:

A regular expression pattern is composed of simple characters, such as /abc/, or a combination of simple and special characters, such as /ab*c/ or /Chapter (\d+)\.\d*/. The last example includes parentheses, which are used as a memory device. The match made with this part of the pattern is remembered for later use, as described in Using groups.

Simple patterns are constructed of characters for which you want to find a direct match. For example, the pattern /abc/ matches character combinations in strings only when the exact sequence "abc" occurs (all characters together and in that order).


Using special characters:

When the search for a match requires something more than a direct match, such as finding one or more b's, or finding white space, you can include special characters in the pattern. For example, to match a single "a" followed by zero or more "b"s followed by "c", you'd use the pattern /ab*c/: the * after "b" means "0 or more occurrences of the preceding item." In the string "cbbabbbbcdebc", this pattern will match the substring "abbbbc".


Using Parentheses:

Parentheses around any part of the regular expression pattern causes that part of the matched substring to be remembered. Once remembered, the substring can be recalled for other use. See Groups and ranges for more details.



Character Classes

\

Indicates that the following character should be treated specially, or "escaped". It behaves one of two ways.

  • For characters that are usually treated literally, indicates that the next character is special and not to be interpreted literally. For example, /b/ matches the character "b". By placing a backslash in front of "b", that is by using /\b/, the character becomes special to mean match a word boundary.
  • For characters that are usually treated specially, indicates that the next character is not special and should be interpreted literally. For example, "*" is a special character that means 0 or more occurrences of the preceding character should be matched; for example, /a*/ means match 0 or more "a"s. To match * literally, precede it with a backslash; for example, /a\*/ matches "a*".

.

Has one of the following meanings:

  • Matches any single character except line terminators: \n, \r, \u2028 or \u2029. For example, /.y/ matches "my" and "ay", but not "yes", in "yes make my day".
  • Inside a character class, the dot loses its special meaning and matches a literal dot.

Note that the m multiline flag doesn't change the dot behavior. So to match a pattern across multiple lines, the character class [^] can be used — it will match any character including newlines.

ES2018 added the s "dotAll" flag, which allows the dot to also match line terminators.


\d

Matches any digit (Arabic numeral). Equivalent to [0-9]. For example, /\d/ or /[0-9]/ matches "2" in "B2 is the suite number".


\D

Matches any character that is not a digit (Arabic numeral). Equivalent to [^0-9]. For example, /\D/ or /[^0-9]/ matches "B" in "B2 is the suite number".


\w

Matches any alphanumeric character from the basic Latin alphabet, including the underscore. Equivalent to [A-Za-z0-9_]. For example, /\w/ matches "a" in "apple", "5" in "$5.28", "3" in "3D" and "m" in "Émanuel".


\W

Matches any character that is not a word character from the basic Latin alphabet. Equivalent to [^A-Za-z0-9_]. For example, /\W/ or /[^A-Za-z0-9_]/ matches "%" in "50%" and "É" in "Émanuel".


\s

Matches a single white space character, including space, tab, form feed, line feed, and other Unicode spaces. Equivalent to [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]. For example, /\s\w*/ matches " bar" in "foo bar".


\S

Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]. For example, /\S\w*/ matches "foo" in "foo bar".


\t

Matches a horizontal tab.


\n

Matches a linefeed.


\r

Matches a carriage return.


\v

Matches a vertical tab.


\f

Matches a form-feed.


\0

Matches a NUL character. Do not follow this with another digit.


\cX

Matches a control character using caret notation, where "X" is a letter from A-Z (corresponding to codepoints U+0001-U+001A). For example, /\cM\cJ/ matches "\r\n".


\xhh

Matches the character with the code hh (two hexadecimal digits).


\uhhhh

Matches a UTF-16 code-unit with the value hhhh (four hexadecimal digits).


\u{hhhh} or \u{hhhhh}

(Only when the u flag is set.) Matches the character with the Unicode value U+hhhh or U+hhhhh (hexadecimal digits).


[\b]

Matches a backspace. If you're looking for the word-boundary character (\b), see Assertions.



Assertions

Assertions include boundaries, which indicate the beginnings and endings of lines and words, and other patterns indicating in some way that a match is possible (including look-ahead, look-behind, and conditional expressions).

^

Matches the beginning of input. If the multiline flag is set to true, also matches immediately after a line break character. For example, /^A/ does not match the "A" in "an A", but does match the first "A" in "An A".


$

Matches the end of input. If the multiline flag is set to true, also matches immediately before a line break character. For example, /t$/ does not match the "t" in "eater", but does match it in "eat".


\b

Matches a word boundary. This is the position where a word character is not followed or preceded by another word-character, such as between a letter and a space. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero.

Examples:

  • /\bm/ matches the "m" in "moon".
  • /oo\b/ does not match the "oo" in "moon", because "oo" is followed by "n" which is a word character.
  • /oon\b/ matches the "oon" in "moon", because "oon" is the end of the string, thus not followed by a word character.
  • /\w\b\w/ will never match anything, because a word character can never be followed by both a non-word and a word character.

\B

Matches a non-word boundary. This is a position where the previous and next character are of the same type: Either both must be words, or both must be non-words, for example between two letters or between two spaces. The beginning and end of a string are considered non-words. Same as the matched word boundary, the matched non-word boundary is also not included in the match. For example, /\Bon/ matches "on" in "at noon", and /ye\B/ matches "ye" in "possibly yesterday".


x(?=y)

Lookahead assertion: Matches "x" only if "x" is followed by "y". For example, /Jack(?=Sprat)/ matches "Jack" only if it is followed by "Sprat". /Jack(?=Sprat|Frost)/ matches "Jack" only if it is followed by "Sprat" or "Frost". However, neither "Sprat" nor "Frost" is part of the match results.


x(?!y)

Negative lookahead assertion: Matches "x" only if "x" is not followed by "y". For example, /\d+(?!\.)/ matches a number only if it is not followed by a decimal point. /\d+(?!\.)/.exec('3.141') matches "141" but not "3".


(?<=y)x

Lookbehind assertion: Matches "x" only if "x" is preceded by "y". For example, /(?<=Jack)Sprat/ matches "Sprat" only if it is preceded by "Jack". /(?<=Jack|Tom)Sprat/ matches "Sprat" only if it is preceded by "Jack" or "Tom". However, neither "Jack" nor "Tom" is part of the match results.


(?<!y)x

Negative lookbehind assertion: Matches "x" only if "x" is not preceded by "y". For example, /(?<!-)\d+/ matches a number only if it is not preceded by a minus sign. /(?<!-)\d+/.exec('3') matches "3". /(?<!-)\d+/.exec('-3') match is not found because the number is preceded by the minus sign.



Groups & Ranges

Groups and ranges indicate groups and ranges of expression characters.

x|y

Matches either "x" or "y". For example, /green|red/ matches "green" in "green apple" and "red" in "red apple".


[xyz][a-c]

A character class. Matches any one of the enclosed characters. You can specify a range of characters by using a hyphen, but if the hyphen appears as the first or last character enclosed in the square brackets it is taken as a literal hyphen to be included in the character class as a normal character.

For example, [abcd] is the same as [a-d]. They match the "b" in "brisket", and the "c" in "chop".

For example, [abcd-] and [-abcd] match the "b" in "brisket", the "c" in "chop", and the "-" (hyphen) in "non-profit".

For example, [\w-] is the same as [A-Za-z0-9_-]. They both match the "b" in "brisket", the "c" in "chop", and the "n" in "non-profit".


[^xyz][^a-c]

A negated or complemented character class. That is, it matches anything that is not enclosed in the brackets. You can specify a range of characters by using a hyphen, but if the hyphen appears as the first or last character enclosed in the square brackets it is taken as a literal hyphen to be included in the character class as a normal character. For example, [^abc] is the same as [^a-c]. They initially match "o" in "bacon" and "h" in "chop".


(x)

Capturing group: Matches x and remembers the match. For example, /(foo)/ matches and remembers "foo" in "foo bar".

A regular expression may have multiple capturing groups. In results, matches to capturing groups typically in an array whose members are in the same order as the left parentheses in the capturing group. This is usually just the order of the capturing groups themselves. This becomes important when capturing groups are nested. Matches are accessed using the index of the result's elements ([1], ..., [n]) or from the predefined RegExp object's properties ($1, ..., $9).

Capturing groups have a performance penalty. If you don't need the matched substring to be recalled, prefer non-capturing parentheses (see below).

String.match() won't return groups if the /.../g flag is set. However, you can still use String.matchAll() to get all matches.


\n

Where "n" is a positive integer. A back reference to the last substring matching the n parenthetical in the regular expression (counting left parentheses). For example, /apple(,)\sorange\1/ matches "apple, orange," in "apple, orange, cherry, peach".


\k<Name>

A back reference to the last substring matching the Named capture group specified by <Name>.

For example, /(?<title>\w+), yes \k<title>/ matches "Sir, yes Sir" in "Do you copy? Sir, yes Sir!".


\?<Name>x

Named capturing group: Matches "x" and stores it on the groups property of the returned matches under the name specified by <Name>. The angle brackets (< and >) are required for group name.

For example, to extract the United States area code from a phone number, we could use /\((?<area>\d\d\d)\)/. The resulting number would appear under matches.groups.area.


(?:x)

Non-capturing group: Matches "x" but does not remember the match. The matched substring cannot be recalled from the resulting array's elements ([1], ..., [n]) or from the predefined RegExp object's properties ($1, ..., $9).



Quantifiers

Quantifiers indicate numbers of characters or expressions to match.

x*

Matches the preceding item "x" 0 or more times. For example, /bo*/ matches "boooo" in "A ghost booooed" and "b" in "A bird warbled", but nothing in "A goat grunted".


x+

Matches the preceding item "x" 1 or more times. Equivalent to {1,}. For example, /a+/ matches the "a" in "candy" and all the "a"'s in "caaaaaaandy".


x?

Matches the preceding item "x" 0 or 1 times. For example, /e?le?/ matches the "el" in "angel" and the "le" in "angle."

If used immediately after any of the quantifiers *, +, ?, or {}, makes the quantifier non-greedy (matching the minimum number of times), as opposed to the default, which is greedy (matching the maximum number of times).


x{n}

Where "n" is a positive integer, matches exactly "n" occurrences of the preceding item "x". For example, /a{2}/ doesn't match the "a" in "candy", but it matches all of the "a"'s in "caandy", and the first two "a"'s in "caaandy".


x{n,}

Where "n" is a positive integer, matches at least "n" occurrences of the preceding item "x". For example, /a{2,}/ doesn't match the "a" in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy".


x{n,m}

Where "n" is 0 or a positive integer, "m" is a positive integer, and m > n, matches at least "n" and at most "m" occurrences of the preceding item "x". For example, /a{1,3}/ matches nothing in "cndy", the "a" in "candy", the two "a"'s in "caandy", and the first three "a"'s in "caaaaaaandy". Notice that when matching "caaaaaaandy", the match is "aaa", even though the original string had more "a"s in it.


x*?, x+?, x??, x{n}?, x{n,}?, x{n,m}?

By default quantifiers like * and + are "greedy", meaning that they try to match as much of the string as possible. The ? character after the quantifier makes the quantifier "non-greedy": meaning that it will stop as soon as it finds a match. For example, given a string like "some <foo> <bar> new </bar> </foo> thing":

  • /<.*>/ will match "<foo> <bar> new </bar> </foo>"
  • /<.*?>/ will match "<foo>"


Unicode Escapes

Before ES2018 there was no performance-efficient way to match characters from different sets based on scripts (like Macedonian, Greek, Georgian etc.) or propertyName (like Emoji etc) in JavaScript. Check out tc39 Proposal on Unicode Property Escapes for more info.

Unicode property escapes Regular Expressions allows for matching characters based on their Unicode properties. A character is described by several properties which are either binary ("boolean-like") or non-binary. For instance, unicode property escapes can be used to match emojis, punctuations, letters (even letters from specific languages or scripts), etc.

Note:

For Unicode property escapes to work, a regular expression must use the u flag which indicates a string must be considered as a series of Unicode code points. See also RegExp.prototype.unicode.


Note:

Some Unicode properties encompasses many more characters than some character classes (such as \w which matches only latin letters, a to z) but the latter is better supported among browsers (as of January 2020).


Syntax

            
              // Non-binary values
              \p{UnicodePropertyValue}
              \p{UnicodePropertyName=UnicodePropertyValue}
              
              // Binary and non-binary values
              \p{UnicodeBinaryPropertyName}
              
              // Negation: \P is negated \p
              \P{UnicodePropertyValue}
              \P{UnicodeBinaryPropertyName}
            
          

General categories

General categories are used to classify Unicode characters and subcategories are available to define a more precise categorization. It is possible to use both short or long forms in Unicode property escapes. They can be used to match letters, numbers, symbols, punctuations, spaces, etc. For a more exhaustive list of general categories, please refer to the Unicode specification.

          
  // finding all the letters of a text
  let story = "It's the Cheshire Cat: now I shall have somebody to talk to.";

  // Most explicit form
  story.match(/\p{General_Category=Letter}/gu);

  // It is not mandatory to use the property name for General categories
  story.match(/\p{Letter}/gu);

  // This is equivalent (short alias):
  story.match(/\p{L}/gu);

  // This is also equivalent (conjunction of all the subcategories using short aliases)
  story.match(/\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}/gu);
          
        

Unicode property escapes vs. character classes

With JavaScript regular expressions, it is also possible to use character classes and especially \w or \d to match letters or digits. However, such forms only match characters from the Latin script (in other words, a to z and A to Z for \w and 0 to 9 for \d). As shown in this example, it might be a bit clumsy to work with non Latin texts.

Unicode property escapes categories encompass much more characters and \p{Letter} or \p{Number} will work for any script.

            
              // Trying to use ranges to avoid \w limitations:
              
              const nonEnglishText = "Приключения Алисы в Стране чудес";
              const regexpBMPWord = /([\u0000-\u0019\u0021-\uFFFF])+/gu;
              // BMP goes through U+0000 to U+FFFF but space is U+0020
              
              console.table(nonEnglishText.match(regexpBMPWord));
              
              // Using Unicode property escapes instead
              const regexpUPE = /\p{L}+/gu;
              console.table(nonEnglishText.match(regexpUPE));
            
          

Scripts and script extensions

Some languages use different scripts for their writing system. For instance, English and Spanish are written using the Latin script while Arabic and Russian are written with other scripts (respectively Arabic and Cyrillic). The Script and Script_Extensions Unicode properties allow regular expression to match characters according to the script they are mainly used with (Script) or according to the set of scripts they belong to (Script_Extensions).

For example, A belongs to the Latin script and ε to the Greek script.

            
              let mixedCharacters = "aεЛ";
              
              // Using the canonical "long" name of the script
              mixedCharacters.match(/\p{Script=Latin}/u); // a
              
              // Using a short alias for the script
              mixedCharacters.match(/\p{Script=Greek}/u); // ε
              
              // Using the short name Sc for the Script property
              mixedCharacters.match(/\p{Sc=Cyrillic}/u); // Л
            
          



Flags

Advanced searching with flags

Regular expressions have optional flags that allow for functionality like global searching and case-insensitive searching. These flags can be used separately or together in any order, and are included as part of the regular expression.

d

Generate indices for substring matches.

RegExp.prototype.hasIndices


g

Global search.

RegExp.prototype.global


i

Case-insensitive search

RegExp.prototype.ignoreCase


m

Multi-line search

RegExp.prototype.multiline


s

Allows . to match newline characters.

RegExp.prototype.dotAll


u

"unicode"; treat a pattern as a sequence of unicode code points.

RegExp.prototype.unicode


uy

Perform a "sticky" search that matches starting at the current position in the target string. See sticky.

RegExp.prototype.sticky



Methods

RegExp Methods

Regular expressions are used with the RegExp methods test() and exec()

exec()

Executes a search for a match in a string. It returns an array of information or null on a mismatch.


test()

Tests for a match in a string. It returns true or false.


Syntax

            

            
          

String Methods

Regular expressions are also used with the String methods match(), replace(), search(), and split().

match()

Returns an array containing all of the matches, including capturing groups, or null if no match is found.


matchAll()

Returns an iterator containing all of the matches, including capturing groups.


search()

Tests for a match in a string. It returns the index of the match, or -1 if the search fails.


replace()

Executes a search for a match in a string, and replaces the matched substring with a replacement substring.


replaceAll()

Executes a search for all matches in a string, and replaces the matched substrings with a replacement substring.


split()

Uses a regular expression or a fixed string to break a string into an array of substrings.


Syntax

            

            
          


Escaping

Escaping

If you need to use any of the special characters literally (actually searching for a "*", for instance), you must escape it by putting a backslash in front of it. For instance, to search for "a" followed by "*" followed by "b", you'd use /a\*b/ — the backslash "escapes" the "*", making it literal instead of special.

Similarly, if you're writing a regular expression literal and need to match a slash ("/"), you need to escape that (otherwise, it terminates the pattern). For instance, to search for the string "/example/" followed by one or more alphabetic characters, you'd use /\/example\/[a-z]+/i—the backslashes before each slash make them literal.

To match a literal backslash, you need to escape the backslash. For instance, to match the string "C:\" where "C" can be any letter, you'd use /[A-Z]:\\/ — the first backslash escapes the one after it, so the expression searches for a single literal backslash.

If using the RegExp constructor with a string literal, remember that the backslash is an escape in string literals, so to use it in the regular expression, you need to escape it at the string literal level. /a\*b/ and new RegExp("a\\*b") create the same expression, which searches for "a" followed by a literal "*" followed by "b".

If escape strings are not already part of your pattern you can add them using String.replace:

            
  function escapeRegExp(string) {
    return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); 
    // $& means the whole matched string
  }