ES2020: String.prototype.matchAll

[2018-02-06] dev, javascript, es feature, es2020
(Ad, please don’t block)

The proposal “String.prototype.matchAll” by Jordan Harband is currently at stage 3. This blog post explains how it works.

Before we look at the proposal, let’s review the status quo.

Getting all matches for a regular expression  

At the moment, there are several ways in which you can get all matches for a given regular expression.

RegExp.prototype.exec() with /g  

If a regular expression has the /g flag, you call .exec() multiple times to get all matches. After the last match, it returns null. Before that, it returns a match object for each match. Such an object contains captured substrings and more.

In the following example, we collect all captures of group 1 in the Array matches:

function collectGroup1(regExp, str) {
  const matches = [];
  while (true) {
    const match = regExp.exec(str);
    if (match === null) break;
    // Add capture of group 1 to `matches`
    matches.push(match[1]);
  }
  return matches;
}

collectGroup1(/"([^"]*)"/ug,
  `"foo" and "bar" and "baz"`);
  // [ 'foo', 'bar', 'baz' ]

Without the flag /g, .exec() always only returns the first match:

> let re = /[abc]/;
> re.exec('abc')
[ 'a', index: 0, input: 'abc' ]
> re.exec('abc')
[ 'a', index: 0, input: 'abc' ]

This is bad news for collectGroup1(), because it will never finish if regExp doesn’t have the flag /g.

String.prototype.match() with /g  

If you use .match() with a regular expression whose flag /g is set, you get all full matches for it in an Array (in other words, capture groups are ignored):

> "abab".match(/a/ug)
[ 'a', 'a' ]

If /g is not set, .match() works like RegExp.prototype.exec():

> "abab".match(/a/u)
[ 'a', index: 0, input: 'abab' ]

String.prototype.replace() with /g  

You can use a trick to collect captures via .replace(): We use a function to compute the replacement values. That function receives all capture information. However, instead of computing replacement values, it collects the data it is interested in, in the Array matches:

function collectGroup1(regExp, str) {
  const matches = [];
  function replacementFunc(all, first) {
    matches.push(first);
  }
  str.replace(regExp, replacementFunc);
  return matches;
}

collectGroup1(/"([^"]*)"/ug,
  `"foo" and "bar" and "baz"`);
  // [ 'foo', 'bar', 'baz' ]

For regular expressions without the flag /g, .replace() only visits the first match.

RegExp.prototype.test()  

.test() returns true as long as a regular expression matches:

const regExp = /a/ug;
const str = 'aa';
regExp.test(str); // true
regExp.test(str); // true
regExp.test(str); // false

String.prototype.split()  

You can split a string and use a regular expression to specify the separator. If that regular expression contains at least one capture group then .split() returns an Array in which the substrings are interleaved with whatever the first group captures:

const regExp = /<(-+)>/ug;
const str = 'a<--->b<->c';
str.split(regExp);
  // [ 'a', '---', 'b', '-', 'c' ]

Problems with current approaches  

Current approaches have several disadvantages:

  • They are verbose and unintuitive.

  • They only work if /g is set. Sometimes we receive a regular expression from somewhere else, e.g. via a parameter. Then we have to check that this flag is set if we want to be sure that all matches are found.

  • In order to keep track of progress, all approaches (except .match()) change the regular expression: property .lastIndex records where the previous match ended. This makes using the same regular expression at multiple locations risky. And while it’s generally not recommended, it’s a shame that you can’t inline the regular expression when using .exec() multiple times (because the regular expression is reset for each invocation):

    ···
    // Doesn’t work:
    const match = /abc/ug.exec(str);
    ···
    
  • Due to property .lastIndex determining where matching continues, it must always be zero when we start collecting matches. But at least .exec() and friends reset it to zero after the last match. This is what happens if it isn’t zero:

    const regExp = /a/ug;
    regExp.lastIndex = 2;
    regExp.exec('aabb'); // null
    

Proposal: String.prototype.matchAll()  

This is how you invoke .matchAll():

const matchIterable = str.matchAll(regExp);

Given a string and a regular expression, .matchAll() returns an iterable over the match objects of all matches.

You can also use the spread operator (...) to convert the iterable to an Array:

> [...'-a-a-a'.matchAll(/-(a)/ug)]
[ [ '-a', 'a' ], [ '-a', 'a' ], [ '-a', 'a' ] ]

Flag /g must be set:

> [...'-a-a-a'.matchAll(/-(a)/u)]
TypeError: String.prototype.matchAll called with a non-global RegExp argument

With .matchAll(), function collectGroup1() becomes shorter and easier to understand:

function collectGroup1(regExp, str) {
  let results = [];
  for (const match of str.matchAll(regExp)) {
     results.push(match[1]);
  }
  return results;
}

Let’s use spread and .map() to make this function more concise:

function collectGroup1(regExp, str) {
  let arr = [...str.matchAll(regExp)];
  return arr.map(x => x[1]);
}

Another option is to use Array.from(), which does the conversion to an Array and the mapping at the same time. Therefore, you don’t need the intermediate value arr:

function collectGroup1(regExp, str) {
  return Array.from(str.matchAll(regExp), x => x[1]);
}

.matchAll() returns an iterator, not a restartable iterable  

.matchAll() returns an iterator, not a true restartable iterable. That is, once the result is exhausted, you need to call the method again and create a new iterator.

In contrast, .match() plus /g returns an iterable (an Array) over which you can iterate as often as you want.

Implementing .matchAll()  

.matchAll() could be implemented via .exec() as follows:

function* matchAll(str, regExp) {
  if (!regExp.global) {
    throw new TypeError('Flag /g must be set!');
  }
  const localCopy = new RegExp(regExp, regExp.flags);
  let match;
  while (match = localCopy.exec(str)) {
    yield match;
  }
}

Making a local copy ensures two things:

  • regex.lastIndex isn’t changed.
  • localCopy.lastIndex is zero.

Using matchAll():

const str = '"fee" "fi" "fo" "fum"';
const regex = /"([^"]*)"/g;

for (const match of matchAll(str, regex)) {
  console.log(match[1]);
}
// Output:
// fee
// fi
// fo
// fum

FAQ  

Why not RegExp.prototype.execAll()?  

On one hand, .matchAll() does work like batch version of .exec(), so the name .execAll() would make sense.

On the other hand, exec() changes regular expressions and match() doesn’t. That explains why the name matchAll() was chosen.

Further reading