Neil, in profile, wearing welding goggles

Neil Kandalgaonkar

Software Engineering Leader & Open Source Innovator

A regular expression that can determine if a number is prime

Yes, it's possible for a regular expression to determine if a number is prime! Here it is:

/^1?$|^(11+?)\1+$/

If you are going cross-eyed, that was intentional. Long ago, the original author posted it as an enigmatic one-liner, as a challenge to other hackers. We'll learn more about that history later, but let's get straight into how it works. I promise that, once you understand this, you'll never look at regular expressions the same way again.

To use it in an actual program, it would look something like this:

const regex = /^1?$|^(11+?)\1+$/;

function isPrime(num) {
    const s = '1'.repeat(num);
    return !s.match(regex);
}

for (i = 0; i <= 10; i++) {
    console.log(
        i,
        isPrime(i)
            ? "is prime"
            : "is not prime"
    );
}
0 is not prime
1 is not prime
2 is prime
3 is prime
4 is not prime
5 is prime
6 is not prime
7 is prime
8 is not prime
9 is not prime
10 is not prime

How to use it

This regex doesn't work on a number's string representation. We're not checking if the string 53 is prime. We'd have to list literally all of them:

/^2|3|5|7|11|13 ..... (an infinite series of numbers) $/

When we want to test a number n, we start by constructing a string of length n.

0 -> ''
1 -> 'x'
2 -> 'xx'
3 -> 'xxx'
7 -> 'xxxxxxx'
...
53 -> 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

In this case, the string we are going to construct will be n 1s. The only reason the author used 1 is to confuse you. There are other syntactic uses of 1 in the regular expression, as an actual number, and you just have to disambiguate it by context.

0 -> ''
1 -> '1'
2 -> '11'
3 -> '111'
7 -> '1111111'
...
53 -> '11111111111111111111111111111111111111111111111111111'

Explanation of the regular expression

So here’s what the full regular expression is saying:

/
    ^          # match the beginning of the string, 
     1?        # an optional "1", 
    $          # and then the end of the string

  |            # or...

    ^          # match the beginning of the string
      (        #   begin storing the following in the first group:
        1      #     match a "1"
        1+?    #     then match one or more "1"s, minimally.
      )        #   end storing the first group
      \1+      #   match the first group repeated 1 or more times.
    $          # match the end of string
/

Let’s follow along with the regular expression matching engine. Let’s consider the case of N = 9. This generates a string like this (here padded with spaces for clarity):

1 1 1 1 1 1 1 1 1

The regex is divided into two major parts by the | character. This is an alternation; the regex matches if (everything before matches) or (everything after).

First alternation; handle 0 or 1

First, the ^ in the regex will anchor the match to the start of the string. It can’t match this anywhere in the middle.

/^1?$|^(11+?)\1+$/

1 1 1 1 1 1 1 1 1

The next part is 1?, an optional one, so it actually can match nothing.

/^1?$|^(11+?)\1+$/

1 1 1 1 1 1 1 1 1

The next part of the regex is another anchor, $, which wants to match the end of the string. But our string doesn’t end there, so it fails, here represented by red.

/^1?$|^(11+?)\1+$/

11 1 1 1 1 1 1 1

Now what? Well, our regular expression has failed, so it looks back to see if there were any other possibilities that it hasn’t tried yet. This is called backtracking. It remembers that there was that optional one. What happens if it tries matching a real one?

/^1?$|^(11+?)\1+$/

111 1 1 1 1 1 1

Nope, that didn’t work out.

So, we see with the trivial first part, if the string had been just "" or "1" it would have worked. And that corresponds to the case of n = 0 or n = 1.

We’ve also seen how a regular expression engine works. If you give it a specification like /match.*me/, humans intuitively think about it as "match, then some other stuff, then me". But regular expressions don't scan text like your eyes. They are state machines. It's a miniature programming language, which just happens to kind of look like the strings they match. They describe a program for consuming a string, character by character, looping over some bits, and skipping over others.

Most of the regular expression engines you'll encounter have backtracking; if the state machine can't proceed further, it will look back and see if there was another way to consume the string that might work. And that's how this regular expression exploits to do its magic.

Second alternation; handle all the other numbers

We start again by anchoring to the beginning of the string.

/^1?$|^(11+?)\1+$/

1 1 1 1 1 1 1 1 1

(11+?) will match two ones at the start. Because these are in parentheses, this is a grouping; the value of what they match is captured into the backreference \1. Note that in this case \1 just means "the first thing in parentheses". It doesn’t have to do with the literal "1"s in the string.

/^1?$|^(11+?)\1+$/

1 11 1 1 1 1 1 1

Next, \1+ matches one or more strings identical to the first matched grouping, which in this case is 1 1.

/^1?$|^(11+?)\1+$/

1 11 11 11

Oops, but the string doesn’t end there!

/^1?$|^(11+?)\1+$/

1 11 11 11

We’ll have to backtrack. This doesn’t work...

/^1?$|^(11+?)\1+$/

1 11 11 1 1

Okay, we have nothing left to backtrack to. How about that 11+?, let’s try matching three ones, and calling that \1:

/^1?$|^(11+?)\1+$/

1 1 11 1 1 1 1 1

And let’s match as many of these new \1 as we can:

/^1?$|^(11+?)\1+$/

1 1 11 1 11 1 1

The end of string is next -- this matched!

/^1?$|^(11+?)\1+$/

1 1 11 1 11 1 1

So, you can see how this process is analogous to trying to divide a number by successively larger divisors, leaving no remainder. For n = 9 it tried 2, and that failed, and then 3, which succeeded. But in the case of a prime number, this is never going to succeed.

Matching strategies

This regex is made much more efficient due to the inclusion of a single character: ?.

If that were not included, the regex would immediately consume the entire string at (11+). The question mark makes the match minimal, so it will start by trying to "divide" the number by 2, then 3, etc. It will reject 32000 instantly, and take somewhat longer to determine that 32003 is prime. A good reminder that backtracking doesn’t always reduce the size of a matched portion; with minimal matches it will backtrack and try matching more.

There is still some inefficiency here - there are useless attempts to match fewer multiples of \1. Some regular expression engines have a special grouping that disallows backtracking: (?>), but it doesn't make much difference.

And even the dumbest prime-finding function shouldn’t bother to check numbers greater than the square root of n. There’s no point in checking if 32003 can be divided by 25123. Nor should it check if the string is divisible by 4 when if we already know it isn’t divisible by 2. So this is hardly a practical technique.

State machines

I promised you would learn more about regular expressions.

The average programmer thinks of a regex as just a string with wildcards, like, match quux or quuuuux against the pattern /qu+x/.

This little prime-finder gem illuminates the true nature of regular expressions. Programmers who are new to regular expressions often flounder when they have to debug them. Or they expect the computer to have insight into the problem, and thus don’t understand why /(<(.*)>.*<\1>)*/ not only doesn’t grab all the HTML in a string, but might even take centuries to run. That’s why regular expressions have a reputation for being scary and dangerous. But now you understand why they're worth the trouble. Once you understand that you're programming a little state machine; what to do at each step, what to do when things don't work - with just a few keystrokes you can specify an incredibly complex task!

The state machines in regexes are special-purposed for matching strings. But I find programmers don't use this paradigm nearly enough. You often see a terrifyingly long set of conditionals, frantically trying to capture everything that could have gone in a user's journey through an app, when it could be so much easier to directly model it as a state machine.

Origins

As far as I know, the prolific hacker Abigail invented it in 1998 or some time before. Abigail was famous for posting enigmatic but ingenious one-liner scripts in her signatures on forums. At the time, web forums were rare, and a distributed discussion protocol called Netnews was popular.

Everything was text, so, for flavor, people would add signatures to their posts. Some people added ascii art. In some programming communities, you would add a tiny one-liner program to show off your skills.

Abigail had a large number of these, and were apparently automatically rotated for each new post. Here's the first post I could find that used the prime-finding regex. I'm including the full post just so you get the flavor of how it would look.

Date: 16 Sep 1998 20:19:40 GMT
From: abigail@fnx.com (Abigail)
Subject: Re: how safe is xor encryption ?
Message-Id: <6tp6gs$k0k$1@client3.news.psi.net>

Elaine -HappyFunBall- Ashton (eashton@bbnplanet.com) wrote on MDCCCXLII
September MCMXCIII in :
++ Mark-Jason Dominus wrote:
++
++ > If the key is intercepted, it is not very useful to the snooper. They
++ > can use it to authorize purchases from you, but not from anyone else.
++
++ How so? If I can get the key, I can get the CC# and then I can make
++ purchases anywhere I want.

No. You need the key *and* the encrypted number.

++ > If the entire database is stolen, it is not useful to the thief
++ > without a database of decryption keys, and no such database exists.
++
++ You presume that the key length is sufficient to render the decryption
++ impractical or impossible. Also, too, that the customer has chosen a
++ totally random key which, in my experience at least, is all too rare an
++ occurance.

Impossible is the right answer. The beauty of XOR encryption is that even
if you brute force try all keys, and you hit the right one, you have
absolutely no idea it is the right key.

For instance, I might give you the encrypted text "t^R][S". You might
try keys, and find out that XORring with "123456" gives "Elaine". However,
XORring with "5-:)4=" will give you "Ashton". And a key of "
!" will give you "Hacker". For any 6 character string, there will be a unique,
6 character key that, when applied to the encrypted text, gives you the
target text.

Abigail
-- 
perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'
A netnews discussion in comp.lang.perl.misc

The Perl one-liner deserves a little explication of its own. You would run it like this:

$ perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' 18
$ perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' 19
Prime
$ perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' 20
$ perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' 21
$ perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' 22
$ perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' 23

Let's unwrap this.

STATEMENT if CONDITION

A quirk of Perl syntax; a statement may have ’modifiers’ at the end like this. This is identical to if CONDITION { STATEMENT }.

shift

shift normally removes the first element of an array and returns that element, like popping from the beginning. Since we are in file scope, shift without array is magically interpreted to mean shift @ARGV, which means the number we fed in as an argument.

1 x shift

x is the string repetition operator. "1" x 9 yields "111111111". (see man perlop, grep for ’repetition operator’.)

!~

The logical negation of =~, so it will succeed if the regex does not match. Since we are looking for primes, the regex will match all non-primes.

Me

And that's where I came into the picture. I used to post on these forums too, with much less flair or knowledge. Once I saw the prime-finding regex, I had to figure out how it worked. I posted an exegesis to my local programming meetup around the year 2000, and the content has been finding its way into viral posts every few months ever since.