Regular expressions in Haskell

If you are writing a piece of software which needs to process text, regular expressions are a useful tool. There are many mature regex libraries in the wild, but I thought it would be a fun exercise to implement some basic regex functionality in Haskell.

Setting up the workspace

Before we write any code, we need to set up a workspace. We shall use the stack build system. First open a shell and run

stack new hs-regex protolude
cd ./hs-regex

Protolude is a minimal Haskell prelude which I like using. We are going to be using the vector and text libraries, so in hs-regex.cabal, edit the library section so it looks like

library
  hs-source-dirs:      src
  ghc-options:         -Wall
  exposed-modules:     Lib
                     , Regex
  other-modules:       Lib.Prelude
  build-depends:       base >= 4.7 && < 5
                     , protolude >= 0.1.6 && < 0.2
                     , vector >= 0.1
                     , text >= 1.2
  default-language:    Haskell2010
  default-extensions:  OverloadedStrings, NoImplicitPrelude, OverloadedLists

Now we are ready to start writing ./src/Regex.hs.

the Code

First we need to declare the module and import some libraries.

module Regex where

import Lib.Prelude
import qualified Data.Vector as V
import qualified Data.Text as T

Importing Lib.Prelude brings protolude into the module namespace. Now lets define the syntax tree for our regular expressions:

data Regex = Empty                -- matches empty string
           | Atom Char            -- matches a single character
           | Not Char             -- matches any other character
           | Union Regex Regex    -- matches first regex OR second regex
           | Concat Regex Regex   -- matches first regex THEN second regex
           | Kleene Regex         -- matches zero or more occurrences of regex

In order to keep things short, we are sticking to a pretty small set of regular expression constructors. Now that we have specified our regular expressions, we need to decide what we want to do with them. Often, regular expressions are used for searching, but that is a lot to tackle at once. Instead, let's use our regular expressions to split of a matching sequence of characters from the start of the character stream. If there is no match, we need to return an error:

data RegexError = UnexpectedCharacter Char
                | StreamEmpty
                deriving Show

We want to define a function

useRegex :: Regex -> Text -> (Either RegexError (V.Vector Char),Text)

This is exactly what the state monad is designed for. We can fill in the definition for useRegex as follows:

type RegexMachine m = StateT Text m

runRegexMachine :: RegexMachine m a -> Text -> m (a, Text)
runRegexMachine = runStateT

runRegexMachineIdentity :: RegexMachine Identity a -> Text -> (a, Text)
runRegexMachineIdentity machine text = runIdentity $ runRegexMachine machine text

useRegex :: Regex -> Text -> (Either RegexError (V.Vector Char),Text)
useRegex regex = runRegexMachineIdentity $ buildRegexMachine regex

useRegexIO :: Regex -> Text -> IO (Either RegexError (V.Vector Char),Text)
useRegexIO regex = runRegexMachine $ buildRegexMachine regex

buildRegexMachine :: (Monad m) => Regex -> RegexMachine m (Either RegexError (V.Vector Char))

We define RegexMachine using the StateT monad transformer because then we can also construct useRegexIO which lets us use our regular expressions on values with type IO Text. Now we need to fill in the definition for buildRegexMachine. First we have the Empty constructor:

buildRegexMachine Empty = return $ Right []

The Empty regex matches the empty string, so we return nothing and consume no characters.

buildRegexMachine (Atom x) = do s <- get
                                let split = T.uncons s
                                case split of
                                  Just (c,rest) -> if c == x then (put rest) >> (return $ Right [c])
                                                             else return $ Left $ UnexpectedCharacter c
                                  Nothing       -> return $ Left StreamEmpty

The Atom x regex matches a single character, so we get the character stream and pull of the head. If it matches, we consume the character from the stream and return the head. Otherwise we return an UnexpectedCharacter error. The other case occurs when the stream is empty. In that case, we return a StreamEmpty error.

buildRegexMachine (Not x) = do s <- get
                               let split = T.uncons s
                               case split of
                                 Just (c,rest) -> if c == x then return $ Left $ UnexpectedCharacter c
                                                            else (put rest) >> (return $ Right [c])
                                 Nothing -> return $ Left StreamEmpty

The Not x regex is very similar to the Atom x regex.

buildRegexMachine (Union regex1 regex2) = do backup  <- get
                                             output1 <- machine1
                                             case output1 of
                                               Left _  -> do put backup
                                                             output2 <- machine2
                                                             case output2 of
                                                               Left e  -> (put backup) >> (return $ Left e)
                                                               Right v -> return $ Right v
                                               Right w -> return $ Right w
  where machine1 = buildRegexMachine regex1
        machine2 = buildRegexMachine regex2

The code the Union regex1 regex2 regex is a little more complicated. We try the first regex. If it fails, we need to roll back the stream and try the second.

buildRegexMachine ( Concat regex1 regex2) = do backup <- get
                                               output1 <- machine1
                                               case output1 of
                                                 Left e  -> (put backup) >> (return $ Left e)
                                                 Right v -> do output2 <- machine2
                                                               case output2 of
                                                                 Left f -> (put backup) >> (return $ Left f)
                                                                 Right w -> return $ Right (v V.++ w)
  where machine1 = buildRegexMachine regex1
        machine2 = buildRegexMachine regex2

For Concat regex1 regex2, we try and match both regexes in order. If at any point we fail, we need to revert the stream and return the error.

buildRegexMachine (Kleene regex) = Right <$> (buildKleeneMachine machine)
  where machine = buildRegexMachine regex

-- a kleene machine can't fail
buildKleeneMachine :: (Monad m) => RegexMachine m (Either RegexError (V.Vector Char)) -> RegexMachine m (V.Vector Char)
buildKleeneMachine machine = do output <- machine
                                case output of
                                  Left _  -> return []
                                  Right v -> do vs <- buildKleeneMachine machine
                                                return (v V.++ vs)

Since Kleene regex can't fail, we deviate slightly from the above pattern. That completes the definition of buildRegexMachine. We can also make Regex an instance of some type classes so we can use some generic functions:

instance Monoid Regex where
  mempty = Empty
  mappend = Concat

instance Semiring Regex where
  one = undefined -- not sure what should go here.
  (<.>) = Union

Now lets look at some examples:

wordRegex :: [Char] -> Regex
wordRegex w = mconcat (Atom <$> w :: [Regex])

red = wordRegex "red"
blue = wordRegex "blue"
redblue = red `mappend` blue

In the REPL we have

> useRegex redblue "redbluered"
(Right "redblue","red")
> useRegex redblue "0redbluered"
(Left (UnexpectedCharacter '0'),"0redbluered")
> useRegex (Kleene redblue) "redbluered"
(Right "redblue","red")
> useRegex (Kleene redblue) "0redbluered"
(Right "","0redbluered")