Overview
As software developers, we deal with language parsers every day. Our tools, like Typescript, ESLint, and language servers have become so common that parsing is happening constantly as part of our routine. In this article, we’re going to explore a particular type of parser that is conceptually simple, super flexible, and crazy easy to test. We’ll use these parsers to help us solve a problem that many of us have been faced with and frustrated by: parsing dates.
There are loads of different kinds of parsers. You can buy entire books on parsing theory and language construction (and you should!). The type of parser that we’ll be exploring today is called a parser combinator.
The idea behind parser combinators is that a complex parser can be built from simple pieces, from the ground up. Much like how your rad Lego masterpieces are made up of hundreds of smaller parts. First, we will create a parser that only recognizes a single character, but by combining multiples of those, we’ll be able to recognize strings. Add a few boolean operators like “or” and “and” and now we can recognize numbers and other patterns.
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
At this point you might be wondering, why not just use regular expressions? Well, you could! In some cases, regular expressions are definitely the right tool for the job. Regular expressions are, however, hard to write, require a special syntax, and quickly grow into an unmaintainable mess when you start generating them dynamically. Have you seen the regular expression to validate email addresses?
Most importantly, learning new things is fun! Even if you don’t end up using parser combinators in production code, they are still a fantastic tool to add to your belt. For the remainder of this article, we’re going to build up a small library of parser combinators and then combine them to create something powerful. Hopefully, by the end, you will see the usefulness of these tools and your head will be swimming with creative uses!
If you’d like to skip to the end result or just follow along, you can find the code here.
The simplest case
A parser combinator is just a fancy term for a function that returns a function that does or matches something very specific. This is best learned through example.
type Combinator = (str: string) => boolean;
export function char(c: string): Combinator {
return (str: string) => {
if (str[0] === c) {
return true;
} else {
return false;
}
};
}
Here we’ve got a function, char
, that takes a string (a single character) and returns a function that returns true if the passed-in string matches, and false if it does not. Pretty simple! Let’s see how easy this is to test.
import { char } from "../src/combinators";
describe("char", () => {
it("matches a single character", () => {
expect(char("a")("abc")).toBeTruthy();
});
it("fails matching if the character is not expected", () => {
expect(char("b")("abc")).toBeFalsy();
});
});
Combing combinators
You might be questioning our sanity right now. How could a small function like that be of any use to anybody? Well, to be honest, it’s not much use…yet. Let’s add another tool to our arsenal.
export function either(...combinators: Combinator[]): Combinator {
return (str: string) => {
for (let i = 0; i < combinators.length; i++) {
if (combinators[i](str)) {
return true;
}
}
return false;
};
}
OK, now we’ve stepped it up a notch. We’ve created a combinator that takes other combinators as input. This function will return true if any of its combinators return true. This is kind of like a boolean OR operation. Again, here are a few tests.
import { char, either } from "../src/combinators";
describe("either", () => {
const aOrB = either(char("a"), char("b"));
it("returns true if any of the combinators are true", () => {
expect(aOrB("a")).toBeTruthy();
expect(aOrB("b")).toBeTruthy();
});
it("returns false none of the combinators are true", () => {
expect(aOrB("c")).toBeFalsy();
});
});
Big deal! We can match an “a” or a “b”. Ok, I see your skepticism, and I raise you another example.
const digit = either(..."0123456789".split('').map(char));
const hexDigit = either(digit, ..."abcdefABCDEF".split('').map(char));
digit('1'); // true
digit('a'); // false
hexDigit('C'); // true
hexDigit('G'); // false
By combining our two simple combinators, we’ve created new combinators that can match a single decimal or hexadecimal digit. Take a look at hexDigit
and you’ll notice that we even reuse the digit combinator.
More combinators
Before we go any further, we need to address something I was hoping you wouldn’t notice. In order to keep things simple, we’ve only ever matched a single character. In fact, the way this is designed now, we can only ever match a single character. Our combinators don’t know anything about the string they are parsing and only ever return a boolean match result. Let’s make our combinators a little bit smarter and immeasurably more practical.
interface SuccessResult {
success: true;
value: string;
rest: string;
}
interface FailureResult {
success: false;
}
type CombinatorResult = SuccessResult | FailureResult;
type Combinator = (str: string) => CombinatorResult;
export function char(c: string): Combinator {
return (str: string) => {
if (str[0] === c) {
return {
success: true,
value: c,
rest: str.substring(1)
};
} else {
return {
success: false
};
}
};
}
export function either(...combinators: Combinator[]): Combinator {
return (str: string) => {
for (let i = 0; i < combinators.length; i++) {
const result = combinators[i](str);
if (result.success) {
return result;
}
}
return { success: false };
};
}
That wasn’t so bad! Now, instead of just returning true/false, we are returning the true/false status, the value that was matched, and the rest of the string that was not matched. Let’s see what we can do now and why this is important.
export function sequence(...combinators: Combinator[]): Combinator {
return (str: string) => {
let rest = str;
let value = "";
for (let i = 0; i < combinators.length; i++) {
const result = combinators[i](rest);
if (result.success) {
rest = result.rest;
value += result.value;
} else {
return { success: false };
}
}
return {
success: true,
value,
rest
};
};
}
And of course a few tests:
import { char, sequence } from "../src/combinators";
describe("sequence", () => {
const aAndB = sequence(char("a"), char("b"));
it("returns true if all of the combinators are true", () => {
expect(aAndB("abcdef")).toEqual({
success: true,
value: "ab",
rest: "cdef"
});
});
it("returns false any of the combinators are false", () => {
expect(aAndB("bcdef")).toEqual({
success: false
});
});
});
We pass a series of combinators into sequence
, and as they match, we pass the unmatched portion of the string to the next combinator, with the returning result being the sum of all the matches from the sequence. This acts a lot like the boolean AND operation.
Now we can match entire strings, instead of single characters.
const string = (str: string) => sequence(...str.split('').map(char));
string("abc")("abcdef"); // "abc"
string("abd")("abcdef"); // false
Let’s build up our toolkit with a few more simple combinators.
export function nOrMore(n: number, combinator: Combinator): Combinator {
return (str: string) => {
let matches = 0;
let rest = str;
let value = "";
while (1) {
const result = combinator(rest);
if (result.success) {
matches++;
value += result.value;
rest = result.rest;
continue;
}
break;
}
if (matches >= n) {
return {
success: true,
value,
rest
};
}
return {
success: false
};
};
}
This is easy enough. We’ll pass in a minimum number of times we want a combinator to match. If it doesn’t match at least that many times, we’ll return a failure.
export function optional(c: Combinator): Combinator {
return (str: string) => {
const result = c(str);
if (result.success) {
return result;
}
return {
success: true,
value: "",
rest: str
};
};
}
Take a close look at this one as it is a bit of an oddity. If a match is found, we’ll return the match. If no match was found, instead of failing, we will return an empty match! By returning a successful result, and not consuming any of the input string, we’ve made a function that matches an optional combinator but moves on if it is not found.
Let’s put some of these together and do something fun.
const digit = either(..."0123456789".split("").map(char));
const hexDigit = either(digit, ..."abcdefABCDEF".split("").map(char));
const hexNumber = sequence(string("0x"), nOrMore(1, hexDigit));
hexNumber("0x1afb"); // "0x1afb"
hexNumber("hello"); // false
This will match any string that begins with 0x
followed by one or more hex characters. Pretty neat. Let’s try another.
const digit = either(..."0123456789".split("").map(char));
const integer = nOrMore(1, digit);
const floatingPoint = sequence(integer, char("."), integer);
const number = either(floatingPoint, integer);
number("123"); // "123"
number("3.14"); // 3.14
- We’ve created a combinator to match any single digit.
- We create a new combinator that matches one or more of those digits (an integer)
- We create a new combinator that matches an integer, followed by a period, followed by another integer (a floating point value).
- We create a new combinator that matches either an integer or a floating point.
By combining a few of our simple tools, we can match floating point and integer numbers. I hope you are starting to imagine situations you could use combinators in.
Capture groups
We are missing one thing before we can start using combinators for more than trivial things like just knowing if a string matches some pattern. Borrowing a name from regular expressions, we’ll call these “capture groups”. We’ll add the ability for our combinators to save their values with a designated name. That way, we can pull them out and use them from other combinators.
interface SuccessResult {
success: true;
value: string;
rest: string;
captures: Record<string, string>;
}
interface FailureResult {
success: false;
}
type CombinatorResult = SuccessResult | FailureResult;
type Combinator = (str: string) => CombinatorResult;
For brevity, I won’t post all the code here. Here is an example of merging capture groups together in a combinator.
export function sequence(...combinators: Combinator[]): Combinator {
return (str: string) => {
let rest = str;
let value = "";
let captures = {};
for (let i = 0; i < combinators.length; i++) {
const result = combinators[i](rest);
if (result.success) {
rest = result.rest;
value += result.value;
captures = {
...captures,
...result.captures
};
} else {
return { success: false };
}
}
return {
success: true,
value,
rest,
captures
};
};
}
Of course, capture groups are not that useful unless we create them. Let’s create a simple combinator that doesn’t consume any of the input string but instead just creates a capture group with the results of a different combinator. While we are at it, we’ll also create a combinator that transforms the result of one combinator into something else. You’ll see why this is useful later.
export function capture(
name: string,
combinator: Combinator,
map?: (input: string) => string
): Combinator {
return (str: string) => {
const result = combinator(str);
if (result.success) {
return {
...result,
captures: {
...result.captures,
[name]: map ? map(result.value) : result.value
}
};
}
return result;
};
}
export function map(
combinator: Combinator,
m: (r: CombinatorResult) => CombinatorResult
): Combinator {
return (str) => {
const result = combinator(str);
if (result.success) {
return m(result);
}
return result;
};
}
Let’s demonstrate capture groups by building on our trusty number parser.
const digit = either(..."0123456789".split("").map(char));
const integer = nOrMore(1, digit);
const floatingPoint = sequence(integer, char("."), integer);
const number = either(integer, floatingPoint);
const plus = sequence(
capture("left", number),
char("+"),
capture("right", number)
);
plus("1+1");
As a result, we find our captures
object filled in with left
and right
properties.
{
success: true,
value: "1+1",
rest: "",
captures: {
left: "1",
right: "1"
}
}
Date Parsing
Remember way back at the beginning, I promised we’d parse some dates? Well, I didn’t forget! We just didn’t have a big enough toolbox until now. Our goal here is to take a string that could be in a variety of formats and turn it into a Javascript object.
We’ll add support for the following date formats:
- Jan 1st, 2023
- January 1st, 2023
- 1/1/2023
- 2023-01-01
As well as a couple of time formats:
- 23:30
- 4am
- 12:30 am/pm
Let’s start with parsing the time.
const ws = nOrMore(0, char(" "));
const digit = either(..."0123456789".split("").map(char));
const numberFrom = (min: number, max: number) => either(
...Array.from(Array(max - min + 1).keys()).map((i) => char(`${min + i}`))
);
const twentyFourHourTime = sequence(
capture("hour", sequence(numberFrom(0, 2), digit)),
char(":"),
capture("minute", sequence(numberFrom(0, 6), digit))
);
const twelveHourTime = map(
sequence(
capture("hour", either(sequence(optional(char("1")), digit), digit)),
optional(
sequence(char(":"), capture("minute", sequence(numberFrom(0, 6), digit)))
),
ws,
capture("suffix", sequence(either(string("am"), string("pm"))))
),
(r) => {
const { hour, minute = "0", suffix } = r.captures;
let h = parseInt(hour, 10);
if (suffix === "pm" && h !== 12) {
h += 12;
} else if (suffix === "am" && h === 12) {
h = 0;
}
return {
...r,
captures: {
hour: h.toString(),
minute
}
};
}
);
export const time = either(twelveHourTime, twentyFourHourTime);
OK let’s unpack that. We created two new combinators using the previous combinators we’ve written, one for each of our time formats. For 12-hour times,
- The hour is either going to start with a 1, in which case it might be two digits, or it will be a single digit. This should account for both “10” and “1”.
- We then optionally match a “:” followed by two digits. If we have that, it means we’ve got some minutes.
- Then, after some optional whitespace, we look for an am/pm suffix.
- Finally, in order to only have to deal with 24 hour time in the future, we map the twelve hour time into 24 hour time.
Don’t miss this! We’re taking a combinator result, performing some operations, and returning a new result.
time("12:00"); // { hour: "12", minute: "00" }
time("12am"); // { hour: "0", minute: "0"}
time("4:15pm"); // { hour: "16", minute: "15" }
OK, let’s tackle our date formats. These are the same concepts, just combined in different ways.
const dash = char("-");
const slash = char("/");
const validateDate = (result: SuccessResult): CombinatorResult => {
const { month, date } = result.captures;
const m = parseInt(month, 10);
const d = parseInt(date, 10);
if (m < 1 || m > 12 || d < 1 || d > 31) {
return {
success: false
};
}
return result;
};
// 2023-01-01
export const YYYYMMDD = map(
sequence(
capture("year", sequence(digit, digit, digit, digit)),
dash,
capture("month", sequence(digit, digit)),
dash,
capture("date", sequence(digit, digit))
),
validateDate
);
// m/d/YYYY
export const MDYYYY = map(
sequence(
capture("month", nOrMore(1, digit)),
slash,
capture("date", nOrMore(1, digit)),
slash,
capture("year", sequence(digit, digit, digit, digit))
),
validateDate
);
const months = [
"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December"
];
export const monthDDYYYY = sequence(
either(
map(capture("month", either(...months.map((m) => string(m)))), (m) => ({
...m,
captures: {
month: String(months.indexOf(m.captures.month) + 1)
}
})),
map(
capture("month", either(...months.map((m) => string(m.substring(0, 3))))),
(m) => ({
...m,
captures: {
month: String(
months.findIndex((month) => month.indexOf(m.captures.month) === 0) + 1
)
}
})
)
),
ws,
capture("date", nOrMore(1, digit)),
optional(either(string("st"), string("th"))),
ws,
optional(char(",")),
ws,
capture("year", sequence(digit, digit, digit, digit))
);
export const date = either(YYYYMMDD, MDYYYY, monthDDYYYY);
The beginning of that monthDDYYYY function is a little confusing. We’re adding a combinator to match either long or short month names (“January” vs “Jan”) and then mapping their names back to their index, so “January” would become “1”, “February” would become “2”, and so on. Like when we adjusted 12-hour time to match 24-hour time, it will help us if we keep all our data in the same format. We can parse all of the dates we set out to parse now:
date("Jan 1st, 2023"); // { month: "1", date: "1", year: "2023" }
date("February 24th, 2023"); // { month: "2", date: "24", year: "2023" }
date("2023-05-12"); // { month: "05", date: "12", year: "2023" }
date("5/12/2023"); // { month: "5", date: "12", year: "2023" }
I hope you are still with me. It’s been a journey but we are reaching the end. Let’s bundle up our date and time parsers into a single convenient function.
export const dateAndTime = sequence(date, ws, optional(time));
export function parseDateTime(str: string): Date | null {
const result = dateAndTime(str);
if (result.success === false) {
return null;
}
const { month, date, year, hour = "0", minute = "0" } = result.captures;
return new Date(
parseInt(year, 10),
parseInt(month, 10) - 1,
parseInt(date, 10),
parseInt(hour, 10),
parseInt(minute, 10),
0,
0
);
}
parseDateTime
will either return a javascript Date
object, or a null
value. If the result is null
, we know that we were unable to parse the input. If the result is a Date
object, things must have gone OK. Go ahead and give this a try and parse a few dates. It’s pretty amazing what we have accomplished just by stacking a few tools on top of each other!
Where to go from here
Congratulations on making it this far! Hopefully, you are starting to see the power of parser combinators and what else you might be able to do with them. Let’s take a quick recap of what we’ve done.
- We discovered the concept of parser combinators and built up a few tools from the ground up
- We wrote some parser combinators to match various kinds of numbers
- We used our parser combinators to create new combinators to parse several different date and time formats
- We wrote a function that takes a string and returns a simple
Date
object. The external API for our parser combinator couldn’t be more simple
I encourage you to jump in and play around with the code and make it your own. I’ve got a few ideas for you to try if you need some inspiration.
- Make the date optional in some of the formats, so “Jan 2023” would be parsed as
{ month: "1", date: "1", year: "2023" }
- Add additional date/time formats
- Add additional combinators, like an end combinator that will only match at the end of the string. You can use this to make sure there are no “trailing” characters after a match
- Add error handling! Right now, if a parser doesn’t match, we simply bail. Try adding an error message to the
FailureResult
type that will let the user know exactly what failed
You might still be wondering if this was really better than just writing a few regular expressions. In short, maybe, maybe not. It depends on your needs. We did some things here that we can’t easily do with regular expressions. Some advantages:
- Our parser combinators are made of small, reusable, little pieces. Once you write a parser combinator you can use it on its own, or, as part of another even bigger combinator.
- Regular expressions can’t really “map” values as we did. Since combinator results are sort of like passing results down a stream, we were able to update our results “upstream” and have the results trickle down. We did this when we mapped 12 hour time into a 24 hour time format.
- Regular expressions are “black boxes”. A string goes in, and either it matches, or it does not. With a little bit of error handling, we can use combinators to specify exactly what went wrong and give the user a little guidance.
- Style preference. Some people just don’t like regular expressions. They find them syntactically difficult and confusing to follow. Parser combinators don’t need any special syntax and just rely on the language features you’re already familiar with.
This has been a small introduction to parser combinators. By composing these little functions you can parse anything from user input to entire languages. I’m hoping that by the end of this, you’ve at least learned something new and maybe thought about where you can go from here. Remember, the code we wrote today is just for learning purposes to explore one implementation of how parser combinators might work. If you are interested in really using parser combinators in your next project, or just want to play with combinations without worrying about how they work, I encourage you to check out the Parsimmon project.