A lesser known characterisation of the Regular Languages

My thesis is slowly getting written, and at some point I should do a post about the paper got accepted to LICS, and maybe some stuff about fibrations, but today I'm talking about regular languages.

Regular languages: you gotta love 'em. There are so many good ways to characterise them and here are a few you are likely to hear about as a computing undergraduate:

Deterministic Finite Automata
Nondeterministic Finite Automata
Regular Expressions
Languages closed under concatenation, union, kleene star, and containing some singletons and the empty set
Read-only Turing machines
Regular grammars
…

The wiki page has a bunch more, like definability in monadic second-order logic, but today I want to talk about one not on its list. Specifically, about the following theorem of Hillebrand and Kanellakis.

Let $L \subseteq \str{\Sigma}$. The following are equivalent:

$L$ can be defined by a simply typed $\lambda$-term of type $\Str_{\Sigma}[\tau] \to \Bool$ for some simple type $\tau$
$L$ is a regular language

To understand this statement, I'll need to fill in a few definitions for you. $\Bool$ is the set of ($\beta\eta$ equivalence classes) of Church-encoded booleans. By strong normalisation, all such strings are equivalent to projections.

$\Bool \coloneq \basety \to \basety \to \basety$
$\Church(\true) \coloneq \lambda t. \, \lambda f. \, t$
$\Church(\false) \coloneq \lambda f. \, \lambda f. \, f$

Similarly, $\Str_\Sigma$ is the set of ($\beta\eta$ equivalence classes) of Church-encoded strings over the finite alphabet $\Sigma = \{a_1, \dots, a_n\}$. You should think of the first $n$ arguments as “cons $a_i$” and the $n+1$-th as the empty string.

Fix alphabet $\Sigma = \{ a_1, \dots, a_n \}$ and let $w=\text{“}w_1\cdots{}w_m\text{”}$ be a string over $\Sigma$.

$\Str_{\Sigma} \coloneq \underbrace{(\basety \to \basety) \to \cdots \to (\basety \to \basety)}_{\text{$n$ times}} \to \basety \to \basety$
$\Church(w) \coloneq \lambda a_1. \, \cdots \, \lambda a_n. \, \lambda \emptyString. \, w_1 (\cdots (w_m \; \emptyString))$

Finally, we have type substitution, i.e., given a particular type $\tau$, we write $\Str_\Sigma[\tau]$ for the type obtained by substituting $\tau$ for the base type $\basety$ in the type $\Str_\Sigma$. In the theorem, we have a kind of “poor man's polymorphism” by allowing all possible valid substitutions. A given $\tau$ is, essentially, a bound on the complexity of functions we can express.

Before we go on to prove this theorem, I will point out that this is just one of a handful of characterisations of compexity classes they give, and I invite you to check the paper for more (including a definition of the class of polynomial time functions). This work is also the starting point of the ``Implicit Automata'' programme carried out by my advisor Cecilia Pradic, Tito Nguyen, and also myself, in trying to characterise automata classes by $\Str \to \{\Bool, \Str\}$ functions in various typed lambda-calculi.

So, how does one go about proving such a theorem? When going from STLC to Automata, you do it by giving a semantic argument, i.e., a suitable interpretation of lambda terms to (something equivalent to) your class of automata. In the opposite direction, you code up your automaton as a lambda term.

Interpreting STLC terms as DFAs

We will interpret STLC terms in the category of finite sets $\catFinSet$. On types, this is given by:

$\interp{\basety} \coloneq \setBinary$
$\interp{\tau \to \sigma} \coloneq \interp{\tau} \to \interp{\sigma}$

And for each term $t : \Str_{\Sigma}[\tau] \to \Bool$, we define a DFA $(Q, \Sigma, \delta, q_0, F)$ by:

$Q \coloneq \interp{\Str_{\Sigma}[\tau]}$
$\delta(a) : Q \to Q$
$\delta(a)(f) \coloneq \lambda a_1.\, \dots\, \lambda a_n.\, \lambda e.\, a\, (f\, a_1\, \dots\, a_n\, e)$
$q_0 \coloneq \lambda a_1.\, \dots\, \lambda a_n.\, \lambda e.\, e$
$F \coloneq \{ q \in Q : \interp{t}(q) = \interp{\Church(\true)} \}$

Basically, we take the space of all possible functions as our state space, and on an input character $a$ we transition between functions by performing the operation of “append $a$”. A simple proof by induction establishes that $\delta(w)(q_0) =_{\beta\eta} \interp{\Church(w)}$. Thus our automata will accept a particular input $w$ if and only if the term would return $\Church(\true)$ for the input $\Church(w)$.

Coding DFAs as STLC terms

This direction is easier; the idea is to directly implement the transition function for the DFA in the simply typed lambda calculus. Suppose we have the DFA $(Q, q_0, \delta, F)$ over the alphabet $\Sigma$. We set $\tau \coloneq \underbrace{\basety \to \cdots \to \basety}_{\text{\(\abs{Q}$ times}} \to \basety\). Then our term in $\Str_\Sigma[\tau]$ is given by $\lambda x.\, \lambda t.\, \lambda f.\, o\, (s\, d_1\, \dots\, d_{\abs{\Sigma}}\, d_\epsilon)$ where:

$d_\epsilon \coloneq \lambda q_1.\, \dots\, \lambda q_{\abs{\Sigma}}.\, q_0$,
$d_a \coloneq \lambda s.\, \lambda q_1.\, \dots\, \lambda q_{\abs{\Sigma}}.\, s\, q'_1, \dots q'_{\abs{Q}}$ where $q'_i = q_j$ if $q_i \xrightarrow{a} q_j$ in DFA
$o \coloneq \lambda s.\, \lambda t.\, \lambda f.\, s\, b_1\, \dots b_{\abs{Q}}$ where $b_i = t$ if $q_i \in F$ and $b_i = f$ otherwise.

Essentially, our type is the Church-encoding of the finite set of states, $d_\epsilon$ encodes our initial state, and each $d_a$ tells us where each state $q_i$ gets sent to on input $a$. Finally, $o$ maps each state to whether it is an accepting state or not. It is easy to see that this is an implementation of the automaton in STLC.

That's all folks! I'm going to get back to my thesis now, but if you hear about any postdocs please drop me a line.

Update: Over on Mastodon, Tito pointed me to a stack exchange post on obscure characterisations of the regular languages which may be of interest to readers.