Writing a Search DSL, Part 1

Hi, readers!

This will be a series of posts covering the development of a Domain Specific Language for search. I’ll dive into the design, testing, and implementation of parsing and evaluation. This will include both pure evaluation and compilation to express queries for an external data store (*SQL, Elastic Search, etc.).

In part 1, I’ll cover why someone might want to use a DSL, as well as the design and implementation of the DSL core.

Motivation

Why would you want to implement a DSL?

Focusing on search, you might want to expose search functionality to users. This might be a public-facing, HTTP API that allows users to look up other user details and posts, like Twitter’s Search API. You might be part of a team at a large Cloud Services provider, putting together a search engine for users to review logs regarding infrastrcuture. This might an internal feature, an enhancement to existing developer tools to facilitate handling customer support requests.

Search is a wide space!

The main reasons to pursue a DSL as a solution are:

Essentially, DSL approaches to solving problems bring the capabilities of language design theory to your problem space.

It’s notable that command line tools like grep, sed, awk, and find are themselves DSLs. A surprising number of problems can be approached with language design techniques in mind.

While there are many benefits to such an approach, it is not without costs. The primary costs are knowledge and time - knowing enough about language design, parsing, and the domain to bring together the pieces, and having enough time to make it all work.

This series of posts aims to address the knowledge part of the cost, by demonstrating what this process looks like from start to finish.

Let’s get into it!

The Language

We’re going to start with a simple search language and build on it in later posts. Let’s first specify a few parts of the language, enough to perform some interesting searches:

This allows for expressions like:

confidential == false
count > 10 and ip == "10.0.8.0"
date >= "2010-10-12" and date <= "2010-11-12"

Future extensions to the language could happen along any of the above axes: more operations, more types, more type constraints, more logical combinators.

For a convenient implementation, the limiting factor is how much the host programming language’s type system can accommodate. I’ll be using Haskell in this series, but any language that allows for algebraic data types should be enough to reproduce this work. It’s entirely possible to develop a DSL solution in other languages, but there’ll be more testing to do.

Let’s implement the language core.

We’d like to support all of <, >, ==, >=, <=. Let’s add those in:

data Op
  = Lt   -- <
  | Gt   -- >
  | Eq   -- ==
  | Gte  -- >=
  | Lte  -- <=
    deriving Eq

Next, we’ll add support for types and values:

data Type
  = IntType
  | StringType
  | BoolType
    deriving Eq

data Literal
  = IntLit Int
  | StringLit Text
  | BoolLit Bool
    deriving Eq

Notice that we’re maintaining both type-level and value-level information, and maintain a distinction between the two levels.

Next up, support for variables:

type Var = Text
type Field = (Var, Type)
type Env = [Field]

This is a fairly straightforward approach. We don’t have to worry about bindings or scope. We leave the declaration of allowed variables to the search engine provider, to be exposed via documentation.

Next up: type constraints!

data Supports
  = Equality
  | Ordering
    deriving (Eq, Show)

supports :: Type -> [Supports]
supports IntType = [Equality, Ordering]
supports StringType = [Equality, Ordering]
supports BoolType = [Equality]

needs :: Op -> [Supports]
needs Lt = [Ordering]
needs Gt = [Ordering]
needs Eq = [Equality]
needs Gte = [Ordering, Equality]
needs Lte = [Ordering, Equality]

-- > IntType `can` Lt
-- True
can :: Type -> Op -> Bool
can t op = null (needs op \\ supports t)

Next, the logical combinator and. I’ll also introduce how binary operations are represented at this stage:

data Expr
  = And Expr Expr
  | BinOpL Op Var Literal
  | BinOpR Op Literal Var
    deriving (Show, Eq)

This little recursive data type captures the essence of our language. Let’s break it down:

With this construction, we’ve ruled out some cases:

Here’s the language as a whole, for convenient reading:

data Op
  = Lt   -- <
  | Gt   -- >
  | Eq   -- ==
  | Gte  -- >=
  | Lte  -- <=
    deriving Eq

data Type
  = IntType
  | StringType
  | BoolType
    deriving Eq

data Literal
  = IntLit Int
  | StringLit Text
  | BoolLit Bool
    deriving Eq

data Supports
  = Equality
  | Ordering
    deriving (Eq, Show)

supports :: Type -> [Supports]
supports IntType = [Equality, Ordering]
supports StringType = [Equality, Ordering]
supports BoolType = [Equality]

needs :: Op -> [Supports]
needs Lt = [Ordering]
needs Gt = [Ordering]
needs Eq = [Equality]
needs Gte = [Ordering, Equality]
needs Lte = [Ordering, Equality]

-- > IntType `can` Lt
-- True
can :: Type -> Op -> Bool
can t op = null (needs op \\ supports t)

data Expr
  = And Expr Expr
  | BinOpL Op Var Literal
  | BinOpR Op Literal Var
    deriving (Show, Eq)

Closing

That’s all I’m going to cover for now - the core of the language. In the next post, I’ll build a parser for this language and write some tests for that parser.

If you’d like to get a head start, the source for the search DSL is availaible in this GitLab repo.

Thanks for reading!

Contributing

If you enjoy my works: