Title: Generic programming in Haskell
Author: Wolfgang Jeltsch
Date: 19 February 2016
Tags: functional programming, generic programming, GHC,
Haskell, Institute of Cybernetics, literate
programming, parametric polymorphism, talk, Theory
Lunch, type class, type family, void (Haskell package)
Generic programming is a powerful way to define a function that
works in an analogous way for a class of types. In this
article, I describe [the latest approach to generic programming
that is implemented in GHC][generic-programming-in-ghc]. This
approach goes back to the paper [*A Generic Deriving Mechanism
for Haskell*][generic-programming-paper] by José Pedro
Magalhães, Atze Dijkstra, Johan Jeuring, and Andres
Löh.
[generic-programming-in-ghc]:
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/generic-programming.html
"Generic programming"
[generic-programming-paper]:
http://dreixel.net/research/pdf/gdmh.pdf
"A Generic Deriving Mechanism for Haskell"
This article is a writeup of a [Theory Lunch][theory-lunch]
talk I gave on 4 February 2016. As usual, the source of this
article is a literate Haskell file, which you can
[download][article-source], load into GHCi, and play with.
[theory-lunch]:
http://theorylunch.wordpress.com/
"Theory Lunch"
[article-source]:
http://hub.darcs.net/jeltsch/blog/raw-file/2016/02/19/generic-programming-in-haskell.lhs
"Generic programming in Haskell"
Motivation
==========
Parametric polymorphism allows you to write functions that deal
with values of any type. An example of such a function is the
`reverse` function, whose type is `[a] -> [a]`. You can apply
`reverse` to any list, no matter what types the elements have.
However, parametric polymorphism does not allow your functions
to depend on the structure of the concrete types that are used
in place of type variables. So values of these types are always
treated as black boxes. For example, the `reverse` function
only reorders the elements of the given list. A function of
type `[a] -> [a]` could also drop elements (like the `tail`
function does) or duplicate elements (like the `cycle` function
does), but it could never invent new elements (except for ⊥) or
analyze elements.
Now there are situation where a function is suitable for a
class of types that share certain properties. For example, the
`sum` function works for all types that have a notion of binary
addition. Haskell uses type classes to support such functions.
For example, the `Num` class provides the method `(+)`, which
is used in the definition of `sum`, whose type `Num a => [a] ->
a` contains a respective class constraint.
The methods of a class have to be implemented separately for
every type that is an instance of the class. This is reasonable
for methods like `(+)`, where the implementations for the
different instances differ fundamentally. However, it is
unfortunate for methods that are implemented in an analogous
way for most of the class instances. An example of such a
method is `(==)`, since there is a canonical way of checking
values of algebraic data types for equality. It works by first
comparing the outermost data constructors of the two given
values and if they match, the individual fields. Only when the
data constructors and all the fields match, the two values are
considered equal.
For several standard classes, including `Eq`, Haskell provides
the deriving mechanism to generate instances with default
method implementations whose precise functionality depends on
the structure of the type. Unfortunately, there is no
possibility in standard Haskell to extend this deriving
mechanism to user-defined classes. Generic programming is a way
out of this problem.
Prerequisites
=============
For generic programming, we need several language extensions.
The good thing is that only one of them, `DeriveGeneric`, is
specific to generic programming. The other ones have uses in
other areas as well. Furthermore, `DeriveGeneric` is a very
small extension. So the generic programming approach we
describe here can be considered very lightweight.
We state the full set of necessary extensions with the
following pragma:
> {-# LANGUAGE DefaultSignatures,
> DeriveGeneric,
> FlexibleContexts,
> TypeFamilies,
> TypeOperators #-}
Apart from these language extensions, we need the module
`GHC.Generics`:
> import GHC.Generics
Our running example
===================
As our running example, we pick serialization and
deserialization of values. Serialization means converting a
value into a bit string, and deserialization means parsing a
bit string in order to get back a value.
We introduce a type `Bit` for representing bits:
> data Bit = O | I deriving (Eq, Show)
Furthermore, we define the class of all types that support
serialization and deserialization as follows:
< class Serializable a where
<
< put :: a -> [Bit]
<
< get :: [Bit] -> (a, [Bit])
There is a canonical way of serializing values of algebraic
data types. It works by first encoding the data constructor of
the given value as a sequence of bits and then serializing the
individual fields. To show this approach in action, we define
an algebraic data type `Tree`, which is a type of labeled
binary trees:
> data Tree a = Leaf | Branch (Tree a) a (Tree a) deriving Show
An instantiation of `Serializable` for `Tree` that follows the
canonical serialization approach can be carried out as follows:
< instance Serializable a => Serializable (Tree a) where
<
< put Leaf = [O]
< put (Branch left root right) = [I] ++
< put left ++
< put root ++
< put right
<
< get (O : bits) = (Leaf, bits)
< get (I : bits) = (Branch left root right, bits''') where
<
< (left, bits') = get bits
< (root, bits'') = get bits'
< (right, bits''') = get bits''
Of course, it quickly becomes cumbersome to provide such an
instance declaration for every algebraic data type that should
use the canonical serialization approach. So we want to
implement the canonical approach once and for all and make it
easily usable for arbitrary types that are amenable to it.
Generic programming makes this possible.
Representations
===============
An algebraic data type is essentially a sum of products where
the terms “sum” and “product” are understood as follows:
* A sum is a variant type. In Haskell, `Either` is the
canonical type constructor for binary sums, and the empty
type `Void` from [the `void` package][void] is the nullary
sum.
* A product is a tuple type. In Haskell, `(,)` is the
canonical type constructor for binary products, and `()` is
the nullary product.
[void]:
http://hackage.haskell.org/package/void
"void: A Haskell 2010 logically uninhabited data type"
The key idea of generic programming is to map types to
representations that make the sum-of-products structure
explicit and to implement canonical behavior based on these
representations instead of the actual types.
The `GHC.Generics` module defines a number of type constructors
for constructing representations:
< data V1 p
<
< infixr 5 :+:
< data (:+:) f g p = L1 (f p) | R1 (g p)
<
< data U1 p = U1
<
< infixr 6 :*:
< data (:*:) f g p = f p :*: g p
<
< newtype K1 i a p = K1 { unK1 :: a }
<
< newtype M1 i a f p = M1 { unM1 :: f p }
All of these type constructors take a final parameter `p`. This
parameter is relevant only when dealing with higher-order
classes. In this article, however, we only discuss generic
programming with first-order classes. In this case, the
parameter `p` is ignored. The different type constructors play
the following roles:
* `V1` is for the nullary sum.
* `(:+:)` is for binary sums.
* `U1` is for the nullary product.
* `(:*:)` is for binary products.
* `K1` is a wrapper for fields of algebraic data types. Its
parameter `i` used to provide some information about the
field at the type level, but is now obsolete.
* `M1` is a wrapper for attaching meta information at the
type level. Its parameter `i` denotes the kind of the
language construct the meta information refers to, and its
parameter `c` provides access to the meta information.
[ghc-generics]:
http://hackage.haskell.org/package/base-4.8.2.0/docs/GHC-Generics.html
"GHC.Generics"
The `GHC.Generics` module furthermore introduces a class
`Generic`, whose instances are the types for which a
representation exists. Its definition is as follows:
< class Generic a where
<
< type Rep a :: * -> *
<
< from :: a -> (Rep a) p
<
< to :: (Rep a) p -> a
A type `Rep a` is the representation of the type `a`. The
methods `from` and `to` convert from values of the actual type
to values of the representation type and vice versa.
To see all this in action, we make `Tree a` an instance of
`Generic`:
> instance Generic (Tree a) where
>
> type Rep (Tree a) =
> M1 D D1_Tree (
> M1 C C1_Tree_Leaf U1
> :+:
> M1 C C1_Tree_Branch (
> M1 S NoSelector (K1 R (Tree a))
> :*:
> M1 S NoSelector (K1 R a)
> :*:
> M1 S NoSelector (K1 R (Tree a))
> )
> )
>
> from Leaf = M1 (L1 (M1 U1))
> from (Branch left root right) = M1 (
> R1 (
> M1 (
> M1 (K1 left)
> :*:
> M1 (K1 root)
> :*:
> M1 (K1 right)
> ))
> )
>
> to (M1 (L1 (M1 U1))) = Leaf
> to (M1 (
> R1 (
> M1 (
> M1 (K1 left)
> :*:
> M1 (K1 root)
> :*:
> M1 (K1 right)
> ))
> )) = Branch left root right
The types `D1_Tree`, `C1_Tree_Leaf`, and `C1_Tree_Branch` are
type-level representations of the type constructor `Tree`, the
data constructor `Leaf`, and the data constructor `Branch`,
respectively. We declare them as empty types:
> data D1_Tree
> data C1_Tree_Leaf
> data C1_Tree_Branch
We need to make these types instances of the classes `Datatype`
and `Constructor`, which are part of `GHC.Generics` as well.
These classes provide a link between the type-level
representations of type and data constructors and the meta
information related to them. This meta information particularly
covers the identifiers of the type and data constructors, which
are needed when implementing canonical implementations for
methods like `show` and `read`. The instance declarations for
the `Tree`-related types are as follows:
> instance Datatype D1_Tree where
>
> datatypeName _ = "Tree"
>
> moduleName _ = "Main"
>
> instance Constructor C1_Tree_Leaf where
>
> conName _ = "Leaf"
>
> instance Constructor C1_Tree_Branch where
>
> conName _ = "Branch"
Instantiating the `Generic` class as shown above is obviously
an extremely tedious task. However, it is possible to
instantiate `Generic` completely automatically for any given
algebraic data type, using the `deriving`{.haskell} syntax.
This is what the `DeriveGeneric` language extension makes
possible.
So instead of making `Tree a` an instance of `Generic` by hand,
as we have done above, we could have declared the `Tree` type
as follows in the first place:
< data Tree a = Leaf | Branch (Tree a) a (Tree a)
< deriving (Show, Generic)
Implementing canonical behavior
===============================
As mentioned above, we implement canonical behavior based on
representations. Let us see how this works in the case of the
`Serializable` class.
We introduce a new class `Serializable'` whose methods provide
serialization and deserialization for representation types:
> class Serializable' f where
>
> put' :: f p -> [Bit]
>
> get' :: [Bit] -> (f p, [Bit])
We instantiate this class for all the representation types:
> instance Serializable' U1 where
>
> put' U1 = []
>
> get' bits = (U1, bits)
>
> instance (Serializable' r, Serializable' s) =>
> Serializable' (r :*: s) where
>
> put' (rep1 :*: rep2) = put' rep1 ++ put' rep2
>
> get' bits = (rep1 :*: rep2, bits'') where
>
> (rep1, bits') = get' bits
> (rep2, bits'') = get' bits'
>
> instance Serializable' V1 where
>
> put' _ = error "attempt to put a void value"
>
> get' _ = error "attempt to get a void value"
>
> instance (Serializable' r, Serializable' s) =>
> Serializable' (r :+: s) where
>
> put' (L1 rep) = O : put' rep
> put' (R1 rep) = I : put' rep
>
> get' (O : bits) = let (rep, bits') = get' bits in
> (L1 rep, bits')
> get' (I : bits) = let (rep, bits') = get' bits in
> (R1 rep, bits')
>
> instance Serializable' r => Serializable' (M1 i a r) where
>
> put' (M1 rep) = put' rep
>
> get' bits = (M1 rep, bits') where
>
> (rep, bits') = get' bits
>
> instance Serializable a => Serializable' (K1 i a) where
>
> put' (K1 val) = put val
>
> get' bits = (K1 val, bits') where
>
> (val, bits') = get bits
Note that in the case of `K1`, the context mentions
`Serializable`, not `Serializable'`, and the methods `put'` and
`get` call `put` and `get`, not `put'` and `get'`. The reason
is that the value wrapped in `K1` has an ordinary type, not a
representation type.
Accessing canonical behavior
============================
We can now apply canonical behavior to ordinary types using the
methods `from` and `to` from the `Generic` class. For example,
we can implement functions `defaultPut` and `defaultGet` that
provide canonical serialization and deserialization for all
instances of `Generic`:
> defaultPut :: (Generic a, Serializable' (Rep a)) =>
> a -> [Bit]
> defaultPut = put' . from
>
> defaultGet :: (Generic a, Serializable' (Rep a)) =>
> [Bit] -> (a, [Bit])
> defaultGet bits = (to rep, bits') where
>
> (rep, bits') = get' bits
We can use these functions in instance declarations for
`Serializable`. For example, we can make `Tree a` an instance
of `Serializable` in the following way:
< instance Serializable a => Serializable (Tree a) where
<
< put = defaultPut
<
< get = defaultGet
Compared to the instance declaration we had initially, this one
is a real improvement, since we do not have to implement the
desired behavior of `put` and `get` by hand anymore. However,
it still contains boilerplate code in the form of the trivial
method declarations. It would be better to establish
`defaultPut` and `defaultGet` as defaults in the class
declaration:
< class Serializable a where
<
< put :: a -> [Bit]
< put = defaultPut
<
< get :: [Bit] -> (a, [Bit])
< get = defaultGet
However, this is not possible, since the types of `defaultPut`
and `defaultGet` are less general than the types of `put` and
`get`, as they put additional constraints on the type `a`.
Luckily, GHC supports the language extension
`DefaultSignatures`, which allows us to give default
implementations that have less general types than the actual
methods (and consequently work only for those instances that
are compatible with these less general types). Using
`DefaultSignatures`, we can declare the `Serializable` class as
follows:
> class Serializable a where
>
> put :: a -> [Bit]
> default put :: (Generic a, Serializable' (Rep a)) =>
> a -> [Bit]
> put = defaultPut
>
> get :: [Bit] -> (a, [Bit])
> default get :: (Generic a, Serializable' (Rep a)) =>
> [Bit] -> (a, [Bit])
> get = defaultGet
With this class declaration in place, we can make `Tree a` an
instance of `Serializable` as follows:
> instance Serializable a => Serializable (Tree a)
With the minor extension `DeriveAnyClass`, which is provided by
GHC starting from Version 7.10, we can even use the
`deriving`{.haskell} keyword to instantiate `Serializable` for
`Tree a`. As a result, we only have to write the following in
order to define the `Tree` type and make it an instance of
`Serializable`:
< data Tree a = Leaf | Branch (Tree a) a (Tree a)
< deriving (Show, Generic, Serializable)
So finally, we can use our own classes like the Haskell
standard classes regarding the use of deriving clauses, except
that we have to additionally derive an instance declaration for
`Generic`.
Specialized implementations
===========================
Usually, not all instances of a class should or even can be
generated by means of generic programming, but some instances
have to be crafted by hand. For example, making `Int` an
instance of `Serializable` requires manual work, since `Int` is
not an algebraic data type.
However, there is no problem with this, since we still have the
opportunity to write explicit instance declarations, despite
the presence of a generic solution. This is in line with the
standard deriving mechanism: you can make use of it, but you
are not forced to do so. So we can have the following instance
declaration, for example:
> instance Serializable Int where
>
> put val = replicate val I ++ [O]
>
> get bits = (length is, bits') where
>
> (is, O : bits') = span (== I) bits
Of course, the serialization approach we use here is not very
efficient, but the instance declaration illustrates the point
we want to make.