## Applicative Bidirectional Programming and Automatic Differentiation

I got referred to an interesting paper by a comment of /u/syrak.

http://www2.sf.ecei.tohoku.ac.jp/~kztk/papers/kztk_jfp_am_2018.pdf

Applicative bidirectional programming (PDF), by Kazutaka Matsuda and Meng Wang

In it, they use a couple interesting tricks to make Lens programming more palatable. Lens often need to be be programmed in a point free style, which is rough, but by using some combinators, they are able to program lenses in a pointful style (with some pain points still left over). It is a really interesting, well-written paper. Lots ‘o’ Yoneda and laws. I’m not doing it justice. Check it out!

A while back I noted that reverse mode auto-differentiation has a type very similar to a lens and in fact you can build a working reverse mode automatic differentiation DSL out of lenses and lens-combinators. Many things about lenses, but not all, transfer over to automatic differentiation. The techniques of Matsuda and Wang do appear to transfer fairly well.

This is interesting to me for another reason. Their lift2 and unlift2 functions remind me very much of my recent approach to compiling to categories. The lift2 function is fanning a lens pair. This is basically what my FanOutput typeclass automated. unlift2 is building the input for a function function by supplying a tuple of projection lenses. This is what my BuildInput typeclass did. I think their style may extend many monoidal cartesian categories, not just lenses.

One can use the function b -> a in many of the situations one can use a in. You can do elegant things by making a Num typeclass of b -> a for example. This little fact seems to extend to other categories as well. By making a Num typeclass for Lens s a when a is a Num, we can use reasonable looking notation for arithmetic.

They spend some time discussing the necessity of a Poset typeclass. For actual lawful lenses, the dup implementation needs a way to recombine multiple adjustments to the same object. In the AD-lens case, dup takes care of this by adding together the differentials. This means that everywhere they needed an Eq typeclass, we can use a Num typeclass. There may be usefulness to building a wrapped type data NZ a = Zero | NonZero a like their Tag type to accelerate the many 0 values that may be propagating through the system.

Unfortunately, as is, the performance of this is abysmal. Maybe there is a way to fix it? Unlifting and lifting destroys a lot of sharing and often necessitates adding many redundant zeros. Why are you doing reverse mode differentiation unless you care about performance? Forward mode is simpler to implement. In the intended use case of Matsuda and Wang, they are working with actual lawful lenses, which have far less computational content than AD-lenses. Good lawful lenses should just be shuffling stuff around a little. Maybe one can hope GHC is insanely intelligent and can optimize these zeros away. One point in favor of that is that our differentiation is completely pure (no mutation). Nevertheless, I suspect it will not without help. Being careful and unlifting and lifting manually may also help. In principle, I think the Lensy approach could be pretty fast (since all it is is just packing together exactly what you need to differentiate into a data type), but how to make it fast while still being easily programmable? It is also nice that it is pretty simple to implement. It is the simplest method that I know of if you needed to port operable reverse mode differentiation to a new library (Massiv?) or another functional language (Futhark?). And a smart compiler really does have a shot at finding optimizations/fusions.

While I was at it, unrelated to the paper above, I think I made a working generic auto differentiable fold lens combinator. Pretty cool. mapAccumL is a hot tip.

For practical Haskell purposes, all of this is a little silly with the good Haskell AD packages around, the most prominent being

It is somewhat interesting to note the similarity of type forall s. Lens s appearing in the Matsuda and Wang approach to those those of the forall s. BVar s monad appearing in the backprop package. In this case I believe that the s type variable plays the same role it does in the ST monad, protecting a mutating Wengert tape state held in the monad, but I haven’t dug much into it. I don’t know enough about backprop to know what to make of this similarity.

The github repo with my playing around and stream of consciousness commentary is here

## A Touch of Topological Quantum Computation in Haskell Pt. II: Automating Drudgery

Last time we built the basic pieces we need to describe anyons in Haskell. Anyon models describe interesting physical systems where a set of particles (Tau and Id in our case) have certain splitting rules and peculiar quantum properties. The existence of anyons in a system are the core physics necessary to support topological quantum computation. In topological quantum computing, quantum gates are applied by braiding the anyons and measurements performed by fusing anyons together and seeing what particle comes out. Applying gates in this way has inherent error correcting properties.

The tree of particle production with particle labelled leaves picks a basis (think the collection $\{\hat{x}, \hat{y}, \hat{z}\}$ ) for the anyon quantum vector space. An individual basis vector (think $\hat{x}$ ) from this basis is specified by labelling the internal edges of the tree. We built a Haskell data type for a basic free vector space and functions for the basic R-moves for braiding two anyons and reassociating the tree into a new basis with F-moves. In addition, you can move around your focus within the tree by using the function lmap and rmap. The github repo with that and what follows below is here.

#### Pain Points

We’ve built the atomic operations we need, but they work very locally and are quite manual. You can apply many lmap and rmap to zoom in to the leaves you actually wish to braid, and you can manually perform all the F-moves necessary to bring nodes under the same parent, but it will be rather painful.

The standard paper-and-pencil graphical notation for anyons is really awesome. You get to draw little knotty squiggles to calculate. It does not feel as laborious. The human eye and hand are great at applying a sequence of reasonably optimal moves to untangle the diagram efficiently. Our eye can take the whole thing in and our hand can zip around anywhere.

To try and bridge this gap, we need to build functions that work in some reasonable way on the global anyon tree and that automate simple tasks.

#### A Couple Useful Functions

Our first useful operation is pullLeftLeaf. This operation will rearrange the tree using F-moves to get the leftmost leaf associated all the way to the root. The leftmost leaf will then have the root as a parent.

Because the tree structure is in the FibTree a b data type, we need the tuple tree type of the pulled tree. This is a slightly non-trivial type computation.

In order to do this, we’ll use a bit of typelevel programming. If this is strange and alarming stuff for you, don’t sweat it too much. I am not the most elegant user of these techniques, but I hope that alongside my prose description you can get the gist of what we’re going for.

(Sandy Maguire has a new book on typelevel programming in Haskell out. Good stuff. Support your fellow Haskeller and toss him some buckos.)

The resulting tree type b is an easily computable function of the starting tree type a. That is what the “functional dependency” notation | a -> b in the typeclass definition tells the compiler.

The first 4 instances are base cases. If you’re all the way at the leaf, you basically want to do nothing. pure is the function that injects the classical tree description into a quantum state vector with coefficient 1.

The meat is in the last instance. In the case that the tree type matches ((a,b),c), we recursively call PullLeftLeaf on (a,b) which returns a new result (a',b'). Because of the recursion, this a' is the leftmost leaf. We can then construct the return type by doing a single reassociation step. The notation ~ forces two types to unify. We can use this conceptually as an assignment statement at the type level. This is very useful for building intermediate names for large expressions, as assert statements to ensure the types are as expected, and also occasionally to force unification of previously unknown types. It’s an interesting operator for sure.

The recursion at the type level is completely reflected in the actual function definition. We focus on the piece (a,b) inside t by using lmap. We do a recursive call to pullLeftLeaf, and finally fmove' performs the final reassociation move. It is all rather verbose, but straightforward I hope.

You can also build a completely similar PullRightLeaf.

#### A Canonical Right Associated Basis

One common way of dealing with larger trees is to pick a canonical basis of fully right associated trees. The fully right associated tree is a list-like structure. Its uniformity makes it easier to work with.

By recursively applying pullLeftLeaf, we can fully right associate any tree.

This looks quite similar to the implementation of pullLeftLeaf. It doesn’t actually have much logic to it. We apply pullLeftLeaf, then we recursively apply rightAssoc in the right branch of the tree.

#### B-Moves: Braiding in the Right Associated Basis

Now we have the means to convert any structure to it’s right associated canonical basis. In this basis, one can apply braiding to neighboring anyons using B-moves, which can be derived from the braiding R-moves and F-moves.

The B-move applies one F-move so that the two neighboring leaves share a parent, uses the regular braiding R-move, then applies the inverse F-move to return back to the canonical basis. Similarly, bmove'  is the same thing except applies the under braiding braid' rather that the over braiding braid.

(Image Source : Preskill’s notes)

#### Indexing to Leaves

We also may desire just specifying the integer index of where we wish to perform a braid. This can be achieved with another typeclass for iterated rmaping. When the tree is in canonical form, this will enable us to braid two neighboring leaves by an integer index. This index has to be a typelevel number because the output type depends on it.

In fact there is quite a bit of type computation. Given a total tree type s and an index n this function will zoom into the subpart a of the tree at which we want to apply our function. The subpart a is replaced by b, and then the tree is reconstructed into t. t is s with the subpart a mapped into b.  I have intentionally made this reminiscent of the type variables of the lens type Lens s t a b .

This looks much noisier that it has to because we need to work around some of the unfortunate realities of using the typeclass system to compute types. We can’t just match on the number n in order to pick which instance to use because the patterns 0 and n are overlapping. The pattern n can match the number 0 if n ~ 0. The pattern matching in the type instance is not quite the same as the regular Haskell pattern matching we use to define functions. The order of the definitions does not matter, so you can’t have default cases. The patterns you use cannot be unifiable. In order to fix this, we make the condition if n is greater than 0 an explicit type variable gte. Now the different cases cannot unify. It is a very common trick to need a variable representing some branching condition.

For later convenience, we define rmapN which let’s us not need to supply the necessary comparison type gte.

#### Parentifying Leaves Lazily

While it is convenient to describe anyon computations in a canonical basis, it can be quite inefficient. Converting an arbitrary  anyon tree into the standard basis will often result in a dense vector. A natural thing to do for the sake of economy is only do reassociation on demand.

The algorithm for braiding two neighboring leaves is pretty straightforward. We need to reassociate these leaves so that they have the same parent. First we need the ability to map into the least common ancestor of the two leaves. To reassociate these two leaves to have a common parent we pullrightLeaf the left subtree and then pullLeftLeaf the left subtree. Finally, there is a bit extra bit of shuffling to actually get them to be neighbors.

As a first piece, we need a type level function to count the number of leaves in a tree. In this case, I am inclined to use type families rather than multi parameter typeclasses as before, since I don’t need value level stuff coming along for the ride.

Next, we make a typeclass for mapping into the least common ancestor position.

We find the least common ancestor position by doing a binary search on the size of the left subtrees at each node. Once the size of the left subtree equals n, we’ve found the common ancestor of leaf n and leaf n+1.

Again, this LCAMap typeclass has a typelevel argument gte that directs it which direction to go down the tree.

The Twiddle typeclass will perform some final cleanup after we’ve done all the leaf pulling. At that point, the leaves still do not have the same parent. They are somewhere between 0 and 2 F-moves off depending on whether the left or right subtrees may be just a leaf or larger trees. twiddle is not a recursive function.

Putting this all together we get the nmap function that can apply a function after parentifying two leaves. By far the hardest part is writing out that type signature.

#### Usage Example

Here’s some simple usage:

Note that rmapN is 0-indexed but nmap is 1-indexed. This is somewhat horrifying, but that is what was natural in the implementation.

Here is a more extended example showing how to fuse some particles.

I started with the tree at the top and traversed downward implementing each braid and fusion. Implicitly all the particles shown in the diagram are Tau particles. The indices refer to particle position, not to the particles “identity” as you would trace it by eye on the page. Since these are identical quantum  particles, the particles don’t have identity as we classically think of it anyhow.

The particle pairs are indexed by the number on the left particle. First braid 1 over 2, then 2 over 3, fuse 1 and 2, braid 2 under 3, fuse 2 and 3, and then fuse 1 and 2. I got an amplitude for the process of -0.618, corresponding to a probability of 0.382. I would give myself 70% confidence that I implemented all my signs and conventions correctly. The hexagon and pentagon equations from last time being correct gives me some peace of mind.

Syntax could use a little spit polish, but it is usable. With some readjustment, one could use the Haskell do notation removing the need for explicit >>=.

#### Next Time

Anyons are often described in categorical terminology. Haskell has a category culture as well. Let’s explore how those mix!

## Compiling to Categories 3: A Bit Cuter

Ordinary Haskell functions form a cartesian closed category. Category means you can compose functions. Cartesian basically means you can construct and deconstruct tuples and Closed means that you have first class functions you can pass around.

Conal Elliott’s Compiling to Categories is a paradigm for reinterpreting ordinary functions as the equivalent in other categories. At an abstract level, I think you could describe it as a mechanism to build certain natural law abiding Functors from Hask to other categories. It’s another way to write things once and have them run many ways, like polymorphism or generic programming. The ordinary function syntax is human friendly compared to writing raw categorical definitions, which look roughly like point-free programming (no named variables). In addition, by embedding it as a compiler pass within GHC, he gets to leverage existing GHC optimizations as optimizations for other categories. Interesting alternative categories include the category of graphs, circuits, and automatically differentiable functions. You can find his implementation here

I’ve felt hesitance at using a GHC plugin plus I kind of want to do it in a way I understand, so I’ve done different versions of this using relatively normal Haskell (no template haskell, no core passes, but a butt ton of hackery).

The first used explicit tags. Give them to the function and see where they come out. That is one way to probe a simple tuple rearrangement function.

The second version worked almost entirely at the typelevel. It worked on the observation that a completely polymorphic tuple type signature completely specifies the implementation. You don’t have to look at the term level at all. It unified the polymorphic values in the input with a typelevel number indexed Tag type. Then it searched through the input type tree to find the piece it needed. I did end up passing stuff in to the term level because I could use this mechanism to embed categorical literals. The typeclass hackery I used to achieve this all was heinous.

I realized today another way related to both that is much simpler and fairly direct. It has some pleasing aesthetic properties and some bad ones. The typeclass hackery is much reduced and the whole thing fits on a screen, so I’m pleased.

Here are the basic categorical definitions. FreeCat is useful for inspection in GHCi of what comes out of toCCC.

And here is the basic toCCC implementation

Here is some example usage

What we do is generate a tuple to give to your function. The function is assumed to be polymorphic again but now not necessarily totally polymorphic (this is important because Num typeclass usage will unify variables). Once we hit a leaf of the input tuple, we place the categorical morphism that would extract that piece from the input. For example for the input type (a,(b,c)) we pass it the value (fstC ,(fstC . sndC, sndC . sndC )). Detecting when we are at a leaf again requires somehow detecting a polymorphic location, which is a weird thing to do. We use the Incoherent IsTup instance from last time to do this. It is close to being safe, because we immediately unify the polymorphic variable with a categorial type, so if something goes awry, a type error should result. We could make it more ironclad by unifying immediately to something that contains the extractor and a user inaccessible type.

We apply the function to this input. Now the output is a tuple tree of morphisms. We recursively traverse down this tree with a fanC for every tuple. This all barely requires any typelevel hackery. The typelevel stuff that is there is just so that I can traverse down tuple trees basically.

One benefit is that we can now use some ordinary typeclasses. We can make a simple implementation of Num for (k z a) like how we would make it for (z -> a). This let’s us use the regular (+) and (*)  operators for example.

What is not good is the performance. As it is, the implementation takes many global duplication of the input to create all the fanCs. In many categories, this is very wasteful.This may be a fixable problem, either via passing in more sophisticated objects that just the bare extraction morphisms to to input (CPS-ified? Path Lists?) or via the GHC rewrite rules mechanism. I have started to attempt that, but have not been successful in getting any of my rewrite rules to fire yet, because I have no idea what I’m doing. If anyone could give me some advice, I’d be much obliged. You can check that out here. For more on rewrite rules, check out the GHC user manual and this excellent tutorial by Mark Karpov here.

Another possibility is to convert to FreeCat, write regular Haskell function optimization passes over the FreeCat AST and then interpret it. This adds interpretation overhead, which may or may not be acceptable depending on your use case. It would probably not be appropriate for automatically differentiable functions, but may be for outputting circuits or graphs.

Another problem is dealing with boolean operations. The booleans operators and comparison operators are not sufficiently polymorphic in the Prelude. We could define new operators that work as drop in replacements in the original context, but I don’t really have the ability to overload the originals. It is tough because if we do things like this, it feels like we’re really kind of building a DSL more than we are compiling to categories. We need to write our functions with the DSL in mind and can’t just import and use some function that had no idea about the compiling to categories stuff.

I should probably just be using Conal’s concat project. This is all a little silly.

## A Touch of Topological Quantum Computation in Haskell Pt. I

Quantum computing exploits the massive vector spaces nature uses to describe quantum phenomenon.

The evolution of a quantum system is described by the application of matrices on a vector describing the quantum state of the system. The vector has one entry for every possible state of the system, so the number of entries can get very, very large. Every time you add a new degree of freedom to a system, the size of the total state space gets multiplied by the size of the new DOF, so you have a vector exponentially sized in the  number  of degrees of freedom.

Now, a couple caveats. We could have described probabilistic dynamics similarly, with a probability associated with each state. The subtle difference is that quantum amplitudes are complex numbers whereas probabilities are positive real numbers. This allows for interference. Another caveat is that when you perform a measurement, you only get a single state, so you are hamstrung by the tiny amount of information you can actually extract out of this huge vector. Nevertheless, there are a handful of situations where, to everyone’s best guess, you get a genuine quantum advantage over classical or probabilistic computation.

Topological quantum computing is based around the braiding of particles called anyons. These particles have a peculiar vector space associated with them and the braiding applies a matrix to this space. In fact, the space associated with the particles can basically only be manipulated by braiding and other states require more energy or very large scale perturbations to access. Computing using anyons has a robustness compared to a traditional quantum computing systems. It can be made extremely unlikely that unwanted states are accessed or unwanted gates applied. The physical nature of the topological quantum system has an intrinsic error correcting power. This situation is schematically similar in some ways to classical error correction on a magnetic hard disk. Suppose some cosmic ray comes down and flips a spin in your hard disk. The physics of magnets makes the spin tend to realign with it’s neighbors, so the physics supplies an intrinsic classical error correction in this case.

The typical descriptions of how the vector spaces associated with anyons work I have found rather confusing. What we’re going to do is implement these vector spaces in the functional programming language Haskell for concreteness and play around with them a bit.

###### Anyons

In many systems, the splitting and joining of particles obey rules. Charge has to be conserved. In chemistry, the total number of each individual atom on each side of a reaction must be the same. Or in particle accelerators, lepton number and other junk has to be conserved.

Anyonic particles have their own system of combination rules. Particle A can combine with B to make C or D. Particle B combined with C always make A. That kind of stuff. These rules are called fusion rules and there are many choices, although they are not arbitrary. They can be described by a table $N_{ab}^{c}$ that holds counts of the ways to combine a and b into c. This table has to be consistent with some algebraic conditions, the hexagon and pentagon equations, which we’ll get to later.

We need to describe particle production trees following these rules in order to describe the anyon vector space.

Fibonacci anyons are one of the simplest anyon systems, and yet sufficiently complicated to support universal quantum computation. There are only two particles types in the Fibonacci system, the $I$ particle and the $\tau$  particle. The $I$ particle is an identity particle (kind of like an electrically neutral particle). It combines with $\tau$ to make a $\tau$. However, two $\tau$ particle can combine in two different ways, to make another $\tau$ particle or to make an $I$ particle.

So we make a datatype for the tree structure that has one constructor for each possible particle split and one constructor (TLeaf, ILeaf) for each final particle type. We can use GADTs (Generalize Algebraic Data Types) to make only good particle production history trees constructible. The type has two type parameters carried along with it, the particle at the root of the tree and the leaf-labelled tree structure, represented with nested tuples.

###### Free Vector Spaces

We need to describe quantum superpositions of these anyon trees. We’ll consider the particles at the leaves of the tree to be the set of particles that you have at the current moment in a time. This is a classical quantity. You will not have a superposition of these leaf particles. However, there are some quantum remnants of the history of how these particles were made. The exact history can never be determined, kind of like how the exact history of a particle going through a double slit cannot be determined. However, there is still a quantum interference effect left over. When you bring particles together to combine them, depending on the quantum connections, you can have different possible resulting particles left over with different probabilities. Recombining anyons and seeing what results is a measurement of the system .

Vectors can be described in different basis sets. The bases for anyon trees are labelled by both a tree structure and what particles are at the root and leaves. Different tree associations are the analog of using some basis x vs some other rotated basis x’. The way we’ve built the type level tags in the FibTree reflects this perspective.

The labelling of inner edges of the tree with particles varies depending on which basis vector we’re talking about. A different inner particle is the analog of $\hat{x}$ vs $\hat{y}$.

To work with these bases we need to break out of the mindset that a vector put on a computer is the same as an array. While for big iron purposes this is close to true, there are more flexible options. The array style forces you to use integers to index your space, but what if your basis does not very naturally map to integers?

A free vector space over some objects is the linear combination of those objects. This doesn’t have the make any sense. We can form the formal sum (3.7💋+2.3i👩‍🎨) over the emoji basis for example. Until we attach more meaning to it, all it really means is a mapping between emojis and numerical coefficients. We’re also implying by the word vector that we can add two of the combinations coefficient wise and multiply scalars onto them.

We are going to import free vectors as described by the legendary Dan Piponi as described here: http://blog.sigfpe.com/2007/03/monads-vector-spaces-and-quantum.html

What he does is implement the Free vector space pretty directly. We represent a Vector space using a list of tuples [(a,b)]. The a are basis objects and b are the coefficients attached to them.

The Vector monad factors out the linear piece of a computation. Because of this factoring, the type constrains the mapping to be linear, in a similar way that monads in other contexts might guarantee no leaking of impure computations. This is pretty handy. The function you give to bind correspond to a selector columns of the matrix.

We need some way to zoom into a subtrees and then apply operations. We define the operations lmap and rmap.

You reference a node by the path it takes to get there from the root. For example,  (rmap . lmap . rmap) f applies f at the node that is at the right-left-right position down from the root.

###### Braiding

For Fibonacci anyons, the only two non trivial braidings happen when you braid two $\tau$ particles.

We only have defined how to braid two particles that were split directly from the same particle. How do we describe the braiding for the other cases? Well we need to give the linear transformation for how to change basis into other tree structures. Then we have defined braiding for particles without the same immediate parent also.

###### F-Moves

We can transform to a new basis. where the histories differs by association. We can braid two particles by associating the tree until they are together. An association move does not change any of the outgoing leaf positions. It can, however, change a particle in an interior position. We can apply an F-move anywhere inside the tree, not only at the final leaves.

###### Fusion / Dot product

Two particles that split can only fuse back into themselves. So the definition is pretty trivial. This is like $\hat{e}_i \cdot \hat{e}_j = \delta_{ij}$.

###### Hexagon and Pentagon equations

The F and R matrices and the fusion rules need to obey consistency conditions called the hexagon and pentagon equations. Certain simple rearrangements have alternate ways of being achieved. The alternative paths need to agree.

###### Next Time:

With this, we have the rudiments of what we need to describe manipulation of anyon spaces. However, applying F-moves manually is rather laborious. Next time we’ll look into automating this using arcane type-level programming. You can take a peek at my trash WIP repo here

###### RefErences:
A big ole review on topological quantum computation: https://arxiv.org/abs/0707.1889
Ady Stern on The fractional quantum hall effect and anyons: https://www.sciencedirect.com/science/article/pii/S0003491607001674

Another good anyon tutorial: https://arxiv.org/abs/0902.3275

Mathematica program that I still don’t get, but is very interesting: http://www.cs.ox.ac.uk/people/jamie.vicary/twovect/guide.pdf

Kitaev huge Paper: https://arxiv.org/abs/cond-mat/0506438

Bonderson thesis: https://thesis.library.caltech.edu/2447/2/thesis.pdf

Bernevig review: https://arxiv.org/abs/1506.05805

More food for thought:

The Rosetta Stone: http://math.ucr.edu/home/baez/rosetta.pdf

## Reverse Mode Differentiation is Kind of Like a Lens II

For those looking for more on automatic differentiation in Haskell:

Conal Elliott is making the rounds with a new take on AD (GOOD STUFF).

Justin Le has been making excellent posts and has another library he’s working on.

https://blog.jle.im/entry/introducing-the-backprop-library.html

And here we go:

Reverse mode automatic differentiation is kind of like a lens. Here is the type for a non-fancy lens

When you compose two lenses, you compose the getters (s -> a) and you compose the partially applied setter (b -> t) in the reverse direction.

We can define a type for a reverse mode differentiable function

When you compose two differentiable functions you compose the functions and you flip compose the Jacobian transpose (dy -> dx). It is this flip composition which gives reverse mode it’s name. The dependence of the Jacobian on the base point x corresponds to the dependence of the setter on the original object

The implementation of composition for Lens and AD are identical.

Both of these things are described by the same box diagram (cribbed from the profunctor optics paper www.cs.ox.ac.uk/people/jeremy.gibbons/publications/poptics.pdf ).

This is a very simple way of implementing a reserve mode automatic differentiation using only non exotic features of a functional programming language. Since it is so bare bones and functional, is this a good way to achieve the vision gorgeous post by Christoper Olah?  http://colah.github.io/posts/2015-09-NN-Types-FP/  I do not know.

Now, to be clear, these ARE NOT lenses. Please, I don’t want to cloud the water, do not call these lenses. They’re pseudolenses or something. A very important part of what makes a lens a lens is that it obeys the lens laws, in which the getter and setter behave as one would expect. Our “setter” is a functional representation of the Jacobian transpose and our getter is the function itself. These do not obey lens laws in general.

###### Chain Rule AND Jacobian

What is reverse mode differentiation? One’s thinking is muddled by defaulting to the Calc I perspective of one dimensional functions. Thinking is also muddled by  the general conception that the gradient is a vector. This is slightly sloppy talk and can lead to confusion. It definitely has confused me.

The right setting for intuition is $R^n \rightarrow R^m$ functions

If one looks at a multidimensional to multidimensional function like this, you can form a matrix of partial derivatives known as the Jacobian. In the scalar to scalar case this is a $1\times 1$ matrix, which we can think of as just a number. In the multi to scalar case this is a $1\times n$ matrix which we somewhat fuzzily can think of as a vector.

The chain rule is a beautiful thing. It is what makes differentiation so elegant and tractable.

For many-to-many functions, if you compose them you matrix multiply their Jacobians.

Just to throw in some category theory spice (who can resist), the chain rule is a functor between the category of differentiable functions and the category of vector spaces where composition is given by Jacobian multiplication. This is probably wholly unhelpful.

The cost of multiplication for an $a \times b$ matrix A and an $b \times c$ matrix B is $O(abc)$. If we have 3 matrices ABC, we can associate to the left or right. (AB)C vs A(BC) choosing which product to form first. These two associations have different cost, abc * acd for left association or abd * bcd for right association. We want to use the smallest dimension over and over. For functions that are ultimately many to scalar functions, that means we want to multiply starting at the right.

For a clearer explanation of the importance of the association, maybe this will help https://en.wikipedia.org/wiki/Matrix_chain_multiplication

###### Functional representations of matrices

A Matrix data type typically gives you full inspection of the elements. If you partially apply the matrix vector product function (!* :: Matrix -> Vector -> Vector) to a matrix m, you get a vector to vector function (!* m) :: Vector -> Vector. In the sense that a matrix is data representing a linear map, this type looks gorgeous. It is so evocative of purpose.

If all you want to do is multiply matrices or perform matrix vector products this is not a bad way to go. A function in Haskell is a thing that exposes only a single interface, the ability to be applied. Very often, the loss of Gaussian elimination or eigenvalue decompositions is quite painfully felt. For simple automatic differentiation, it isn’t so bad though.

You can inefficiently reconstitute a matrix from it’s functional form by applying it to a basis of vectors.

One weakness of the functional form is that the type does not constrain the function to actually act linearly on the vectors.

One big advantage of the functional form is that you can intermix different matrix types (sparse, low-rank, dense) with no friction, just so long as they all have some way of being applied to the same kind of vector. You can also use functions like (id :: a -> a) as the identity matrix, which are not built from any underlying matrix type at all.

To match the lens, we need to represent the Jacobian transpose as the function (dy -> dx) mapping differentials in the output space to differentials in the input space.

###### The Lens Trick

A lens is the combination of a getter (a function that grabs a piece out of a larger object) and a setter (a function that takes the object and a new piece and returns the object with that piece replaced).

The common form of lens used in Haskell doesn’t look like the above. It looks like this.

This form has exactly the same content as the previous form (A non obvious fact. See the Profunctor Optics paper above. Magic neato polymorphism stuff), with the added functionality of being able to compose using the regular Haskell (.) operator.

I think a good case can be made to NOT use the lens trick (do as I say, not as I do). It obfuscates sharing and obfuscates your code to the compiler (I assume the compiler optimizations have less understanding of polymorphic functor types than it does of tuples and functions), meaning the compiler has less opportunity to help you out. But it is also pretty cool. So… I dunno. Edit:

/u/mstksg points out that compilers actually LOVE the van Laarhoven representation (the lens trick) because when f is finally specialized it is a newtype wrappers which have no runtime cost. Then the compiler can just chew the thing apart.

One thing that is extra scary about the fancy form is that it makes it less clear how much data is likely to be shared between the forward and backward pass. Another alternative to the lens that shows this is the following.

This form is again the same in end result. However it cannot share computation and therefore isn’t the same performance wise. One nontrivial function that took me some head scratching is how to convert from the fancy lens directly to the regular lens without destroying sharing. I think this does it

###### Some code

I have some small exploration of the concept in this git https://github.com/philzook58/ad-lens

Again, really check out Conal Elliott’s AD paper and enjoy the many, many apostrophes to follow.

Some basic definitions and transformations between fancy and non fancy lenses. Extracting the gradient is similar to the set function. Gradient assumes a many to one function and it applies it to 1.

Basic 1D functions and arrow/categorical combinators

Some List based stuff.

And some functionality from HMatrix

In practice, I don’t think this is a very ergonomic approach without something like Conal Elliott’s Compiling to Categories plugin. You have to program in a point-free arrow style (inspired very directly by Conal’s above AD paper) which is pretty nasty IMO. The neural network code here is inscrutable. It is only a three layer neural network.

## Division of Polynomials in Haskell

I’ve been been on a kick learning about some cool math topics. In particular, I busted out my copy of Concrete Mathematics (awesome book) and was reading up on the number theory section. Bezout’s identity is that if you have ma+nb=d for some m,n and d divides a and b, then d is the greatest divisor of a and b. Bezout’s identity is a certificate theorem like the dual solution to a linear program. It doesn’t matter how you found m and n or d, Euclid’s algorithm or brute search, but once you’ve found that certificate, you know you have the biggest divisor. Pretty cool. Suppose you have some other common divisor d’. That means xd’=a and yd’=b. Subsitute this into the certificate. m(xd’)+n(yd’)=(mx+ny)d’ =d. This means d’ is a divisor of d. But the divisor relation implies the inequality relation (a divisor is always less than or equal to the thing it divides). Therefore d'<=d.

But another thing I’ve been checking out is algebraic geometry and Groebner bases. Groebner bases are a method for solving multivariate polynomial equations. The method is a relative of euclidean division and also in a sense Gaussian elimination. Groebner basis are a special recombination of a set of polynomial equations such that polynomial division works uniquely on them (in many variables polynomial division doesn’t work like you’d expect from the one variable case anymore). Somewhat surprisingly other good properties come along for the ride. More on this in posts to come

Some interesting stuff in the Haskell arena:

Gröbner bases in Haskell: Part I

https://konn.github.io/computational-algebra/

I love Doug McIlroy’s powser. It is my favorite functional pearl. https://www.cs.dartmouth.edu/~doug/powser.html

One interesting aspect not much explored is that you can compose multiple layers of [] like [[[Double]]] to get multivariate power series. You can then also tuple up m of these series to make power series from R^n -> R^m. An interesting example of a category and a nonlinear relative of the matrix category going from V^n -> V^m. I’m having a hell of a time getting composition to work automatically though. I’m a little stumped.

I made a version of division that uses the Integral typeclass rather than the fractional typeclass with an eye towards these applications. It was kind of annoying but I think it is right now.

I thought that in order to make things more Groebner basis like, I should make polynomials that start with the highest power first. It also makes sense for a division algorithm, but now I’m not so sure. I should also have probably used Vector and not List.

## Approximating Compiling to Categories using Type-level Haskell: Take 2

The previous episode is here .

Summary: I’m trying to use typelevel programming in Haskell to achieve some of the aims of Conal Elliott’s compiling to categories GHC plugin. The types of highly polymorphic tuple functions are enough to specify the implementation. We aren’t going to be able to piggy-back off of GHC optimizations (a huge downside), but we can reify lambdas into other categories and avoid the scariness of plugins.

The current implementation github source is here

JESUS CHRIST.

Of course, Oleg already did it. This is a file where he builds the implementation of a polymorphic function from the type signature. Instead of tuples, he is focusing on higher order functions with deep nesting of (->).

The trick that I was missing is in the IsFunction typeclass at the very end, which is only achievable as a Incoherent instances.

I would never have had the courage to use an Incoherent instance if I hadn’t seen a higher authority do it first. It has turned out in my typelevel journey that many instances that I’ve been tempted to make overlapping or incoherent don’t actually have to be, especially with the availability of closed type families. I think you really do need Incoherent in this application because type variables stay polymorphic forever.

To the best of my knowledge, if you need to differentiate between a tuple type (a,b) and an uninstantiated polymorphic value a’ like we do when deconstructing the input type of our lambda, you need to use an Incoherent instance. Since a’ could hypothetically eventually be unified to some (a,b) we should not be able to do different things for the two cases without stepping outside the usual laws of typeclass resolution.

New features of the implementation:

• The new implementation does not require the V annotation of the input like previous version by using the previously mentioned. This is super cool because now I can throw the stock Prelude.fst into toCcc.
• I changed how the categorical implementation is built, such that it short circuits with an ‘id’ if a large structure is needed from the input, rather than deconstructing all the way to every piece of the input. Lots of other optimizations would be nice (vital?), but it is a start.
• I also implemented a FreeCat GADT so that we can see the implementation in ghci.
• I switched from using Data.Proxy to the type annotations extensions, which is a huge readability win.
• I added a binary application operator binApp, which is useful for encapsulating categorical literals as infix operators into your lambda expressions.
• General cleanup, renaming, and refactoring into library files.

A couple typelevel tricks and comments:

You often have to make helper typeclasses, so don’t be afraid to. If you want something like an if-then-else in your recursion, it is very likely you need a form of the typeclass that has slots for ‘True or ‘False to help you pick the instance.

If possible, you often want the form

rather than

The type inference tends to be better.

Here are some usage examples of toCcc.

My partially implemented version of some of Conal’s category typeclasses. Should I switch over to using the constrained-categories package?

The actual implementation of toCcc

## Shit Compiling to Categories using Type level Programming in Haskell

So I’ve  been trying to try and approximate Conal Elliott’s compiling to categories in bog standard Haskell and I think I’ve got an interesting chunk.

His approach in using a GHC plugin is better for a couple reasons. One really important thing is that he gets to piggy back on GHC optimizations for the lambdas. I have only implemented a very bad evaluation strategy. Perhaps we could recognize shared subexpressions and stuff, but it is more work. I seem somewhat restricted in what I can do and sometimes type inference needs some help. Not great. However, GHC plugins definitely bring their own difficulties.

What I’ve got I think still has roots in my thinking from this previous post

There are a couple insights that power this implementation

1. A fully polymorphic tuple to tuple converter function is uniquely defined by it’s type. For example, swap :: (a,b) -> (b,a), id:: a -> a, fst :: (a,b) ->a, snd::(a,b)->b are all unique functions due to the restrictions of polymorphism. Thus typelevel programming can build the implementation from the type.
2. Getting a typeclass definition to notice polymorphism is hard though. I haven’t figured out how to do it, if it is even possible. We get around it by explicitly newtyping every pattern match on a polymorphic variable like so \(V x, V y) -> (y,x). Two extra characters per variable. Not too shabby.
3. You can abuse the type unification system as a substitution mechanism. Similar to HOAS, you can make a lambda interpreter at the type level that uses polymorphism as way of generating labels for variables. This is probably related to Oleg Kiselyov’s type level lambda calculus, but his kind of confuses me http://okmij.org/ftp/Computation/lambda-calc.html#haskell-type-level
4. You can inject a categorical literal morphism using a wrapping type to be extracted later using an apply operator and type App f a. An infix () makes it feel better.

So here is the rough structure of what the typelevel programming is doing

You can do a depth first traversal of the input tuple structure, when you hit V, unify the interior type with a Nat labelled Leaf. At the same time, you can build up the actual value of the structure so that you can apply the lambda to it and get the output which will hold a tree that has morphism literals you want.

Then inspect the output type of the lambda which now has Nat labels, and traverse the labelled tree again to find the appropriate sequence of fst and snd to extract what you need. If you run into an application App, you can pull the literal out now and continue traversing down.

At the moment I’ve only tested with the (->) Category, which is a bit like demonstrating a teleporter by deconstructing a guy and putting back in the same place, but I think it will work with others. We’ll see. I see no reason why not.

At the moment, I’m having trouble getting GHC to not freakout unless you explicitly state what category you’re working in, or explicitly state that you’re polymorphic in the Category k.

Some future work thoughts: My typelevel code is complete awful spaghetti written like I’m a spastic trapped animal. It needs some pruning. I think all those uses of Proxy can be cleaned up by using TypeApplications. I need to implement some more categories. Should I conform to the Constrained-Categories package? Could I use some kind of hash consing to find shared structures? Could Generic or Generic1 help us autoplace V or locate polymorphism? Maybe a little Template Haskell spice to inject V?

Here’s the most relevant bits, with my WIP git repository here

## Reverse Mode Auto Differentiation is Kind of Like a Lens

Edit: More cogent version here http://www.philipzucker.com/reverse-mode-differentiation-is-kind-of-like-a-lens-ii/

Warning: I’m using sketchy uncompiled Haskell pseudocode.

Auto-differentiation is writing a function that also computes the derivative alongside calculating its value. Function composition is done alongside applying the chain rule to the derivative part.

One way to do this is to use a “dual number”. Functions now take a tuple of values and derivatives.

The Jacobean of a function from $R^n \rightarrow R^m$ is a m by n matrix. The chain rule basically says that you need to compose the matrices via multiplication when you compose the value functions.  This is the composition of the linear maps.

Conceptually, you initialize the process with a NxN identity matrix corresponding to the fact that \$latex \partial x_i/\partial x_j=\delta_{ij}

Vectorized versions of scalar functions (maps) will often use diag

A couple points:

1.  Since the Jacobean j is always going to be multiplied in composition, it makes sense to factor this out into a Monad structure (Applicative maybe? Not sure we need full Monad power).
2. There is an alternative to using explicit Matrix data types for linear maps. We could instead represent the jacobeans using (Vector Double) -> Vector Double. The downside of this is that you can’t inspect elements. You need explicit matrices as far as I know to do Gaussian elimination and QR decomposition. You can sample the function to reconstitute the matrix if need be, but this is somewhat roundabout. On the other hand, if your only objective is to multiply matrices, one can use very efficient versions. Instead of an explicit dense NxN identity matrix, you can use the function id :: a -> a, which only does some minimal pointer manipulation or is optimized away. I think that since we are largely multiplying Jacobeans, this is fine.

What we’ve shown so far is Forward Mode.

When you multiply matrices you are free to associate them in any direction you like. (D(C(BA))) is the association we’re using right now. But you are free to left associate them. ((DC)B)A). You can write this is right associated form using the transpose $((DC)B)A)^T = (A^T(B^T(C^TD^T)))$

This form is reverse mode auto differentiation. Its advantage is the number of computations you have to do and the intermediate values you have to hold. If one is going from many variables to a small result, this is preferable.

It is actually exactly the same in implementation except you reverse the order of composition of the derivatives. We forward compose value functions and reverse compose derivative functions (matrices).