## Neural Networks with Weighty Lenses (DiOptics?)

I wrote a while back how you can make a pretty nice DSL for reverse mode differentiation based on the same type as Lens. I’d heard some interesting rumblings on the internet around these ideas and so was revisiting them.

Composition is defined identically for reverse mode just as it is for lens.

After chewing on it a while, I realized this really isn’t that exotic. How it works is that you store the reverse mode computation graph, and all necessary saved data from the forward pass in the closure of the (dy -> dx). I also have a suspicion that if you defunctionalized this construction, you’d get the Wengert tape formulation of reverse mode ad.

Second, Lens is just a nice structure for bidirectional computation, with one forward pass and one backward pass which may or may not be getting/setting. There are other examples for using it like this.

It is also pretty similar to the standard “dual number” form type FAD x dx y dy = (x,dx)->(y,dy) for forward mode AD. We can bring the two closer by a CPS/Yoneda transformation and then some rearrangement.

and meet it in the middle with

I ended the previous post somewhat unsatisfied by how ungainly writing that neural network example was, and I called for Conal Elliot’s compiling to categories plugin as a possible solution. The trouble is piping the weights all over the place. This piping is very frustrating in point-free form, especially when you know it’d be so trivial pointful. While the inputs and outputs of layers of the network compose nicely (you no longer need to know about the internal computations), the weights do not. As we get more and more layers, we get more and more weights. The weights are in some sense not as compositional as the inputs and outputs of the layers, or compose in a different way that you need to maintain access to.

I thought of a very slight conceptual twist that may help.

The idea is we keep the weights out to the side in their own little type parameter slots. Then we define composition such that it composes input/outputs while tupling the weights. Basically we throw the repetitive complexity appearing in piping the weights around into the definition of composition itself.

These operations are easily seen as 2 dimensional diagrams.

Here’s the core reverse lens ad combinators

And here are the two dimensional combinators. I tried to write them point-free in terms of the combinators above to demonstrate that there is no monkey business going on. We

I wonder if this is actually nice?

I asked around and it seems like this idea may be what davidad is talking about when he refers to dioptics

http://events.cs.bham.ac.uk/syco/strings3-syco5/slides/dalrymple.pdf

Perhaps this will initiate a convo.

Edit: He confirms that what I’m doing appears to be a dioptic. Also he gave a better link http://events.cs.bham.ac.uk/syco/strings3-syco5/papers/dalrymple.pdf

He is up to some interesting diagrams

### Bits and Bobbles

• Does this actually work or help make things any better?
• Recurrent neural nets flip my intended role of weights and inputs.
• Do conv-nets naturally require higher dimensional diagrams?
• This weighty style seems like a good fit for my gauss seidel and iterative LQR solvers. A big problem I hit there was getting all the information to the outside, which is a similar issue to getting the weights around in a neural net.

## Concolic Weakest Precondition is Kind of Like a Lens

That’s a mouthful.

Lens are described as functional getters and setters. The simple lens type is

. The setter is

and the getter is

This type does not constrain lenses to obey the usual laws of getters and setters. So we can use/abuse lens structures for nontrivial computations that have forward and backwards passes that share information. Jules Hedges is particular seems to be a proponent for this idea.

I’ve described before how to encode reverse mode automatic differentiation in this style. I have suspicions that you can make iterative LQR and guass-seidel iteration have this flavor too, but I’m not super sure. My attempts ended somewhat unsatisfactorily a whiles back but I think it’s not hopeless. The trouble was that you usually want the whole vector back, not just its ends.

I’ve got another example in imperative program analysis that kind of makes sense and might be useful though. Toy repo here: https://github.com/philzook58/wp-lens

In program analysis it sometimes helps to run a program both concretely and symbolically. Concolic = CONCrete / symbOLIC. Symbolic stuff can slowly find hard things and concrete execution just sprays super fast and can find the dumb things really quick.

We can use a lens structure to organize a DSL for describing a simple imperative language

The forward pass is for the concrete execution. The backward pass is for transforming the post condition to a pre condition in a weakest precondition analysis. Weakest precondition semantics is a way of specifying what is occurring in an imperative language. It tells how each statement transforms post conditions (predicates about the state after the execution) into pre conditions (predicates about before the execution).  The concrete execution helps unroll loops and avoid branching if-then-else behavior that would make the symbolic stuff harder to process. I’ve been flipping through Djikstra’s book on this. Interesting stuff, interesting man.

I often think of a state machine as a function taking s -> s. However, this is kind of restrictive. It is possible to have heterogenous transformations s -> s’. Why not? I think I am often thinking about finite state machines, which we really don’t intend to have a changing state size. Perhaps we allocated new memory or something or brought something into or out of scope. We could model this by assuming the memory was always there, but it seems wasteful and perhaps confusing. We need to a priori know everything we will need, which seems like it might break compositionally.

We could model our language making some data type like
data Imp = Skip | Print String | Assign String Expr | Seq Imp Imp | ...
and then build an interpreter

But we can also cut out the middle man and directly define our language using combinators.

To me this has some flavor of a finally tagless style.

Likewise for expressions. Expressions evaluate to something in the context of the state (they can lookup variables), so let’s just use

And, confusingly (sorry), I think it makes sense to use Lens in their original getter/setter intent for variables. So Lens structure is playing double duty.

type Var s a = Lens' s a

With that said, here we go.

Weakest precondition can be done similarly, instead we start from the end and work backwards

Predicates are roughly sets. A simple type for sets is

Now, this doesn’t have much deductive power, but I think it demonstrates the principles simply. We could replace Pred with perhaps an SMT solver expression, or some data type for predicates, for which we’ll need to implement things like substitution. Let’s not today.

A function

is equivalent to

. This is some kind of CPS / Yoneda transformation thing. A state transformer

to predicate transformer

is somewhat evocative of that. I’m not being very precise here at all.

Without further ado, here’s how I think a weakest precondition looks roughly.

Finally here is a combination of the two above that uses the branching structure of the concrete execution to aid construction of the precondition. Although I haven’t expanded it out, we are using the full s t a b parametrization of lens in the sense that states go forward and predicates come back.

Neat. Useful? Me dunno.

## Applicative Bidirectional Programming and Automatic Differentiation

I got referred to an interesting paper by a comment of /u/syrak.

http://www2.sf.ecei.tohoku.ac.jp/~kztk/papers/kztk_jfp_am_2018.pdf

Applicative bidirectional programming (PDF), by Kazutaka Matsuda and Meng Wang

In it, they use a couple interesting tricks to make Lens programming more palatable. Lens often need to be be programmed in a point free style, which is rough, but by using some combinators, they are able to program lenses in a pointful style (with some pain points still left over). It is a really interesting, well-written paper. Lots ‘o’ Yoneda and laws. I’m not doing it justice. Check it out!

A while back I noted that reverse mode auto-differentiation has a type very similar to a lens and in fact you can build a working reverse mode automatic differentiation DSL out of lenses and lens-combinators. Many things about lenses, but not all, transfer over to automatic differentiation. The techniques of Matsuda and Wang do appear to transfer fairly well.

This is interesting to me for another reason. Their lift2 and unlift2 functions remind me very much of my recent approach to compiling to categories. The lift2 function is fanning a lens pair. This is basically what my FanOutput typeclass automated. unlift2 is building the input for a function function by supplying a tuple of projection lenses. This is what my BuildInput typeclass did. I think their style may extend many monoidal cartesian categories, not just lenses.

One can use the function b -> a in many of the situations one can use a in. You can do elegant things by making a Num typeclass of b -> a for example. This little fact seems to extend to other categories as well. By making a Num typeclass for Lens s a when a is a Num, we can use reasonable looking notation for arithmetic.

They spend some time discussing the necessity of a Poset typeclass. For actual lawful lenses, the dup implementation needs a way to recombine multiple adjustments to the same object. In the AD-lens case, dup takes care of this by adding together the differentials. This means that everywhere they needed an Eq typeclass, we can use a Num typeclass. There may be usefulness to building a wrapped type data NZ a = Zero | NonZero a like their Tag type to accelerate the many 0 values that may be propagating through the system.

Unfortunately, as is, the performance of this is abysmal. Maybe there is a way to fix it? Unlifting and lifting destroys a lot of sharing and often necessitates adding many redundant zeros. Why are you doing reverse mode differentiation unless you care about performance? Forward mode is simpler to implement. In the intended use case of Matsuda and Wang, they are working with actual lawful lenses, which have far less computational content than AD-lenses. Good lawful lenses should just be shuffling stuff around a little. Maybe one can hope GHC is insanely intelligent and can optimize these zeros away. One point in favor of that is that our differentiation is completely pure (no mutation). Nevertheless, I suspect it will not without help. Being careful and unlifting and lifting manually may also help. In principle, I think the Lensy approach could be pretty fast (since all it is is just packing together exactly what you need to differentiate into a data type), but how to make it fast while still being easily programmable? It is also nice that it is pretty simple to implement. It is the simplest method that I know of if you needed to port operable reverse mode differentiation to a new library (Massiv?) or another functional language (Futhark?). And a smart compiler really does have a shot at finding optimizations/fusions.

While I was at it, unrelated to the paper above, I think I made a working generic auto differentiable fold lens combinator. Pretty cool. mapAccumL is a hot tip.

For practical Haskell purposes, all of this is a little silly with the good Haskell AD packages around, the most prominent being

It is somewhat interesting to note the similarity of type forall s. Lens s appearing in the Matsuda and Wang approach to those those of the forall s. BVar s monad appearing in the backprop package. In this case I believe that the s type variable plays the same role it does in the ST monad, protecting a mutating Wengert tape state held in the monad, but I haven’t dug much into it. I don’t know enough about backprop to know what to make of this similarity.

The github repo with my playing around and stream of consciousness commentary is here

## Reverse Mode Differentiation is Kind of Like a Lens II

For those looking for more on automatic differentiation in Haskell:

Conal Elliott is making the rounds with a new take on AD (GOOD STUFF).

Justin Le has been making excellent posts and has another library he’s working on.

https://blog.jle.im/entry/introducing-the-backprop-library.html

And here we go:

Reverse mode automatic differentiation is kind of like a lens. Here is the type for a non-fancy lens

When you compose two lenses, you compose the getters (s -> a) and you compose the partially applied setter (b -> t) in the reverse direction.

We can define a type for a reverse mode differentiable function

When you compose two differentiable functions you compose the functions and you flip compose the Jacobian transpose (dy -> dx). It is this flip composition which gives reverse mode it’s name. The dependence of the Jacobian on the base point x corresponds to the dependence of the setter on the original object

The implementation of composition for Lens and AD are identical.

Both of these things are described by the same box diagram (cribbed from the profunctor optics paper www.cs.ox.ac.uk/people/jeremy.gibbons/publications/poptics.pdf ).

This is a very simple way of implementing a reserve mode automatic differentiation using only non exotic features of a functional programming language. Since it is so bare bones and functional, is this a good way to achieve the vision gorgeous post by Christoper Olah?  http://colah.github.io/posts/2015-09-NN-Types-FP/  I do not know.

Now, to be clear, these ARE NOT lenses. Please, I don’t want to cloud the water, do not call these lenses. They’re pseudolenses or something. A very important part of what makes a lens a lens is that it obeys the lens laws, in which the getter and setter behave as one would expect. Our “setter” is a functional representation of the Jacobian transpose and our getter is the function itself. These do not obey lens laws in general.

###### Chain Rule AND Jacobian

What is reverse mode differentiation? One’s thinking is muddled by defaulting to the Calc I perspective of one dimensional functions. Thinking is also muddled by  the general conception that the gradient is a vector. This is slightly sloppy talk and can lead to confusion. It definitely has confused me.

The right setting for intuition is $R^n \rightarrow R^m$ functions

If one looks at a multidimensional to multidimensional function like this, you can form a matrix of partial derivatives known as the Jacobian. In the scalar to scalar case this is a $1\times 1$ matrix, which we can think of as just a number. In the multi to scalar case this is a $1\times n$ matrix which we somewhat fuzzily can think of as a vector.

The chain rule is a beautiful thing. It is what makes differentiation so elegant and tractable.

For many-to-many functions, if you compose them you matrix multiply their Jacobians.

Just to throw in some category theory spice (who can resist), the chain rule is a functor between the category of differentiable functions and the category of vector spaces where composition is given by Jacobian multiplication. This is probably wholly unhelpful.

The cost of multiplication for an $a \times b$ matrix A and an $b \times c$ matrix B is $O(abc)$. If we have 3 matrices ABC, we can associate to the left or right. (AB)C vs A(BC) choosing which product to form first. These two associations have different cost, abc * acd for left association or abd * bcd for right association. We want to use the smallest dimension over and over. For functions that are ultimately many to scalar functions, that means we want to multiply starting at the right.

For a clearer explanation of the importance of the association, maybe this will help https://en.wikipedia.org/wiki/Matrix_chain_multiplication

###### Functional representations of matrices

A Matrix data type typically gives you full inspection of the elements. If you partially apply the matrix vector product function (!* :: Matrix -> Vector -> Vector) to a matrix m, you get a vector to vector function (!* m) :: Vector -> Vector. In the sense that a matrix is data representing a linear map, this type looks gorgeous. It is so evocative of purpose.

If all you want to do is multiply matrices or perform matrix vector products this is not a bad way to go. A function in Haskell is a thing that exposes only a single interface, the ability to be applied. Very often, the loss of Gaussian elimination or eigenvalue decompositions is quite painfully felt. For simple automatic differentiation, it isn’t so bad though.

You can inefficiently reconstitute a matrix from it’s functional form by applying it to a basis of vectors.

One weakness of the functional form is that the type does not constrain the function to actually act linearly on the vectors.

One big advantage of the functional form is that you can intermix different matrix types (sparse, low-rank, dense) with no friction, just so long as they all have some way of being applied to the same kind of vector. You can also use functions like (id :: a -> a) as the identity matrix, which are not built from any underlying matrix type at all.

To match the lens, we need to represent the Jacobian transpose as the function (dy -> dx) mapping differentials in the output space to differentials in the input space.

###### The Lens Trick

A lens is the combination of a getter (a function that grabs a piece out of a larger object) and a setter (a function that takes the object and a new piece and returns the object with that piece replaced).

The common form of lens used in Haskell doesn’t look like the above. It looks like this.

This form has exactly the same content as the previous form (A non obvious fact. See the Profunctor Optics paper above. Magic neato polymorphism stuff), with the added functionality of being able to compose using the regular Haskell (.) operator.

I think a good case can be made to NOT use the lens trick (do as I say, not as I do). It obfuscates sharing and obfuscates your code to the compiler (I assume the compiler optimizations have less understanding of polymorphic functor types than it does of tuples and functions), meaning the compiler has less opportunity to help you out. But it is also pretty cool. So… I dunno. Edit:

/u/mstksg points out that compilers actually LOVE the van Laarhoven representation (the lens trick) because when f is finally specialized it is a newtype wrappers which have no runtime cost. Then the compiler can just chew the thing apart.

One thing that is extra scary about the fancy form is that it makes it less clear how much data is likely to be shared between the forward and backward pass. Another alternative to the lens that shows this is the following.

This form is again the same in end result. However it cannot share computation and therefore isn’t the same performance wise. One nontrivial function that took me some head scratching is how to convert from the fancy lens directly to the regular lens without destroying sharing. I think this does it

###### Some code

I have some small exploration of the concept in this git https://github.com/philzook58/ad-lens

Again, really check out Conal Elliott’s AD paper and enjoy the many, many apostrophes to follow.

Some basic definitions and transformations between fancy and non fancy lenses. Extracting the gradient is similar to the set function. Gradient assumes a many to one function and it applies it to 1.

Basic 1D functions and arrow/categorical combinators

Some List based stuff.

And some functionality from HMatrix

In practice, I don’t think this is a very ergonomic approach without something like Conal Elliott’s Compiling to Categories plugin. You have to program in a point-free arrow style (inspired very directly by Conal’s above AD paper) which is pretty nasty IMO. The neural network code here is inscrutable. It is only a three layer neural network.