Reverse Mode Differentiation is Kind of Like a Lens II

For those looking for more on automatic differentiation in Haskell:

Ed Kmett’s ad package

http://hackage.haskell.org/package/ad

Conal Elliott is making the rounds with a new take on AD (GOOD STUFF).

http://conal.net/papers/essence-of-ad/

Justin Le has been making excellent posts and has another library he’s working on.

https://blog.jle.im/entry/introducing-the-backprop-library.html

 

And here we go:

Reverse mode automatic differentiation is kind of like a lens. Here is the type for a non-fancy lens

When you compose two lenses, you compose the getters (s -> a) and you compose the partially applied setter (b -> t) in the reverse direction.

We can define a type for a reverse mode differentiable function

When you compose two differentiable functions you compose the functions and you flip compose the Jacobian transpose (dy -> dx). It is this flip composition which gives reverse mode it’s name. The dependence of the Jacobian on the base point x corresponds to the dependence of the setter on the original object

The implementation of composition for Lens and AD are identical.

Both of these things are described by the same box diagram (cribbed from the profunctor optics paper www.cs.ox.ac.uk/people/jeremy.gibbons/publications/poptics.pdf ).

 

 

This is a very simple way of implementing a reserve mode automatic differentiation using only non exotic features of a functional programming language. Since it is so bare bones and functional, is this a good way to achieve the vision gorgeous post by Christoper Olah?  http://colah.github.io/posts/2015-09-NN-Types-FP/  I do not know.

Now, to be clear, these ARE NOT lenses. Please, I don’t want to cloud the water, do not call these lenses. They’re pseudolenses or something. A very important part of what makes a lens a lens is that it obeys the lens laws, in which the getter and setter behave as one would expect. Our “setter” is a functional representation of the Jacobian transpose and our getter is the function itself. These do not obey lens laws in general.

Chain Rule AND Jacobian

What is reverse mode differentiation? One’s thinking is muddled by defaulting to the Calc I perspective of one dimensional functions. Thinking is also muddled by  the general conception that the gradient is a vector. This is slightly sloppy talk and can lead to confusion. It definitely has confused me.

The right setting for intuition is R^n \rightarrow R^m functions

If one looks at a multidimensional to multidimensional function like this, you can form a matrix of partial derivatives known as the Jacobian. In the scalar to scalar case this is a 1\times 1 matrix, which we can think of as just a number. In the multi to scalar case this is a 1\times n matrix which we somewhat fuzzily can think of as a vector.

The chain rule is a beautiful thing. It is what makes differentiation so elegant and tractable.

For many-to-many functions, if you compose them you matrix multiply their Jacobians.

Just to throw in some category theory spice (who can resist), the chain rule is a functor between the category of differentiable functions and the category of vector spaces where composition is given by Jacobian multiplication. This is probably wholly unhelpful.

The cost of multiplication for an a \times b matrix A and an b \times c matrix B is O(abc) . If we have 3 matrices ABC, we can associate to the left or right. (AB)C vs A(BC) choosing which product to form first. These two associations have different cost, abc * acd for left association or abd * bcd for right association. We want to use the smallest dimension over and over. For functions that are ultimately many to scalar functions, that means we want to multiply starting at the right.

For a clearer explanation of the importance of the association, maybe this will help https://en.wikipedia.org/wiki/Matrix_chain_multiplication

 

Functional representations of matrices

A Matrix data type typically gives you full inspection of the elements. If you partially apply the matrix vector product function (!* :: Matrix -> Vector -> Vector) to a matrix m, you get a vector to vector function (!* m) :: Vector -> Vector. In the sense that a matrix is data representing a linear map, this type looks gorgeous. It is so evocative of purpose.

If all you want to do is multiply matrices or perform matrix vector products this is not a bad way to go. A function in Haskell is a thing that exposes only a single interface, the ability to be applied. Very often, the loss of Gaussian elimination or eigenvalue decompositions is quite painfully felt. For simple automatic differentiation, it isn’t so bad though.

You can inefficiently reconstitute a matrix from it’s functional form by applying it to a basis of vectors.

One weakness of the functional form is that the type does not constrain the function to actually act linearly on the vectors.

One big advantage of the functional form is that you can intermix different matrix types (sparse, low-rank, dense) with no friction, just so long as they all have some way of being applied to the same kind of vector. You can also use functions like (id :: a -> a) as the identity matrix, which are not built from any underlying matrix type at all.

To match the lens, we need to represent the Jacobian transpose as the function (dy -> dx) mapping differentials in the output space to differentials in the input space.

The Lens Trick

A lens is the combination of a getter (a function that grabs a piece out of a larger object) and a setter (a function that takes the object and a new piece and returns the object with that piece replaced).

The common form of lens used in Haskell doesn’t look like the above. It looks like this.

This form has exactly the same content as the previous form (A non obvious fact. See the Profunctor Optics paper above. Magic neato polymorphism stuff), with the added functionality of being able to compose using the regular Haskell (.) operator.

I think a good case can be made to NOT use the lens trick (do as I say, not as I do). It obfuscates sharing and obfuscates your code to the compiler (I assume the compiler optimizations have less understanding of polymorphic functor types than it does of tuples and functions), meaning the compiler has less opportunity to help you out. But it is also pretty cool. So… I dunno. Edit:

/u/mstksg points out that compilers actually LOVE the van Laarhoven representation (the lens trick) because when f is finally specialized it is a newtype wrappers which have no runtime cost. Then the compiler can just chew the thing apart.

https://www.reddit.com/r/haskell/comments/9oc9dq/reverse_mode_differentiation_is_kind_of_like_a/

One thing that is extra scary about the fancy form is that it makes it less clear how much data is likely to be shared between the forward and backward pass. Another alternative to the lens that shows this is the following.

This form is again the same in end result. However it cannot share computation and therefore isn’t the same performance wise. One nontrivial function that took me some head scratching is how to convert from the fancy lens directly to the regular lens without destroying sharing. I think this does it

 

Some code

I have some small exploration of the concept in this git https://github.com/philzook58/ad-lens

Again, really check out Conal Elliott’s AD paper and enjoy the many, many apostrophes to follow.

Some basic definitions and transformations between fancy and non fancy lenses. Extracting the gradient is similar to the set function. Gradient assumes a many to one function and it applies it to 1.

Basic 1D functions and arrow/categorical combinators

Some List based stuff.

And some functionality from HMatrix

In practice, I don’t think this is a very ergonomic approach without something like Conal Elliott’s Compiling to Categories plugin. You have to program in a point-free arrow style (inspired very directly by Conal’s above AD paper) which is pretty nasty IMO. The neural network code here is inscrutable. It is only a three layer neural network.

 

 

Model Predictive Control of CartPole in OpenAI Gym using OSQP

A continuation of this post http://www.philipzucker.com/osqp-sparsegrad-fast-model-predictive-control-python-inverted-pendulum/

I was having difficulty getting the model predictive control from a previous post working on an actual cartpole. There are a lot more unknown variables in that case and other issues (the thing has a tendency to destroy itself). I was kind of hoping it would just work. So I realized that I should get it working in simulation.

I did not copy the simulation code of the openai cartpole https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py  which gives some small amount of credence that the MPC might generalize to a real system.

For the sake of honesty, I’m still at the point where my controller freaks out about 1/3 of the time. I do not really understand why.

 

Looks damn good here though huh.

A problem I had for a while was the Length of my pole was off by a factor of 2. It still kind of worked, especially if nearly balanced (although with a lot of oscillation, which in hindsight was a clue something wasn’t tuned in right).

There are some useful techniques for manipulating gym. You can set some parameters from the outside, like starting positions and thresholds. Also you can mangle your way into continuous force control rather than just left right commands (wouldn’t it be cool to use Integer programming for that? Silly, but cool).

There is still a bunch of trash in here from me playing around with parameters. I like to keep it real (and lazy).

One problem was that originally I had the pole just want to go to pi. But if it swings the other direction or many swings, that is bad and it will freak out. So I changed it to try to go the the current nearest multiple of pi, which helps.

Fiddling with the size of the regulation does have a significant effect and the relative size of regulation for x, v, f, omega. I am doing a lot of that search dumbly. I should probably do some kind of automatic.

Loosening the constraints on v and x seems to help stability.

I think weighting the angle at the end of the episode slightly more helps. That’s why I used linspace for the weight on the angle.

I’ve had a lot of problem with the answer coming back as infeasible from OSQP. I feel like it probably shouldn’t be and that is the solver’s problem?

Two things help: sometimes the cart does go out of the allowable range. The optimization probably will try to go all the way to the boundaries since it is useful. And since there is some mismatch between the actual dynamics and my model, it will go outside. So I heavily reduce the constraints for the first couple time steps. It takes a couple. 4 seems to work ok. It should want to apply forces during those four steps to get it back in range anyhow.

Even then it still goes infeasible sometimes and I don’t know why. So in that case, I reduce the required accuracy to hopefully at least get something that makes sense. That is what the eps_rel stuff is about. Maybe it helps. Not super clear. I could try a more iterative increasing of the eps?

https://groups.google.com/forum/#!topic/osqp/BzEqWQR2dYY

 

 

 

Division of Polynomials in Haskell

I’ve been been on a kick learning about some cool math topics. In particular, I busted out my copy of Concrete Mathematics (awesome book) and was reading up on the number theory section. Bezout’s identity is that if you have ma+nb=d for some m,n and d divides a and b, then d is the greatest divisor of a and b. Bezout’s identity is a certificate theorem like the dual solution to a linear program. It doesn’t matter how you found m and n or d, Euclid’s algorithm or brute search, but once you’ve found that certificate, you know you have the biggest divisor. Pretty cool. Suppose you have some other common divisor d’. That means xd’=a and yd’=b. Subsitute this into the certificate. m(xd’)+n(yd’)=(mx+ny)d’ =d. This means d’ is a divisor of d. But the divisor relation implies the inequality relation (a divisor is always less than or equal to the thing it divides). Therefore d'<=d.

But another thing I’ve been checking out is algebraic geometry and Groebner bases. Groebner bases are a method for solving multivariate polynomial equations. The method is a relative of euclidean division and also in a sense Gaussian elimination. Groebner basis are a special recombination of a set of polynomial equations such that polynomial division works uniquely on them (in many variables polynomial division doesn’t work like you’d expect from the one variable case anymore). Somewhat surprisingly other good properties come along for the ride. More on this in posts to come

Some interesting stuff in the Haskell arena:

Gröbner bases in Haskell: Part I

https://konn.github.io/computational-algebra/

I love Doug McIlroy’s powser. It is my favorite functional pearl. https://www.cs.dartmouth.edu/~doug/powser.html

One interesting aspect not much explored is that you can compose multiple layers of [] like [[[Double]]] to get multivariate power series. You can then also tuple up m of these series to make power series from R^n -> R^m. An interesting example of a category and a nonlinear relative of the matrix category going from V^n -> V^m. I’m having a hell of a time getting composition to work automatically though. I’m a little stumped.

I made a version of division that uses the Integral typeclass rather than the fractional typeclass with an eye towards these applications. It was kind of annoying but I think it is right now.

I thought that in order to make things more Groebner basis like, I should make polynomials that start with the highest power first. It also makes sense for a division algorithm, but now I’m not so sure. I should also have probably used Vector and not List.

 

 

 

 

 

RC Airplane

We decided to give making an RC airplane a shot. We have lots of parts from our drone projects. It’s interesting that we haven’t tried before. Drones appear to be more popular than rc planes if one goes by the internet.

We are going for the simplest thing we can think of. We bought some foamboard and cut it into the wing tail and body. We half cut through it to make the control surfaces.

We ripped a crappy small brushless motor and esc from our eldest drone project. The esc back powers the radio receiver through the control line

We had a bag of mini servos and a Turnigy 9x. The servos are directly connectable to the receiver unit. I don’t see why you’d need more than 3 servos to get starts. One for the ailerons connected on opposite sides of the servo horn so that they flap oppositely. We used paper clips as push rods. One for the rudder and one for the elevator flap. Putting the controller into an airplane mode is a bit confusing, but Ben figured it out eventually.

We sized things to look roughly plane like. I’ve heard a longer wingspan is better for trainers and gliders. Also we dug up a calculator for wing loading. All our stuff comes in to about 10oz.

 

 

 

 

We didn’t follow this but maybe we should have

Solving the Ising Model using a Mixed Integer Linear Program Solver (Gurobi)

I came across an interesting thing, that finding the minimizer of the Ising model is encodable as a mixed integer linear program.

The Ising model is a simple model of a magnet. A lattice of spins that can either be up or down. They want to align with an external magnetic field, but also with their neighbors (or anti align, depending on the sign of the interaction). At low temperatures they can spontaneously align into a permanent magnet. At high temperatures, they are randomized. It is a really great model that contains the essence of many physics topics.

Linear Programs minimize linear functions subject to linear equality and inequality constraints. It just so happens this is a very solvable problem (polynomial time).

MILP also allow you to add the constraint that variables take on integer values. This takes you into NP territory. Through fiendish tricks, you can encode very difficult problems. MILP solvers use LP solvers as subroutines, giving them clues where to search, letting them step early if the LP solver returns integer solutions, or for bounding branches of the search tree.

How this all works is very interesting (and very, very roughly explained), but barely matters practically since other people have made fiendishly impressive implementations of this that I can’t compete with. So far as I can tell, Gurobi is one of the best available implementations (Hans Mittelman has some VERY useful benchmarks here http://plato.asu.edu/bench.html), and they have a gimped trial license available (2000 variable limit. Bummer.). Shout out to CLP and CBC, the Coin-Or Open Source versions of this that still work pretty damn well.

Interesting Connection: Quantum Annealing (like the D-Wave machine) is largely based around mapping discrete optimization problems to an Ising model. We are traveling that road in the opposite direction.

So how do we encode the Ising model?

Each spin is a binary variable s_i \in {0,1}

We also introduce a variable for every edge. which we will constrain to actually be the product of the spins. e_{ij} \in {0,1}. This is the big trick.

We can compute the And/Multiplication (they coincide for 0/1 binary variables) of the spins using a couple linear constraints. I think this does work for the 4 cases of the two spins.

e_{ij} \ge s_i +s_j - 1

e_{ij} \le s_j

e_{ij} \le s_i

The xor is usually what we care about for the Ising model, we want aligned vs unaligned spins to have different energy. It will have value 1 if they are aligned and 0 if they are anti-aligned. This is a linear function of the spins and the And.

s_i \oplus s_j = s_i + s_j - 2 e_{ij}

Then the standard Hamiltonian is

H=\sum B_i s_i + \sum J_{ij} (s_i + s_j - 2 e_{ij})

Well, modulo some constant offset. You may prefer making spins \pm 1, but that leads to basically the same Hamiltonian.

The Gurobi python package actually let’s us directly ask for AND constraints, which means I don’t actually have to code much of this.

We are allowed to use spatially varying external field B and coupling parameter J. The Hamiltonian is indeed linear in the variables as promised.

After already figuring this out, I found this chapter where they basically do what I’ve done here (and more probably). There is nothing new under the sun. The spatially varying fields B and J are very natural in the field of spin glasses.

https://onlinelibrary.wiley.com/doi/10.1002/3527603794.ch4

For a while I thought this is all we could do, find the lowest energy solution, but there’s more! Gurobi is one of the few solvers that support iteration over the lowest optimal solutions, which means we can start to talk about a low energy expansion. https://www.gurobi.com/documentation/8.0/refman/poolsearchmode.html#parameter:PoolSearchMode

Here we’ve got the basic functionality. Getting 10,000 takes about a minute. This is somewhat discouraging when I can see that we haven’t even got to very interesting ones yet, just single spin and double spin excitations. But I’ve got some ideas on how to fix that. Next time baby-cakes.

(A hint: recursion with memoization leads to some brother of a cluster expansion.)

 

 

 

Here’s the ground antiferromagnetic state. Cute.

 

 

 

 

OSQP and Sparsegrad: Fast Model Predictive Control in Python for an inverted pendulum

OSQP is a quadratic programming solver that seems to be designed for control problems. It has the ability to warm start, which should make it faster.

I heard about it in this Julia talk

They have some really cool technology over there in Julia land.

The problem is setup as a sequential quadratic program. I have a quadratic cost to try to bring the pendulum to an upright position.

The equations of motion are roughly

\ddot{\theta}I=-mgL\sin(\theta)+mfL\cos(\theta)

\ddot{x}=f

I=\frac{1}{3}mL^2

We don’t include the back reaction of the pole on the cart. It is basically irrelevant for our systems and overly complicated for no reason. The interesting bit is the nonlinear dynamics and influence of the cart acceleration.

I write down obeying the equations of motion as a linear constraint between timesteps. I use sparsegrad to get a sparse Jacobian of the equations of motion. The upper and lower (l and u) bounds are set equal to make an equality constraint.

Our setup has a finite track about a meter long. Our motors basically are velocity control and they can go about +-1m/s. Both of these can be encapsulated as linear constraints on the position and velocity variables. l \le Ax \le u

\phi(x)=0

\phi(x) \approx \phi(x_0) + \partial \phi(x_0) (x - x_0)

A= \partial \phi(x_0)

l=u=\partial \phi(x_0) x_0 - \phi_0(x_0)

Similarly for finding the quadratic cost function in terms of the setpoint x_s. \frac{1}{2}x^T P x + q^Tx= \frac{1}{2}(x-x_s)P(x-x_s) Expanding this out we get

q = - Px_s

I run for multiple iterations to hopefully converge to a solution (it does). The initial linearization might be quite poor, but it gets better.

The OSQP program seems to be able to run in under 1ms. Nice! Initial tests of our system seem to be running at about 100Hz.

Could probably find a way to make those A matrices faster than constructing them entirely anew every time. We’ll see if we need that.

I very inelegantly smash all the variables into x. OSQP and scipy sparse don’t support multiIndex objects well, best as I can figure. Vectors should be single dimensioned and matrices 2 dimensioned.

Still to be seen if it works on hardware. I was already having infeasibility problems. Adjusting the regularization on P seemed to help.

 

osqp_cart

Approximating Compiling to Categories using Type-level Haskell: Take 2

The previous episode is here .

Summary: I’m trying to use typelevel programming in Haskell to achieve some of the aims of Conal Elliott’s compiling to categories GHC plugin. The types of highly polymorphic tuple functions are enough to specify the implementation. We aren’t going to be able to piggy-back off of GHC optimizations (a huge downside), but we can reify lambdas into other categories and avoid the scariness of plugins.

The current implementation github source is here 


JESUS CHRIST.

http://okmij.org/ftp/Haskell/de-typechecker.lhs

Of course, Oleg already did it. This is a file where he builds the implementation of a polymorphic function from the type signature. Instead of tuples, he is focusing on higher order functions with deep nesting of (->).

The trick that I was missing is in the IsFunction typeclass at the very end, which is only achievable as a Incoherent instances.

I would never have had the courage to use an Incoherent instance if I hadn’t seen a higher authority do it first. It has turned out in my typelevel journey that many instances that I’ve been tempted to make overlapping or incoherent don’t actually have to be, especially with the availability of closed type families. I think you really do need Incoherent in this application because type variables stay polymorphic forever.

To the best of my knowledge, if you need to differentiate between a tuple type (a,b) and an uninstantiated polymorphic value a’ like we do when deconstructing the input type of our lambda, you need to use an Incoherent instance. Since a’ could hypothetically eventually be unified to some (a,b) we should not be able to do different things for the two cases without stepping outside the usual laws of typeclass resolution.

New features of the implementation:

  • The new implementation does not require the V annotation of the input like previous version by using the previously mentioned. This is super cool because now I can throw the stock Prelude.fst into toCcc.
  • I changed how the categorical implementation is built, such that it short circuits with an ‘id’ if a large structure is needed from the input, rather than deconstructing all the way to every piece of the input. Lots of other optimizations would be nice (vital?), but it is a start.
  • I also implemented a FreeCat GADT so that we can see the implementation in ghci.
  • I switched from using Data.Proxy to the type annotations extensions, which is a huge readability win.
  • I added a binary application operator binApp, which is useful for encapsulating categorical literals as infix operators into your lambda expressions.
  • General cleanup, renaming, and refactoring into library files.

A couple typelevel tricks and comments:

You often have to make helper typeclasses, so don’t be afraid to. If you want something like an if-then-else in your recursion, it is very likely you need a form of the typeclass that has slots for ‘True or ‘False to help you pick the instance.

If possible, you often want the form

rather than

The type inference tends to be better.

Here are some usage examples of toCcc.

My partially implemented version of some of Conal’s category typeclasses. Should I switch over to using the constrained-categories package?

 

The actual implementation of toCcc

 

drawing-1

 

 

Variational Method of the Quantum Simple Harmonic Oscillator using PyTorch

A fun method (and useful!) for solving the ground state of the Schrodinger equation is to minimize the energy integral dx \psi^\dagger H \psi while keeping the total probability 1. Pytorch is a big ole optimization library, so let’s give it a go.

I’ve tried two versions, using a stock neural network with relus and making it a bit easier by giving a gaussian with variable width and shift.

We can mimic the probability constraint by dividing by to total normalization \int dx \psi^\dagger \psi. A Lagrange multiplier or penalty method may allows us to access higher wavefunctions.

SGD seems to do a better job getting a rounder gaussian, while Adam is less finicky but makes a harsh triangular wavefunction.

The ground state solution of -\frac{d^2\psi}{dx^2} + x^2\psi=E\psi is e^{-x^2/2}, with an energy of 1/2 (unless I botched up my head math). We may not get it, because we’re not sampling a very good total domain. Whatever, for further investigation.

Very intriguing is that pytorch has a determinant in it, I believe. That opens up the possibility of doing a Hartree-Fock style variational solution.

Here is my garbage

Edit: Hmm I didn’t compensate for the fact I was using randn sampling. That was a goof. I started using unifrom sampling, which doesn’t need compensation

 

sho

Shit Compiling to Categories using Type level Programming in Haskell

So I’ve  been trying to try and approximate Conal Elliott’s compiling to categories in bog standard Haskell and I think I’ve got an interesting chunk.

His approach in using a GHC plugin is better for a couple reasons. One really important thing is that he gets to piggy back on GHC optimizations for the lambdas. I have only implemented a very bad evaluation strategy. Perhaps we could recognize shared subexpressions and stuff, but it is more work. I seem somewhat restricted in what I can do and sometimes type inference needs some help. Not great. However, GHC plugins definitely bring their own difficulties.

What I’ve got I think still has roots in my thinking from this previous post

There are a couple insights that power this implementation

  1. A fully polymorphic tuple to tuple converter function is uniquely defined by it’s type. For example, swap :: (a,b) -> (b,a), id:: a -> a, fst :: (a,b) ->a, snd::(a,b)->b are all unique functions due to the restrictions of polymorphism. Thus typelevel programming can build the implementation from the type.
  2. Getting a typeclass definition to notice polymorphism is hard though. I haven’t figured out how to do it, if it is even possible. We get around it by explicitly newtyping every pattern match on a polymorphic variable like so \(V x, V y) -> (y,x). Two extra characters per variable. Not too shabby.
  3. You can abuse the type unification system as a substitution mechanism. Similar to HOAS, you can make a lambda interpreter at the type level that uses polymorphism as way of generating labels for variables. This is probably related to Oleg Kiselyov’s type level lambda calculus, but his kind of confuses me http://okmij.org/ftp/Computation/lambda-calc.html#haskell-type-level
  4. You can inject a categorical literal morphism using a wrapping type to be extracted later using an apply operator and type App f a. An infix ($$) makes it feel better.

 

So here is the rough structure of what the typelevel programming is doing

You can do a depth first traversal of the input tuple structure, when you hit V, unify the interior type with a Nat labelled Leaf. At the same time, you can build up the actual value of the structure so that you can apply the lambda to it and get the output which will hold a tree that has morphism literals you want.

Then inspect the output type of the lambda which now has Nat labels, and traverse the labelled tree again to find the appropriate sequence of fst and snd to extract what you need. If you run into an application App, you can pull the literal out now and continue traversing down.

At the moment I’ve only tested with the (->) Category, which is a bit like demonstrating a teleporter by deconstructing a guy and putting back in the same place, but I think it will work with others. We’ll see. I see no reason why not.

At the moment, I’m having trouble getting GHC to not freakout unless you explicitly state what category you’re working in, or explicitly state that you’re polymorphic in the Category k.

Some future work thoughts: My typelevel code is complete awful spaghetti written like I’m a spastic trapped animal. It needs some pruning. I think all those uses of Proxy can be cleaned up by using TypeApplications. I need to implement some more categories. Should I conform to the Constrained-Categories package? Could I use some kind of hash consing to find shared structures? Could Generic or Generic1 help us autoplace V or locate polymorphism? Maybe a little Template Haskell spice to inject V?

Here’s the most relevant bits, with my WIP git repository here

 

 

 

Deducing Row Sparse Matrices from Matrix Vector Product Samples

So as part of betting the banded hessian out of pytorch, I’ve been trying to think about what it is I am doing and I’ve come up with an alternative way of talking about it.

If every row of a matrix has <N non zero entries, you can back out that matrix from N matrix vector samples of it. You have many choices for the possible sampling vectors. Random works well.

 

14c0bdc6-d4b3-453d-8d98-8bef87c41a02

The unknown in this case is the row of the matrix, represented in green. We put a known set of inputs into it and get outputs. Each row of the output, represented in red, can tell use the matrix row. We have to invert the matrix that consists of all the elements that the nonzero elements of that row touches represented in blue. That black T is a transpose. To turn those row vectors into column vectors, we have to transpose everything.

For a banded matrix, we have a shifting sample matrix that we have to invert, one for each row.

A cute trick we can use to simplify the edge cases where we run off the ends is to extend the sample matrix with some random trash. That will actually put the entries in the appropriate place and will keep the don’t cares close to zero also.

In my previous post I used a stack of identity matrices. These are nice because a shifted identity matrix is a circular permutation of the entries, which is very easy to undo. That was what the loop that used numpy.roll was doing. Even better, it is easy to at least somewhat vectorize the operation and you can produce those sampling vectors using some clever use of broadcasting of an identity matrix.

An alternative formulation that is a little tighter. I want the previous version because the samples isn’t actually always random. Often they won’t really be under our control.

 

This all still doesn’t address the hermitian problem. The constraint that A is hermitian hypothetically allows you to take about half the samples. But the constraints make it tougher to solve. I haven’t come up with anything all that much better than vectorizing the matrix and building a matrix out of the samples in the appropriate spots.

I think such a matrix will be banded if A is banded, so that’s something at least.