Deep Learning Basic Shapes

Our Github Source:

We are trying to get some basic deep learning to happen. Don some MNIST tutorials before, but there in lies the problem. These tutorials have a lot already done for you.

So we’re going to try an even simpler task. Classifying computer generated shapes.

We generate images of shapes using scikit-image.

We started off using Caffe (because it has support for non cuda gpu) and were displeased. It’s heavily configuration file based. The python api mostly is about generating configuration files for you. You need to dip into C to get some stuff done. You need to put your files into an intermediate database file. The docs mostly consist of tutorial python notebooks. It was just all too much barrier for getting started. We may return for various reasons though at some point. It has a database of pretrained networks. Also supposedly Caffe is fast.

Once we switched over to Keras, things started going a lot smoother.

We copied the VGG convolutional example network and chopped out some of it’s layers and lessened the size of the hidden layers. The more complicated the network, the more time and data its going to need. This is a simple task so it shouldn’t need so much.

We had an initial problem with the loss diverging. Our learning rate was too high. We had to cut it down. I read somewhere that a rule of thumb is to decrease the learning rate by factors of 2 until the loss stops diverging.

We also weren’t sure what cross-entropy was. This helped.

The cross entropy is the number of bits you’d need to determine the actual answer after the neural network gave you its guess for the probabilities of the answer given the input. If the neural network gives you a 1 for one answer then it is absolutely certain and you need no more bits. If it gives you a 1 for the wrong answer, you’d need infinitely many bits if you were encoding answers using the networks guesses. In efficient coding, impossible answers take infinite bits. Kind of funny thing.

If I understand it right, the cross entropy of a randomly guessing neural network would be \log _2 (3 ), since you’d expect it to take that many bits to encode 3 options. The starting point tends to be around 6 or so though?

The tensorflow vs theano backend is a constant problem. I suppose we’re going to use the tensorflow backend. Tensorflow has momentum behind it.

You can change the backend by changing keras file.

But you also need to constantly be switching the channel ordering.

The decay parameter is a thing that decreases the learning rate. Basically I think it is 1/iteration at which you want the learning to slow down?

I think we’re going to use AWS for now rather than buy a nice graphics card. Unfortunately, it is important to have CUDA capability and our card is a Radeon.

AWS isn’t so bad. We experienced maybe a 10x speedup in a very unofficial test compared to local cpu on a g2.xlarge.

This is useful. Although the software is constantly going out of date. It has enough to get running.

The smallest g2 instances are currently 0.60$/hour for a regular instance.

Spot instances are often suggested. They cannot be stopped, only terminated which sucks. You can mount an external EBS. Spot instance are somewhere between 0.20$ and 0.6$ per hour.

You can check here

Or on your AWS console.



Chebyshev Polynomials

Chebyshev polynomials are really cool.

No please. Don’t leave. I’m so lonely.

They really are. I remember them coming up a lot during my Gilbert Strang phase (his numerical computing course online is really awesome). When in doubt use Chebyshev polynomials.

They have a lot of great properties. A lot of special functions are just a bunch of random garbage.

The key to Chebyshev polynomials is that they are really \cos in disguise. and they inherit a ton of properties just from that. \cos is a very special function.


What this says is that if you transformed  the domain [-1,1] using x=\cos(\theta) , then the Chebyshev polynomials are just cosines. A couple of properties fall out of this.

First, the equioscillation property. One thing that defines chebyshev polynomials is that they are polynomials with the minimum maximum values on the domain, i.e. they oscillate between -1 and 1. This is why they are good for function approximation. Using them tends to make the approximation error have a minimum maximum value over the interval. Other polynomials of the same degree overshoot the bounds.

This is not surprising from the cosine perspective.

An equally spaced grid is the best one in a periodic domain. This domain transformation transforms the equally spaced grid \theta_n = n/2\pi N to the chebyshev points x_n = \cos(n/ 2\piN). It is somewhat surprising that these are actually much better points to sample functions at rather than the equally spaced points.

A fast n \ln n transform between the chebyshev coefficients and the sampled point values can be achieved using the discrete cosine transform. The discrete cosine transform takes an expansion in cosines to samples on the equally spaced grid and again the chebyshev functions are just cosines in disguise.

They inherit orthonormality from the orthonormality of cosines.

An advantage that chebyshev functions have over cosines is that they are polynomials. This makes them easy to manipulate and compute. The correct way to evaluate a Chebyshev series is using the recurrence.


Once you’ve sampled a function into Chebyshev form, it is simple to differentiate, integrate, multiply, and add functions. Composition of functions may be a bit harder. There is the very interesting relation that T_n(T_m(x))=T_{nm}(x), but it is not clear to me that there is an easy way to find a relation $latexT_n(a+b)$ in order to evaluate the composition of full Chebyshev series efficiently.

For differentiation and integration it is important to remember the \sin factor that occurs due to the change of domain from the unit circle to the unit interval.

Transformation between the samples at Chebyshev points and the polynomial coefficients can be achieved using a DCT.

There are two sets of Chebyshev points, one for roots of T_n one for the extrema.

The Cosine Series works with an implicit even periodic extension of the function you want to approximate. This is why it can work better than approximating a non periodic function with the Fourier series, where the implicit periodic version will have discontinuities.

Root finding becomes possible.

scipy has some solid chebyshev capabilities.

It surprises me that Haskell does not have an appropriate translation of chebfun as far as I can find. Chebfun is such a functional kind of thing, a beautiful programming pattern.

This library has some basic chebyshev capabilities

I’d like to take moment here to comment how important the vector type is in Haskell. It is not mainstream in the sense that any Haskell tutorial mentions it as I recall, but it gives you access into the world of c-like arrays (which is ugly and unhaskelly I guess).




Some Resources on Automatic Differentiation & Power Series in Haskell

Ed Kmett’s ad package. Probably everything you could ever want if you understand it.

Conal Elliott – Beautiful Differentiation – There is an implementation in vector-space

Functional Differentiation of Computer Programs J Karczmarczuk

GR raytracing uses ad package in branch

general backpropagation