In this note, I will collect some snippets and pseudocode that illustrate the operation of einsum and a similar cousin tensordot.

(The section on tensordot is still a little rambling, will be updating it with better examples sometime)

To get upto speed, two great articles that distill the einsum operation need to be read an in-depth introduction.

mean using einsum

Unless you’re doing very low-level matrix manipulations, like implementing your own layers, you’re better off using a combination of already present functions like mean, sum, etc.

Suppose that you have N channels in a WxH image. You need to find the average of each channel. The for loop based calculation is:

M = Tensor(H, W, C) # Input image tensor
out_vec = np.zeros(N)
for k in range(0, N):
        sum_of_channel_values = 0
        for y in range(0, H):
                for x in range(0, W):
                        sum_of_channel_values += M[y, x, k]

        out_vec[k] = sum_of_channel_values

This can be compiled to an einsum notation (refer to the linked article for details on how to convert an einsum into a series of nested for loops with the innermost loops doing a linear combination).

The equivalent einsum notation is np.einsum("ijk->k", M) / (W * H).

The equivalent numpy mean call is np.mean(M, axis=(0, 1)).

Another use-case (RGB to grayscale)

Another example, where einsum does might make some sense to use.

Suppose that you have an RGB image and you want to convert it to grayscale. The formula to obtain the luminance value is something like $y = 0.299 ⋅ r + 0.587 ⋅ g + 0.114 ⋅ b$, where the 3-tuple $⟨r, g, b⟩$ represents a single pixel. So you have an image tensor of shape $⟨H, W, 3⟩$ and on converting to grayscale you’ll obtain an image tensor of shape $⟨W, H, 1⟩$. Implementing a scalar version using for loops first, using pseudocode

A = Tensor(H, W, C) # Here C = 3, the number of channels, but this is just pseudocode
O = zeros(H, W)
weights = [0.299, 0.587, 0.114] # The weights

for y in range(H):
        for x in range(W):
                s = 0
                for k in range(C):
                        s += A[y, x, k] * weights[k]
        
                O[y, x] = s

This can be written as an einsum like so: einsum("yxk,k->yx", A, weights).

What would an implementation without einsum look like? Observe that we’re multiplying the $r$, $g$, $b$ channels by the weights $w_1$, $w_2$, $w_3$ respectively. So we can have a weight vector w = [w_1, w_2, w_3] and then reshape it to have the shape $⟨1, 1, 3⟩$. You can now broadcast-multiply this weight vector to the image tensor (which has shape $⟨H, W, 3⟩$). Then we can reduce-sum along the channel axis to obtain an image of shape $⟨H, W⟩$. So, in short, it would be something like (w.reshape(1, 1, 3) * image).sum(axis=2) (axis=2 denotes that we’re reducing over the channel axis).

tensordot is just dot-product/linear-combination over arbitrary-rank tensors

The numpy documentation for tensordot gives the general idea regarding how to translate the call to a series of nested for loops. This is another example with some pseudocode.

You have a batch of tensors, A of shape (BatchSize, Ny, Nx). Each item tensor is rank-2 tensor of shape (Ny, Nx). You want to apply a linear transform on it (i.e. a matrix multiplication) and obtain an Output tensor of shape (BatchSize, Ox). So basically you are treating each item tensor as if they are flattened to shape ``

What should the shape of the multiplicand be?

Let’s tackle a simple problem. Input has shape (BatchSize, Nx) and output has shape (Batchsize, Ox). As you can probably tell, we can have a tensor W (the “weights”) of shape (Nx, Ox) and do a matmul. Does a tensordot produce same output?

We do a tensordot along the columns of Input and the rows of W: np.tensordot(A, B, axes=[(1,), (0,)]). The for loop-version of which woule look like the following:

assert Input.shape == (BatchSize, Nx)
assert W.shape == (Nx, Ox)
O = np.zeros((BatchSize, Ox))

for b in range(BatchSize):
  for ox in range(Ox):
    for ixA in range(Nx):
      for ixW in range(Nx):
        O[b, ox] += A[b, ixA] * W[ixW, ox]

As you see, the tail of the input tensor’s shape i.e. the (Nx,), matches with the head of the weight tensors shape (Nx,). And each sub-tensor of shape (Nx,) is linearly-combined (aka weighted-sum) using the weights to produce a 1D tensor of shape (Ox,).

We can generalize this idea so that we can take as input sub-tensors of shape (Nx, Ny) to 1D tensors of shape (Ox,). Note that we will still produce output of shape (BatchSize, Ox).

Which brings us back to the original problem. We have rank-3 input and the sub-tensors i.e. the tail shape of the tensor is (Nx, Ny,). This time, our weights tensor would need to be of shape (Nx, Ny, Ox) and we can call np.tensordot(A, B, axes=[(1, 2), (0, 1)]).

assert Input.shape == (BatchSize, Nx, Ny)
assert W.shape == (Nx, Ny, Ox)
O = np.zeros((BatchSize, Ox))

for b in range(BatchSize):
  for ox in range(Ox):
    for ixA in range(Nx):
      for iyA in range(Ny):
        for ixW in range(Nx):
          for iyW in range(Ny):
            O[b, ox] += A[b, ixA, iyA] * W[ixW, iyW, ox]

Ipython session demonstrating tensordot is just a more general dot-product that takes the dot-product along the reduced axes - which can be of any rank, not just 1D vectors.

import numpy as np
bs = 16
nx = 5
ny = 4
ox = 7
A = np.random.randn(bs, nx, ny)
W = np.random.randn(nx, ny, ox)
D = np.tensordot(A, W)
A.shape, W.shape, D.shape # (16, 5, 4), (5, 4, 7), (16, 7)
Ar = A.reshape((16, 20))  # Flattened each "item tensor"
Wr = W.reshape((20, 7))   # Flattened each "weight tensor"
np.matmul(Ar, Wr).shape   # (16, 7) as expected
Dr = np.matmul(Ar, Wr)    # Let's see the result of matmul on flattened tensors
all(D == Dr)              # And yes, we got same result as if we flattened