# What are matrices? (and why matrix multiplication is defined the way it is)

Let $\mathbb{R}^4$ denotes the set of all column vectors $\begin{bmatrix} a \\ b \\ c \\d \end{bmatrix}$ as $a, b, c, d$ range over all real numbers (i.e. $\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix},\begin{bmatrix} \pi \\ \sqrt{2} \\ 3000 \\ -5 \end{bmatrix},\begin{bmatrix} 10^{10} \\ 1/1000 \\ e \\ 0 \end{bmatrix}$ all belong in$\mathbb{R}^4$). Of course this concept is easily generalized to define $\mathbb{R}^n$ for any whole number $n$.

On $\mathbb{R}^n$ there are two simple things we can always do: add and multiply with some real number (scalar multiplication). These operations are executed term-by-term. For instance in $\mathbb{R}^4$$\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} +\begin{bmatrix} -3 \\ 2 \\ 5 \\ 1 \end{bmatrix}$ will result in $\begin{bmatrix} 1 - 3 \\ 2+2 \\ 3+5 \\ 4+1 \end{bmatrix} = \begin{bmatrix} -2 \\ 4 \\ 8 \\ 5 \end{bmatrix}$ and $\pi \begin{bmatrix}-3 \\ 2 \\ 5 \\ 1 \end{bmatrix} = \begin{bmatrix} -3\pi \\ 2\pi \\ 5\pi \\ \pi \end{bmatrix}$.

Suppose we have a function$f$ from $\mathbb{R}^2$ to $\mathbb{R}^3$, a natural question to ask is, if I have two vectors say $\begin{bmatrix} 1 \\ 2 \end{bmatrix}, \begin{bmatrix} 3 \\ 4 \end{bmatrix}$, and I have computed$f \left(\begin{bmatrix} 1 \\ 2 \end{bmatrix}\right)$ and $f \left(\begin{bmatrix} 3 \\ 4 \end{bmatrix}\right)$ can I compute $f \left(\begin{bmatrix} \pi \\ 2\pi \end{bmatrix}\right)$ simply by multiplying $f \left(\begin{bmatrix} 1 \\ 2 \end{bmatrix}\right)$ with $\pi$. Similarly, can I compute $f \left(\begin{bmatrix} 4 \\ 6 \end{bmatrix}\right)$ just by adding $f \left(\begin{bmatrix} 1 \\ 2 \end{bmatrix}\right)$ and $f \left(\begin{bmatrix} 3 \\ 4 \end{bmatrix}\right)$. There is no reason to believe that this is always true. For example, it can be checked that is is not the case for the function $f\left(\begin{bmatrix} x \\ y \end{bmatrix}\right) = \begin{bmatrix} x^2 \\ y^2 \\ xy \end{bmatrix}$. That is, for such $f$ adding two vectors and applying the function is not the same as applying the function to each vector and adding them. Order matters. The same thing can be said about the the order of scalar multiplication and applying functions.

The class of functions from $\mathbb{R}^n$ to $\mathbb{R}^m$ where the order of scalar multiplication, addition, and applying function don’t matter is thus special. We call such functions linear. And although they are special, they are everywhere. Any differentiable function from $\mathbb{R}^n$ to $\mathbb{R}^m$ can be approximated (locally) by linear functions. In fact, before the time when computers can help solve non-linear problems, the main method in dealing with non-linearity is to deal with its linear approximation instead.

Now suppose we have a linear function $f$, say from $\mathbb{R}^2$ to $\mathbb{R}^3$. Also suppose that we know the value of $f\left(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\right)$ to be $\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$ and $f\left(\begin{bmatrix} 0 \\ 1 \end{bmatrix}\right)$ to be $\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}$. Where will $f$ send $\begin{bmatrix} 3 \\ 5 \end{bmatrix}$. Easy:  We know that $\begin{bmatrix} 3 \\ 5 \end{bmatrix} =3\begin{bmatrix} 1 \\ 0 \end{bmatrix} +5\begin{bmatrix} 0 \\ 1 \end{bmatrix}$. Since the order of scalar multiplication, addition, and applying function don’t matter, we can apply the function to $\begin{bmatrix} 1 \\ 0 \end{bmatrix}$ and $\begin{bmatrix} 0 \\ 1 \end{bmatrix}$, multiply them by 3 and 5 respectively, and add them together (i.e.$f\left(\begin{bmatrix} 3 \\ 5 \end{bmatrix}\right) =3f\left(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\right) +5f\left(\begin{bmatrix} 0 \\ 1 \end{bmatrix}\right) = \begin{bmatrix} 3 \\ 6 \\ 14 \end{bmatrix}$). Since every vector in $\mathbb{R}^2$ can be written as some number times $\begin{bmatrix} 1 \\ 0 \end{bmatrix}$ plus some number times $\begin{bmatrix} 0 \\ 1 \end{bmatrix}$, knowing where a linear function sends $\begin{bmatrix} 1 \\ 0 \end{bmatrix}$ and $\begin{bmatrix} 0 \\ 1 \end{bmatrix}$ gives us knowledge of where $f$ send every vector in $\mathbb{R}^2$.

It is easy to generalize: If $f$ is a linear function from $\mathbb{R}^3$ to $\mathbb{R}^m$ and if we know where $f$ sends $\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$$\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$$\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}$, then we know everything there is to know about $f$. Why stop at 3? The same thing can be said if the domain is $\mathbb{R}^4$$\mathbb{R}^5$, and so on.

So let us agree on a better notation, if I want to tell you about a linear function from $\mathbb{R}^2$ to $\mathbb{R}^3$, I do not want to write $f\left(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\right) = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$ and $f\left(\begin{bmatrix} 0 \\ 1 \end{bmatrix}\right)= \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}$ each time. Instead, I will just write a table, where the first column is the image of $\begin{bmatrix} 1 \\ 0 \end{bmatrix}$ under $f$ and the second column is the image of $\begin{bmatrix} 0 \\ 1 \end{bmatrix}$ under $f$. The table shall looks like

$\begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix}$

and we call this table (unsurprisingly) the matrix representing $f$. We will use this matrix again, so let us name it $[f]$. As always, there is no reason to stop at $\mathbb{R}^2$. I will leave the generalization to the readers. From this discussion, we see that for each linear function there is a matrix representing it, but it is easy to argue the other way: every matrix is representing some linear function. So now we know what a matrix is: a representation of linear function from $\mathbb{R}^n$ to $\mathbb{R}^m$ as an array listing where the functions send $\begin{bmatrix} 1 \\ 0 \\ \vdots \\0 \end{bmatrix}$$\begin{bmatrix} 0 \\ 1 \\ \vdots \\0 \end{bmatrix}, \cdots\begin{bmatrix} 0 \\ 0 \\ \vdots \\1 \end{bmatrix}$.

Now we know what matrices are, let us talk about how they are multiplied together. Say we have another linear function $g$ from $\mathbb{R}^3$ to $\mathbb{R}^2$ represented by

$[g] = \begin{bmatrix} 1 & 0 & 1 \\ 2 & 1 & 0 \end{bmatrix}.$

We can compose the functions and get $g \circ f$ from $\mathbb{R}^2$ to $\mathbb{R}^2$. It is easy to argue that $g \circ f$ is also a linear function. So a natural question would be “what is the matrix representing $g \circ f$?”. Well, we only need to know where $g \circ f$ send $\begin{bmatrix} 1 \\ 0 \end{bmatrix}$ and $\begin{bmatrix} 0 \\ 1 \end{bmatrix}$ and put the result on a table. $f$ sends $\begin{bmatrix} 1 \\ 0 \end{bmatrix}$ to $\begin{bmatrix} 1 \\ 2 \\3 \end{bmatrix}$, and $g$ sends $\begin{bmatrix} 1 \\ 2 \\3 \end{bmatrix}$ to $1\begin{bmatrix} 1 \\ 2 \end{bmatrix} +2\begin{bmatrix} 0 \\ 1 \end{bmatrix} +3\begin{bmatrix} 1 \\ 0 \end{bmatrix}$. Similarly, $g \circ f$ sends $\begin{bmatrix} 0 \\ 1 \end{bmatrix}$ to $0\begin{bmatrix} 1 \\ 2 \end{bmatrix} +0\begin{bmatrix} 0 \\ 1 \end{bmatrix} +1\begin{bmatrix} 1 \\ 0 \end{bmatrix}$. The end result is the matrix we defined as $[g] \times [f]$, recovering the formula for matrix multiplication we all know and love:

$\begin{bmatrix} 1 & 0 & 1 \\ 2 & 1 & 0 \end{bmatrix} \times\begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix} =$ $\left[ \begin{array}{c|c} 1\begin{pmatrix} 1 \\ 2 \end{pmatrix} +2\begin{pmatrix} 0 \\ 1 \end{pmatrix} +3\begin{pmatrix} 1 \\ 0 \end{pmatrix} & 0\begin{pmatrix} 1 \\ 2 \end{pmatrix} +0\begin{pmatrix} 0 \\ 1 \end{pmatrix} +1\begin{pmatrix} 1 \\ 0 \end{pmatrix} \end{array} \right]$$\begin{bmatrix} 4 & 1 \\ 4 & 0 \end{bmatrix}$.

This also explains why a $m \times n$ matrix multiplied by a $n \times k$ matrix is a $m \times k$ matrix. The $n \times k$ matrix represents a linear function from $\mathbb{R}^k$ to $\mathbb{R}^n$ and the $m \times n$ matrix represents a linear function from $\mathbb{R}^n$ to $\mathbb{R}^m$. Their multiplication represents the composition which is a linear function from $\mathbb{R}^k$ to $\mathbb{R}^m$ and hence must be a $m \times k$ matrix.