Let denotes the set of all column vectors as range over all real numbers (i.e. all belong in). Of course this concept is easily generalized to define for any whole number .

On there are two simple things we can always do: add and multiply with some real number (scalar multiplication). These operations are executed term-by-term. For instance in , will result in and .

Suppose we have a function from to , a natural question to ask is, if I have two vectors say , and I have computed and can I compute simply by multiplying with . Similarly, can I compute just by adding and . There is no reason to believe that this is always true. For example, it can be checked that is is not the case for the function . That is, for such adding two vectors and applying the function is not the same as applying the function to each vector and adding them. Order matters. The same thing can be said about the the order of scalar multiplication and applying functions.

The class of functions from to where the order of scalar multiplication, addition, and applying function don’t matter is thus special. We call such functions linear. And although they are special, they are everywhere. Any differentiable function from to can be approximated (locally) by linear functions. In fact, before the time when computers can help solve non-linear problems, the main method in dealing with non-linearity is to deal with its linear approximation instead.

Now suppose we have a linear function , say from to . Also suppose that we know the value of to be and to be . Where will send . Easy: We know that . Since the order of scalar multiplication, addition, and applying function don’t matter, we can apply the function to and , multiply them by 3 and 5 respectively, and add them together (i.e.). Since every vector in can be written as some number times plus some number times , knowing where a linear function sends and gives us knowledge of where send every vector in .

It is easy to generalize: If is a linear function from to and if we know where sends , , , then we know everything there is to know about . Why stop at 3? The same thing can be said if the domain is , , and so on.

So let us agree on a better notation, if I want to tell you about a linear function from to , I do not want to write and each time. Instead, I will just write a table, where the first column is the image of under and the second column is the image of under . The table shall looks like

and we call this table (unsurprisingly) the matrix representing . We will use this matrix again, so let us name it . As always, there is no reason to stop at . I will leave the generalization to the readers. From this discussion, we see that for each linear function there is a matrix representing it, but it is easy to argue the other way: every matrix is representing some linear function. So now we know what a matrix is: a representation of linear function from to as an array listing where the functions send , .

Now we know what matrices are, let us talk about how they are multiplied together. Say we have another linear function from to represented by

We can compose the functions and get from to . It is easy to argue that is also a linear function. So a natural question would be “what is the matrix representing ?”. Well, we only need to know where send and and put the result on a table. sends to , and sends to . Similarly, sends to . The end result is the matrix we defined as , recovering the formula for matrix multiplication we all know and love:

= .

This also explains why a matrix multiplied by a matrix is a matrix. The matrix represents a linear function from to and the matrix represents a linear function from to . Their multiplication represents the composition which is a linear function from to and hence must be a matrix.