.. _sec_geometry-linear-algebraic-ops:
Geometry and Linear Algebraic Operations
========================================
In :numref:`sec_linear-algebra`, we encountered the basics of linear
algebra and saw how it could be used to express common operations for
transforming our data. Linear algebra is one of the key mathematical
pillars underlying much of the work that we do in deep learning and in
machine learning more broadly. While :numref:`sec_linear-algebra`
contained enough machinery to communicate the mechanics of modern deep
learning models, there is a lot more to the subject. In this section, we
will go deeper, highlighting some geometric interpretations of linear
algebra operations, and introducing a few fundamental concepts,
including of eigenvalues and eigenvectors.
Geometry of Vectors
-------------------
First, we need to discuss the two common geometric interpretations of
vectors, as either points or directions in space. Fundamentally, a
vector is a list of numbers such as the Python list below.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
v = [1, 7, 0, 1]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
v = [1, 7, 0, 1]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
v = [1, 7, 0, 1]
.. raw:: html
.. raw:: html
Mathematicians most often write this as either a *column* or *row*
vector, which is to say either as
.. math::
\mathbf{x} = \begin{bmatrix}1\\7\\0\\1\end{bmatrix},
or
.. math::
\mathbf{x}^\top = \begin{bmatrix}1 & 7 & 0 & 1\end{bmatrix}.
These often have different interpretations, where data examples are
column vectors and weights used to form weighted sums are row vectors.
However, it can be beneficial to be flexible. As we have described in
:numref:`sec_linear-algebra`, though a single vector’s default
orientation is a column vector, for any matrix representing a tabular
dataset, treating each data example as a row vector in the matrix is
more conventional.
Given a vector, the first interpretation that we should give it is as a
point in space. In two or three dimensions, we can visualize these
points by using the components of the vectors to define the location of
the points in space compared to a fixed reference called the *origin*.
This can be seen in :numref:`fig_grid`.
.. _fig_grid:
.. figure:: ../img/grid-points.svg
An illustration of visualizing vectors as points in the plane. The
first component of the vector gives the
:math:`\mathit{x}`-coordinate, the second component gives the
:math:`\mathit{y}`-coordinate. Higher dimensions are analogous,
although much harder to visualize.
This geometric point of view allows us to consider the problem on a more
abstract level. No longer faced with some insurmountable seeming problem
like classifying pictures as either cats or dogs, we can start
considering tasks abstractly as collections of points in space and
picturing the task as discovering how to separate two distinct clusters
of points.
In parallel, there is a second point of view that people often take of
vectors: as directions in space. Not only can we think of the vector
:math:`\mathbf{v} = [3,2]^\top` as the location :math:`3` units to the
right and :math:`2` units up from the origin, we can also think of it as
the direction itself to take :math:`3` steps to the right and :math:`2`
steps up. In this way, we consider all the vectors in figure
:numref:`fig_arrow` the same.
.. _fig_arrow:
.. figure:: ../img/par-vec.svg
Any vector can be visualized as an arrow in the plane. In this case,
every vector drawn is a representation of the vector
:math:`(3,2)^\top`.
One of the benefits of this shift is that we can make visual sense of
the act of vector addition. In particular, we follow the directions
given by one vector, and then follow the directions given by the other,
as is seen in :numref:`fig_add-vec`.
.. _fig_add-vec:
.. figure:: ../img/vec-add.svg
We can visualize vector addition by first following one vector, and
then another.
Vector subtraction has a similar interpretation. By considering the
identity that :math:`\mathbf{u} = \mathbf{v} + (\mathbf{u}-\mathbf{v})`,
we see that the vector :math:`\mathbf{u}-\mathbf{v}` is the direction
that takes us from the point :math:`\mathbf{v}` to the point
:math:`\mathbf{u}`.
Dot Products and Angles
-----------------------
As we saw in :numref:`sec_linear-algebra`, if we take two column
vectors :math:`\mathbf{u}` and :math:`\mathbf{v}`, we can form their dot
product by computing:
.. math:: \mathbf{u}^\top\mathbf{v} = \sum_i u_i\cdot v_i.
:label: eq_dot_def
Because :eq:`eq_dot_def` is symmetric, we will mirror the notation
of classical multiplication and write
.. math::
\mathbf{u}\cdot\mathbf{v} = \mathbf{u}^\top\mathbf{v} = \mathbf{v}^\top\mathbf{u},
to highlight the fact that exchanging the order of the vectors will
yield the same answer.
The dot product :eq:`eq_dot_def` also admits a geometric
interpretation: it is closely related to the angle between two vectors.
Consider the angle shown in :numref:`fig_angle`.
.. _fig_angle:
.. figure:: ../img/vec-angle.svg
Between any two vectors in the plane there is a well defined angle
:math:`\theta`. We will see this angle is intimately tied to the dot
product.
To start, let’s consider two specific vectors:
.. math::
\mathbf{v} = (r,0) \; \textrm{and} \; \mathbf{w} = (s\cos(\theta), s \sin(\theta)).
The vector :math:`\mathbf{v}` is length :math:`r` and runs parallel to
the :math:`x`-axis, and the vector :math:`\mathbf{w}` is of length
:math:`s` and at angle :math:`\theta` with the :math:`x`-axis. If we
compute the dot product of these two vectors, we see that
.. math::
\mathbf{v}\cdot\mathbf{w} = rs\cos(\theta) = \|\mathbf{v}\|\|\mathbf{w}\|\cos(\theta).
With some simple algebraic manipulation, we can rearrange terms to
obtain
.. math::
\theta = \arccos\left(\frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|}\right).
In short, for these two specific vectors, the dot product combined with
the norms tell us the angle between the two vectors. This same fact is
true in general. We will not derive the expression here, however, if we
consider writing :math:`\|\mathbf{v} - \mathbf{w}\|^2` in two ways: one
with the dot product, and the other geometrically using the law of
cosines, we can obtain the full relationship. Indeed, for any two
vectors :math:`\mathbf{v}` and :math:`\mathbf{w}`, the angle between the
two vectors is
.. math:: \theta = \arccos\left(\frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|}\right).
:label: eq_angle_forumla
This is a nice result since nothing in the computation references
two-dimensions. Indeed, we can use this in three or three million
dimensions without issue.
As a simple example, let’s see how to compute the angle between a pair
of vectors:
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import torch
import torchvision
from IPython import display
from torchvision import transforms
from d2l import torch as d2l
def angle(v, w):
return torch.acos(v.dot(w) / (torch.norm(v) * torch.norm(w)))
angle(torch.tensor([0, 1, 2], dtype=torch.float32), torch.tensor([2.0, 3, 4]))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor(0.4190)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
from IPython import display
from mxnet import gluon, np, npx
from d2l import mxnet as d2l
npx.set_np()
def angle(v, w):
return np.arccos(v.dot(w) / (np.linalg.norm(v) * np.linalg.norm(w)))
angle(np.array([0, 1, 2]), np.array([2, 3, 4]))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[22:03:02] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array(0.41899002)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import tensorflow as tf
from IPython import display
from d2l import tensorflow as d2l
def angle(v, w):
return tf.acos(tf.tensordot(v, w, axes=1) / (tf.norm(v) * tf.norm(w)))
angle(tf.constant([0, 1, 2], dtype=tf.float32), tf.constant([2.0, 3, 4]))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
We will not use it right now, but it is useful to know that we will
refer to vectors for which the angle is :math:`\pi/2` (or equivalently
:math:`90^{\circ}`) as being *orthogonal*. By examining the equation
above, we see that this happens when :math:`\theta = \pi/2`, which is
the same thing as :math:`\cos(\theta) = 0`. The only way this can happen
is if the dot product itself is zero, and two vectors are orthogonal if
and only if :math:`\mathbf{v}\cdot\mathbf{w} = 0`. This will prove to be
a helpful formula when understanding objects geometrically.
It is reasonable to ask: why is computing the angle useful? The answer
comes in the kind of invariance we expect data to have. Consider an
image, and a duplicate image, where every pixel value is the same but
:math:`10\%` the brightness. The values of the individual pixels are in
general far from the original values. Thus, if one computed the distance
between the original image and the darker one, the distance can be
large. However, for most ML applications, the *content* is the same—it
is still an image of a cat as far as a cat/dog classifier is concerned.
However, if we consider the angle, it is not hard to see that for any
vector :math:`\mathbf{v}`, the angle between :math:`\mathbf{v}` and
:math:`0.1\cdot\mathbf{v}` is zero. This corresponds to the fact that
scaling vectors keeps the same direction and just changes the length.
The angle considers the darker image identical.
Examples like this are everywhere. In text, we might want the topic
being discussed to not change if we write twice as long of document that
says the same thing. For some encoding (such as counting the number of
occurrences of words in some vocabulary), this corresponds to a doubling
of the vector encoding the document, so again we can use the angle.
Cosine Similarity
~~~~~~~~~~~~~~~~~
In ML contexts where the angle is employed to measure the closeness of
two vectors, practitioners adopt the term *cosine similarity* to refer
to the portion
.. math::
\cos(\theta) = \frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|}.
The cosine takes a maximum value of :math:`1` when the two vectors point
in the same direction, a minimum value of :math:`-1` when they point in
opposite directions, and a value of :math:`0` when the two vectors are
orthogonal. Note that if the components of high-dimensional vectors are
sampled randomly with mean :math:`0`, their cosine will nearly always be
close to :math:`0`.
Hyperplanes
-----------
In addition to working with vectors, another key object that you must
understand to go far in linear algebra is the *hyperplane*, a
generalization to higher dimensions of a line (two dimensions) or of a
plane (three dimensions). In an :math:`d`-dimensional vector space, a
hyperplane has :math:`d-1` dimensions and divides the space into two
half-spaces.
Let’s start with an example. Suppose that we have a column vector
:math:`\mathbf{w}=[2,1]^\top`. We want to know, “what are the points
:math:`\mathbf{v}` with :math:`\mathbf{w}\cdot\mathbf{v} = 1`?” By
recalling the connection between dot products and angles above
:eq:`eq_angle_forumla`, we can see that this is equivalent to
.. math::
\|\mathbf{v}\|\|\mathbf{w}\|\cos(\theta) = 1 \; \iff \; \|\mathbf{v}\|\cos(\theta) = \frac{1}{\|\mathbf{w}\|} = \frac{1}{\sqrt{5}}.
.. _fig_vector-project:
.. figure:: ../img/proj-vec.svg
Recalling trigonometry, we see the formula
:math:`\|\mathbf{v}\|\cos(\theta)` is the length of the projection of
the vector :math:`\mathbf{v}` onto the direction of
:math:`\mathbf{w}`
If we consider the geometric meaning of this expression, we see that
this is equivalent to saying that the length of the projection of
:math:`\mathbf{v}` onto the direction of :math:`\mathbf{w}` is exactly
:math:`1/\|\mathbf{w}\|`, as is shown in :numref:`fig_vector-project`.
The set of all points where this is true is a line at right angles to
the vector :math:`\mathbf{w}`. If we wanted, we could find the equation
for this line and see that it is :math:`2x + y = 1` or equivalently
:math:`y = 1 - 2x`.
If we now look at what happens when we ask about the set of points with
:math:`\mathbf{w}\cdot\mathbf{v} > 1` or
:math:`\mathbf{w}\cdot\mathbf{v} < 1`, we can see that these are cases
where the projections are longer or shorter than
:math:`1/\|\mathbf{w}\|`, respectively. Thus, those two inequalities
define either side of the line. In this way, we have found a way to cut
our space into two halves, where all the points on one side have dot
product below a threshold, and the other side above as we see in
:numref:`fig_space-division`.
.. _fig_space-division:
.. figure:: ../img/space-division.svg
If we now consider the inequality version of the expression, we see
that our hyperplane (in this case: just a line) separates the space
into two halves.
The story in higher dimension is much the same. If we now take
:math:`\mathbf{w} = [1,2,3]^\top` and ask about the points in three
dimensions with :math:`\mathbf{w}\cdot\mathbf{v} = 1`, we obtain a plane
at right angles to the given vector :math:`\mathbf{w}`. The two
inequalities again define the two sides of the plane as is shown in
:numref:`fig_higher-division`.
.. _fig_higher-division:
.. figure:: ../img/space-division-3d.svg
Hyperplanes in any dimension separate the space into two halves.
While our ability to visualize runs out at this point, nothing stops us
from doing this in tens, hundreds, or billions of dimensions. This
occurs often when thinking about machine learned models. For instance,
we can understand linear classification models like those from
:numref:`sec_softmax`, as methods to find hyperplanes that separate
the different target classes. In this context, such hyperplanes are
often referred to as *decision planes*. The majority of deep learned
classification models end with a linear layer fed into a softmax, so one
can interpret the role of the deep neural network to be to find a
non-linear embedding such that the target classes can be separated
cleanly by hyperplanes.
To give a hand-built example, notice that we can produce a reasonable
model to classify tiny images of t-shirts and trousers from the
Fashion-MNIST dataset (seen in :numref:`sec_fashion_mnist`) by just
taking the vector between their means to define the decision plane and
eyeball a crude threshold. First we will load the data and compute the
averages.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Load in the dataset
trans = []
trans.append(transforms.ToTensor())
trans = transforms.Compose(trans)
train = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=True, download=True)
test = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=False, download=True)
X_train_0 = torch.stack(
[x[0] * 256 for x in train if x[1] == 0]).type(torch.float32)
X_train_1 = torch.stack(
[x[0] * 256 for x in train if x[1] == 1]).type(torch.float32)
X_test = torch.stack(
[x[0] * 256 for x in test if x[1] == 0 or x[1] == 1]).type(torch.float32)
y_test = torch.stack([torch.tensor(x[1]) for x in test
if x[1] == 0 or x[1] == 1]).type(torch.float32)
# Compute averages
ave_0 = torch.mean(X_train_0, axis=0)
ave_1 = torch.mean(X_train_1, axis=0)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Load in the dataset
train = gluon.data.vision.FashionMNIST(train=True)
test = gluon.data.vision.FashionMNIST(train=False)
X_train_0 = np.stack([x[0] for x in train if x[1] == 0]).astype(float)
X_train_1 = np.stack([x[0] for x in train if x[1] == 1]).astype(float)
X_test = np.stack(
[x[0] for x in test if x[1] == 0 or x[1] == 1]).astype(float)
y_test = np.stack(
[x[1] for x in test if x[1] == 0 or x[1] == 1]).astype(float)
# Compute averages
ave_0 = np.mean(X_train_0, axis=0)
ave_1 = np.mean(X_train_1, axis=0)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Load in the dataset
((train_images, train_labels), (
test_images, test_labels)) = tf.keras.datasets.fashion_mnist.load_data()
X_train_0 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
train_labels) if label == 0]] * 256), dtype=tf.float32)
X_train_1 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
train_labels) if label == 1]] * 256), dtype=tf.float32)
X_test = tf.cast(tf.stack(test_images[[i for i, label in enumerate(
test_labels) if label == 0]] * 256), dtype=tf.float32)
y_test = tf.cast(tf.stack(test_images[[i for i, label in enumerate(
test_labels) if label == 1]] * 256), dtype=tf.float32)
# Compute averages
ave_0 = tf.reduce_mean(X_train_0, axis=0)
ave_1 = tf.reduce_mean(X_train_1, axis=0)
.. raw:: html
.. raw:: html
It can be informative to examine these averages in detail, so let’s plot
what they look like. In this case, we see that the average indeed
resembles a blurry image of a t-shirt.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
.. figure:: output_geometry-linear-algebraic-ops_80b1c3_39_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
.. figure:: output_geometry-linear-algebraic-ops_80b1c3_42_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(tf.reshape(ave_0, (28, 28)), cmap='Greys')
d2l.plt.show()
.. figure:: output_geometry-linear-algebraic-ops_80b1c3_45_0.svg
.. raw:: html
.. raw:: html
In the second case, we again see that the average resembles a blurry
image of trousers.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Plot average trousers
d2l.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
.. figure:: output_geometry-linear-algebraic-ops_80b1c3_51_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Plot average trousers
d2l.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
.. figure:: output_geometry-linear-algebraic-ops_80b1c3_54_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Plot average trousers
d2l.plt.imshow(tf.reshape(ave_1, (28, 28)), cmap='Greys')
d2l.plt.show()
.. figure:: output_geometry-linear-algebraic-ops_80b1c3_57_0.svg
.. raw:: html
.. raw:: html
In a fully machine learned solution, we would learn the threshold from
the dataset. In this case, I simply eyeballed a threshold that looked
good on the training data by hand.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
# '@' is Matrix Multiplication operator in pytorch.
predictions = X_test.reshape(2000, -1) @ (w.flatten()) > -1500000
# Accuracy
torch.mean((predictions.type(y_test.dtype) == y_test).float(), dtype=torch.float64)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor(0.7870, dtype=torch.float64)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
predictions = X_test.reshape(2000, -1).dot(w.flatten()) > -1500000
# Accuracy
np.mean(predictions.astype(y_test.dtype) == y_test, dtype=np.float64)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array(0.801, dtype=float64)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Print test set accuracy with eyeballed threshold
w = tf.transpose(ave_1 - ave_0)
predictions = tf.reduce_sum(X_test * tf.nest.flatten(w), axis=0) > -1500000
# Accuracy
tf.reduce_mean(
tf.cast(tf.cast(predictions, y_test.dtype) == y_test, tf.float32))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
Geometry of Linear Transformations
----------------------------------
Through :numref:`sec_linear-algebra` and the above discussions, we
have a solid understanding of the geometry of vectors, lengths, and
angles. However, there is one important object we have omitted
discussing, and that is a geometric understanding of linear
transformations represented by matrices. Fully internalizing what
matrices can do to transform data between two potentially different high
dimensional spaces takes significant practice, and is beyond the scope
of this appendix. However, we can start building up intuition in two
dimensions.
Suppose that we have some matrix:
.. math::
\mathbf{A} = \begin{bmatrix}
a & b \\ c & d
\end{bmatrix}.
If we want to apply this to an arbitrary vector
:math:`\mathbf{v} = [x, y]^\top`, we multiply and see that
.. math::
\begin{aligned}
\mathbf{A}\mathbf{v} & = \begin{bmatrix}a & b \\ c & d\end{bmatrix}\begin{bmatrix}x \\ y\end{bmatrix} \\
& = \begin{bmatrix}ax+by\\ cx+dy\end{bmatrix} \\
& = x\begin{bmatrix}a \\ c\end{bmatrix} + y\begin{bmatrix}b \\d\end{bmatrix} \\
& = x\left\{\mathbf{A}\begin{bmatrix}1\\0\end{bmatrix}\right\} + y\left\{\mathbf{A}\begin{bmatrix}0\\1\end{bmatrix}\right\}.
\end{aligned}
This may seem like an odd computation, where something clear became
somewhat impenetrable. However, it tells us that we can write the way
that a matrix transforms *any* vector in terms of how it transforms *two
specific vectors*: :math:`[1,0]^\top` and :math:`[0,1]^\top`. This is
worth considering for a moment. We have essentially reduced an infinite
problem (what happens to any pair of real numbers) to a finite one (what
happens to these specific vectors). These vectors are an example a
*basis*, where we can write any vector in our space as a weighted sum of
these *basis vectors*.
Let’s draw what happens when we use the specific matrix
.. math::
\mathbf{A} = \begin{bmatrix}
1 & 2 \\
-1 & 3
\end{bmatrix}.
If we look at the specific vector :math:`\mathbf{v} = [2, -1]^\top`, we
see this is :math:`2\cdot[1,0]^\top + -1\cdot[0,1]^\top`, and thus we
know that the matrix :math:`A` will send this to
:math:`2(\mathbf{A}[1,0]^\top) + -1(\mathbf{A}[0,1])^\top = 2[1, -1]^\top - [2,3]^\top = [0, -5]^\top`.
If we follow this logic through carefully, say by considering the grid
of all integer pairs of points, we see that what happens is that the
matrix multiplication can skew, rotate, and scale the grid, but the grid
structure must remain as you see in :numref:`fig_grid-transform`.
.. _fig_grid-transform:
.. figure:: ../img/grid-transform.svg
The matrix :math:`\mathbf{A}` acting on the given basis vectors.
Notice how the entire grid is transported along with it.
This is the most important intuitive point to internalize about linear
transformations represented by matrices. Matrices are incapable of
distorting some parts of space differently than others. All they can do
is take the original coordinates on our space and skew, rotate, and
scale them.
Some distortions can be severe. For instance the matrix
.. math::
\mathbf{B} = \begin{bmatrix}
2 & -1 \\ 4 & -2
\end{bmatrix},
compresses the entire two-dimensional plane down to a single line.
Identifying and working with such transformations are the topic of a
later section, but geometrically we can see that this is fundamentally
different from the types of transformations we saw above. For instance,
the result from matrix :math:`\mathbf{A}` can be “bent back” to the
original grid. The results from matrix :math:`\mathbf{B}` cannot because
we will never know where the vector :math:`[1,2]^\top` came from—was it
:math:`[1,1]^\top` or :math:`[0, -1]^\top`?
While this picture was for a :math:`2\times2` matrix, nothing prevents
us from taking the lessons learned into higher dimensions. If we take
similar basis vectors like :math:`[1,0, \ldots,0]` and see where our
matrix sends them, we can start to get a feeling for how the matrix
multiplication distorts the entire space in whatever dimension space we
are dealing with.
Linear Dependence
-----------------
Consider again the matrix
.. math::
\mathbf{B} = \begin{bmatrix}
2 & -1 \\ 4 & -2
\end{bmatrix}.
This compresses the entire plane down to live on the single line
:math:`y = 2x`. The question now arises: is there some way we can detect
this just looking at the matrix itself? The answer is that indeed we
can. Let’s take :math:`\mathbf{b}_1 = [2,4]^\top` and
:math:`\mathbf{b}_2 = [-1, -2]^\top` be the two columns of
:math:`\mathbf{B}`. Remember that we can write everything transformed by
the matrix :math:`\mathbf{B}` as a weighted sum of the columns of the
matrix: like :math:`a_1\mathbf{b}_1 + a_2\mathbf{b}_2`. We call this a
*linear combination*. The fact that
:math:`\mathbf{b}_1 = -2\cdot\mathbf{b}_2` means that we can write any
linear combination of those two columns entirely in terms of say
:math:`\mathbf{b}_2` since
.. math::
a_1\mathbf{b}_1 + a_2\mathbf{b}_2 = -2a_1\mathbf{b}_2 + a_2\mathbf{b}_2 = (a_2-2a_1)\mathbf{b}_2.
This means that one of the columns is, in a sense, redundant because it
does not define a unique direction in space. This should not surprise us
too much since we already saw that this matrix collapses the entire
plane down into a single line. Moreover, we see that the linear
dependence :math:`\mathbf{b}_1 = -2\cdot\mathbf{b}_2` captures this. To
make this more symmetrical between the two vectors, we will write this
as
.. math::
\mathbf{b}_1 + 2\cdot\mathbf{b}_2 = 0.
In general, we will say that a collection of vectors
:math:`\mathbf{v}_1, \ldots, \mathbf{v}_k` are *linearly dependent* if
there exist coefficients :math:`a_1, \ldots, a_k` *not all equal to
zero* so that
.. math::
\sum_{i=1}^k a_i\mathbf{v_i} = 0.
In this case, we can solve for one of the vectors in terms of some
combination of the others, and effectively render it redundant. Thus, a
linear dependence in the columns of a matrix is a witness to the fact
that our matrix is compressing the space down to some lower dimension.
If there is no linear dependence we say the vectors are *linearly
independent*. If the columns of a matrix are linearly independent, no
compression occurs and the operation can be undone.
Rank
----
If we have a general :math:`n\times m` matrix, it is reasonable to ask
what dimension space the matrix maps into. A concept known as the *rank*
will be our answer. In the previous section, we noted that a linear
dependence bears witness to compression of space into a lower dimension
and so we will be able to use this to define the notion of rank. In
particular, the rank of a matrix :math:`\mathbf{A}` is the largest
number of linearly independent columns amongst all subsets of columns.
For example, the matrix
.. math::
\mathbf{B} = \begin{bmatrix}
2 & 4 \\ -1 & -2
\end{bmatrix},
has :math:`\textrm{rank}(B)=1`, since the two columns are linearly
dependent, but either column by itself is not linearly dependent. For a
more challenging example, we can consider
.. math::
\mathbf{C} = \begin{bmatrix}
1& 3 & 0 & -1 & 0 \\
-1 & 0 & 1 & 1 & -1 \\
0 & 3 & 1 & 0 & -1 \\
2 & 3 & -1 & -2 & 1
\end{bmatrix},
and show that :math:`\mathbf{C}` has rank two since, for instance, the
first two columns are linearly independent, however any of the four
collections of three columns are dependent.
This procedure, as described, is very inefficient. It requires looking
at every subset of the columns of our given matrix, and thus is
potentially exponential in the number of columns. Later we will see a
more computationally efficient way to compute the rank of a matrix, but
for now, this is sufficient to see that the concept is well defined and
understand the meaning.
Invertibility
-------------
We have seen above that multiplication by a matrix with linearly
dependent columns cannot be undone, i.e., there is no inverse operation
that can always recover the input. However, multiplication by a
full-rank matrix (i.e., some :math:`\mathbf{A}` that is
:math:`n \times n` matrix with rank :math:`n`), we should always be able
to undo it. Consider the matrix
.. math::
\mathbf{I} = \begin{bmatrix}
1 & 0 & \cdots & 0 \\
0 & 1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & 1
\end{bmatrix}.
which is the matrix with ones along the diagonal, and zeros elsewhere.
We call this the *identity* matrix. It is the matrix which leaves our
data unchanged when applied. To find a matrix which undoes what our
matrix :math:`\mathbf{A}` has done, we want to find a matrix
:math:`\mathbf{A}^{-1}` such that
.. math::
\mathbf{A}^{-1}\mathbf{A} = \mathbf{A}\mathbf{A}^{-1} = \mathbf{I}.
If we look at this as a system, we have :math:`n \times n` unknowns (the
entries of :math:`\mathbf{A}^{-1}`) and :math:`n \times n` equations
(the equality that needs to hold between every entry of the product
:math:`\mathbf{A}^{-1}\mathbf{A}` and every entry of :math:`\mathbf{I}`)
so we should generically expect a solution to exist. Indeed, in the next
section we will see a quantity called the *determinant*, which has the
property that as long as the determinant is not zero, we can find a
solution. We call such a matrix :math:`\mathbf{A}^{-1}` the *inverse*
matrix. As an example, if :math:`\mathbf{A}` is the general
:math:`2 \times 2` matrix
.. math::
\mathbf{A} = \begin{bmatrix}
a & b \\
c & d
\end{bmatrix},
then we can see that the inverse is
.. math::
\frac{1}{ad-bc} \begin{bmatrix}
d & -b \\
-c & a
\end{bmatrix}.
We can test to see this by seeing that multiplying by the inverse given
by the formula above works in practice.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
M = torch.tensor([[1, 2], [1, 4]], dtype=torch.float32)
M_inv = torch.tensor([[2, -1], [-0.5, 0.5]])
M_inv @ M
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[1., 0.],
[0., 1.]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
M = np.array([[1, 2], [1, 4]])
M_inv = np.array([[2, -1], [-0.5, 0.5]])
M_inv.dot(M)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[1., 0.],
[0., 1.]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
M = tf.constant([[1, 2], [1, 4]], dtype=tf.float32)
M_inv = tf.constant([[2, -1], [-0.5, 0.5]])
tf.matmul(M_inv, M)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
Numerical Issues
~~~~~~~~~~~~~~~~
While the inverse of a matrix is useful in theory, we must say that most
of the time we do not wish to *use* the matrix inverse to solve a
problem in practice. In general, there are far more numerically stable
algorithms for solving linear equations like
.. math::
\mathbf{A}\mathbf{x} = \mathbf{b},
than computing the inverse and multiplying to get
.. math::
\mathbf{x} = \mathbf{A}^{-1}\mathbf{b}.
Just as division by a small number can lead to numerical instability, so
can inversion of a matrix which is close to having low rank.
Moreover, it is common that the matrix :math:`\mathbf{A}` is *sparse*,
which is to say that it contains only a small number of non-zero values.
If we were to explore examples, we would see that this does not mean the
inverse is sparse. Even if :math:`\mathbf{A}` was a :math:`1` million by
:math:`1` million matrix with only :math:`5` million non-zero entries
(and thus we need only store those :math:`5` million), the inverse will
typically have almost every entry non-negative, requiring us to store
all :math:`1\textrm{M}^2` entries—that is :math:`1` trillion entries!
While we do not have time to dive all the way into the thorny numerical
issues frequently encountered when working with linear algebra, we want
to provide you with some intuition about when to proceed with caution,
and generally avoiding inversion in practice is a good rule of thumb.
Determinant
-----------
The geometric view of linear algebra gives an intuitive way to interpret
a fundamental quantity known as the *determinant*. Consider the grid
image from before, but now with a highlighted region
(:numref:`fig_grid-filled`).
.. _fig_grid-filled:
.. figure:: ../img/grid-transform-filled.svg
The matrix :math:`\mathbf{A}` again distorting the grid. This time, I
want to draw particular attention to what happens to the highlighted
square.
Look at the highlighted square. This is a square with edges given by
:math:`(0, 1)` and :math:`(1, 0)` and thus it has area one. After
:math:`\mathbf{A}` transforms this square, we see that it becomes a
parallelogram. There is no reason this parallelogram should have the
same area that we started with, and indeed in the specific case shown
here of
.. math::
\mathbf{A} = \begin{bmatrix}
1 & 2 \\
-1 & 3
\end{bmatrix},
it is an exercise in coordinate geometry to compute the area of this
parallelogram and obtain that the area is :math:`5`.
In general, if we have a matrix
.. math::
\mathbf{A} = \begin{bmatrix}
a & b \\
c & d
\end{bmatrix},
we can see with some computation that the area of the resulting
parallelogram is :math:`ad-bc`. This area is referred to as the
*determinant*.
Let’s check this quickly with some example code.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
torch.det(torch.tensor([[1, -1], [2, 3]], dtype=torch.float32))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor(5.)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import numpy as np
np.linalg.det(np.array([[1, -1], [2, 3]]))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
5.000000000000001
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
tf.linalg.det(tf.constant([[1, -1], [2, 3]], dtype=tf.float32))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
The eagle-eyed amongst us will notice that this expression can be zero
or even negative. For the negative term, this is a matter of convention
taken generally in mathematics: if the matrix flips the figure, we say
the area is negated. Let’s see now that when the determinant is zero, we
learn more.
Let’s consider
.. math::
\mathbf{B} = \begin{bmatrix}
2 & 4 \\ -1 & -2
\end{bmatrix}.
If we compute the determinant of this matrix, we get
:math:`2\cdot(-2 ) - 4\cdot(-1) = 0`. Given our understanding above,
this makes sense. :math:`\mathbf{B}` compresses the square from the
original image down to a line segment, which has zero area. And indeed,
being compressed into a lower dimensional space is the only way to have
zero area after the transformation. Thus we see the following result is
true: a matrix :math:`A` is invertible if and only if the determinant is
not equal to zero.
As a final comment, imagine that we have any figure drawn on the plane.
Thinking like computer scientists, we can decompose that figure into a
collection of little squares so that the area of the figure is in
essence just the number of squares in the decomposition. If we now
transform that figure by a matrix, we send each of these squares to
parallelograms, each one of which has area given by the determinant. We
see that for any figure, the determinant gives the (signed) number that
a matrix scales the area of any figure.
Computing determinants for larger matrices can be laborious, but the
intuition is the same. The determinant remains the factor that
:math:`n\times n` matrices scale :math:`n`-dimensional volumes.
Tensors and Common Linear Algebra Operations
--------------------------------------------
In :numref:`sec_linear-algebra` the concept of tensors was introduced.
In this section, we will dive more deeply into tensor contractions (the
tensor equivalent of matrix multiplication), and see how it can provide
a unified view on a number of matrix and vector operations.
With matrices and vectors we knew how to multiply them to transform
data. We need to have a similar definition for tensors if they are to be
useful to us. Think about matrix multiplication:
.. math::
\mathbf{C} = \mathbf{A}\mathbf{B},
or equivalently
.. math:: c_{i, j} = \sum_{k} a_{i, k}b_{k, j}.
This pattern is one we can repeat for tensors. For tensors, there is no
one case of what to sum over that can be universally chosen, so we need
specify exactly which indices we want to sum over. For instance we could
consider
.. math::
y_{il} = \sum_{jk} x_{ijkl}a_{jk}.
Such a transformation is called a *tensor contraction*. It can represent
a far more flexible family of transformations that matrix multiplication
alone.
As a often-used notational simplification, we can notice that the sum is
over exactly those indices that occur more than once in the expression,
thus people often work with *Einstein notation*, where the summation is
implicitly taken over all repeated indices. This gives the compact
expression:
.. math::
y_{il} = x_{ijkl}a_{jk}.
Common Examples from Linear Algebra
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Let’s see how many of the linear algebraic definitions we have seen
before can be expressed in this compressed tensor notation:
- :math:`\mathbf{v} \cdot \mathbf{w} = \sum_i v_iw_i`
- :math:`\|\mathbf{v}\|_2^{2} = \sum_i v_iv_i`
- :math:`(\mathbf{A}\mathbf{v})_i = \sum_j a_{ij}v_j`
- :math:`(\mathbf{A}\mathbf{B})_{ik} = \sum_j a_{ij}b_{jk}`
- :math:`\textrm{tr}(\mathbf{A}) = \sum_i a_{ii}`
In this way, we can replace a myriad of specialized notations with short
tensor expressions.
Expressing in Code
~~~~~~~~~~~~~~~~~~
Tensors may flexibly be operated on in code as well. As seen in
:numref:`sec_linear-algebra`, we can create tensors as is shown below.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Define tensors
B = torch.tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = torch.tensor([[1, 2], [3, 4]])
v = torch.tensor([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(torch.Size([2, 2]), torch.Size([2, 2, 3]), torch.Size([2]))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Define tensors
B = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = np.array([[1, 2], [3, 4]])
v = np.array([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
((2, 2), (2, 2, 3), (2,))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Define tensors
B = tf.constant([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = tf.constant([[1, 2], [3, 4]])
v = tf.constant([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(TensorShape([2, 2]), TensorShape([2, 2, 3]), TensorShape([2]))
.. raw:: html
.. raw:: html
Einstein summation has been implemented directly. The indices that
occurs in the Einstein summation can be passed as a string, followed by
the tensors that are being acted upon. For instance, to implement matrix
multiplication, we can consider the Einstein summation seen above
(:math:`\mathbf{A}\mathbf{v} = a_{ij}v_j`) and strip out the indices
themselves to get the implementation:
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Reimplement matrix multiplication
torch.einsum("ij, j -> i", A, v), A@v
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(tensor([ 5, 11]), tensor([ 5, 11]))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Reimplement matrix multiplication
np.einsum("ij, j -> i", A, v), A.dot(v)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(array([ 5, 11]), array([ 5, 11]))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Reimplement matrix multiplication
tf.einsum("ij, j -> i", A, v), tf.matmul(A, tf.reshape(v, (2, 1)))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(,
)
.. raw:: html
.. raw:: html
This is a highly flexible notation. For instance if we want to compute
what would be traditionally written as
.. math::
c_{kl} = \sum_{ij} \mathbf{b}_{ijk}\mathbf{a}_{il}v_j.
it can be implemented via Einstein summation as:
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
torch.einsum("ijk, il, j -> kl", B, A, v)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[ 90, 126],
[102, 144],
[114, 162]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
np.einsum("ijk, il, j -> kl", B, A, v)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[ 90, 126],
[102, 144],
[114, 162]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
tf.einsum("ijk, il, j -> kl", B, A, v)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
This notation is readable and efficient for humans, however bulky if for
whatever reason we need to generate a tensor contraction
programmatically. For this reason, ``einsum`` provides an alternative
notation by providing integer indices for each tensor. For example, the
same tensor contraction can also be written as:
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# PyTorch does not support this type of notation.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
np.einsum(B, [0, 1, 2], A, [0, 3], v, [1], [2, 3])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[ 90, 126],
[102, 144],
[114, 162]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# TensorFlow does not support this type of notation.
.. raw:: html
.. raw:: html
Either notation allows for concise and efficient representation of
tensor contractions in code.
Summary
-------
- Vectors can be interpreted geometrically as either points or
directions in space.
- Dot products define the notion of angle to arbitrarily
high-dimensional spaces.
- Hyperplanes are high-dimensional generalizations of lines and planes.
They can be used to define decision planes that are often used as the
last step in a classification task.
- Matrix multiplication can be geometrically interpreted as uniform
distortions of the underlying coordinates. They represent a very
restricted, but mathematically clean, way to transform vectors.
- Linear dependence is a way to tell when a collection of vectors are
in a lower dimensional space than we would expect (say you have
:math:`3` vectors living in a :math:`2`-dimensional space). The rank
of a matrix is the size of the largest subset of its columns that are
linearly independent.
- When a matrix’s inverse is defined, matrix inversion allows us to
find another matrix that undoes the action of the first. Matrix
inversion is useful in theory, but requires care in practice owing to
numerical instability.
- Determinants allow us to measure how much a matrix expands or
contracts a space. A nonzero determinant implies an invertible
(non-singular) matrix and a zero-valued determinant means that the
matrix is non-invertible (singular).
- Tensor contractions and Einstein summation provide for a neat and
clean notation for expressing many of the computations that are seen
in machine learning.
Exercises
---------
1. What is the angle between
.. math::
\vec v_1 = \begin{bmatrix}
1 \\ 0 \\ -1 \\ 2
\end{bmatrix}, \qquad \vec v_2 = \begin{bmatrix}
3 \\ 1 \\ 0 \\ 1
\end{bmatrix}?
2. True or false: :math:`\begin{bmatrix}1 & 2\\0&1\end{bmatrix}` and
:math:`\begin{bmatrix}1 & -2\\0&1\end{bmatrix}` are inverses of one
another?
3. Suppose that we draw a shape in the plane with area
:math:`100\textrm{m}^2`. What is the area after transforming the
figure by the matrix
.. math::
\begin{bmatrix}
2 & 3\\
1 & 2
\end{bmatrix}.
4. Which of the following sets of vectors are linearly independent?
- :math:`\left\{\begin{pmatrix}1\\0\\-1\end{pmatrix}, \begin{pmatrix}2\\1\\-1\end{pmatrix}, \begin{pmatrix}3\\1\\1\end{pmatrix}\right\}`
- :math:`\left\{\begin{pmatrix}3\\1\\1\end{pmatrix}, \begin{pmatrix}1\\1\\1\end{pmatrix}, \begin{pmatrix}0\\0\\0\end{pmatrix}\right\}`
- :math:`\left\{\begin{pmatrix}1\\1\\0\end{pmatrix}, \begin{pmatrix}0\\1\\-1\end{pmatrix}, \begin{pmatrix}1\\0\\1\end{pmatrix}\right\}`
5. Suppose that you have a matrix written as
:math:`A = \begin{bmatrix}c\\d\end{bmatrix}\cdot\begin{bmatrix}a & b\end{bmatrix}`
for some choice of values :math:`a, b, c`, and :math:`d`. True or
false: the determinant of such a matrix is always :math:`0`?
6. The vectors :math:`e_1 = \begin{bmatrix}1\\0\end{bmatrix}` and
:math:`e_2 = \begin{bmatrix}0\\1\end{bmatrix}` are orthogonal. What
is the condition on a matrix :math:`A` so that :math:`Ae_1` and
:math:`Ae_2` are orthogonal?
7. How can you write :math:`\textrm{tr}(\mathbf{A}^4)` in Einstein
notation for an arbitrary matrix :math:`A`?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html