I came across this w/ Embeddings. It’s a way to discern if two vectors through multi-dimensional space are near each other or not. In embedding terms, this means they are conceptually similar because the dimensions have some form of encoded semantic meaning.
The top part of the formula is called the ‘dot product’ and the bottom part are called the ‘magnitudes’. The top part says “How similar are these?” and the bottom part corrects for total length.
In code:
import math
def cosine_similarity(a, b):
dot = sum(x * y for x, y in zip(a, b))
mag_a = math.sqrt(sum(x ** 2 for x in a))
mag_b = math.sqrt(sum(x ** 2 for x in b))
print(dot, mag_a, mag_b)
return dot / (mag_a * mag_b)Example
The simple example is if you had a 2 piece vector:
A = [2, 2] and B = [2, 4]
This would be 2*2 + 2*4 = 12
and sqrt(2^2 + 2^2) * sqrt(2^2 + 4^2) = sqrt(8) * sqrt(20) = 2.828 * 4.472 = 12.647
and 12/12.647 = .9487 similarity. This is pretty close to 1.. so pretty similar.
xychart-beta x-axis [0, 1, 2] y-axis 0 --> 4 line [0, 1, 2] line [0, 2, 4]
If our vector is A = [2, 2] and B = [2, 2] (identical).. we’d end up with x/x = 1.
xychart-beta x-axis [0, 1, 2] y-axis 0 --> 4 line [0, 1, 2] line [0, 1, 2]
Normalization via magnitudes
One of the things about this that didn’t quite hit for me at the start was that if you have a vector pair like A=[3,0] and B=[2,0].. and another that’s A=[3,0] and C=[2,2].. In both cases, they will have the same dot product . But they aren’t quite as aligned. AB is exactly overlapping angles, just of different lengths. AC has a slightly different angle because C goes two points off in the other direction.
This looks like: