Vector Difference Functions

January 15, 2008 at 7:52 PM

Problem Formulation

Given vectors a and b of identical (but arbitrary) dimension, what's the best way to tell how "different" the vectors are?

Simple distance function

There are two different ways one can approach this problem. The trivial approach is to calculate an absolute distance between the two vectors. This works especially well in smaller dimensions:

a = [0, 5]
b = [5, 0]

If we treat these 2-dimensional vectors as points on a plane, there's a simple Euclidean two-dimensional distance function that describes the difference between a and b. This distance function scales to an arbitrary number of dimensions. Our distance for a and b is:

((0 - 5)2 + (5 - 0)2) = 7.07 (or 5 * √2)

This naturally has an equivalent in any Euclidean space of any dimension. For many applications of vector math, however, a pure distance calculation becomes less interesting as the dimension of the space grows. Consider the following 5-dimensional vectors:
c = [1, 1, 1, 1, 1]
d = [1, 0, 0, 0, 0]
e = [9, 9, 9, 9, 9]

This gives us the following distances:

cd = 2
ce = 4 * √20 = 17.889
de = 2 * √97 = 19.698

Angular difference function

In lower dimensions, the direction of a vector is a very simplistic construct. The second dimension is made of components that are either up or down, left or right. Naturally, one can differentiate between vectors in this manner, but it's often not as meaningful as a regular distance calculation. Referring back to vectors a and b, we can quickly calculate that (assuming [0, 0] as the origin) they are 90° apart — in a word, orthogonal.

We can calculate an angle between two N-dimensional vectors as follows (for more detail, read about the structure of Euclidean spaces):

θ = cos-1((xy) / (||x||*||y||))

For our 5-dimensional vectors, this gives us the following angular relationships:
a(c,d) = 63.435°
a(c,e) =
a(d,e) = 63.435°

Which function to choose?

The angular and distance relationships differ greatly with this set of vectors. Although the distance between c and e is almost as large as that between d and e, the angle between the former pair is 0. In other words, c and e point exactly the same direction, and differ only in magnitude. Thus, depending on the real-world relationships that one might be modeling with these two vectors, they might be considered identical. d is the oddball from the angular perspective.

"What's a practical use of this?", you ask. Imagine evaluating candidates in a political race based on how they appropriated money in the past. Take as many differentiating spending categories as can be imagined; each category represents a dimension in our vector, and the amount spent by a given candidate is the magnitude in that direction traveled by the candidate.

So, we can envision the following political arena (cost in millions; stereotypes in spades):

Candidate Education Defense Health Care Raising Own Pay Republic of Cascadia
Joe Greedy 0 0 0 1,000 0
Leftist Len 450 100 590 100 10
Ron Rightwing 100 666 200 100 50
Cascadian Callie 1 1 1 0 5

Practically speaking, how similar are these candidates to each other? Using our two methods for calculating vector differences, we get two very different answers:

Difference between candidates based on Euclidean distance
  Joe Greedy Leftist Len Ron Rightwing Cascadian Callie
Joe Greedy 0 1170.8 1142.8 1000.0
Leftist Len 1170.8 0 772.37 753.89
Ron Rightwing 1142.8 772.37 0 709.68
Cascadian Callie 1000.0 753.89 709.68 0

Difference between candidates based on angular difference
  Joe Greedy Leftist Len Ron Rightwing Cascadian Callie
Joe Greedy 0 82.393 81.919 90.000
Leftist Len 82.393 0 63.463 72.681
Ron Rightwing 81.919 63.463 0 71.153
Cascadian Callie 90.000 72.681 71.153 0

Analysis

What can we glean from these tables? Let's first take a look at one of the edge cases. Joe Greedy, whom I hope will be beaten soundly in this election, gives money to no one but himself. According to the angular difference, he and Cascadian Callie have nothing whatsoever in common. This is because Callie is a wholly virtuous individual, giving no money to herself whatsoever. Their vectors are completely orthogonal. However, Callie appears to be the nearest candidate to Joe (i.e. of Joe's neighbors, they have the smallest difference) when considered with a distance function.

What!? That's right — since Callie is an under-funded independent candidate, her vector is small in magnitude when compared to the other candidates' vectors. She's the "closest" of Joe's neighbors, even though the two are ideological opposites. Joe's vector is one-dimensional, and is closer to the origin (represented by [0,0,0,0,0]) than to any of the other candidates. This is because the other candidates' combined moves span the entire 5-dimensional space. Callie's moves, however, are the smallest in magnitude, so has moved the least distance away from Joe despite being the only one who's completely orthogonal in direction (i.e. ideology).

Note also the difference between Ron Rightwing's and Leftist Len's numbers as generated by the different functions. The distance functions put both closer to Callie than each other, even though neither really show a ton of material support to the Cascadia movement (4.3% and 0.8% of all money, respectively). The angular function, however, puts them as each other's nearest neighbor followed by Callie. It recognizes that they both support Education, Defense, and Health Care to varying degrees. It also takes into account the fact that Callie spends a little bit on those three all-important issues but that Joe does not; this pushes Joe into last place related to all three (where he belongs, that jerk).

Summary

The traditional distance function is always useful, but angular difference is often a better suited tool for the job. The political candidate similarity chart is one such application. I'd love to create an angular similarity graph between Federal politicians based on a large number of political dimensions. It would be very useful to see — outside all of the jockeying and positioning — what things politicians actually vote for, and which people vote most similarly. I believe the results would be surprising.

Source Code

I developed a really simplistic Vector class (in Ruby) to help me do these calculations, and I'm releasing it under the MIT license. It's definitely not production-ready, is unfit for any use, et cetera, et cetera.