A number of machine learning methods heavily depend on distance measurements. Distance measurement is a calculation of the distance between two data points. This distance measure is typically employed to determine the similarity between data points in supervised and unsupervised learning.
Whether used for clustering or classification tasks, an efficient distance metric enhances the performance of our ML model.
Consider a classification or regression problem that needs to be solved, and we wish to use the K-Means Clustering or KNN algorithm to build clusters. If the features are similar then they will lie close to each other.
So it is enough to find the distances between points and see if they are similar or not. The big question is: how do we define this distance and what are the various distance metrics used?
Of all the different distance measures, Euclidean distance is one widely used distance but it has a problem. If dimensionality increases it becomes less useful. For higher dimensionality data, we use Manhattan distance.
In this article, we will discuss Manhattan distance and why it is used for higher dimensionality data.
What is Manhattan Distance?
The Manhattan distance, formerly the taxicab distance or city block distance, is calculated by calculating the distance between real-value vectors. This could be more useful for vectors of objects in uniform grids such as chess boards or city blocks.
Taxis are referred to as the taxicab for the measure of intuition. If the Manhattan distance is not Euclidean in integer space, then the Manhattan distance might be more efficient than the Euclidean distance. Manhattan distance is calculated using the sum of the absolute differences between the two vectors.
More explicitly, in an n-dimensional real vector space with a fixed cartesian coordinate system, the taxicab distance, d, between two vectors x, and y, is equal to the total of the lengths of the line segments projected from the points onto the coordinate axes.
The Manhattan distance formula is represented as follows
is the distance function where and are vectors
If we consider a cartesian coordinate system that has two dimensions as a plane, the taxicab distance or manhattan distance between two points is
Now that we have seen what the Manhattan distance is, let's see how to calculate it programmatically in python.
The below code shows the execution of the manhattan distance between two points
def manhattan_distance(x, y): return sum(abs(a-b) for a, b in zip(x,y)) x = [20, 30, 40, 50, 60] y = [12, 26, 16, 9, 8] d = manhattan_distance(x, y) print(d)
In the above code, we defined a function that calculates the manhattan distance and then we defined our data points. In the next step, we called our function to calculate the manhattan distance.
The output of the above is as shown below
129
Thus in taxicab geometry, the distance metric known as the taxicab distance is the sum of the absolute difference of their cartesian coordinates. But what is it used for in machine learning?
Manhattan distance in machine learning
Euclidean distance is a bad choice if the number of dimensions rises due to the curse of dimensionality. With the increase in dimensions, the problem of correlation and non-linearity increases, and euclidean distance is not so useful to deal with this because it calculates the shortest path between two points.
In order to find a straight line that matches the provided set of points, the Manhattan distance is utilized in higher dimensional regression analysis. It is also used to understand discrete frequency distributions. Manhattan distance is also called metric.
In machine learning and data science, Manhattan distances can be used to process speech or to process images. Manhattan distance is used for face recognition along with image segmentation.
Now that we have seen what Manhattan distance is, let’s see other distance metrics Minkowski distance, Euclidean distance, Hamming distance, and Cosine distance.
What is Minkowski Distance?
Minkowski distance is named after the german mathematician Hermann Minkowski. It is used as a metric in normed vector space where distances are represented as a vector that length.
If then the Minkowski distance between the vectors is defined as follows
Now we will see how to calculate the Minkowski distance in python.
def minkowski_distance(x,y,p): return sum(abs(a-b)**p for a, b in zip(x,y))**(1/p) x = [10, 20, 30, 40, 50] y = [50, 40, 30, 20, 10] d = minkowski_distance(x, y, 1) print(d)
In the above code snippet, we first defined a function to calculate Minkowski distance as per the formula. In the next step, we gave the X and Y values. Finally, we called the function to calculate the distance. The output of the above code is
120
If p=1 we get the manhattan distance and if p=2 we get the euclidean distance.
What is Euclidean Distance?
It is the shortest distance between any two points. In the Euclidean plane if the cartesian coordinates are then the euclidean distance is defined as follows
We calculate the euclidean distance programmatically as follows
from math import sqrt def euclidean_distance(x,y): return sqrt(sum((a-b)**2 for a, b in zip(x,y))) x = [50,40,30,20,10] y = [10,20,30,40,50] d = euclidean_distance(x,y) print(d)
In the above code, we created a function that calculates the distance between two vectors as per the above formula. Next, we gave the values and called the function to calculate the distance. The output of the code is as follows
63.245553203367585
What is Hamming distance?
The hamming distance is the distance between 2 binary vectors, aka binary string or bitstring as it is known. If you encode a one-hot categorical field, then you might encounter a bitstring. If you want a list with categories ‘red', and 'green', you can encode each of those examples into a bitstring with a single bit for every column.
The difference between green and red can be determined using sums and averages of the bit differences in the two digits. That's Hamming Distance.
A python program to calculate the hamming distance is as follows
def hamming_distance(x, y): return sum(abs(a - b) for a, b in zip(x, y)) x = [1,1,0,0,1] y = [1,1,1,0,0] z = hamming_distance(x,y) print(z)
We created a function that calculates the hamming distance. In the next step, we defined the values and called the function to calculate the hamming distance. Finally, we print the result. The output of the above code is as follows
2
What is Cosine Distance?
To understand cosine distance we need to understand cosine similarity. Cosine similarity is defined as the measure of similarity between two vectors. Cosine similarity between two vectors is represented as follows
From the above formula we can see that the range of cosine similarity lies between -1 and 1.
Cosine distance is nothing but the complement of cosine similarity. The underlying meaning is that if two vectors are similar to each other then the distance between them is zero as the similarity is one.
Cosine distance between two points or two vectors is formally defined as
We can see that the more the similarity the less the distance. These metrics are used for text analysis and document similarity. We can calculate the cosine distance between two vectors directly through SciPy library as follows
from scipy import spatial x = [1,2,3,4,5] y = [6,7,8,9,10] cosine_distance = spatial.distance.cosine(x,y) print(cosine_distance)
We defined two vectors x, y, and using SciPy we calculated the cosine distance between them. The output is as follows
0.03504949526723289
Conclusion
Several distance metrics exist to solve different problems. Though Euclidean distance is the widely used one, sometimes our dataset suffers from high dimensionality. Choosing the right distance metric as per the problem is important. Manhattan distance is most useful when your dataset is suffering from the curse of dimensionality. Other distance metrics too are useful depending on the use case.