Hashing and Locality Sensitive Hashing

Takeaway

Locality-Sensitive Hashing (LSH) is conceptually similar to typical hashing in that both map data to a hash value. However, LSH differs significantly in its mathematical approach and purpose. Unlike typical hashing functions, which aim to create a unique and evenly distributed hash for each unique input, LSH is designed to create hashes that reflect similarities between data points.

Key Mathematical Differences Between LSH and Typical Hashing

Objective:
- Typical Hashing: Uses hash functions to produce unique outputs that minimize collisions. The goal is to spread data uniformly across a hash space to avoid clustering and make exact retrieval fast.
- LSH: Specifically designed to increase the likelihood of collisions for similar inputs. The mathematical structure encourages “similar” points in high-dimensional space to hash to the same or nearby buckets, facilitating similarity-based searches.
Types of Hash Functions:
- Typical Hashing: Uses deterministic, uniform hash functions (e.g., MD5, SHA-256) that are meant to produce an even distribution with high randomness. These functions are insensitive to the relationships between inputs.
- LSH: Uses locality-sensitive hash functions, which are mathematically crafted to reflect a specific similarity measure. Different types of LSH functions correspond to different similarity measures:
  - Cosine Similarity: Hash functions that map vectors in a way that nearby vectors (with small angles between them) are more likely to end up in the same bucket.
  - Euclidean Distance: Hash functions that divide space into regions such that points close in Euclidean space map to the same hash.
  - Jaccard Similarity: For sets, hash functions that map similar sets to the same bucket based on the overlap of elements.
Mathematical Mechanism:
- Typical Hashing: The hash function for typical hashing, such as h(x) = hash_function(x), is designed to maximize entropy and minimize patterns, making each output value appear random and independent of input structure.
- LSH: LSH hash functions are designed to maintain spatial locality, where the function h(x) or family of hash functions h_1(x), h_2(x), ... is selected to reflect a particular similarity measure. For example:
  - Random Projection: For cosine similarity, LSH uses random vectors as projections, mapping points to a binary value based on which side of the vector they fall on. Close vectors (i.e., vectors with small angles between them) are more likely to produce the same binary result.
  - Random Hyperplanes: LSH for Euclidean distance may use randomly chosen hyperplanes to segment space, with each hash function encoding which side of each hyperplane a point falls on, so points close together are likely to fall in the same region.
Collisions and Probability:
- Typical Hashing: Collisions (where two distinct inputs hash to the same output) are avoided, and hash functions aim to make them infrequent.
- LSH: Uses probabilistic collisions by design. The mathematical functions behind LSH are chosen to make it more likely that similar points will collide in the same bucket, while dissimilar points have a lower probability of colliding.

Password Hashing vs. Typical Hashing

Summary

While both typical hashing and LSH share the general concept of mapping data to hashes, the math and purpose behind LSH are tailored to similarity preservation. LSH hash functions are specifically engineered to reflect input similarity through probabilistic, locality-sensitive mappings, making them quite distinct from typical hash functions, which are designed for uniqueness and uniform distribution.