Implementing a Hash Map in Python: A Clean and Simple Approach

Dec 26, 2024

Recently, I started tackling Leetcode problems for fun and came across one called "Design a HashMap," which was marked as an easy problem. Initially, the solution seemed straightforward, but as I delved into the implementation, I realized how much I needed to refresh my knowledge on the topic.

A HashMap is a powerful data structure used to store key-value pairs and enables fast access, insertion, and deletion of elements based on their keys. It achieves this efficiency by computing a hash of the key, which is then used to determine the appropriate index in an underlying array (often referred to as a "bucket") where the corresponding value should be stored. This hashing technique ensures that operations can be performed in constant time, on average, making the HashMap a highly efficient structure for many applications.

In this article, we'll build a simple hash map implementation in Python from scratch. We'll discuss the underlying principles such as hashing, handling collisions, resizing, and maintaining performance as the data structure grows.

Key Features of Our Hash Map:

Hashing: Each key is hashed to determine its index in the underlying bucket array.
Collision Handling: We use separate chaining to handle hash collisions (multiple keys that hash to the same index).
Resizing: The hash map automatically resizes when it reaches a load factor threshold (0.7 in this case) to ensure efficient performance.

class MyHashMap:
    def __init__(self):
        self.load_factor = 0.7
        self.downsize_factor = 0.25  # When the load factor falls below this threshold, we downsize
        self.bucket = [list() for _ in range(10)]  # Create a list of 10 empty lists (buckets)
        self.current_load = 0  # Track the current number of elements in the map
        self.bucket_length = len(self.bucket)  # Initial bucket size

    def put(self, key: int, value: int) -> None:
        # If the current load exceeds the load factor, resize the hash map
        if self.current_load / self.bucket_length > self.load_factor:
            self._resize()
        self._put(key, value)

    def _put(self, key: int, value: int, change_load=True):
        # Find the appropriate bucket for the key using the hash function
        array = self.bucket[self._hash(key)]
        # Check if the key already exists in the bucket
        for i, (k, v) in enumerate(array):
            if k == key:
                # If the key is found, update its value
                array[i] = (key, value)
                return
        # If the key is not found, append a new key-value pair to the bucket
        array.append((key, value))
        # Increase the current load if a new element is added
        if change_load:
            self.current_load += 1

    def get(self, key: int) -> int:
        # Retrieve the value associated with the given key
        array = self.bucket[self._hash(key)]
        # Search for the key in the bucket
        result = self._find_by_key(array, key)
        if result == -1:
            # If the key is not found, return -1
            return -1
        return result[1]  # Return the value of the found key-value pair

    def _find_by_key(self, array: list, key) -> tuple | Literal[-1]:
        # Search for a key in the bucket
        for k, v in array:
            if k == key:
                return (k, v)
        return -1  # Return -1 if the key is not found

    def remove(self, key: int) -> None:
        # Find the appropriate bucket for the key
        array = self.bucket[self._hash(key)]
        # Search for the key in the bucket
        tuple_ = self._find_by_key(array, key)
        if tuple_ != -1:
            # If the key is found, remove the key-value pair
            array.remove(tuple_)
            # Decrease the current load after removal
            self.current_load -= 1
            # If the load factor is too low, downsize the bucket array
            if self.current_load / self.bucket_length < self.downsize_factor:
                self._downsize()

    def _hash(self, key):
        # Hash function that computes an index for the key in the bucket array
        return abs(hash(key)) % self.bucket_length

    def _resize(self):
        # Double the size of the bucket array to maintain efficient performance
        old_bucket = self.bucket
        self.bucket = [[] for _ in range(2 * self.bucket_length)]  # Create new larger bucket array
        self.bucket_length = len(self.bucket)  # Update the bucket length
        # Rehash all key-value pairs from the old bucket array into the new one
        for old_array in old_bucket:
            for k, v in old_array:
                self._put(k, v, change_load=False)

    def _downsize(self):
        # Halve the size of the bucket array to save memory
        old_bucket = self.bucket
        self.bucket = [[] for _ in range(self.bucket_length // 2)]  # Create a new smaller bucket array
        self.bucket_length = len(self.bucket)  # Update the bucket length
        # Rehash all key-value pairs from the old bucket array into the new one
        for old_array in old_bucket:
            for k, v in old_array:
                self._put(k, v, change_load=False)

Explanation of the Code

Initialization (__init__):
- The hash map starts with a small bucket array of size 10 and an initial load factor of 0.7.
- The bucket is a list of empty lists (buckets). Each bucket holds key-value pairs as tuples.
- The current_load variable tracks the number of key-value pairs in the map, and bucket_length holds the current number of buckets.
put(key, value):
- This method adds a key-value pair to the hash map.
- If the current load exceeds the load factor (i.e., the number of elements relative to the size of the bucket), the map will resize itself to maintain efficient performance.
- The key is hashed to determine the appropriate bucket. If the key already exists in the bucket, its value is updated. If not, a new key-value pair is added.
get(key):
- This method retrieves the value associated with the given key.
- If the key is found, its value is returned. If not, the method returns -1.
remove(key):
- This method removes a key-value pair from the hash map.
- The appropriate bucket is found using the hash of the key, and the key-value pair is removed if it exists.
_hash(key):
- This is the hash function that calculates the index for storing a key. It uses Python's built-in hash() function and ensures that the result fits within the size of the bucket array.
_resize():
- The hash map automatically resizes when the number of elements grows too large relative to the bucket size.
- Resizing involves creating a new, larger bucket array (double the size) and rehashing all the existing key-value pairs into the new array.

Key Concepts

Hash Function: The hash function is crucial for distributing the keys evenly across the bucket array. A good hash function minimizes collisions (when two different keys map to the same index) and ensures that the hash map performs efficiently.
Collision Handling: In this implementation, we use separate chaining to handle collisions. When two keys hash to the same index, the bucket at that index will hold a list of key-value pairs, allowing for multiple entries at the same index.
Resizing: As the number of elements in the hash map grows, the bucket array may become too small. To ensure efficient operations, the hash map automatically resizes by doubling the bucket size when the load factor exceeds a certain threshold.
Time Complexity:
- Insert (put): O(1) on average, but resizing can make it O(n) when the bucket needs to be resized.
- Search (get): O(1) on average, since the hash function provides direct access to the bucket.
- Delete (remove): O(1) on average, but finding the key requires traversing the bucket list, so it could be O(n) in the worst case.

Conclusion

We’ve built a simple but effective hash map in Python, focusing on core concepts like hashing, collision resolution, and resizing. We also highlighted how these techniques work together to ensure the hash map remains efficient as it scales. There are multiple strategies for handling collisions, such as chaining and open addressing, each with its own advantages depending on the situation.

Jakub’s Substack

Discussion about this post