Hashing and Maps

Hashing and Maps
> Click or hit Control-Enter to run Example.main above

Imagine I told you that there was a function with the following properties:

Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
Uniformity: over many inputs, each output value is equally likely.
Efficiency: it is efficient to compute.

A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.

If a hash function produces the same hash for two different inputs this is called a collision.

In some cases, particularly if the size of the hash is small, collisions are expected and we plan to deal with them.
If the size of the hash is large enough and the hash function is uniform, collisions should never happen and the world will end if they do. (Or at least git will stop working and my world will end.)

In a room with 100 students, what is the probability that two will share the same birthday 1? 99.9999%

Does anybody know how many you need to get a 50% chance? Only 23!
This is bad for our hash functions… collisions are more likely than we might think!

How many documents do I have to hash before I find two with the same hash with 50% probability?

It depends on how large the hash is!
For 16 bits, 300. (The MP6 starter code had 80 files in it.)
For 32 bits, 77,000 (My computer has 2.5 million files on it.)
For 64 bits, 5 billion (GitHub.com has 1 billion files.)
For 128 bits, 14,000,000,000,000,000,000. (Now we’re getting warmer.)
(Git actually uses a 160-bit hash function.)
For 512 bits, 1.4 * 10^77 (The universe only has ~10^78 atoms, so this is probably enough.)

But the best is yet to come!

int[] numbers = new int[] { 5, 6, 7 };
System.out.println(numbers[0]);
numbers[1] = 8;

Arrays map an index (0, 1, 2, array.length to a value).
The value can be anything, but the indices had to be be integers.
No longer!

A Java Map allows us to use any object like an array index.

import java.util.Map;
import java.util.HashMap;

Map<String, Integer> stringValues = new HashMap<>();
stringLengths.put("test", 5);
System.out.println(stringLengths.get("test")); // Prints 5
stringLengths.put("test", 7);
System.out.println(stringLengths.get("test")); // Prints 7

Maps are one of the two data structures you meet in heaven. (Along with lists.) Every language has them:

Python calls them dictionaries: dict["key"] = "value"
JavaScript calls them anonymous objects dict["key"] = "value"
C++ calls them maps: dict.insert(std::make_pair("key", "value"));
Go calls them maps: dict["key"] = "value"
Even Perl had them: $dict{'key'} = "value"!
Sometimes we call them key-value stores, since each key maps to a single value

> Click or hit Control-Enter to run Example.main above

Let’s say that I want to process a large corpus of text and then be able to quickly answer queries about how many times particular words appear.

> Click or hit Control-Enter to run Example.main above

So how do we implement a Map?

Use a hashCode to retrieve a hash code for each object.
Use that value—or a smaller part of it—as an index into an array.
But what about collisions?

$hash table$

public class HashMap {
  private static final int TABLE_SIZE = 16;
  class Item {
    public Object key;
    public Object value;
    public Item next;
    Item(Object setKey, Object setValue, Item setNext) {
      key = setKey;
      value = setValue;
      next = setNext;
    }
  }
  private Item[] items = new Item[TABLE_SIZE];
  public int itemCount = 0;
  private int hash(Object key) {
    int hashValue = key.hashCode() % TABLE_SIZE;
    if (hashValue < 0) {
      hashValue += TABLE_SIZE;
    }
    return hashValue;
  }
  public void put(Object key, Object value) {
    int bucket = hash(key);
    Item current = items[bucket];
    for (; current != null; current = current.next) {
      if (current.key.equals(key)) {
        current.value = value;
        return;
      }
    }
    Item newItem = new Item(key, value, items[bucket]);
    items[bucket] = newItem;
    itemCount++;
    return;
  }
}
public class Example {
  public static void main(String[] unused) {
    HashMap ourHashMap = new HashMap();
    ourHashMap.put("test", "me");
    System.out.println(ourHashMap.itemCount);
    ourHashMap.put("test", "another");
    System.out.println(ourHashMap.itemCount);
  }
}

> Click or hit Control-Enter to run Example.main above

Let’s consider the performance of our simple HashMap in two cases. First, if the array is very small relative to the number of items:

put: O(n) with n being the number of items
get: O(n) with n being the number of items
At this point the HashMap is acting like a linked list.

Let’s consider the performance of our simple HashMap in two cases. Second, if the array is very large relative to the number of items:

put: O(1)
get: O(1)
At this point the HashMap is acting like an array.
What’s the problem? It requires a lot of space.

In reality we want our HashMap to blend the good features of an array and a linked list.

Usually implementations will enlarge the array part of a HashMap once it gets filled past a certain point (called the load factor).

This is cool stuff!

Hash functions already provide:

Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
Uniformity: over many inputs, each output value is equally likely.
Efficiency: it is efficient to compute.

But what if there were hash functions with the following new properties:

Given the hash, it is infeasible to determine the original input
A small change to the input produces a large change in the output
The function is difficult to compute, not easy

A hash function that satisfies these properties is known as a cryptographic hash function, largely because they are ubiquitous in modern cryptography.

I need to be able to check your password, but I don’t want to save it. Is that possible?

Yes!
Save the cryptographic hash of your password, not the password itself.
When you submit your password, I hash it and compare it with the saved hash.
If someone steals my database, they can’t recover the original passwords.
- Given the hashes, you can’t recover the original passwords
- Hash values reveal nothing about how close you are to the actual password
- Hashing inputs to test them is expensive

$ $ cat example.txt
The quick brown fox jumps over the lazy dog
$ md5 < example.txt
37c4b87edffc5d198ff5a185cee7ee09
$ cat example.txt
The quick brown fox jumps over the lazy doG
$ md5 < example.txt
75559fc9857fe9bebf65f97760e3f67d

The third and final midterm starts on Saturday.

The focus is data structures and algorithms, but completing the programming questions will require both imperative and object-oriented programming
There are four programming problems together worth 60 points. Most have partial credit.
Our final midterm represents your more sophisticated set of programming tasks yet.
(My goal is to irritate you before you complete the course evaluations…)

As always, the best way to review for the midterm is to review the practice homework problems.

Midterm 2 consists of four programming tasks:

One question on lists that is very similar to a homework problem
One question on trees that is very similar to a homework problem
One question on sorting that is directly drawn from the homework
A final question that I’m not going to say more about

Your first final project checkpoint is in lab this week.
I have office hours MWF from 10AM–12PM in Siebel 2227. Please stop by!
Remember to provide feedback on the course using the anonymous feedback form.
I’ve started to respond to existing feedback on the forum.

Review: Hash Functions

Review: Hash Functions

Hash Collisions

The Birthday Paradox

Birthday Hashing Paradox

So Hashes Seem Useful…

Remember Arrays?

Java Maps

A Map By Any Other Name

Map Usage Example

Brief `Map` Implementation

`Map` As Array + Linked List

`HashMap` Performance

`HashMap` Performance

Realistic `HashMap` Performance

Looking forward to CS 225 yet?

And cooler still…

Cryptographic Hash Functions

A Simple Example

Small Input Change, Big Output Change

Midterm 2

Midterm 2 Problems

Questions About Midterm 2?

Announcements

Review: Hash Functions

Review: Hash Functions

Hash Collisions

The Birthday Paradox

Birthday Hashing Paradox

So Hashes Seem Useful…​

Remember Arrays?

Java Maps

A Map By Any Other Name

Map Usage Example

Brief Map Implementation

Map As Array + Linked List

HashMap Performance

HashMap Performance

Realistic HashMap Performance

Looking forward to CS 225 yet?

And cooler still…​

Cryptographic Hash Functions

A Simple Example

Small Input Change, Big Output Change

Midterm 2

Midterm 2 Problems

Questions About Midterm 2?

Announcements

So Hashes Seem Useful…

Brief `Map` Implementation

`Map` As Array + Linked List

`HashMap` Performance

`HashMap` Performance

Realistic `HashMap` Performance

And cooler still…