Hashing and Maps
> Click or hit Control-Enter to run Example.main above
Hash Functions
Imagine I told you that there was a function with the following properties:
-
Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
-
Uniformity: over many inputs, each output value is equally likely.
-
Efficiency: it is efficient to compute.
Hash Functions
A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.
What Could We Do With Such A Function?
It may not seem obvious at first, but hash functions have many, many uses.
Example: Download Verification
Imagine the following scenario:
-
You need to download a 120GB file to install a particular piece of software.
-
It’s possible that, along the way, some data gets corrupted—either by the network or by your disk, who knows.
-
So before you install the software you want to be sure that you downloaded the file correctly.
Without A Hash Function
Without a hash function, what do we have to do?
-
Download the 120GB file.
-
Download it again. (Slow.)
-
Compare the two to make sure that they are the same. (Also slow.)
But…
Remember, I have a function with the following properties:
-
Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
-
Uniformity: over many inputs, each output value is equally likely.
-
Efficiency: it is efficient to compute.
With A Hash Function
With a hash function, what do we do?
-
You compute the hash of your copy of the file.
-
Download a hash of the file: maybe only a few bytes.
-
Compute the hash of the file locally and make sure that it matches.
Example Download With md5sum
md5
is a popular
hash function
that produces a 128-bit value.
We’re expecting an md5
hash value of d95bacb4ccd59657a5ac2bf66b35ebcc
:
$ md5 mactex-20170524.pkg
MD5 (mactex-20170524.pkg) = d95bacb4ccd59657a5ac2bf66b35ebcc
$
Example: Fingerprinting Content
Imagine the following scenario.
-
You sent me
foo.docx
at some point. -
(I deleted it because it was a
.docx
file, so in reality scenario over.) -
But let’s pretend that you can’t remember if you sent me the latest version.
Without a Hash Function
Without a hash function, what do we do?
-
You send me the file again.
-
(And I delete it again.)
But…
Remember, I have a function with the following properties:
-
Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
-
Uniformity: over many inputs, each output value is equally likely.
-
Efficiency: it is efficient to compute.
With a Hash Function
With a hash function, what do we to do?
-
You compute the hash of your file.
-
I compute the hash of my file.
-
If they are the same, we’re done.
-
Otherwise you send me your copy.
Example Content Hash with git
git
uses hashes (the
SHA-1 algorithm)
to fingerprint files and commits:
Example git push
More or less, here’s what happens when you push to GitHub.com:
-
Your computer says: "Hi GitHub.com, I have the following files:
a6efc501d57b88df337fe904483d25732bb3e45e
,4e292499a1194d0493bd5350408fe3254d2335d3
,20da0fbbf8e8c279bb1edbbe0ac5ae40349edceb
, …" -
Server, "OK, I’ve got
4e292499a1194d0493bd5350408fe3254d2335d3
anda6efc501d57b88df337fe904483d25732bb3e45e
but I need20da0fbbf8e8c279bb1edbbe0ac5ae40349edceb
and …". -
Your computer: "OK, sending those now…"
Hash Collisions
If a hash function produces the same hash for two different inputs this is called a collision.
-
In some cases, particularly if the size of the hash is small, collisions are expected and we plan to deal with them.
-
If the size of the hash is large enough and the hash function is uniform, collisions should never happen and the world will end if they do. (Or at least
git
will stop working and my world will end.)
The Birthday Paradox
In a room with 100 students, what is the probability that two will share the same birthday 1? 99.9999%
-
Does anybody know how many you need to get a 50% chance? Only 23!
-
This is bad for our hash functions… collisions are more likely than we might think!
Birthday Hashing Paradox
How many documents do I have to hash before I find two with the same hash with 50% probability?
-
It depends on how large the hash is!
-
For 16 bits, 300. (The MP6 starter code had 80 files in it.)
-
For 32 bits, 77,000 (My computer has 2.5 million files on it.)
-
For 64 bits, 5 billion (GitHub.com has 1 billion files.)
-
For 128 bits, 14,000,000,000,000,000,000. (Now we’re getting warmer.)
-
(Git actually uses a 160-bit hash function.)
-
For 512 bits, 1.4 * 10^77 (The universe only has ~10^78 atoms, so this is probably enough.)
So Hashes Seem Useful…
But the best is yet to come!
Remember Arrays?
int[] numbers = new int[] {5, 6, 7};
System.out.println(numbers[0]);
numbers[1] = 8;
-
Arrays map an index (0, 1, 2,
array.length
to a value). -
The value can be anything, but the indices had to be be integers.
-
No longer!
Java Maps
A Java Map
allows us to use any object like an array index.
import java.util.Map;
import java.util.HashMap;
Map<String, Integer> stringValues = new HashMap<>();
stringLengths.put("test", 5);
System.out.println(stringLengths.get("test")); // Prints 5
stringLengths.put("test", 7);
System.out.println(stringLengths.get("test")); // Prints 7
A Map By Any Other Name
Maps are one of the two data structures you meet in heaven. (Along with lists.) Every language has them:
-
Python calls them dictionaries:
dict["key"] = "value"
-
JavaScript calls them anonymous objects
dict["key"] = "value"
-
C++ calls them maps:
dict.insert(std::make_pair("key", "value"));
-
Go calls them maps:
dict["key"] = "value"
-
Even Perl had them:
$dict{'key'} = "value"
! -
Sometimes we call them key-value stores, since each key maps to a single value
> Click or hit Control-Enter to run Example.main above
Map Usage Example
Let’s say that I want to process a large corpus of text and then be able to quickly answer queries about how many times particular words appear.
> Click or hit Control-Enter to run Example.main above
Brief Map
Implementation
So how do we implement a Map
?
-
Use a
hashCode
to retrieve a hash code for each object. -
Use that value—or a smaller part of it—as an index into an array.
-
But what about collisions?
Map
As Array + Linked List
> Click or hit Control-Enter to run Example.main above
HashMap
Performance
Let’s consider the performance of our simple HashMap
in two cases. First, if
the array is very small relative to the number of items:
-
put
: O(n) with n being the number of items -
get
: O(n) with n being the number of items -
At this point the
HashMap
is acting like a linked list.
HashMap
Performance
Let’s consider the performance of our simple HashMap
in two cases. Second, if
the array is very large relative to the number of items:
-
put
: O(1) -
get
: O(1) -
At this point the
HashMap
is acting like an array. -
What’s the problem? It requires a lot of space.
Realistic HashMap
Performance
In reality we want our HashMap
to blend the good features of an array and a
linked list.
-
Usually implementations will enlarge the array part of a
HashMap
once it gets filled past a certain point (called the load factor).
Looking forward to CS 225 yet?
This is cool stuff!
Announcements
-
Material from today’s class is covered in the quiz that starts today. Good luck!
-
The first final project checkpoint is tomorrow in lab. And the project fair is just around the corner!