Today’s Musical Selections
"If I don’t get out, out of this town—I just might be the one to burn it down." I’m sure maybe a few of us are feeling that way right about now, except substitute "house" for "town".
I’m not surprised that I like the new song by The Killers. I was, however, surprised by how good they live from their bathroom on Jimmy Kimmel.
Hashing and Maps
> Click or hit Control-Enter to run Example.main above
Hash Functions
Imagine I told you that there was a function with the following properties:
-
Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
-
Uniformity: over many inputs, each output value is equally likely.
-
Efficiency: it is efficient to compute.
Hash Functions
A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.
What Could We Do With Such A Function?
It may not seem obvious at first, but hash functions have many, many uses.
Example: Download Verification
Imagine the following scenario:
-
You need to download a 120GB file to install a particular piece of software.
-
It’s possible that, along the way, some data gets corrupted—either by the network or by your disk, who knows.
-
So before you install the software you want to be sure that you downloaded the file correctly.
Without A Hash Function
Without a hash function, what do we have to do?
-
Download the 120GB file.
-
Download it again. (Slow.)
-
Compare the two to make sure that they are the same. (Also slow.)
But…
Remember, I have a function with the following properties:
-
Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
-
Uniformity: over many inputs, each output value is equally likely.
-
Efficiency: it is efficient to compute.
With A Hash Function
With a hash function, what do we do?
-
You compute the hash of your copy of the file.
-
Download a hash of the file: maybe only a few bytes.
-
Compute the hash of the file locally and make sure that it matches.
Example Download With md5sum
md5 is a popular
hash function
that produces a 128-bit value.
We’re expecting an md5 hash value of d95bacb4ccd59657a5ac2bf66b35ebcc:
$ md5 mactex-20170524.pkg
MD5 (mactex-20170524.pkg) = d95bacb4ccd59657a5ac2bf66b35ebcc
$
Example: Fingerprinting Content
Imagine the following scenario.
-
You sent me
foo.docxat some point. -
(I deleted it because it was a
.docxfile, so in reality scenario over.) -
But let’s pretend that you can’t remember if you sent me the latest version.
Without a Hash Function
Without a hash function, what do we do?
-
You send me the file again.
-
(And I delete it again.)
But…
Remember, I have a function with the following properties:
-
Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
-
Uniformity: over many inputs, each output value is equally likely.
-
Efficiency: it is efficient to compute.
With a Hash Function
With a hash function, what do we to do?
-
You compute the hash of your file.
-
I compute the hash of my file.
-
If they are the same, we’re done.
-
Otherwise you send me your copy.
Example Content Hash with git
git uses hashes (the
SHA-1 algorithm)
to fingerprint files and commits:
Example git push
More or less, here’s what happens when you push to GitHub.com:
-
Your computer says: "Hi GitHub.com, I have the following files:
a6efc501d57b88df337fe904483d25732bb3e45e,4e292499a1194d0493bd5350408fe3254d2335d3,20da0fbbf8e8c279bb1edbbe0ac5ae40349edceb, …" -
Server, "OK, I’ve got
4e292499a1194d0493bd5350408fe3254d2335d3anda6efc501d57b88df337fe904483d25732bb3e45ebut I need20da0fbbf8e8c279bb1edbbe0ac5ae40349edceband …". -
Your computer: "OK, sending those now…"
Hash Collisions
If a hash function produces the same hash for two different inputs this is called a collision.
-
In some cases, particularly if the size of the hash is small, collisions are expected and we plan to deal with them.
-
If the size of the hash is large enough and the hash function is uniform, collisions should never happen and the world will end if they do. (Or at least
gitwill stop working and my world will end.)
The Birthday Paradox
In a room with 100 students, what is the probability that two will share the same birthday 1? 99.9999%
-
Does anybody know how many you need to get a 50% chance? Only 23!
-
This is bad for our hash functions… collisions are more likely than we might think!
Birthday Hashing Paradox
How many documents do I have to hash before I find two with the same hash with 50% probability?
-
It depends on how large the hash is!
-
For 16 bits, 300. (The MP6 starter code had 80 files in it.)
-
For 32 bits, 77,000 (My computer has 2.5 million files on it.)
-
For 64 bits, 5 billion (GitHub.com has 1 billion files.)
-
For 128 bits, 14,000,000,000,000,000,000. (Now we’re getting warmer.)
-
(Git actually uses a 160-bit hash function.)
-
For 512 bits, 1.4 * 10^77 (The universe only has ~10^78 atoms, so this is probably enough.)
So Hashes Seem Useful…
But the best is yet to come!
Remember Arrays?
int[] numbers = new int[] {5, 6, 7};
System.out.println(numbers[0]);
numbers[1] = 8;
-
Arrays map an index (0, 1, 2,
array.lengthto a value). -
The value can be anything, but the indices had to be be integers.
-
No longer!
Java Maps
A Java Map allows us to use any object like an array index.
import java.util.Map;
import java.util.HashMap;
Map<String, Integer> stringValues = new HashMap<>();
stringLengths.put("test", 5);
System.out.println(stringLengths.get("test")); // Prints 5
stringLengths.put("test", 7);
System.out.println(stringLengths.get("test")); // Prints 7
A Map By Any Other Name
Maps are one of the two data structures you meet in heaven. (Along with lists.) Every language has them:
-
Python calls them dictionaries:
dict["key"] = "value" -
JavaScript calls them anonymous objects
dict["key"] = "value" -
C++ calls them maps:
dict.insert(std::make_pair("key", "value")); -
Go calls them maps:
dict["key"] = "value" -
Even Perl had them:
$dict{'key'} = "value"! -
Sometimes we call them key-value stores, since each key maps to a single value
> Click or hit Control-Enter to run Example.main above
Map Usage Example
Let’s say that I want to process a large corpus of text and then be able to quickly answer queries about how many times particular words appear.
> Click or hit Control-Enter to run Example.main above
Brief Map Implementation
So how do we implement a Map?
-
Use a
hashCodeto retrieve a hash code for each object. -
Use that value—or a smaller part of it—as an index into an array.
-
But what about collisions?
Map As Array + Linked List
> Click or hit Control-Enter to run Example.main above
HashMap Performance
Let’s consider the performance of our simple HashMap in two cases. First, if
the array is very small relative to the number of items:
-
put: O(n) with n being the number of items -
get: O(n) with n being the number of items -
At this point the
HashMapis acting like a linked list.
HashMap Performance
Let’s consider the performance of our simple HashMap in two cases. Second, if
the array is very large relative to the number of items:
-
put: O(1) -
get: O(1) -
At this point the
HashMapis acting like an array. -
What’s the problem? It requires a lot of space.
Realistic HashMap Performance
In reality we want our HashMap to blend the good features of an array and a
linked list.
-
Usually implementations will enlarge the array part of a
HashMaponce it gets filled past a certain point (called the load factor).
Looking forward to CS 225 yet?
This is cool stuff!
Announcements
-
Material from today’s class is covered on Wednesday’s quiz.
-
The first final project checkpoint is tomorrow in lab.