Hashing and Maps
> Click or hit Control-Enter to run the code above
Q10 Review and Warmup
> Click or hit Control-Enter to run the code above
Let’s Imagine…
Imagine I told you that there was a function with the following properties:
-
Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
-
Uniformity: over many inputs, each output value is equally likely.
-
Efficiency: it is efficient to compute.
> Click or hit Control-Enter to run the code above
Hash Functions
A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.
What Could We Do With Such A Function?
It may not seem obvious at first, but hash functions have many, many uses.
Example: Download Verification
Imagine the following scenario:
-
You need to download a 120GB file to install a particular piece of software.
-
It’s possible that, along the way, some data gets corrupted—either by the network or by your disk, who knows.
-
So before you install the software you want to be sure that you downloaded the file correctly.
Without A Hash Function
Without a hash function, what do we have to do?
-
Download the 120GB file.
-
Download it again. (Slow.)
-
Compare the two to make sure that they are the same. (Also slow.)
But…
Remember, I have a function with the following properties:
-
Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
-
Uniformity: over many inputs, each output value is equally likely.
-
Efficiency: it is efficient to compute.
With A Hash Function
With a hash function, what do we do?
-
You compute the hash of your copy of the file.
-
Download a hash of the file: maybe only a few bytes.
-
Compute the hash of the file locally and make sure that it matches.
Example Download With md5sum
md5
is a popular
hash function
that produces a 128-bit value.
We’re expecting an md5
hash value of a04828f1ffacc0bf9b48879b57794c2b
:
$ md5 mactex-20170524.pkg
MD5 (mactex-20170524.pkg) = a04828f1ffacc0bf9b48879b57794c2b
$
Example: Fingerprinting Content
Imagine the following scenario.
-
You sent me
foo.docx
at some point. -
(I deleted it because it was a
.docx
file, so in reality scenario over.) -
But let’s pretend that you can’t remember if you sent me the latest version.
Without a Hash Function
Without a hash function, what do we do?
-
You send me the file again.
-
(And I delete it again.)
But…
Remember, I have a function with the following properties:
-
Determinism: it can convert an arbitrary amount of data into a single limited-size value. If we repeat the computation on the same data, we get the same value.
-
Uniformity: over many inputs, each output value is equally likely.
-
Efficiency: it is efficient to compute.
With a Hash Function
With a hash function, what do we to do?
-
You compute the hash of your file.
-
I compute the hash of my file.
-
If they are the same, we’re done.
-
Otherwise you send me your copy.
Example Content Hash with git
git
uses hashes (the
SHA-1 algorithm)
to fingerprint files and commits:
Example git push
More or less, here’s what happens when you push to GitHub.com:
-
Your computer says: "Hi GitHub.com, I have the following files:
a6efc501d57b88df337fe904483d25732bb3e45e
,4e292499a1194d0493bd5350408fe3254d2335d3
,20da0fbbf8e8c279bb1edbbe0ac5ae40349edceb
, …" -
Server, "OK, I’ve got
4e292499a1194d0493bd5350408fe3254d2335d3
anda6efc501d57b88df337fe904483d25732bb3e45e
but I need20da0fbbf8e8c279bb1edbbe0ac5ae40349edceb
and …". -
Your computer: "OK, sending those now…"
Hash Collisions
If a hash function produces the same hash for two different inputs this is called a collision.
-
In some cases, particularly if the size of the hash is small, collisions are expected and we plan to deal with them.
-
If the size of the hash is large enough and the hash function is uniform, collisions should never happen and the world will end if they do. (Or at least
git
will stop working and my world will end.)
> Click or hit Control-Enter to run the code above
The Birthday Paradox
In a room with 100 students, what is the probability that two will share the same birthday 1? 99.9999%
-
Does anybody know how many you need to get a 50% chance? Only 23!
-
This is bad for our hash functions… collisions are more likely than we might think!
Birthday Hashing Paradox
How many documents do I have to hash before I find two with the same hash with 50% probability?
-
It depends on how large the hash is!
-
For 16 bits, 300. (The MP6 starter code had 80 files in it.)
-
For 32 bits, 77,000 (My computer has 2.5 million files on it.)
-
For 64 bits, 5 billion (GitHub.com has 1 billion files.)
-
For 128 bits, 14,000,000,000,000,000,000. (Now we’re getting warmer.)
-
(Git actually uses a 160-bit hash function.)
-
For 512 bits, 1.4 * 10^77 (The universe only has ~10^78 atoms, so this is probably enough.)
So Hashes Seem Useful…
But the best is yet to come!
Remember Arrays?
int[] numbers = new int[] { 5, 6, 7 };
System.out.println(numbers[0]);
numbers[1] = 8;
-
Arrays map an index (0, 1, 2,
array.length
to a value). -
The value can be anything, but the indices had to be be integers.
-
No longer!
Java Maps
A Java Map
allows us to use any object like an array index.
import java.util.Map;
import java.util.HashMap;
Map stringValues = new HashMap();
stringLengths.put("test", 5);
System.out.println(stringLengths.get("test")); // Prints 5
stringLengths.put("test", 7);
System.out.println(stringLengths.get("test")); // Prints 7
A Map By Any Other Name
Maps are one of the two data structures you meet in heaven. (Along with lists.) Every language has them:
-
Python calls them dictionaries:
dict["key"] = "value"
-
JavaScript calls them anonymous objects
dict["key"] = "value"
-
C++ calls them maps:
dict.insert(std::make_pair("key", "value"));
-
Go calls them maps:
dict["key"] = "value"
-
Even Perl had them:
$dict{'key'} = "value"
! -
Sometimes we call them key-value stores, since each key maps to a single value
> Click or hit Control-Enter to run the code above
Map Usage Example
Let’s say that I want to process a large corpus of text and then be able to quickly answer queries about how many times particular words appear.
> Click or hit Control-Enter to run Example.main above
Announcements
-
I’m giving a talk on my research in mobile systems, tomorrow (Thursday) at 10AM in Siebel 2405. Feel free to attend.
-
New exam practice problems are available.
-
MP7 (the final project) is out. Please get started!
-
The anonymous feedback form remains available on the course website. Use it to give us feedback!
-
My office hours continue today at 11AM in the lounge outside of Siebel 0226.