Hash functions are miraculous, helpful tools we have no right to expect to exist. They seem far too convenient! They have many uses, including determining if two files are the same, creating unique (shorter) names, and encryption and digital security

A **hash** function \(H\) maps data of arbitrary size to a fixed size such that

- \(H(x)\) is an easy to compute, deterministic function
- If \(x\not= y\) then \(H(x)\not=H(y)\) with high probability
- \(H(x)\) appears random over its range as \(x\) varies

**Examples.** The “IT” hash function: first five letters of last name + first letter first name, which has the J. Smith problem. A person’s phone, zip, or social security number act similarly to hash functions.

In practice, we want the *high probability* in (2) to mean a collision probability \(\approx 10^{-40}\), not one in one hundred^{1}.

A hash functions is called **cryptographic** if it also satisfies

- Given \(y\) it is very hard to find \(x\) with \(H(x)=y\).

SHA256 is a common cryptographic hash function. It is available in Python.

```
import hashlib
'The quick brown fox jumps over the lazy dog'.encode('utf-8')).digest().hex()
hashlib.sha256(# 'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'
b'The quick brown fox jumps over the lazy dog.').hexdigest()
hashlib.sha256(# 'ef537f25c895bfa782526529a9b63d97aa631564d5d789c2b765448c8635fb6c'
```

Adding a period to the end of the string completely changes the hash, illustrating (3).

The output of hash can be interpreted as an integer. For SHA256, it is a **very** large integer, between 0 and \(2^{256}\approx 10^{77}\). Note that you must carefully specify input and output formats, especially when dealing the Unicode strings. Since the hash is so large, the probability of a hash collision is extremely low. How low?

The Birthday Problem asks how many people you need so there is at least a 50 percent chance two share a birthday. The answer is only 23 people. Hash collisions are analogous to the Birthday Problem: two documents (people) with the same hash (birthday).

Approximately \(\sqrt{2Np}\) documents are needed before there is a \(p\) probability of collision given a hash space size of \(N\)^{2}. For SHA256, \(N=2^{256}=10^{77}\) and so a \(10^{-3}\) collision probability requires about \(1.5\times 10^{37}\) documents. That’s enough for every person on earth to compute 1 billion hashes per second for five times the age of the universe!

The (in)famous Bitcoin mining process involves finding a *nonce* (number used once) to append to a coded string so that the result has a small hash, i.e., one with several leading zeros. The following Python code illustrates.

```
import hashlib
= 2**256
running_min = []
ans = b'The quick brown fox jumps over the lazy dog'
base
for nonce in range(2000000000): # 2 billion
= hashlib.sha256(nonce.to_bytes(4, byteorder='big') + base).hexdigest()
h = int(h, 16)
n if n < running_min:
= n
running_min
ans.append([nonce, n])print(f'{nonce:12,d} {n:077}')
# Nonce Hash
# 0 29115579230639891023898657467946481563928575965694753738500728003067276450760
# 3 21833633896494697913657452817095065461049276120751755746016193921330837964982
# 8 17391853960576662285627567225372501697536440120814058733709287576654299269058
# 9 00 491741673371171570027367996335736784622791320015893772572199978008540614786
# 817 00 207113148484537618144604663416437589440289273319027116671254033065643419132
# 827 000 35029650895291714754047120679492927968250654901817817434081241936987361735
# 3,292 000 30590294895123458493702891527069975442971551875566805022772671084264919745
# 6,362 000 23157006908555232018903879877754051315219896322661305099606253143774488785
# 7,634 000 11843095073522994422561274720857316931066719486382550615573171404879921966
# 22,034 0000 6045160764465103256154815045992679930360222615550766779824452388654984639
# 32,737 0000 3218718010716516807246023638919032202673987969434384430166215105132280583
# 43,078 0000 3066940367111277087798394765784480513227788830972580117541505418890948712
# 50,740 0000 0 344804005194498392473362848134761831134304453202875173759130216105619080
# 260,109 0000 0 149043122808237032345561872905133216060467384369910593113997965062602336
# 610,827 0000 00 25441204939268765420155917698735840343496809686969451042687651132777655
# 3,553,698 0000 00 12372585984995238023081534031026808791454761919139475665549030259593011
# 16,603,005 0000 000 4682308792444739613119316155033986067282587356863979013510780284611482
# 45,767,445 0000 000 4295135810439807939037487563409966578108755229939605598485594694500274
# 56,389,936 0000 000 1219890553970511010693160459086914039690075265862677724048817741406404
# 186,599,009 0000 0000 741733398915175814111679160159562329641666849535152212310255158283708
# 187,060,155 0000 0000 129027976973068678554136418237268320708790839626316760173444080235551
# 209,437,773 0000 0000 0 46418792192972977622708878642780226280538977482131916077098153688658
# 554,751,705 0000 0000 0 38492057003517052607600918969310106371482316138230835578404460555913
# 1,724,412,865 0000 0000 0 20951411954830677538112338658105096359813168232452740277675602777590
```

My machine took about 6 seconds to find six leading zeros.

Currently, the Bitcoin network hash rate^{3} is \(2\times 10^{20}\) (in November 2018 it was \(4\times 10^{19}\)) hashes per second. It is estimated that miner’s electricity consumption approximates that of Austria or Finland. How small are the hashes being found? Recent block hashes start:

```
Nov 2018 0x 0000 0000 0000 0000 0051 a841 86ab c5df ........
Feb 2022 0x 0000 0000 0000 0000 0003 b82d d7af bfb4 ........
```

Our simple code found nine leading zeros. Currently, Bitcoin finds 19. Remember, the difficulty increases exponentially. The problem is analogous to simulating random numbers in a spreadsheet and hitting `F9`

until one starting 0.0xxx appears. You’ll wait ten times as long for 0.0**0**xxx, etc.

The logarithm is the inverse to the exponential function: if \(a=b^c\) then \(c\) is the logarithm of \(a\) to base \(b\). With regular numbers, the exponential function \(x\mapsto x^c\) is a nice smooth function and it is easy to compute its inverse using numerical methods. The picture changes using modular arithmetic (remainders).

**Setup.** Let \(p\) be a large prime number with a large prime \(q\) dividing \(p-1\)

- E.g. \(p=23\), \(p-1=22\) has factor \(q=11\)
- Non-e.g. if \(p=29\) then \(p-1=28\) has many factors
- Condition on \(p\) is needed to ensure the discrete logarithm problem is hard
- Find \(g<q\) with \(g, g^2, \dots,g^{q-1}, g^q\equiv 1\) all distinct mod \(p\)
- E.g. \(g=3\) and work
**mod 23**=**remainder**after dividing by 23 - \((3) \rightarrow (9 = 9) \rightarrow (27 = 4) \rightarrow (12 = 12) \rightarrow (36 = 13) \rightarrow (39 = 16) \rightarrow (48 = 2) \rightarrow (6 = 6) \rightarrow (18 = 18) \rightarrow (54 = 8) \rightarrow (24 = 1)\)

- E.g. \(g=3\) and work

The **discrete logarithm problem** asks

given \(n\), find \(a\) such that \(g^a\equiv n\pmod{p}\).

Remember, \(\log_g(n)=a\) means \(g^a=g^{\log_g(n)}=n\), hence name logarithm. Unlike the case for regular numbers, the remainders of powers of \(g\) bounce around randomly (see figures). We can’t use a Newton-Raphson method to home-in on the answer. And in practice you work with **really big** \(p\) and \(q\). Given \(g\), \(a\), and \(p\), computing \(A=g^a\mod p\) is easy. But recovering \(a\) from \(A\) is hard. Exponentiation is an example of a **one-way function.** These ideas are illustrated in the next figure.

Many techniques in cryptography and digital signatures are based on the ingenious use of basic modular arithmetic. Internet security relies on addition, multiplication and division, and the difficulty of the discrete logarithm problem. The next three sections explain how two actors can create a **shared secret**, send a message using **Public Key Encryption**, and digitally sign a document.

Digital signatures are improve over traditional signatures in three important ways.

- Signer
**authentication**: the verifier is assured that signature has been created only by sender who possess the corresponding secret private key, \(a\). - Message
**integrity**: if the message is modified then the signature fails, making the signature tamper evident. **Non-repudiation**: the existence of a signature proves it came from the sender. The sender cannot later repudiate signing.

In contrast, a wet-ink signature can be forged, or the document can be altered, or the signature can be denied.

See Birthday Attack.↩︎

E.g. for birthday problem \(p=1/2\), \(N=365\) and \(\sqrt{2Np}=19\). Approximation relies on \(p\approx -\log(1-p)\), only true for smaller \(p\). Using \((-2N\log(1-p))^{1/2}=22.49\) is very close to correct answer, 23.↩︎

Number of hashes computed per second by all miners.↩︎

posted 2022-02-08 | tags: hash, encryption, digital signature