Hash Magic: Encryption and Digital Signatures

Hash functions are miraculous, helpful tools we have no right to expect to exist. They seem far too convenient! They have many uses, including determining if two files are the same, creating unique (shorter) names, and encryption and digital security

Hash functions

A hash function \(H\) maps data of arbitrary size to a fixed size such that

  1. \(H(x)\) is an easy to compute, deterministic function
  2. If \(x\not= y\) then \(H(x)\not=H(y)\) with high probability
  3. \(H(x)\) appears random over its range as \(x\) varies

Examples. The “IT” hash function: first five letters of last name + first letter first name, which has the J. Smith problem. A person’s phone, zip, or social security number act similarly to hash functions.

In practice, we want the high probability in (2) to mean a collision probability \(\approx 10^{-40}\), not one in one hundred1.

A hash functions is called cryptographic if it also satisfies

  1. Given \(y\) it is very hard to find \(x\) with \(H(x)=y\).

SHA256 is a common cryptographic hash function. It is available in Python.

import hashlib
hashlib.sha256('The quick brown fox jumps over the lazy dog'.encode('utf-8')).digest().hex()
# 'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'

hashlib.sha256(b'The quick brown fox jumps over the lazy dog.').hexdigest()
# 'ef537f25c895bfa782526529a9b63d97aa631564d5d789c2b765448c8635fb6c'

Adding a period to the end of the string completely changes the hash, illustrating (3).

The output of hash can be interpreted as an integer. For SHA256, it is a very large integer, between 0 and \(2^{256}\approx 10^{77}\). Note that you must carefully specify input and output formats, especially when dealing the Unicode strings. Since the hash is so large, the probability of a hash collision is extremely low. How low?

Aside: The Birthday Problem and hash collisions

The Birthday Problem asks how many people you need so there is at least a 50 percent chance two share a birthday. The answer is only 23 people. Hash collisions are analogous to the Birthday Problem: two documents (people) with the same hash (birthday).

Approximately \(\sqrt{2Np}\) documents are needed before there is a \(p\) probability of collision given a hash space size of \(N\)2. For SHA256, \(N=2^{256}=10^{77}\) and so a \(10^{-3}\) collision probability requires about \(1.5\times 10^{37}\) documents. That’s enough for every person on earth to compute 1 billion hashes per second for five times the age of the universe!

Aside: Proof of work and Bitcoin mining = compute hashes

The (in)famous Bitcoin mining process involves finding a nonce (number used once) to append to a coded string so that the result has a small hash, i.e., one with several leading zeros. The following Python code illustrates.

import hashlib

running_min = 2**256
ans = []
base = b'The quick brown fox jumps over the lazy dog'

for nonce in range(2000000000):  # 2 billion
    h = hashlib.sha256(nonce.to_bytes(4, byteorder='big') + base).hexdigest()
    n = int(h, 16)
    if n <  running_min:
        running_min = n
        ans.append([nonce, n])
        print(f'{nonce:12,d}   {n:077}')

#         Nonce   Hash
#             0    29115579230639891023898657467946481563928575965694753738500728003067276450760
#             3    21833633896494697913657452817095065461049276120751755746016193921330837964982
#             8    17391853960576662285627567225372501697536440120814058733709287576654299269058
#             9   00 491741673371171570027367996335736784622791320015893772572199978008540614786
#           817   00 207113148484537618144604663416437589440289273319027116671254033065643419132
#           827   000 35029650895291714754047120679492927968250654901817817434081241936987361735
#         3,292   000 30590294895123458493702891527069975442971551875566805022772671084264919745
#         6,362   000 23157006908555232018903879877754051315219896322661305099606253143774488785
#         7,634   000 11843095073522994422561274720857316931066719486382550615573171404879921966
#        22,034   0000 6045160764465103256154815045992679930360222615550766779824452388654984639
#        32,737   0000 3218718010716516807246023638919032202673987969434384430166215105132280583
#        43,078   0000 3066940367111277087798394765784480513227788830972580117541505418890948712
#        50,740   0000 0 344804005194498392473362848134761831134304453202875173759130216105619080
#       260,109   0000 0 149043122808237032345561872905133216060467384369910593113997965062602336
#       610,827   0000 00 25441204939268765420155917698735840343496809686969451042687651132777655
#     3,553,698   0000 00 12372585984995238023081534031026808791454761919139475665549030259593011
#    16,603,005   0000 000 4682308792444739613119316155033986067282587356863979013510780284611482
#    45,767,445   0000 000 4295135810439807939037487563409966578108755229939605598485594694500274
#    56,389,936   0000 000 1219890553970511010693160459086914039690075265862677724048817741406404
#   186,599,009   0000 0000 741733398915175814111679160159562329641666849535152212310255158283708
#   187,060,155   0000 0000 129027976973068678554136418237268320708790839626316760173444080235551
#   209,437,773   0000 0000 0 46418792192972977622708878642780226280538977482131916077098153688658
#   554,751,705   0000 0000 0 38492057003517052607600918969310106371482316138230835578404460555913
# 1,724,412,865   0000 0000 0 20951411954830677538112338658105096359813168232452740277675602777590

My machine took about 6 seconds to find six leading zeros.

Currently, the Bitcoin network hash rate3 is \(2\times 10^{20}\) (in November 2018 it was \(4\times 10^{19}\)) hashes per second. It is estimated that miner’s electricity consumption approximates that of Austria or Finland. How small are the hashes being found? Recent block hashes start:

Nov 2018 0x 0000 0000 0000 0000 0051 a841 86ab c5df ........
Feb 2022 0x 0000 0000 0000 0000 0003 b82d d7af bfb4 ........

Our simple code found nine leading zeros. Currently, Bitcoin finds 19. Remember, the difficulty increases exponentially. The problem is analogous to simulating random numbers in a spreadsheet and hitting F9 until one starting 0.0xxx appears. You’ll wait ten times as long for 0.00xxx, etc.

The discrete logarithm problem

The logarithm is the inverse to the exponential function: if \(a=b^c\) then \(c\) is the logarithm of \(a\) to base \(b\). With regular numbers, the exponential function \(x\mapsto x^c\) is a nice smooth function and it is easy to compute its inverse using numerical methods. The picture changes using modular arithmetic (remainders).

Setup. Let \(p\) be a large prime number with a large prime \(q\) dividing \(p-1\)

  • E.g. \(p=23\), \(p-1=22\) has factor \(q=11\)
  • Non-e.g. if \(p=29\) then \(p-1=28\) has many factors
  • Condition on \(p\) is needed to ensure the discrete logarithm problem is hard
  • Find \(g<q\) with \(g, g^2, \dots,g^{q-1}, g^q\equiv 1\) all distinct mod \(p\)
    • E.g. \(g=3\) and work mod 23 = remainder after dividing by 23
    • \((3) \rightarrow (9 = 9) \rightarrow (27 = 4) \rightarrow (12 = 12) \rightarrow (36 = 13) \rightarrow (39 = 16) \rightarrow (48 = 2) \rightarrow (6 = 6) \rightarrow (18 = 18) \rightarrow (54 = 8) \rightarrow (24 = 1)\)

The discrete logarithm problem asks

given \(n\), find \(a\) such that \(g^a\equiv n\pmod{p}\).

Remember, \(\log_g(n)=a\) means \(g^a=g^{\log_g(n)}=n\), hence name logarithm. Unlike the case for regular numbers, the remainders of powers of \(g\) bounce around randomly (see figures). We can’t use a Newton-Raphson method to home-in on the answer. And in practice you work with really big \(p\) and \(q\). Given \(g\), \(a\), and \(p\), computing \(A=g^a\mod p\) is easy. But recovering \(a\) from \(A\) is hard. Exponentiation is an example of a one-way function. These ideas are illustrated in the next figure.

Powers of 3 modulo 100043; 100042 = 2 x 50021 is twice a prime. The “graph” of the exponential function looks completely random, making it very difficult to compute the inverse and solve the discrete logarithm problem.

Cryptography and Digital Signatures

Many techniques in cryptography and digital signatures are based on the ingenious use of basic modular arithmetic. Internet security relies on addition, multiplication and division, and the difficulty of the discrete logarithm problem. The next three sections explain how two actors can create a shared secret, send a message using Public Key Encryption, and digitally sign a document.

Creating a shared secret using hash functions

Figure: Alice picks a, computes A=g^a \pmod p and sends it to Bob. Bob picks b, computes B=g^b\pmod p and sends it to Alice. Both Alice and Bob can compute K=B^a=A^b=g^{ab}\pmod p, but no one else can, and it was never shared electronically. It is a shared secret between Alice and Bob.

ElGamel Public Key Encryption

Figure: Bob wants to send Alice a message m. Alice picks a and publishes a public key A=g^a\pmod p. Bob picks a nonce k and computes B=g^k\pmod p, the shared secret K=A^k\pmod p, and encodes the message as Km\pmod p. Bob sends the pair (B, Km) to Alice. Knowing B, Alice can compute K=B^a=g^{ak} and then decode m=(K^{-1}Km). The shared secret K is never communicated publicly.

Digital Signatures

Figure: To sign m, Alice picks a nonce k, computes R=g^k\pmod p to hide it, solves kS=m+Ra for k, and communicates the pair (R,S) to Bob as the signature. Bob validates the signature by computing g^mA^R=g^{m+Ra} and R^S=g^{kS} and checking they are equal. If an Imposter, who does not know Alice’s secret a tries to sign the message, they must solve kS=m+Ra, one equation with two unknowns. They do know A=g^a, and they can pick k and R, so they try to solve g^{kS}=R^S=g^{m+Ra}=g^mA^R. But, solving R^S=g^mA^R for R and S is (supposed to be) as hard as the discrete logarithm problem. Thus, an Imposter cannot sign the message.

Digital signatures are improve over traditional signatures in three important ways.

  1. Signer authentication: the verifier is assured that signature has been created only by sender who possess the corresponding secret private key, \(a\).
  2. Message integrity: if the message is modified then the signature fails, making the signature tamper evident.
  3. Non-repudiation: the existence of a signature proves it came from the sender. The sender cannot later repudiate signing.

In contrast, a wet-ink signature can be forged, or the document can be altered, or the signature can be denied.


  1. See Birthday Attack.↩︎

  2. E.g. for birthday problem \(p=1/2\), \(N=365\) and \(\sqrt{2Np}=19\). Approximation relies on \(p\approx -\log(1-p)\), only true for smaller \(p\). Using \((-2N\log(1-p))^{1/2}=22.49\) is very close to correct answer, 23.↩︎

  3. Number of hashes computed per second by all miners.↩︎

posted 2022-02-08 | tags: hash, encryption, digital signature

Share on