How The Heck Does Shazam Work?

How audio fingerprinting and a connect-the-dots trick lets Shazam identify a song in seconds.

Shri Khalpada

Shri Khalpada

Part of How The Heck?, a series of interactive explanations of everyday technology.
If you like these, you can follow the author or buy him a coffee!

You're at a coffee shop. A song comes on. It's right on the tip of your tongue. You pull out your phone, tap a button, and it tells you what it is in a few seconds.

How does a phone listen to a few seconds of music through a noisy room and instantly match it against millions of songs?

Your first instinct might be that the phone is listening to the melody or recognizing the lyrics. It's neither of those. What it's actually doing is far more clever.

Reverse Engineering Sound

TL;DR
Your phone captures sound as a waveform using a very thin membrane. This waveform is not useful for song identification, so a process called the Fast Fourier Transform is used to convert it into a three-dimensional representation of sound called a spectrogram.

Your phone's microphone has a diaphragm, which is a thin membrane thinner than an average human hair that vibrates when sound waves hit it. Those vibrations are converted into an electrical signal, which is then digitized into a waveform: a sequence of numbers representing air pressure at each instant in time. This is essentially how your own ear works: your eardrum catches those same pressure waves, but while your brain turns them into sound, your phone turns them into a sequence of numbers.

But the raw waveform is nearly useless for identification. A song played louder produces a completely different-looking waveform even though it's the same song. Two different songs can produce very similar waveforms, and the same song played in different environments can produce very different waveforms.

The trick is to transform the waveform into something more useful for a computer. Your phone runs a mathematical operation called a Fast Fourier Transform (FFT) on small slices of the waveform. Each slice gets decomposed from a single complex wave into a list of the individual frequencies present at that moment.

Stack all those slices side by side and you get a spectrogram: a picture that encodes three dimensions at once: time runs along the horizontal axis, frequency along the vertical axis, and the brightness of each point represents amplitude (how loud that frequency is at that moment).

What does the FFT actually do?

Any waveform, no matter how jagged, can be described as a sum of smooth sine waves at different frequencies, amplitudes, and phases. The Fast Fourier Transform is an efficient algorithm for decomposing a chunk of audio samples into exactly that list. Feed it 1,024 samples of raw audio (about 23 milliseconds at CD quality), and it returns a spectrum telling us how much energy is present at each frequency. The core formula is the Discrete Fourier Transform:

For each frequency bin , you multiply every sample by a sine wave at that frequency and add them up. If the signal contains that frequency, the sum is large. If not, it cancels out.

The "fast" part matters. A naive decomposition would take millions of operations per chunk. The FFT exploits symmetry in the math to do it in roughly operations instead (where is the number of samples in the chunk). This is fast enough to run hundreds of times per second on a phone. Your device slides this window across the audio, runs the FFT on each slice, and stacks the resulting spectra side by side. That stack is the spectrogram.

Try playing some notes below. The top shows the raw waveform. The bottom shows the same audio as a spectrogram: a map of which frequencies are present at each moment. A single note produces one clean horizontal band. A chord produces several. Switch between chords and watch the bands shift in real time.

Those are simple synthetic tones, each one a single pure frequency. Real music is far more complex: vocals, drums, guitars, and reverb all layered on top of each other. Use your microphone to see sound transformed into a spectrogram in real time.

The phone samples incoming sound tens of thousands of times per second (typically 44,100, the same rate used in CDs). Each tiny slice of those samples gets fed through the FFT. What comes out the other side is a format that the system can reason about.

Less Is More

TL;DR
The algorithm deliberately discards most of the spectrogram, keeping only the loudest peaks: a sparse "constellation map".

Even for a computer, storing and searching all of that spectrogram data would be impossibly slow, so the algorithm does something counterintuitive: it throws almost all of it away.

Drag the threshold slider below and watch what happens. As you raise the threshold, fainter signals disappear while only the loudest peaks survive. What's left is a sparse constellation of dots representing the most acoustically relevant moments in the song.

This is what makes the system robust to noise. Background noise adds low-level energy across the spectrogram, but it rarely creates the single loudest peak in any given region. The landmarks are the frequencies that were so dominant they punched through the noise.

On the flipside, this "fingerprint" approach is also what makes Shazam work poorly if you just sing into it. You're likely to generate different hashes than the original song, even if you are a very good singer!

Connecting the Dots

TL;DR
A single peak doesn't tell us much, but pairs of peaks are much less random. The algorithm pairs nearby peaks to create unique fingerprint hashes.

A single dot in the constellation isn't very useful on its own. A frequency of 1,200 Hz at some moment in time could appear in thousands of songs. But a pair of dots, say 1,200 Hz followed by 2,400 Hz exactly 0.3 seconds later, is far more specific.

The algorithm gives every peak a turn as an anchor. For each one, it defines a target zone to its right (a window of time and frequency) and pairs the anchor with every peak inside that zone. Each pair generates a compact hash from three numbers: the two frequencies and the time difference between them.

You can think of a hash as a short string of characters that acts like an address: the same three inputs will always produce the same hash, but even a tiny change in any input produces a completely different one.

You can try it below by clicking any dot to select an anchor, then clicking one in the target zone.

A single 3-minute song might generate thousands of these fingerprint hashes, and the database stores them all. Now the phone has a handful of hashes from its 5-second clip, and the database has millions of hashes from almost every popular song ever recorded. How does it find the match?

Finding The Perfect Match

TL;DR
Each hash is an address. The system looks up every hash from your clip in a massive table and instantly finds which songs share them.

A Naive Approach: The Song-First View

When we think about music, we intuitively think in terms of songs. To find a match with this mental model, you'd have to search every song, one by one, checking whether its hashes overlap with the ones from your clip. This operates in time, which is a computer scientist's way of saying it gets linearly slower the more songs the world creates.

Try the example below, which represents a simplified version of the problem with a few songs and hashes.

An Inverted Index: The Hash-First View

Computers can flip the script. Instead of searching through a song library, the phone thinks in terms of the sound and tracks which songs that sound has shown up in. You can think of it as an index in a book, but instead of listening the pages a word appears in, it lists the songs a hash appears in.

This makes the lookup operation , meaning it takes roughly the same amount of time whether you have 100 songs or 100 million. More precisely, the phone goes straight to each hash's address rather than scanning through songs, and the number of possible hashes is large enough that each address only contains a handful of entries, even across millions of songs.

Finding shared hashes isn't enough, though. A popular drum pattern might produce the same hash in hundreds of songs. The final test is timing. If your clip has 17403C and 19A998 1.2 seconds apart, the matching song must also have them 1.2 seconds apart. If the time gaps between all the matching hashes agree, and there are enough matches, the system has found the song.

The system is designed around operations that computers are very good at: comparing numbers and looking up addresses. The whole lookup happens in fractions of a second across millions of songs.

More Modern Approaches

Most song-identification services, like Shazam, send your audio clip to a server where a massive database of fingerprints lives. The server does the matching and sends back the result. This works because the database is enormous (hundreds of millions of songs) and searching it requires serious compute.

There are newer approaches as well. Apple's on-device recognition and Google's Pixel "Now Playing" feature run locally on your phone. They use smaller, curated databases and optimized models that trade exhaustive coverage for speed and privacy, and more sophisticated machine learning models that are even more robust to noise and variations in the audio.

These on-device databases are typically slower to update with new songs, and have to pull new data if it detects your location changes. The hit songs in Japan are likely to be different from the hit songs in the US, and vice versa. The tradeoff is that these devices can work passively, without you having to do anything, and are much more efficient in terms of battery usage and privacy.

Regardless of where the matching happens, the core trick is the same. Shazam is solving a high-speed game of connect the dots, and the dots are specifically chosen to give us the best odds of finding the right song. I think it's a great example of technology and math doing something that feels magical.

Much of this piece is based on Avery Wang's original 2003 paper, An Industrial-Strength Audio Search Algorithm. If you want to go deeper into the signal processing and system design behind Shazam, it's worth your time.

Thank you!

If you like this type of content, you can follow me on BlueSky. If you wanted to support me further, buying me a coffee would be much appreciated. It helps us keep the lights on and the servers running! ☕

We're just getting started.

Subscribe for more thoughtful, data-driven explorations.