skip to content


by ViTo • July 26, 2008 1 comment

Going against the current I have to admit that I am no big fan of the iPhone. I’ll give it to you, it’s slyck, it has plenty of features, and it’s… cool; however, I’m not willing to pay the price Apple set for it (I don’t see an iPod or a MacBook in my life any time soon either 😉 ).

In any case, I got to play with my friend’s iPhone and what got my attention the most was this application called Shazam. Why is it special? Basically, you play any song and let the Shazam listen to it for about 5-10 seconds and it will tell you the title, author, album, etc.

Here you can find a video I found that shows it in action:

(more detailed video review)

I am not gonna argue whether it is needed/useful or not, although I think online music shops such as iTunes could easily use the technology to increase their sales. As far as I’ve seen, the current iPhone implementation of the application is free, but their website refers to an existing previous service which was charged on a recognition basis. No match, no pay. So they have to be pretty confident on their recognition abilities…

What got my attention is the fact the application will succeed in recognizing the songs every single time just by listening only to about 8 seconds. It doesn’t matter what part of the songs Shazam listens to, how repetitive it is or how unrepresentative of the rest that segment is, it will get it right! Of course, I was able to make it fail, by feeding it traditional Catalan songs, but in this case, I’m almost certain it failed, not because the algorithm was not successful, but since it didn’t have that song in its database.

Having a background in electrical engineering and computer science I am somehow familiar with digital signal sampling, filtering, etc., and I find this product amazing. I googled around but could find no details at all (even high level ones) about its underlying algorithm.

It’s obvious the procedure goes something like this:

listen to fragment –> digital processing –> song signature –> database match

Given the low quality reception of the song, they will have to deal with plenty of noise but still manage to extract some fundamental audio parameters. If it were only dealing with voice, it would be something like pitch, tone, etc. the parameters that make your voice unique. That’s mostly how voice recognition works after all.

In the case of a full song, that’s not enough due to its high variance: The beginning might be completely different to the end, so the signature extracted by the algorithm has to be different every time. The only way I see they can do it, is by extracting much shorter audio signatures, maybe like 1 second long. I think it is safe to assume a song signature can remain constant within a second timespan.

Of course, with such short signatures, they don’t necessarily have to be unique, but at this point there can be more than one match. Yet, now it’s a matter of overlapping the result set for each of them. If the match is still not unique, signature sequences could also possibly narrow results down.

I read on their website the database is currently holding 4 million songs, which based on the 1 second long signature and assuming the average song is 4 minutes long (that is 240 seconds), that would give us a database of 1 billion (10^9) entries. Fortunately, a signature should be something tiny, similar to a hash value.

At this point, the issues are hardware resources and scalability, but given current technology with terabyte hard drives and mulit-core cpu’s this approach is clearly feasible. Isn’t that amazing? Maybe you don’t think so after all 😉 , but at least it makes me wonder where the limit is and where we can be just in a few years. Are we soon gonna see devices able recognize human voice given a 1 second sample in a unique enough way to discern it from the rest of humans? or maybe it can already be done…

By the way, does anybody have any details about the actual implementation?

Comments RSS

  1. Sergio

    Well, when I was working in Germany 😉 I used to do that in a much easier way, without any processing whatsoever. All you need is an internet connection and the song’s metadata to be stored in the database (to have been previously done by someone with too much free time). It does not matter whether the song’s being played or not, the metadata (i.e., song’s name, album, etc.) can be retrieved anyway. The thing is, each song has a fingerprint that is sent to the database, then a match is found (hopefully the right one), and then you get back the name artist, album, song name, etc (these metadata are called ID3 tags in MP3, for instance, I don’t know for apple’s proprietary formats).

    In the case of the iPhone, I don’t know whether the database is stored in the phone, accessed via a mobile comms. link or whatever, but the principle is the same

    connect to remote database –> send song’s fingerprint –> search match in database –> get metadata back to your program and show it on the screen to make the user go crazy!!

    Examples of databases are freedb (
    or CDDB ( Check them out and their links and everything if you want to learn more.

    It is not really difficult to get it done, but cool nonetheless!