Voice Recognition Exploit Exposed

A recent article on TechCrunch warns that hackers are sending silent commands to speed recognition systems with ultrasound. These sorts of attacks are interesting because are sometimes plausible, sometimes outrageous, and oftentimes teach us something about security concerns we didn't know to worry about previously.

DolphinAttack exploits aspects of the way the hardware records the sound and processes it into speech signals. Essentially it's a hack that's very cleverly based on harmonics. Explaining the signal processing background is a bit more than I wanted to cover in this post, but if readers are interested let me know.

By constructing sounds and exploiting harmonics this solution is able to create sound that the human ear is very unlikely here, Yet is picked up just fine by the machines microphone. So why doesn't the microphone also hear it very softly and that's ignored as background noise? It's the way in which the machines sample and transform the compression waves that are recorded. The most vernacular way I have to describe it is to say these are like phantom sounds or artifacts.

This type of work is not only interesting, it should be celebrated for exposing a potential system attack that had not widely been considered by many people. This demonstration might possibly require a few strict criteria for the circumstances under which it would work, but is still strong demonstration of what's possible when we have microphones constantly listening for they wake up words. Perhaps the eventual outcome is that we do away with wake up words. Perhaps we'll carry convenient cheap clickers that indicate our desire to be listened to, and the second click to revoke. To be clear, those clickers would be connecting via a network, perhaps with some biometrics from the contact with your finger.

We should also note, that is very doubtful the speech recognition engineers too concerned with this use case. If this is overfitting of the data, we absolutely forgive them, for bringing these exceptional systems to market. In general, they are in perfect, but they're very close. With this now exposed as a concern I suspect we may see a hardware and software evolution optimizing no longer for accuracy and energy, But also for a biological match, perhaps.

Why even have sensors that can listen outside of human hearing range? Oh yeah, because that might be useful! I'm reminded of early telephone systems it works flooded by devices called blue boxes, red boxes, and a rainbow of other exploits. The ability for devices to communicate with one another audibly yet without disturbing humans in the vicinity is an interesting use case. Maybe it's not necessary. Maybe they can just sync over Wi-Fi. However, if you consider proximity, to the best of my knowledge Wi-Fi is rather pitiful at pinpointing precise device locations. Maybe the natural sound of the environment could have some theoretic advantage for a swarm of devices I'll talking to one another. This is more speculative than I usually like to be, but I think he audible spectrum is going to be thought about much more in the next 5 to 10 years then perhaps we ever thought about it before.