Ghost in the machine: the deep features of Yanming Guo
In the 1960s at MIT, cognitive scientist Marvin Minsky told a couple of graduate students to program a computer to perform the simple task of recognising objects in pictures, thinking it would be a nice summer project. Scientists from Leiden and the rest of the world are still working on it today.
More than colours and shapes
'To truly understand what's in a photograph', says Leiden PhD student Yanming Guo, 'neural networks need to see more than colours and shapes. They must understand the image, to see depth, scale and context. The first step toward that goal is to develop powerful image features.'
Guo is using deep convolutional and recurrent neural networks with sets of massive images in large-scale datasets, teaching them how to develop those powerful features - to learn how to identify the most basal and deepest features of images, as a first step toward deeper understanding. The algorithms that he has written give computers the power to understand visual input like never before.
Guo uses what is known as deep learning. Computers are given no specific instructions as to how to learn about what they're seeing and no pixel-for-pixel explanation of what they're doing. Instead, they are made to understand sets of algorithms and patterns that enable them to analyse the data they works on, to gain a deeper understanding.
The information is stored in deep neural networks, computing systems that work in a way not dissimilar to human brains. These networks compare whatever they're looking at to everything else they've seen and know. 'They change, adapting to the data they're fed, changing from an initial general state to a specialised state that is focused on the type of data it is exposed to,' explains Guo.
'For my doctoral research, I have done three things', says Guo. 'First and most fundamentally, I have defined new rules to extract those powerful features, using otherwise traditional ways to extract them. Second, I have created a hierarchy of specificity -for instance progressing from “animal” to “dog” to “golden retriever” - in order to create an understanding of the entire image. Finally, I have used cross-model synthesis to translate the data in the computer into something we can understand - words and sentences, with grammar and syntax.'
This is where Guo unites the science of visual recognition with that of the human language, using deep learning to teach computers to speak English. The computer associates a shape it sees with the label 'dog', it identifies the background as 'grass', and it associates what is in the dog's mouth as 'frisbee'. Now, it needs to tell us that, and as clearly as it can. Because of Yanming Guo's work, a computer can now describe this 'dog, grass, frisbee' as 'a dog is standing in the grass with a frisbee', proving that it is doing far more than guessing at shapes.
Not another brick in the wall
Guo has written algorithms to extract deep features, to build hierarchies, and to translate the information into complex language. His thesis seems an exception to the 'brick in the wall' school of publishing. Rather than contributing a small set of conclusions to the great edifice of his field, he has taken on a broad set of tasks, improving and innovating within all of them.
Computers beat us at calculus a long time ago; by now they've beaten us at chess, the ancient game of Go, and they may even be better drivers than us. 'Neural networks are now even better at recognising human faces than we are', says Guo.