In the last video, we briefly talked about
the history of neural network research,
and we discussed how in the early 2010s
neural network architecture started
to win machine learning competitions.
In particular, there was
the ImageNet-competition
which involved classifying images
in which the neural network entry
did significantly better
than any non-neural network entry.
And, interestingly, this is even so,
the non-neural network entry
used hand-coded features
and hand-tuned algorithms.
This reminds us a little bit
of the development
of machine learning in go-playing
where AlphaGo actually succeeded
and improved by reducing
the impact of hand-coded features,
and replaced them with features
learned entirely automatically
from large amounts of data.
Now, what is it that made
the neural network entry
and ImageNet so successful?
Today, we would call
the kind of architecture
that was entered into
ImageNet and won it
a deep neural network
or a deep learning architecture.
And, even though
this is a bit of a buzzword,
there are a few typical characteristics
that define such neural networks
and distinguish them from
previous uses of neural networks.
So, what happened?
What was different?
First, the very design and architecture
of the neural network was deeper --
and more structured.
And I'll define what I mean
by both of those terms in a minute.
Second, there was
a huge data set provided.
The data sets used in these competitions
grew bigger and bigger every year,
and it turns out that neural networks
seems to do really well
when there's a lot of data provided.
Lastly, the team that entered
this neural network
used graphics cards which were developed
essentially for faster gaming
and better graphics and games,
but that also do an operation
called matrix multiplication
that's used extensively
in neural networks,
and they use this to train
the neural network
several orders of magnitude faster than
will be possible with regular processors.
So this allowed them to train the network
for a long time on large amount of data,
and it turns out that just doing
more data and more training
makes a big deal with neural networks.
So let me now go back to my first point
that these deep networks are deeper,
obviously, and more structured.
Remember, I talked about
multi-layer neural networks
where, instead of just having an input
that goes directly to the final output,
there's intermediate weighted sums
and non-linearities.
And each layer of those weighted sums
and non-linearities,
in between the input and the final output,
is called a hidden layer.
In deep neural networks, there's typically
many, many, many hidden layers,
and many more that when used,
even in the 80s or 90s,
where you might have one or two
hidden layers in typical neural network.
Nowadays, it's possible a few dozens
or easily hundreds of hidden layers
in modern deep architectures.
Now, I should know two things about this.
Prior to the rise of deep architectures,
training networks with
many, many hidden layers
ran into various kinds of
technical difficulties.
However, by tweaking with
a non-linear function,
it turned out that it actually is possible
to train neural networks
with many hidden layers and resolve
some of these computational difficulties.
The other thing I'd like to note is that
we don't have a very good sense
about why having more hidden layers
helps improve performance.
We have some ideas.
Generally, if you look at neural networks
with many hidden layers;
for example, I've diagrammed
this prototypical example neural network
that takes in images,
so it takes in pixel information and output,
for example, the name of the person;
you might see this being used
in Facebook, for example,
when it recognizes
your friends from pictures.
In a network like this, when
there's many, many hidden layers,
and we look at the kinds of patterns
that seem to activate the neurons
in the different layers,
they seem to become, in some sense,
more and more abstract and conceptual.
So, at the earliest layers,
what really turns the neurons on,
are things like edges
or high contrast spots.
At intermediate layers, the neurons seem
to be activated by things like
noses, ears, mouths, parts of the face.
And towards the final layers,
it almost seems to be that
the neurons are responding
to what might be called
prototypical faces
or some kind of underlying variation
in the types of faces and
expressions that people have.
So we see that by adding more layers,
it might be that we're able to capture
higher and higher-level concepts
and more abstract concepts that
are then recombined in useful ways.
So essentially, we can think of
deep neural networks
as encoding some assumption that
the kinds of data we're interested in
is frequently hierarchical.
It has many scales and it reuses
some of the lower scale components
in various ways in the higher scales.
The other thing that was different between
more recent deep network architectures
and more traditional approaches
in neural networks,
was that many of the deep learning
architectures have a lot more structure.
So here on the screen,
on the left-hand side,
you see a more traditional neural network.
Even though it has many hidden layers,
you see essentially all the neurons
in one layer are connected
to all the neurons in the next layer.
For example, the winning neural network
in ImageNet that we discussed previously,
we show the topology of that network
on the right-hand side
using a kind of block diagram.
Each of the cubes represents
a whole group of neurons.
Here, you can see that
there's a lot of structure there.
There's kind of two streams,
the sizes of blocks are changing,
some of them are densely interconnected,
some are not interconnected.
So there's a lot of knowledge
and design put into
how the neurons are
interconnected to each other.
I should also add, especially
for image tasks including ImageNet,
what's often used are
so-called convolutional layers.
Convolutional layers have very structured
repetitive weight patterns.
And so they also impose
a kind of constraints on the neurons
and impose a certain kind of structure
on the connectivity pattern
that's possible for the neural network.
So we see that, unlike more
traditional neural networks,
deep nets are often very structured.
They don't just have everything
connected to everything else
as was assumed to be acceptable before.
As I mentioned, designing such architectures
requires quite a bit of domain knowledge,
and it's actually more
of an art than a science.
People don't really
understand how it works,
but it seems to make a big difference
on the performance of the neural networks.
But interestingly, there's been
some recent work showing
that we can actually train
machine learning algorithms
to themselves design the topology
of the neural networks
which are then trained on big data sets.
And this is very interesting because
it's a kind of meta-learning
or meta-design of machine learning
algorithms designing
other machine learning algorithms
and doing just as well
or even better than people can.
So, probably, this is
the beginning of the singularity.
Now, given this recipe that I mentioned
of large amounts of data,
lots of computing power and training
on graphics processors,
and structured architectures of
the connectivity between neurons,
deep networks are coming to dominate
almost all domains of machine learning
or at least many, many of them.
We already talked about image recognition,
classifying images according
to the object inside of them.
Now, voice recognition.
So many people notice that,
for example, Siri on the iPhone
or the voice recognition on Android
got much, much better all of a sudden.
They could suddenly recognize
what people were saying
with very high accuracy.
A lot of this was due to neural networks
and deep neural networks
being used in this application.
Translation is another aspect.
So machine translation translating from
one language to another in human language
is traditionally an extremely
difficult task for artificial intelligence,
and it's thought that statistical models
like neural networks
and many other kinds of
machine learning algorithms
would never really do
very well at such tasks
because they're too structured.
There's too much syntax
and too many rules to follow.
It turns out that, given enough data,
deep neural networks
actually do great at this task.
And, if you've used Google Translate,
they moved from a system
that used hand-designed features,
designed over many decades by linguists,
to essentially training
a huge, deep neural net
on large bodies of text from
the internet,
and it does better translating
than the hand-designed algorithms.
And finally,
we already mentioned things like
video games and board games
being supervised learning problems.
Brendan talked about the development
of go-playing algorithms,
and, actually, a big chunk of
the machine learning
that was used in AlphaGo
was a deep neural net.
And so, in combination
with other techniques,
we saw an AlphaGo that deep neural nets
actually solved an AI task
that was thought to be intractable
for many, many, many years.
And there's many other examples
of deep learning doing very well
at tasks that were thought
to be very difficult.