In the last video, we briefly talked about
the history of neural network research,

and we discussed how in the early 2010s

neural network architecture started 
to win machine learning competitions.

In particular, there was 
the ImageNet-competition

which involved classifying images 
in which the neural network entry

did significantly better
than any non-neural network entry.

And, interestingly, this is even so,

the non-neural network entry 
used hand-coded features

and hand-tuned algorithms.

This reminds us a little bit 
of the development

of machine learning in go-playing 
where AlphaGo actually succeeded

and improved by reducing 
the impact of hand-coded features,

and replaced them with features 
learned entirely automatically

from large amounts of data.

Now, what is it that made
the neural network entry

and ImageNet so successful?

Today, we would call 
the kind of architecture

that was entered into
ImageNet and won it

a deep neural network 
or a deep learning architecture.

And, even though 
this is a bit of a buzzword,

there are a few typical characteristics
that define such neural networks

and distinguish them from 
previous uses of neural networks.

So, what happened? 
What was different?

First, the very design and architecture
of the neural network was deeper --

and more structured.

And I'll define what I mean 
by both of those terms in a minute.

Second, there was
a huge data set provided.

The data sets used in these competitions
grew bigger and bigger every year,

and it turns out that neural networks
seems to do really well

when there's a lot of data provided.

Lastly, the team that entered 
this neural network

used graphics cards which were developed
essentially for faster gaming

and better graphics and games,

but that also do an operation 
called <i>matrix multiplication</i>

that's used extensively
in neural networks,

and they use this to train 
the neural network

several orders of magnitude faster than
will be possible with regular processors.

So this allowed them to train the network
for a long time on large amount of data,

and it turns out that just doing 
more data and more training

makes a big deal with neural networks.

So let me now go back to my first point

that these deep networks are deeper,
obviously, and more structured.

Remember, I talked about
multi-layer neural networks

where, instead of just having an input 
that goes directly to the final output,

there's intermediate weighted sums
and non-linearities.

And each layer of those weighted sums 
and non-linearities,

in between the input and the final output,
is called a hidden layer.

In deep neural networks, there's typically
many, many, many hidden layers,

and many more that when used,
even in the 80s or 90s,

where you might have one or two
hidden layers in typical neural network.

Nowadays, it's possible a few dozens 
or easily hundreds of hidden layers

in modern deep architectures.

Now, I should know two things about this.

Prior to the rise of deep architectures,

training networks with
many, many hidden layers

ran into various kinds of
technical difficulties.

However, by tweaking with 
a non-linear function,

it turned out that it actually is possible
to train neural networks

with many hidden layers and resolve
some of these computational difficulties.

The other thing I'd like to note is that 
we don't have a very good sense

about why having more hidden layers
helps improve performance.

We have some ideas.

Generally, if you look at neural networks
with many hidden layers;

for example, I've diagrammed 
this prototypical example neural network

that takes in images,

so it takes in pixel information and output,
for example, the name of the person;

you might see this being used 
in Facebook, for example,

when it recognizes 
your friends from pictures.

In a network like this, when 
there's many, many hidden layers,

and we look at the kinds of patterns
that seem to activate the neurons

in the different layers,

they seem to become, in some sense,
more and more abstract and conceptual.

So, at the earliest layers,

what really turns the neurons on,

are things like edges 
or high contrast spots.

At intermediate layers, the neurons seem 
to be activated by things like

noses, ears, mouths, parts of the face.

And towards the final layers,

it almost seems to be that
the neurons are responding

to what might be called
<i>prototypical faces</i>

or some kind of underlying variation

in the types of faces and 
expressions that people have.

So we see that by adding more layers,

it might be that we're able to capture 
higher and higher-level concepts

and more abstract concepts that 
are then recombined in useful ways.

So essentially, we can think of 
deep neural networks

as encoding some assumption that 
the kinds of data we're interested in

is frequently hierarchical.

It has many scales and it reuses 
some of the lower scale components

in various ways in the higher scales.

The other thing that was different between
more recent deep network architectures

and more traditional approaches 
in neural networks,

was that many of the deep learning 
architectures have a lot more structure.

So here on the screen,
on the left-hand side,

you see a more traditional neural network.

Even though it has many hidden layers,

you see essentially all the neurons
in one layer are connected

to all the neurons in the next layer.

For example, the winning neural network
in ImageNet that we discussed previously,

we show the topology of that network
on the right-hand side

using a kind of block diagram.

Each of the cubes represents 
a whole group of neurons.

Here, you can see that
there's a lot of structure there.

There's kind of two streams,

the sizes of blocks are changing,

some of them are densely interconnected,
some are not interconnected.

So there's a lot of knowledge
and design put into

how the neurons are 
interconnected to each other.

I should also add, especially 
for image tasks including ImageNet,

what's often used are 
so-called convolutional layers.

Convolutional layers have very structured
repetitive weight patterns.

And so they also impose 
a kind of constraints on the neurons

and impose a certain kind of structure
on the connectivity pattern

that's possible for the neural network.

So we see that, unlike more
traditional neural networks,

deep nets are often very structured.

They don't just have everything
connected to everything else

as was assumed to be acceptable before.

As I mentioned, designing such architectures
requires quite a bit of domain knowledge,

and it's actually more 
of an art than a science.

People don't really 
understand how it works,

but it seems to make a big difference
on the performance of the neural networks.

But interestingly, there's been 
some recent work showing

that we can actually train 
machine learning algorithms

to themselves design the topology 
of the neural networks

which are then trained on big data sets.

And this is very interesting because 
it's a kind of meta-learning

or meta-design of machine learning 
algorithms designing

other machine learning algorithms

and doing just as well 
or even better than people can.

So, probably, this is 
the beginning of the singularity.

Now, given this recipe that I mentioned
of large amounts of data,

lots of computing power and training 
on graphics processors,

and structured architectures of 
the connectivity between neurons,

deep networks are coming to dominate
almost all domains of machine learning

or at least many, many of them.

We already talked about image recognition,

classifying images according 
to the object inside of them.

Now, voice recognition.

So many people notice that, 
for example, Siri on the iPhone

or the voice recognition on Android 
got much, much better all of a sudden.

They could suddenly recognize 
what people were saying

with very high accuracy.

A lot of this was due to neural networks
and deep neural networks

being used in this application.

Translation is another aspect.

So machine translation translating from
one language to another in human language

is traditionally an extremely 
difficult task for artificial intelligence,

and it's thought that statistical models 
like neural networks

and many other kinds of 
machine learning algorithms

would never really do 
very well at such tasks

because they're too structured.

There's too much syntax 
and too many rules to follow.

It turns out that, given enough data,

deep neural networks 
actually do great at this task.

And, if you've used Google Translate,

they moved from a system 
that used hand-designed features,

designed over many decades by linguists,

to essentially training
a huge, deep neural net

on large bodies of text from 
the internet,

and it does better translating 
than the hand-designed algorithms.

And finally,

we already mentioned things like 
video games and board games

being supervised learning problems.

Brendan talked about the development 
of go-playing algorithms,

and, actually, a big chunk of 
the machine learning

that was used in AlphaGo
was a deep neural net.

And so, in combination 
with other techniques,

we saw an AlphaGo that deep neural nets
actually solved an AI task

that was thought to be intractable
for many, many, many years.

And there's many other examples 
of deep learning doing very well

at tasks that were thought
to be very difficult.