In this lesson we will see 
some applications

of algorithmic information dynamics, 
perturbation causal calculus,

and reprogrammability 
to molecular biology and genetics -

in particular to biological networks 
and cell dynamics.

Before getting into details,

it is important to understand 
what a gene ontology is.

As you may know, 
even if we know very little

about the specific functions of genes, 
a good amount of knowledge is known

and has been published 
as part of the biological literature

in journals over the last decades.

So, a gene ontology is a database 
that contains that knowledge

from the literature in a computable form 
that can be used for other purposes -

in particular, for validation 
of other new experiments.

So, in general, these databases include 
some categories for each gene or protein

associated to the process in which they 
are known or are believed to be involved.

By "believed" I mean that 
there is a good statistical evidence

that this is actually the case.

And, those categories can be

cellular components 
which they are associated with,

and the type of molecular functions 
they are known for.

For example, some genes may be known

to be related to energy production, 
to build a specific tissue and so on.

This is what a category in a gene 
ontology database may look like.

Different genes and proteins 
may be associated with each of these nodes

and they can be associated 
with more than one.

And, if associated to a low component, 
then it is associated

to the larger component 
that contains the smaller one.

And that is why
it is called an "ontology" -

because it is basically a hierarchical 
representation of knowledge.

This is an example of a gene product 
and its different categories

that's coming from 
a gene ontology database.

So, this cytochrome c 
is associated - for example -

with a process related to cell death 
and is related to the mitochondria.

This is another example 
and representation,

more in the format 
of what is called a "gene browser" -

the name that is given to browsers 
in which genes and proteins

can be queried to find associations 
in gene ontology databases.

This is a breakdown of biological terms

and how many genes and proteins 
are in each category and subcategories -

and how many of them there are associated 
to different categories.

But, now that you know 
what is a gene ontology,

let's get back to the application 
of algorithmic complexity to biology.

A landmark paper 
in the area of application

of algorithmic information theory 
to molecular biology and genetics

is this one 
by Cilibrasi and Vitányi -

two very important researchers 
in the field of Kolmogorov complexity.

They defined a measure 
called "normalized information distance,"

and another one associated to this one 
that allows numerical approximations

that is called 
"normalized compression distance,"

that makes use of popular lossless 
compression algorithms to be implemented.

And, they applied this measure 
to full genomes of different animals,

extracted precisely 
from one of those popular gene browsers.

And, what they did was to compress 
those full genomes

and compare each other 
according to the compression measure.

And, what they found is that 
they were able to reconstruct

a phylogenetic tree 
that closely corresponds

to the evolutionary history 
and genetic relationship of each species.

Of course, this was an important result 
but it should be put in context.

One major problem and challenge 
in science

is that we often don't talk 
to other scientists in different areas -

and I think this case 
illustrates it very well.

Because, it turns out 
that there is something

in the genome of all species 
that is called "the GC content,"

which is basically the number
of Gs and Cs - that is, the nucleotides -

that are in a DNA sequence 
or in a full genome.

Well, it turns out that each species 
has a very specific GC content

that is about the same for two species 
that are evolutionary related,

because by definition, 
their genomic alignment is very high.

In other words, the gold standard 
of the phylogenetic tree reconstruction

is built by aligning genomic sequences.

But, only looking at GC content,

one can reconstruct it 
with a very high accuracy and precision.

GC-content is nothing but a function
of Shannon entropy of a DNA sequence.

So, it is also not difficult to see
how Shannon entropy alone

would reconstruct
the same phylogenetic tree.

So, we are in the situation
in which even the most trivial measures

can reconstruct the same phylogenetic tree
that Cilibrasi and Vitányi reconstructed

with quite a convoluted measure,
using a black box algorithm,

which is a lossless compression algorithm.

This is why we like to be so much involved
with other experts

and we have also pushed ourselves
to learn as much as we can

about the areas in which we are trying
to make a contribution,

especially to molecular biology
and genetics.

So, let's get back
to algorithmic information dynamics

and how it can contribute to the study
of these types of biological systems

perhaps in a very different way.

Just as a reminder, you may recall

that one can identify 
the elements of a network

by applying our perturbation
causal calculus,

to evaluate how each element
of a network or a system in general

contributes to the original system
or network.

In this case, for example,
we delete every node

and see how much
the algorithmic complexity

of the mutated network changed 
with respect to

the algorithmic complexity
of the original network -

in this case, Watts-Strogatz's 
small-world network

with different rewiring probabilities.

You know, you start
from a regular cycle graph

and then randomly reconnect a few edges

according to that rewiring 
probability value.

By performing our perturbation calculus,

we are able to detect how some nodes
were involved in the rewiring,

because they increase the algorithmic
complexity of the mutated network

that came from the original regular cycle.

Therefore, we can measure by how much,

and color each node 
according to its perturbability magnitude,

providing a disruptability index 
in some way, so to speak.

We did the same
with one of the most curated

and validated
biological networks available -

the transcription factor network
of <i>E. coli</i>.

A transcription factor network

is a network that does not include
all the genes of an organism,

such as this bacteria,

but only a few genes
that regulate other genes,

and are thus called
"transcription factors"

because they regulate the transcription 
of those other genes -

that is, whether
they express themselves or not.

So, basically,
turning on and off genes.

Here, we colored in red
those genes that,

if deleted, would send the network
towards randomness,

while nodes further away from red,
in a regular color spectra,

send the network away from randomness -

so effectively ranking all the genes
in this network.

These are the information signatures

from the application
of the perturbation analysis.

We were able to cluster genes

by whether they have about the same
algorithmic information value

or contribution to the original network.

This is how the network looks

when separating nodes
by their contribution,

either to move the <i>E. coli</i> network
towards randomness

or away from randomness according to BDM.

So, because a lot is known
about the <i>E. coli</i> and all its genes,

and - in particular -
the transcription factors,

we extracted the gene ontology values
for each gene in each cluster

and saw if such genes in the same cluster
were associated to similar processes.

It turns out that they did.

These very small p-values tell

that genes are not clustered by chance
according to BDM,

but by gene function with high accuracy,

because genes in each cluster
have over-represented the categories.

This type of analysis
is called "enrichment analysis,"

because you are enriching your results
with all the known literature

about those genes,
and then testing

if your experiment found something
that such literature is also telling you.

And, that was actually the case.

The same happened when testing

with the three most important
gene ontology databases.

Here is, for example, <i>KEGG</i>.

And, here is <i>EcoCyc</i>,

strengthening our finding
that algorithmic information dynamics

is actually capturing function
when clustering genes

by algorithmic causal
perturbation analysis.

However, when using other measures
exactly in the same way as we used BDM,

but replaced by measures
such as Shannon entropy,

no significant groups were found

after the gene ontology
enrichment analysis.

They produced three clusters

that we call "above the baseline,"
"baseline" and "below the baseline."

So, entropy...
proved to be less sensitive.

This is how entropy colors the network,
and you see how poorly it performs.

Only two clusters could be identified
using lossless compression,

in this case with a compressed algorithm
dividing cases

into above baseline and baseline only.

Above baseline nodes were enriched
for transcription factors,

and no significant groups were found
after the enrichment analysis.

This is how compression was able
to color the network -

quite insensitive to small changes
as one would expect,

because - as we have shown before -

small changes are traditionally
not captured by compression.

So, compression is quite useless

when it comes to finding
perturbation analysis.

Nothing particularly interesting came up

when using lossless
compression algorithms

after enrichment analysis either.

Moreover, unlike graph theoretic measures 
that can be described

as single or composed functions
of other graph theoretic measures,

BDM was not found to be correlated
with any of the most popular measures,

such as lossless compression
and Shannon entropy -

but also node degree, in degree,
out degree, and betweenness centrality.

This diagram summarizes the results
for <i>E. coli</i>

using the three most popular
gene ontology databases,

the three of them producing
the same results

when classifying transcription factors

in the most studied
biological network of this bacteria

that kills so many people
around the world.

There are many questions yet to answer.

For example, whether such a perturbation 
analysis can tell anything

about more fundamental genes than others.

[ end of audio ]