In this lesson we will see
some applications
of algorithmic information dynamics,
perturbation causal calculus,
and reprogrammability
to molecular biology and genetics -
in particular to biological networks
and cell dynamics.
Before getting into details,
it is important to understand
what a gene ontology is.
As you may know,
even if we know very little
about the specific functions of genes,
a good amount of knowledge is known
and has been published
as part of the biological literature
in journals over the last decades.
So, a gene ontology is a database
that contains that knowledge
from the literature in a computable form
that can be used for other purposes -
in particular, for validation
of other new experiments.
So, in general, these databases include
some categories for each gene or protein
associated to the process in which they
are known or are believed to be involved.
By "believed" I mean that
there is a good statistical evidence
that this is actually the case.
And, those categories can be
cellular components
which they are associated with,
and the type of molecular functions
they are known for.
For example, some genes may be known
to be related to energy production,
to build a specific tissue and so on.
This is what a category in a gene
ontology database may look like.
Different genes and proteins
may be associated with each of these nodes
and they can be associated
with more than one.
And, if associated to a low component,
then it is associated
to the larger component
that contains the smaller one.
And that is why
it is called an "ontology" -
because it is basically a hierarchical
representation of knowledge.
This is an example of a gene product
and its different categories
that's coming from
a gene ontology database.
So, this cytochrome c
is associated - for example -
with a process related to cell death
and is related to the mitochondria.
This is another example
and representation,
more in the format
of what is called a "gene browser" -
the name that is given to browsers
in which genes and proteins
can be queried to find associations
in gene ontology databases.
This is a breakdown of biological terms
and how many genes and proteins
are in each category and subcategories -
and how many of them there are associated
to different categories.
But, now that you know
what is a gene ontology,
let's get back to the application
of algorithmic complexity to biology.
A landmark paper
in the area of application
of algorithmic information theory
to molecular biology and genetics
is this one
by Cilibrasi and Vitányi -
two very important researchers
in the field of Kolmogorov complexity.
They defined a measure
called "normalized information distance,"
and another one associated to this one
that allows numerical approximations
that is called
"normalized compression distance,"
that makes use of popular lossless
compression algorithms to be implemented.
And, they applied this measure
to full genomes of different animals,
extracted precisely
from one of those popular gene browsers.
And, what they did was to compress
those full genomes
and compare each other
according to the compression measure.
And, what they found is that
they were able to reconstruct
a phylogenetic tree
that closely corresponds
to the evolutionary history
and genetic relationship of each species.
Of course, this was an important result
but it should be put in context.
One major problem and challenge
in science
is that we often don't talk
to other scientists in different areas -
and I think this case
illustrates it very well.
Because, it turns out
that there is something
in the genome of all species
that is called "the GC content,"
which is basically the number
of Gs and Cs - that is, the nucleotides -
that are in a DNA sequence
or in a full genome.
Well, it turns out that each species
has a very specific GC content
that is about the same for two species
that are evolutionary related,
because by definition,
their genomic alignment is very high.
In other words, the gold standard
of the phylogenetic tree reconstruction
is built by aligning genomic sequences.
But, only looking at GC content,
one can reconstruct it
with a very high accuracy and precision.
GC-content is nothing but a function
of Shannon entropy of a DNA sequence.
So, it is also not difficult to see
how Shannon entropy alone
would reconstruct
the same phylogenetic tree.
So, we are in the situation
in which even the most trivial measures
can reconstruct the same phylogenetic tree
that Cilibrasi and Vitányi reconstructed
with quite a convoluted measure,
using a black box algorithm,
which is a lossless compression algorithm.
This is why we like to be so much involved
with other experts
and we have also pushed ourselves
to learn as much as we can
about the areas in which we are trying
to make a contribution,
especially to molecular biology
and genetics.
So, let's get back
to algorithmic information dynamics
and how it can contribute to the study
of these types of biological systems
perhaps in a very different way.
Just as a reminder, you may recall
that one can identify
the elements of a network
by applying our perturbation
causal calculus,
to evaluate how each element
of a network or a system in general
contributes to the original system
or network.
In this case, for example,
we delete every node
and see how much
the algorithmic complexity
of the mutated network changed
with respect to
the algorithmic complexity
of the original network -
in this case, Watts-Strogatz's
small-world network
with different rewiring probabilities.
You know, you start
from a regular cycle graph
and then randomly reconnect a few edges
according to that rewiring
probability value.
By performing our perturbation calculus,
we are able to detect how some nodes
were involved in the rewiring,
because they increase the algorithmic
complexity of the mutated network
that came from the original regular cycle.
Therefore, we can measure by how much,
and color each node
according to its perturbability magnitude,
providing a disruptability index
in some way, so to speak.
We did the same
with one of the most curated
and validated
biological networks available -
the transcription factor network
of E. coli.
A transcription factor network
is a network that does not include
all the genes of an organism,
such as this bacteria,
but only a few genes
that regulate other genes,
and are thus called
"transcription factors"
because they regulate the transcription
of those other genes -
that is, whether
they express themselves or not.
So, basically,
turning on and off genes.
Here, we colored in red
those genes that,
if deleted, would send the network
towards randomness,
while nodes further away from red,
in a regular color spectra,
send the network away from randomness -
so effectively ranking all the genes
in this network.
These are the information signatures
from the application
of the perturbation analysis.
We were able to cluster genes
by whether they have about the same
algorithmic information value
or contribution to the original network.
This is how the network looks
when separating nodes
by their contribution,
either to move the E. coli network
towards randomness
or away from randomness according to BDM.
So, because a lot is known
about the E. coli and all its genes,
and - in particular -
the transcription factors,
we extracted the gene ontology values
for each gene in each cluster
and saw if such genes in the same cluster
were associated to similar processes.
It turns out that they did.
These very small p-values tell
that genes are not clustered by chance
according to BDM,
but by gene function with high accuracy,
because genes in each cluster
have over-represented the categories.
This type of analysis
is called "enrichment analysis,"
because you are enriching your results
with all the known literature
about those genes,
and then testing
if your experiment found something
that such literature is also telling you.
And, that was actually the case.
The same happened when testing
with the three most important
gene ontology databases.
Here is, for example, KEGG.
And, here is EcoCyc,
strengthening our finding
that algorithmic information dynamics
is actually capturing function
when clustering genes
by algorithmic causal
perturbation analysis.
However, when using other measures
exactly in the same way as we used BDM,
but replaced by measures
such as Shannon entropy,
no significant groups were found
after the gene ontology
enrichment analysis.
They produced three clusters
that we call "above the baseline,"
"baseline" and "below the baseline."
So, entropy...
proved to be less sensitive.
This is how entropy colors the network,
and you see how poorly it performs.
Only two clusters could be identified
using lossless compression,
in this case with a compressed algorithm
dividing cases
into above baseline and baseline only.
Above baseline nodes were enriched
for transcription factors,
and no significant groups were found
after the enrichment analysis.
This is how compression was able
to color the network -
quite insensitive to small changes
as one would expect,
because - as we have shown before -
small changes are traditionally
not captured by compression.
So, compression is quite useless
when it comes to finding
perturbation analysis.
Nothing particularly interesting came up
when using lossless
compression algorithms
after enrichment analysis either.
Moreover, unlike graph theoretic measures
that can be described
as single or composed functions
of other graph theoretic measures,
BDM was not found to be correlated
with any of the most popular measures,
such as lossless compression
and Shannon entropy -
but also node degree, in degree,
out degree, and betweenness centrality.
This diagram summarizes the results
for E. coli
using the three most popular
gene ontology databases,
the three of them producing
the same results
when classifying transcription factors
in the most studied
biological network of this bacteria
that kills so many people
around the world.
There are many questions yet to answer.
For example, whether such a perturbation
analysis can tell anything
about more fundamental genes than others.
[ end of audio ]