orthomap: Step 4 - TEI calculation

This notebook will demonstrate how to to add a transcriptome evolutionary index (short: TEI) to scRNA data.

Notebook file

Notebook file can be obtained here:

https://raw.githubusercontent.com/kullrich/orthomap/main/docs/notebooks/add_tei.ipynb

Import libraries

[1]:

import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from statannot import add_stat_annotation
# increase dpi
%matplotlib inline
#plt.rcParams['figure.dpi'] = 300
#plt.rcParams['savefig.dpi'] = 300
plt.rcParams['figure.figsize'] = [6, 4.5]
#plt.rcParams['figure.figsize'] = [4.4, 3.3]

Import orthomap python package submodules

[2]:

# import submodules
from orthomap import qlin, gtf2t2g, of2orthomap, orthomap2tei, datasets

Step 0, Step 1, Step 2 and Step 3

In order to come to Step 4, TEI calculation, one needs to have the results from Step 0, Step 1, Step 2 and Step 3.

The query species in this part is: Danio rerio (zebrafish).

Please have a look at the documentation of Step 0 - run OrthoFinder to get to know what information and files are mandatory to extract gene age classes from OrthoFinder results.

In Step 1 - get taxonomic information you have already been introduced how to extract query lineage information with orthomap and the qlin.get_qlin() function.

In Step 2 - gene age class assignment you have already been introduced how to extract an orthomap (gene age class) from OrthoFinder results with orthomap and the of2orthomap.get_orthomap() function or how to import pre-calculated orthomaps with the orthomap2tei.read_orthomap() function.

In Step 3 - map gene/transcript IDs you have already been introduced how to extract gene IDs from GTF file with orthoamp and the gtf2t2g.parse_gtf() function. You have also been introduced how to use the orthomap2tei.geneset_overlap() function to check the overlap between the gene IDs and have learned how to use the orthomap2tei.replace_by() function to e.g. reduce isoform gene IDs to gene IDs.

Step 0 - run OrthoFinder

For this documentation part all mandatory OrthoFinder (Emms and Kelly, 2019) results have been pre-calculated.

Please have a look at the documentation of Step 0 - run OrthoFinder to get further insides.

The results are available here:

https://doi.org/10.5281/zenodo.7242264

or can be accessed with the dataset submodule of orthomap

datasets.ensembl105(datapath='data') (download folder set to 'data').

Step 1 - get taxonomic information

Please have a look at the documentation of Step 1 - get taxonomic information to get further insides.

[3]:

# get query species taxonomic lineage information
query_lineage = qlin.get_qlin(q='Danio rerio')

query name: Danio rerio
query taxID: 7955
query kingdom: Eukaryota
query lineage names:
['root(1)', 'cellular organisms(131567)', 'Eukaryota(2759)', 'Opisthokonta(33154)', 'Metazoa(33208)', 'Eumetazoa(6072)', 'Bilateria(33213)', 'Deuterostomia(33511)', 'Chordata(7711)', 'Craniata(89593)', 'Vertebrata(7742)', 'Gnathostomata(7776)', 'Teleostomi(117570)', 'Euteleostomi(117571)', 'Actinopterygii(7898)', 'Actinopteri(186623)', 'Neopterygii(41665)', 'Teleostei(32443)', 'Osteoglossocephalai(1489341)', 'Clupeocephala(186625)', 'Otomorpha(186634)', 'Ostariophysi(32519)', 'Otophysi(186626)', 'Cypriniphysae(186627)', 'Cypriniformes(7952)', 'Cyprinoidei(30727)', 'Danionidae(2743709)', 'Danioninae(2743711)', 'Danio(7954)', 'Danio rerio(7955)']
query lineage:
[1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742, 7776, 117570, 117571, 7898, 186623, 41665, 32443, 1489341, 186625, 186634, 32519, 186626, 186627, 7952, 30727, 2743709, 2743711, 7954, 7955]

Step 2 - gene age class assignment

Here, orthomap use the query species information and OrthoFinder results to extract the oldest common tree node per orthogroup along a species tree and to assign this node as the gene age to the corresponding genes.

Please have a look at the documentation of Step 2 - gene age class assignment to get further insides.

Step 3 - map gene/transcript IDs

To be able to link gene ages assignments from an orthomap and gene or transcript of scRNA dataset, one needs to check the overlap of the annotated gene names. With the gtf2t2g submodule of orthomap and the gtf2t2g.parse_gtf() function, one can extract gene and transcript names from a given gene feature file (GTF).

Please have a look at the documentation of Step 3 - map gene/transcript IDs to get further insides.

Here, the pre-calculated orthomap from Danio rerio (zebrafish) obtained via Step 0, Step 1, Step 2 and Step 3 is loaded as follows:

[4]:

datasets.zebrafish_ensembl105_orthomap(datapath='data')

100% [..........................................................................] 1920563 / 1920563

[4]:

'data/zebrafish_ensembl_105_orthomap.tsv'

[5]:

# get query species orthomap

# download zebrafish orthomap results here: https://doi.org/10.5281/zenodo.7242264
# or download with datasets.zebrafish_ensembl_105_orthomap('data')

query_orthomap = orthomap2tei.read_orthomap('data/zebrafish_ensembl_105_orthomap.tsv')
query_orthomap

[5]:

	seqID	Orthogroup	PSnum	PStaxID	PSname	PScontinuity	geneID
0	ENSDART00000127643.3	OG0000000	6	33213	Bilateria	0.846154	ENSDARG00000087544
1	ENSDART00000171750.2	OG0000000	6	33213	Bilateria	0.846154	ENSDARG00000095745
2	ENSDART00000190648.1	OG0000000	6	33213	Bilateria	0.846154	ENSDARG00000097551
3	ENSDART00000130167.3	OG0000001	10	7742	Vertebrata	0.909091	ENSDARG00000086420
4	ENSDART00000150909.2	OG0000001	10	7742	Vertebrata	0.909091	ENSDARG00000086613
...	...	...	...	...	...	...	...
25167	ENSDART00000180796.1	OG0029510	19	186625	Clupeocephala	0.400000	ENSDARG00000110427
25168	ENSDART00000145618.2	OG0029511	19	186625	Clupeocephala	0.400000	ENSDARG00000093188
25169	ENSDART00000143229.2	OG0029512	29	7955	Danio rerio	1.000000	ENSDARG00000069978
25170	ENSDART00000143837.3	OG0029512	29	7955	Danio rerio	1.000000	ENSDARG00000078193
25171	ENSDART00000180573.1	OG0029513	13	117571	Euteleostomi	0.222222	ENSDARG00000109747

25172 rows × 7 columns

Import now, the scRNA dataset of the query species

Here, data is used, like in the publication (Farrell et al., 2018; Wagner et al., 2018; Qiu et al., 2022).

scRNA data was downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy or orthomap package and is available here:

https://doi.org/10.5281/zenodo.7243602

or can be accessed with the dataset submodule of orthomap:

datasets.qiu22_zebrafish(datapath='data') (download folder set to 'data').

[6]:

# load scRNA data

# download zebrafish scRNA data here: https://doi.org/10.5281/zenodo.7243602
# or download with datasets.qui22_zebrafish(datapath='data')

#zebrafish_data = datasets.qiu22_zebrafish(datapath='data')
zebrafish_data = sc.read('data/zebrafish_data.h5ad')

Step 4 - Get TEI values and add them to scRNA dataset

Since now the gene names correspond to each other in the orthomap and the scRNA adata object, one can calculate the transcriptome evolutionary index (TEI) and add them to the scRNA dataset (adata object).

The TEI measure represents the weighted arithmetic mean (expression levels as weights for the phylostratum value) over all evolutionary age categories denoted as phylostra.

\(TEI_s = \sum (e_{is} * ps_i) / \sum e_{is}\)

, where \(TEI_s\) denotes the TEI value in developmental stage \({s, e_{is}}\) denotes the gene expression level of gene \({i}\) in stage \({s}\), and \({ps_i}\) denotes the corresponding phylostratum of gene \({i, i = 1,...,N}\) and \({N = total\ number\ of\ genes}\).

Note: If e.g. two different isoforms would fall into two different gene age classes, their gene ages might differ based on the oldest ortholog found in their corresponding orthologous groups. However, both isoforms share the same gene name and their gene ages would clash. In this case one can decide either to use the keep='min' or keep='max' gene age to be kept by the get_tei function, which defaults to keep in this cases the keep='min' or in other words the ‘older’ gene age.

To be able to re-use the original count data, they are added as a new layer to the adata object. This is useful because later on the count data can be used to extract either the relative expression per gene age class or re-calculate other metrics.

This can be done either on un-normalized counts, on normalized and log-transformed data.

[7]:

zebrafish_data.layers['counts'] = zebrafish_data.X

add TEI to adata object

Using the submodule orthomap2tei from orthomap and the orthomap2tei.get_tei() function, transcriptome evolutionary index (TEI) values are calculated and directyl added to the existing adata object (add_obs=True).

There are other options to e.g. not start from the adata.X counts but from another layer from the adata object, the default is to use the adata.X (layer=None). By default, the values are pre-processed by the normalize_total option and the log1p option.

Note: Please have a look at Liu & Robinson-Rechavi, 2018 to get more insides how expression transformation can influence TEI calculation.

If add_obs=True the resulting TEI values are added to the existing adata object as a new observation with the name set with the obs_name option.

If add_var=True the gene age values are added to the existing adata object as a new variable with the name set with the var_name option.

Note: Genes not assigned to any gene class will get a missing assignment.

If one wants to calculate bootstrap TEI values per cell, the boot option can be set to boot=True and gene age classes will be randomly chosen prior calculating TEI values bt=10 times.

[8]:

# add TEI values to existing adata object
orthomap2tei.get_tei(adata=zebrafish_data,
    gene_id=query_orthomap['geneID'],
    gene_age=query_orthomap['PSnum'],
    keep='min',
    layer=None,
    add_var=True,
    var_name='Phylostrata',
    add_obs=True,
    obs_name='tei',
    boot=False,
    bt=10,
    normalize_total=True,
    log1p=True,
    target_sum=1e6)

[8]:

	tei
hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC	5.531953
hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT	5.550115
hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC	5.501283
hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN	5.521703
hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT	5.614314
...	...
hpf24_DEW057_TGACACAACAG_GCCACATC	5.136634
hpf24_DEW057_CTTACGGG_AACCTGAC	5.322264
hpf24_DEW057_TGAACATCTAT_GACGATGG	5.121005
hpf24_DEW057_TGAGGTTTCTC_CTCAGAAT	4.963845
hpf24_DEW057_ACGTGCTAG_CAAGTCAT	5.270514

71203 rows × 1 columns

In the adata object, for each cell there is now a new variable called tei, which can be used in downstream analysis.

Once the gene age data has been added to the scRNA dataset, one can e.g. plot the corresponding transcriptome evolutionary index (TEI) values by any given observation pre-defined in the scRNA dataset.

Boxplot gene age class per embryo stage

[9]:

sc.pl.violin(adata=zebrafish_data,
             keys=['tei'],
             groupby='stage',
             rotation=90,
             palette='Paired',
             stripplot=False,
             inner='box')

Please have a look at the documentation of Downstream analysis to get further insides.