Methods

Experimental methods

Experimental details for each separate experiment can be found in links from the experiment details page.

Bioinformatics methods

All Total RNAseq and miRNAseq data across all species were re-analyzed from original published raw data. The full sample table of GEO and SRA samples that were processed can be found by following the details links from the experiment details page.

This page describes the methods for this uniform processing.

Single cell RNAseq processed data were downloaded from the supplementary files of Single-cell RNA sequencing of the mammalian pineal gland identifies two pinealocyte subtypes and cell type-specific daily patterns of gene expression. Gene expression was normalized to 10,000 transcripts per cell and averaged across all cells in each cell type.

Workflow

A Snakemake workflow was designed in such a way that the same code could be applied to different experiments and across different species, with only configuration files needing to be changed. The entry point to each workflow was a sample table containing SRA accessions and other metadata, and a config file describing the URLs of the fasta files for assembly and transcriptome.

Raw data

Raw FASTQ files were downloaded directly from SRA using fastq-dump from NCBI's SRA toolkit.

QC assessment

The quality of all FASTQ files were assayed with FastQC v0.11.9 and MultiQC v1.9 and all samples demonstrated high-quality sequencing. Adapters were removed and light quality trimming was performed with cutadapt v3.1 using additional arguments -q 20 --minimum-length 25 for mRNA processing, and -q 20 --minimum-length 17 for miRNA processing.

Transcript quantification

The trimmed reads were provided to Kallisto v0.46.2 for transcript quantification using an index built as described below for each species' transcriptome, and run using the additional arguments --bias --bootstrap-samples 100. For each gene, the per-transcript values reported by Kallisto were summed to provide a gene-level expression estimate in units of TPM. These are the values reported in the tables and plots of each gene page.

Excel files

To generate the final Excel tables that include gene symbols and genomic coordinates, this extra information was parsed from an annotation file in GTF format (from Ensembl) and attached to Kallisto output.

Genome browser visualization

For genomic visualization in the UCSC genome browser, trimmed reads were aligned using HISAT2 v2.2.1 to the respective genome indicated below. From these aligned reads, normalized bigWig files were created using deepTools v3.5.0 bamCoverage tool, using additional options --minMappingQuality 20 --smoothLength 10 --normalizeUsing BPM --binSize 1 such that multi-mappers were ignored. The tool was run twice, once with --filterRNAstrand forward and once with --filterRNAstrand reverse, to get separate tracks for each strand. The resulting bigWig files were combined into a UCSC track hub using the trackhub Python package. The genomic signal tracks are scaled to effectively display reads per million mapped reads. This allows accurate comparison between track, but can result in misinterpretation of low signals. Signal intensity should always be interpreted relative to track overall signal range.

Note on UCSC genome browser profiles vs. transcript quantification: Kallisto has been shown to be the most accurate method for quantification by aligning reads to transcripts only, thus preventing the visualization of the aligned reads in a genome browser. Due to algorithm differences with HISAT2, small discrepancies of read assignments between transcript quantification and the genome browser mapping are to be expected.

Chromosome nomenclature harmonization

A note on chromosome nomenclature: UCSC (used for visualization) and Ensembl (used for its comprehensive annotations) are not consistent in their chromosome nomenclature. In order to facilitate linking from gene-level transcription estimates on this website (which typicaly use Ensembl annotations) to genomic signal at UCSC, we converted chromosome names from Ensembl to the UCSC equivalents by matching the md5sums of each chromosome across the Ensembl and UCSC assembly fasta files (see GitHub repository).

Website

This website uses Flask, and is built in such a way that a config file is used to drive the generation of the site, including metadata, sample information, plot coloring and labeling. Plots are generated on the fly using the Highcharts.js library. Combined with the workflow described above, adding new species and new experiments can be added in a straightforward manner.

Details on genome and transcriptome assemblies

For each species, the genomic assembly indicated was used for visualization in the UCSC Genome Browser, while the transcriptome was used to calculate TPM expression estimates to display in plots on individual gene pages.

The maturity and availability of assemblies and transcriptomes vary across species, so this section describes what was used in each case.

Human

Genome:GENCODE GRCh38 primary assembly. Unassembled contigs renamed to match UCSC hg38 and all *_alt contigs removed.
Transcriptome:Transcriptome created by providing the GENCODE annotation GTF and GENCODE assembly to the gffread package to generate the transcriptome.

Rhesus

Genome:Ensembl Mmul 10. Chromosomes renamed to match UCSC's rheMac10.
Transcriptome:Transcriptome created by providing the Ensembl annotation GTF and Ensembl assembly to the gffread package to generate the transcriptome.

Mouse

Genome:GENCODE GRCm38 primary assembly. Unassembled contigs renamed to match UCSC mm10.
Transcriptome:Transcriptome created by providing the GENCODE annotation GTF and GENCODE assembly to the gffread package to generate the transcriptome.

Rat

Genome:Ensembl Rnor 6.0.94. Chromosomes renamed to match UCSC rn6. Removed the duplicate contigs AABR07022993.1 and AABR07022518.1
Transcriptome:Annotation created by merging Ensembl Rnor 6 annotation with miRBase rno 6 annotation. Transcriptome created by providing the merged annotation GTF and Ensembl assembly to the gffread package to generate the transcriptome.

Rat miRNA

Genome:Ensembl Rnor 6.0.94. Chromosomes renamed to match UCSC rn6. Removed the duplicate contigs AABR07022993.1 and AABR07022518.1
Transcriptome:miRNA annotation created by filtering the Ensembl Rnor 6 annotation for gene_biotype=miRNA and merging with miRBase rno 6 annotation. Transcriptome created by providing the merged annotation GTF and Ensembl assembly to the gffread package to generate the transcriptome. The salmon index was built using a k-mer size of 19 to accomodate the shorter read fragments.

Chicken

Genome:UCSC galGal5
Transcriptome:Transcriptome created by providing the UCSC annotation GTF and UCSC assembly to the gffread package to generate the transcriptome.

Zebrafish

Genome:UCSC danRer11. Chromosomes renamed to UCSC's danRer11 and all _alt contigs removed.
Transcriptome:Transcriptome created by providing the UCSC annotation GTF and UCSC assembly to the gffread package to generate the transcriptome.