Genome and Annotations

Annotation summary table

Type

Number of genes

Source

miRNA

1917

miRBase hairpin (Version 22)

piRNA

23431

piRNABank

lncRNA

15778

GENCODE V27 and mitranscriptome

rRNA

37

NCBI refSeq 109

mRNA

19836

GENCODE V27

snoRNA

943

GENCODE V27

snRNA

1900

GENCODE V27

srpRNA

680

GENCODE V27

tRNA

649

GENCODE V27

tucpRNA

3734

GENCODE V27

Y_RNA

756

GENCODE V27

circRNA

140527

circBase

repeats

-

UCSC Genome Browser (rmsk)

promoter

-

ChromHMM tracks from 9 cell lines from UCSC Genome Browser

enhancer

-

ChromHMM tracks from 9 cell lines from UCSC Genome Browser

Genome and annotation files

File

Description

fasta/genome.fa

genome sequence

fasta/circRNA.fa

junction sequence in circBase

fasta/rRNA.fa

rRNA sequences in NCBI RefSeq

fasta/miRNA.fa

miRNA hairpin (precursor) sequences in miRBase

fasta/piRNA.fa

piRNA sequences in piRNABank

fasta/${rna_type}.fa

longest isoform for each gene extracted from GENCODE annotations

gtf_by_biotype/${rna_type}.gtf

separate GTF files for each RNA type

gtf/gencode.gtf

GENCODE GTF file

gtf/mitranscriptome.gtf

Mitranscriptome GTF file

gtf/long_RNA.gtf

GTF file of Long RNA (GENCODE + Mitranscriptome - miRNA)

gtf/piRNABank.gtf

piRNA GTF file from piRNABank

gtf/gencode_tRNA.gtf

GTF file of tRNA from GENCODE

transcript_table/all.txt

table of transcript information (gene_id, transcript_id)

rsem_index/bowtie2/${rna_type}

RSEM index files for each RNA type (built using the longest transcripts)

rsem_index/bowtie2/${rna_type}.transcripts.fa

sequence for each RNA type (longest transcripts)

gtf_longest_transcript/${rna_type}.gtf

GTF files for the longest isoforms from GENCODE and Mitranscriptome

bed/*.bed

transcript in BED12 format extracted from GTF files in `gtf/*.gtf

index/bowtie2/${rna_type}

STAR index for transcripts

index/star/${rna_type}

STAR index for transcripts

long_index/star/

STAR index including splicing junctions of long RNA

Generate the genome and annotation files

Create genome directory

Chromosome ID conversion table

  • Column 1: UCSC chromosome ID

  • Column 2: RefSeq chromosome ID

Download Gene annotation (NCBI)

Download chain files for CrossMap

Genome assembly (UCSC hg38)

ENCODE annotations

Mitranscriptome

Extract lncRNA and TUCP RNA to separate GTF files:

NONCODE

lncRNAs identified in HCC (Nature communications 2017)

Merge lncRNA (GENCODE and Mitranscriptome)

piRBase (v1.0)

piRBase (v2.0)

Long RNA (GENCODE + Mitranscriptome - miRNA)

gene_length/long_RNA

  • Tab-deliminated text file

  • First row: header

  • Column 1 (gene): gene_id

  • Column 2 (mean): mean length of isoforms

  • Column 3 (median): median length of isoforms

  • Column 4 (longest_isoform): length of the longest isoform

  • Column 5 (merged): merged length of isoforms

piRNABank (NCBI36)

miRBase (Version 22)

Spike-in

UniVec

Intron

Promoter/enhancer from ChromHMM (hg19)

Repeats

UCSC GenomeBrowser -> Tools -> Table Browser

  • assembly: GRCh38/hg38

  • group: repeats

  • track: RepeatMasker

  • table: rmsk

Dowload to: genome/hg38/source/rmsk.bed.gz

circRNA database (circBase)

Create pseudo-genome for IGV

Merge transcript table

Last updated

Was this helpful?