1 Overview

Here we adopted a mouse sperm scRNA-seq dataset to evaluate the performance of scAPAtrap and compared the results with other two tools, scAPA(Shulman et al, 2019) and Sierra(Patrick, et al., 2020). Majority of the comparison analyses were performed with the movAPA toolkit. For detailed usage of movAPA toolkit, please refer to link

We have used scAPAtrap, scAPA, and Sierra to identify poly(A) sites from the mouse sperm scRNA-seq data, respectively. The identified poly(A) sites can be downloaded from this link.

We focused on three stages of differentiation process as in the previous study (Shulman et al, 2019), including early stage (spermatocytes, SC), intermediate stage (round spermatids, RS), and late stage (elongating spermatids, ES). For detailed information, please refer to the original publication Lukassen, et al., 2018.

1.1 Preparations

In this section, we import the required R packages and some required data. Please download the required data from the above link first before running the code. ### Load R package

library(ggplot2)
library(reshape2)
library(ggsci)
library(movAPA)

1.1.1 Load Data

Load the mouse reference genome, reference genome annotation, and annotated poly(A) sites from two latest known poly(A) site databases from bulk 3?? seq, polyASite 2.0 and PolyA_DB 3.

Please replace the file path in the following commands.

## Load mouse reference genome
library(BSgenome.Mmusculus.UCSC.mm10)
bsgenome <- BSgenome.Mmusculus.UCSC.mm10

## Load reference genome annotation
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdbmm10 <- parseGenomeAnnotation(TxDb.Mmusculus.UCSC.mm10.knownGene)

## Load annotated poly(A) sites from polyAsite 2.0.
load('C:/Users/admin/Desktop/refdata/mm10polyAsite2.rda')

## Load annotated poly(A) sites from PolyA_DB 3.
load('C:/Users/admin/Desktop/refdata/mm10polyAdb3.rda')

Load poly(A) sites identified with the three tools.

## scAPA
load("C:/Users/admin/Desktop/scPACdsData/scAPAds.rda")

## scAPAtrap
load("C:/Users/admin/Desktop/scPACdsData/scAPAtrapds.rda")

## sierra
load("C:/Users/admin/Desktop/scPACdsData/sierrads.rda")

1.2 Basic statistics

Here we refer to poly(A) sites in above loaded datasets as poly(A) site clusters (PACs), as nearby cleavage sites have been already grouped into one poly(A) site.

In this section, we will do some basic statistics on PACs identified by the three methods.

1.2.1 Number of PACs

## Total number of PACs
sum(scAPAtrapds@counts)
## [1] 52592642
sum(scAPAds@counts)
## [1] 53185459
sum(sierrads@counts)
## [1] 31069317
## Average number of PACs per cell
sum(scAPAtrapds@counts)/ncol(scAPAtrapds@counts)
## [1] 25755.46
sum(scAPAds@counts)/ncol(scAPAtrapds@counts)
## [1] 26045.77
sum(sierrads@counts)/ncol(sierrads@counts)
## [1] 15215.14

1.2.2 Distribution of PACs in different genomic regions

## scAPAtrap
table(scAPAtrapds@anno$ftr)
## 
##       3UTR       5UTR        CDS       exon intergenic     intron 
##      17998        139       3660        919      43698      21692
## scAPA
table(scAPAds@anno$ftr)
## 
##       3UTR       5UTR        CDS       exon intergenic     intron 
##      17044          3        344        137        624        785
## sierra
table(sierrads@anno$ftr)
## 
##       3UTR       5UTR        CDS       exon intergenic     intron 
##      11150        218       3355        722       1669       5505

1.2.3 Single nucleotide profiles around PACs

## scAPAtrap
dsPlot = scAPAtrapds[scAPAtrapds@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles1 = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "scAPAtrap.200nt", 
                       up = -100, dn = 100, byGrp = "ftr", chrCheck = FALSE)
## 43698 >>> scAPAtrap.200nt.intergenic.fa 
## 17998 >>> scAPAtrap.200nt.3UTR.fa 
## 3660 >>> scAPAtrap.200nt.CDS.fa 
## 21692 >>> scAPAtrap.200nt.intron.fa
plotATCGforFAfile(fafiles1, ofreq = TRUE, opdf = FALSE, refPos = 101, 
                  mergePlots = T, filepre = "scAPAtrap.200nt")
## >>> scAPAtrap.200nt.intergenic.scAPAtrap.200nt.freq 
## >>> scAPAtrap.200nt.3UTR.scAPAtrap.200nt.freq 
## >>> scAPAtrap.200nt.CDS.scAPAtrap.200nt.freq 
## >>> scAPAtrap.200nt.intron.scAPAtrap.200nt.freq

## sierra
dsPlot = sierrads[sierrads@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles2 = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "sierra.200nt", 
                       up = -100, dn = 100, byGrp = "ftr", chrCheck = FALSE)
## 11150 >>> sierra.200nt.3UTR.fa 
## 5505 >>> sierra.200nt.intron.fa 
## 3355 >>> sierra.200nt.CDS.fa 
## 1669 >>> sierra.200nt.intergenic.fa
plotATCGforFAfile(fafiles2, ofreq = TRUE, opdf = FALSE, refPos = 101, 
                  mergePlots = T, filepre = "sierra.200nt")
## >>> sierra.200nt.3UTR.sierra.200nt.freq 
## >>> sierra.200nt.intron.sierra.200nt.freq 
## >>> sierra.200nt.CDS.sierra.200nt.freq 
## >>> sierra.200nt.intergenic.sierra.200nt.freq

## scAPA
dsPlot = scAPAds[scAPAds@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles3 = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "scAPA.200nt", 
                       up = -100, dn = 100, byGrp = "ftr", chrCheck = FALSE)
## 17044 >>> scAPA.200nt.3UTR.fa 
## 785 >>> scAPA.200nt.intron.fa 
## 344 >>> scAPA.200nt.CDS.fa 
## 624 >>> scAPA.200nt.intergenic.fa
plotATCGforFAfile(fafiles3, ofreq = TRUE, opdf = FALSE, refPos = 101, 
                  mergePlots = T, filepre = "scAPA.200nt")
## >>> scAPA.200nt.3UTR.scAPA.200nt.freq 
## >>> scAPA.200nt.intron.scAPA.200nt.freq 
## >>> scAPA.200nt.CDS.scAPA.200nt.freq 
## >>> scAPA.200nt.intergenic.scAPA.200nt.freq

## ploysite2.0
mm10polyAsite2<- annotatePAC(mm10polyAsite2,aGFF = txdbmm10)
mm10polyAsite2 <- ext3UTRPACds(mm10polyAsite2, ext3UTRlen=1000)
## 19027 PACs in extended 3UTR (ftr=intergenic >> ftr=3UTR)
## Get 3UTR length (anno@toStop) for 3UTR/extended 3UTR PACs
table(mm10polyAsite2@anno$ftr)
## 
##       3UTR       5UTR        CDS       exon intergenic     intron 
##      96703       1173      16251       7408     112697      66527
## polyAsite2.0
dsPlot = mm10polyAsite2[mm10polyAsite2@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles4 = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "mm10polyAsite2.200nt", 
                       up = -100, dn = 100, byGrp = "ftr", chrCheck = FALSE)
## 112697 >>> mm10polyAsite2.200nt.intergenic.fa 
## 66527 >>> mm10polyAsite2.200nt.intron.fa 
## 96703 >>> mm10polyAsite2.200nt.3UTR.fa 
## 16251 >>> mm10polyAsite2.200nt.CDS.fa
plotATCGforFAfile(fafiles4, ofreq = TRUE, opdf = FALSE, refPos = 101, 
                  mergePlots = T, filepre = "mm10polyAsite2.200nt")
## >>> mm10polyAsite2.200nt.intergenic.mm10polyAsite2.200nt.freq 
## >>> mm10polyAsite2.200nt.intron.mm10polyAsite2.200nt.freq 
## >>> mm10polyAsite2.200nt.3UTR.mm10polyAsite2.200nt.freq 
## >>> mm10polyAsite2.200nt.CDS.mm10polyAsite2.200nt.freq

## ployAdb3
mm10polyAdb3<- annotatePAC(mm10polyAdb3, aGFF = txdbmm10)
mm10polyAdb3 <- ext3UTRPACds(mm10polyAdb3, ext3UTRlen=1000)
## 26966 PACs in extended 3UTR (ftr=intergenic >> ftr=3UTR)
## Get 3UTR length (anno@toStop) for 3UTR/extended 3UTR PACs
table(mm10polyAdb3@anno$ftr)
## 
##       3UTR       5UTR        CDS       exon intergenic     intron 
##     123208        720      12776      11039     146249      90327
## polyAdb3 
dsPlot = mm10polyAdb3[mm10polyAdb3@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "mm10polyAdb3.200nt", 
                      up = -100, dn = 100, byGrp = "ftr",  chrCheck = FALSE)
## 146249 >>> mm10polyAdb3.200nt.intergenic.fa 
## 123208 >>> mm10polyAdb3.200nt.3UTR.fa 
## 90327 >>> mm10polyAdb3.200nt.intron.fa 
## 12776 >>> mm10polyAdb3.200nt.CDS.fa
plotATCGforFAfile(fafiles, ofreq = TRUE, opdf = FALSE, refPos = 101, 
                  mergePlots = T, filepre = "mm10polyAdb3.200nt")
## >>> mm10polyAdb3.200nt.intergenic.mm10polyAdb3.200nt.freq 
## >>> mm10polyAdb3.200nt.3UTR.mm10polyAdb3.200nt.freq 
## >>> mm10polyAdb3.200nt.intron.mm10polyAdb3.200nt.freq 
## >>> mm10polyAdb3.200nt.CDS.mm10polyAdb3.200nt.freq

1.3 Comparison of 3’ UTR PACs

In this section, we extract PACs in the 3’ UTR and extended 3’ UTRs and compare the performance of the three methods from the two aspects – poly(A) signal and the overlap with annotated PACs.

scAPAtrapds3UTR <- scAPAtrapds[scAPAtrapds@anno$ftr=='3UTR',]
scAPAds3UTR <- scAPAds[scAPAds@anno$ftr=='3UTR',]
sierrads3UTR <- sierrads[sierrads@anno$ftr == '3UTR',]

1.3.1 Basic statistics

## Show numbers of 3'UTR PACs by barplot
PAC <- c(nrow(scAPAtrapds3UTR@counts),nrow(scAPAds3UTR@counts),nrow(sierrads3UTR@counts))
method <- factor(c('scAPAtrap','scAPA','sierra'),levels = c('scAPAtrap','scAPA','sierra'))

temp <- data.frame(PAC = PAC,method = method)

ggplot(temp) + aes(x = method, fill = method, colour = method,weight = PAC) + geom_bar() + scale_fill_nejm() + scale_color_nejm() + theme()

## Number of genes with 3'UTR PACs
length(unique(scAPAtrapds3UTR@anno$gene))
## [1] 12387
length(unique(scAPAds3UTR@anno$gene))
## [1] 11811
length(unique(sierrads3UTR@anno$gene))
## [1] 7938
# Number of 3'UTR-APA genes
scAPAtrapds3UTRh2 <- get3UTRAPAds(scAPAtrapds3UTR, sortPA = TRUE, choose2PA = NULL)
scAPAds3UTRh2 <- get3UTRAPAds(scAPAds3UTR, sortPA = TRUE, choose2PA = NULL)
sierrads3UTRh2 <- get3UTRAPAds(sierrads3UTR, sortPA = TRUE, choose2PA = NULL)

length(unique(scAPAtrapds3UTRh2@anno$gene))
## [1] 3484
length(unique(scAPAds3UTRh2@anno$gene))
## [1] 3789
length(unique(sierrads3UTRh2@anno$gene))
## [1] 2246

1.3.2 Overlap with annotated PACs

Here we calculate the precision of poly(A) site prediction of different tools using PACs from PolyA_DB 3 or PolyASite 2.0 as reference. Cutoffs ranging from 10 bp to 100 bp in a 10 bp increment were used to determine whether a predicted poly(A) site is supported by annotated PACs.

## PolyASite2.0
scAPAtrap.map <- c()
scAPA.map <- c()
Sierra.map <- c()
for (d in 2:20){
  scAPAtrapds3UTR <- annotateByKnownPAC(scAPAtrapds3UTR, mm10polyAsite2, labels="Mm", d=5*d)
  scAPAtrap.map <- c(scAPAtrap.map,sum(scAPAtrapds3UTR@anno$Mm_ovp == 1))
  scAPAds3UTR <- annotateByKnownPAC(scAPAds3UTR, mm10polyAsite2, labels="Mm", d=5*d)
  scAPA.map <- c(scAPA.map,sum(scAPAds3UTR@anno$Mm_ovp == 1))
  sierrads3UTR <- annotateByKnownPAC(sierrads3UTR, mm10polyAsite2, labels="Mm", d=5*d)
  Sierra.map <- c(Sierra.map,sum(sierrads3UTR@anno$Mm_ovp == 1))
}
scAPAtrap.map <- round(scAPAtrap.map/length(scAPAtrapds3UTR),2)
scAPA.map <- round(scAPA.map/length(scAPAds3UTR),2)
Sierra.map <- round(Sierra.map/length(sierrads3UTR),2)
d <- c(2:20)*5
all.map <- data.frame(scAPAtrap = scAPAtrap.map,scAPA = scAPA.map, Sierra = Sierra.map,d=d)

all.map <- reshape2::melt(all.map, id = c("d"))

ggplot(all.map) + aes(x = d, y = value, fill = variable, colour = variable, shape = variable) + 
  geom_point(size = 3L) + 
  geom_smooth(span = 0.33) + 
  scale_fill_nejm() + scale_color_nejm() + 
  annotate("text", x = 60, y = 0.9, label = "polyA site2.0", 
    family = "Arial", size = 10L) + 
  labs(x = "Cutoff(distance)", y = "Relative to known PA(%)") + 
  theme()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## PolyA_DB 3
scAPAtrap.map <- c()
scAPA.map <- c()
Sierra.map <- c()
for (d in 2:20){
  scAPAtrapds3UTR <- annotateByKnownPAC(scAPAtrapds3UTR, mm10polyAdb3, labels="Mm", d=5*d)
  scAPAtrap.map <- c(scAPAtrap.map,sum(scAPAtrapds3UTR@anno$Mm_ovp == 1))
  scAPAds3UTR <- annotateByKnownPAC(scAPAds3UTR, mm10polyAdb3, labels="Mm", d=5*d)
  scAPA.map <- c(scAPA.map,sum(scAPAds3UTR@anno$Mm_ovp == 1))
  sierrads3UTR <- annotateByKnownPAC(sierrads3UTR, mm10polyAdb3, labels="Mm", d=5*d)
  Sierra.map <- c(Sierra.map,sum(sierrads3UTR@anno$Mm_ovp == 1))
}
scAPAtrap.map <- round(scAPAtrap.map/length(scAPAtrapds3UTR),2)
scAPA.map <- round(scAPA.map/length(scAPAds3UTR),2)
Sierra.map <- round(Sierra.map/length(sierrads3UTR),2)
d <- c(2:20)*5
all.map <- data.frame(scAPAtrap = scAPAtrap.map,scAPA = scAPA.map, Sierra = Sierra.map,d=d)

all.map <- reshape2::melt(all.map, id = c("d"))

ggplot(all.map) + aes(x = d, y = value, fill = variable, colour = variable, shape = variable) + 
  geom_point(size = 3L) +
  geom_smooth(span = 0.33) + 
  scale_fill_nejm() + scale_color_nejm() + 
  annotate("text", x = 60, y = 0.9, label = "polyA DB3",
           family = "Arial", size = 10L) + 
  labs(x = "Cutoff(distance)", y = "Relative to known PA(%)") + 
  theme()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

1.3.3 Distance from predicted PACs to annotated PACs

In this part, we associate PACs identified by the three methods to PACs in polyAsite 2.0 with a distance cutoff of 100 bp.

scAPAtrapds3UTR <- annotateByKnownPAC(scAPAtrapds3UTR, mm10polyAsite2, labels = "Mm", d = 100)
## query: 17998 
## known: 300759 ( Mm )
## 14986 query PACs overlapping known PACs
scAPAds3UTR <- annotateByKnownPAC(scAPAds3UTR, mm10polyAsite2, labels = "Mm", d = 100)
## query: 17044 
## known: 300759 ( Mm )
## 13979 query PACs overlapping known PACs
sierrads3UTR <- annotateByKnownPAC(sierrads3UTR , mm10polyAsite2, labels = "Mm", d = 100)
## query: 11150 
## known: 300759 ( Mm )
## 9112 query PACs overlapping known PACs

When d=100, the proportion of annotation sites on the association.

sum(na.omit(scAPAtrapds3UTR@anno$Mm_dist))/sum(!is.na(scAPAtrapds3UTR@anno$Mm_dist));
## [1] 31.44955
sum(na.omit(scAPAds3UTR@anno$Mm_dist))/sum(!is.na(scAPAds3UTR@anno$Mm_dist));
## [1] 41.5308
sum(na.omit(sierrads3UTR@anno$Mm_dist))/sum(!is.na(sierrads3UTR@anno$Mm_dist))
## [1] 45.04653

Plot the distribution of distances from predicted PACs to annotated ones.

scAPA.dist <- as.data.frame(table(na.omit(scAPAds3UTR@anno$Mm_dist)))
scAPAtrap.dist <- as.data.frame(table(na.omit(scAPAtrapds3UTR@anno$Mm_dist)))
sierra.dist <- as.data.frame(table(na.omit(sierrads3UTR@anno$Mm_dist)))
scAPA.dist$Freq <- scAPA.dist$Freq/sum(scAPA.dist$Freq)
scAPAtrap.dist$Freq <- scAPAtrap.dist$Freq/sum(scAPAtrap.dist$Freq)
sierra.dist$Freq <- sierra.dist$Freq/sum(sierra.dist$Freq)
scAPA.dist$method <- rep("scAPA", length(scAPA.dist$Freq))
scAPAtrap.dist$method <- rep("scAPAtrap", length(scAPAtrap.dist$Freq))
sierra.dist$method <- rep("Sierra", length(sierra.dist$Freq))

all_dist <- rbind(scAPA.dist, scAPAtrap.dist, sierra.dist)
all_dist$Var1 <- as.numeric(all_dist$Var1)
all_dist$Var1 <- all_dist$Var1
all_dist$method <- factor(all_dist$method, levels = c("scAPAtrap", "Sierra", "scAPA"))
ggplot(all_dist) + aes(x = Var1, y = Freq, fill = method, colour = method) + 
  geom_line(size = 1L) + 
  scale_fill_nejm() + 
  scale_color_nejm() + 
  labs(x = "Cutoff(distance)", y = "% of PA", fill = "method", color = "method") + 
  theme()

1.3.4 Calculation of poly(A) signals

## Import mouse poly(A) signal hexamers
load('C:/Users/admin/Desktop/refdata/mm10Signal.rda')
## Set priority: AATAAA >> ATTAAA >> other hexamers
priority <- c(1, 2, rep(3, 16))
## scAPAtrap
scAPAtrapds3UTR <- annotateByPAS(pacds = scAPAtrapds3UTR, bsgenome, 
                                 grams = mm10Signal, priority = priority, 
                                 from = -60, to = 10,
                                 label = "Mm", chrCheck = TRUE)
table(scAPAtrapds3UTR@anno$Mm_gram)
## 
## AACAAA AACAAG AAGAAA AATAAA AATAAG AATAAT AATACA AATAGA AATATA AATGAA ACTAAA AGTAAA ATTAAA ATTACA ATTATA CATAAA GATAAA 
##    291    114    399   8591    106    196    312    137    291    257    163    414   2400    127    137    260    206 
## TATAAA 
##    378
sum(!is.na(scAPAtrapds3UTR@anno$Mm_gram))/length(scAPAtrapds3UTR)  
## [1] 0.8211468
## scAPA
scAPAds3UTR <- annotateByPAS(pacds = scAPAds3UTR, bsgenome, 
                             grams = mm10Signal, priority = priority, 
                             from = -60, to = 10, 
                             label = "Mm", chrCheck = TRUE)
table(scAPAds3UTR@anno$Mm_gram)
## 
## AACAAA AACAAG AAGAAA AATAAA AATAAG AATAAT AATACA AATAGA AATATA AATGAA ACTAAA AGTAAA ATTAAA ATTACA ATTATA CATAAA GATAAA 
##    329    143    417   5122    138    204    274    145    306    300    165    350   1621    153    200    264    204 
## TATAAA 
##    443
sum(!is.na(scAPAds3UTR@anno$Mm_gram))/length(scAPAds3UTR)
## [1] 0.6323633
## Sierra
sierrads3UTR <- annotateByPAS(pacds = sierrads3UTR, bsgenome, 
                              grams = mm10Signal, priority = priority, 
                              from = -60, to = 10, 
                              label = "Mm", chrCheck = TRUE)
table(sierrads3UTR@anno$Mm_gram)
## 
## AACAAA AACAAG AAGAAA AATAAA AATAAG AATAAT AATACA AATAGA AATATA AATGAA ACTAAA AGTAAA ATTAAA ATTACA ATTATA CATAAA GATAAA 
##    202    118    271   3573     92    112    152     97    154    168     87    217    915     81    100    142     91 
## TATAAA 
##    227
sum(!is.na(sierrads3UTR@anno$Mm_gram))/length(sierrads3UTR)
## [1] 0.6097758

Compare proportions of poly(A) signals identified by the three methods.

scAPA.signal <- scAPAds3UTR@anno$Mm_gram[!is.na(scAPAds3UTR@anno$Mm_gram)]
scAPA.signal[!scAPA.signal %in% c("AATAAA", "ATTAAA")] <- "1nt-variants"
scAPA.signal <- as.data.frame(table(scAPA.signal))
scAPA.signal$Freq <- scAPA.signal$Freq/length(scAPAds3UTR@anno$Mm_gram)
scAPA.signal$methed <- rep("scAPA", length(scAPA.signal$Freq))
colnames(scAPA.signal) <- c("signal", "freq", "method")

scAPAtrap.signal <- scAPAtrapds3UTR@anno$Mm_gram[!is.na(scAPAtrapds3UTR@anno$Mm_gram)]
scAPAtrap.signal[!scAPAtrap.signal %in% c("AATAAA", "ATTAAA")] <- "1nt-variants"
scAPAtrap.signal <- as.data.frame(table(scAPAtrap.signal))
scAPAtrap.signal$Freq <- scAPAtrap.signal$Freq/length(scAPAtrapds3UTR@anno$Mm_gram)
scAPAtrap.signal$methed <- rep("scAPAtrap", length(scAPAtrap.signal$Freq))
colnames(scAPAtrap.signal) <- c("signal", "freq", "method")

sierra.signal <- sierrads3UTR@anno$Mm_gram[!is.na(sierrads3UTR@anno$Mm_gram)]
sierra.signal[!sierra.signal %in% c("AATAAA", "ATTAAA")] <- "1nt-variants"
sierra.signal <- as.data.frame(table(sierra.signal))
sierra.signal$Freq <- sierra.signal$Freq/length(sierrads3UTR@anno$Mm_gram)
sierra.signal$methed <- rep("Sierra", length(sierra.signal$Freq))
colnames(sierra.signal) <- c("signal", "freq", "method")

all_signal <- rbind(scAPAtrap.signal, scAPA.signal, sierra.signal)
all_signal$signal = factor(all_signal$signal, levels = c("1nt-variants", "ATTAAA", "AATAAA"))
all_signal$method <- factor(all_signal$method, levels = c("scAPAtrap", "scAPA", "Sierra",'Bulk'))
ggplot(all_signal) + 
  aes(x = method, fill = signal, colour = signal, weight = freq) + 
  geom_bar(position = "stack") + scale_fill_nejm() + 
  scale_color_nejm() + labs(y = "% of signal", fill = "method", color = "method") + 
  theme()

1.3.5 Overlap of PACs among the three methods

scAPAtrapds3UTR.G<- with(scAPAtrapds3UTR@anno, 
                         GRanges(seqnames = chr, 
                                 ranges = IRanges(start = start, end = end), 
                                 strand = strand))
scAPAds3UTR.G <- with(scAPAds3UTR@anno, 
                      GRanges(seqnames = chr, 
                              ranges = IRanges(start = start, end = end), 
                              strand = strand))
sierrads3UTR.G <- with(sierrads3UTR@anno, 
                       GRanges(seqnames = chr, 
                               ranges = IRanges(start = start, end = end), 
                               strand = strand))

ov <- ChIPpeakAnno::findOverlapsOfPeaks(scAPAtrapds3UTR.G, scAPAds3UTR.G, 
                                        sierrads3UTR.G , ignore.strand = FALSE)
ChIPpeakAnno::makeVennDiagram(ov, fill = c("#FF69B4", "#32CD32", "#FFFF00"))

1.4 Identification of lowly expressed PACs

Here we inspected lowly expressed poly(A) sites detected by the three methods from the mouse sperm data. We considered PACs that were expressed in at least two cells but supported by less than 30 reads as lowly expressed.

1.4.1 Basic statistics

cutoff <- 30
scAPAtrapds3UTR.low <- scAPAtrapds3UTR[rowSums(scAPAtrapds3UTR@counts) <= cutoff, ]
scAPAds3UTR.low <- scAPAds3UTR[rowSums(scAPAds3UTR@counts) <= cutoff, ]
sierrads3UTR.low <- sierrads3UTR[rowSums(sierrads3UTR@counts) <= cutoff, ]

Number of lowly expressed PACs by each method.

length(scAPAtrapds3UTR.low)
## [1] 4210
length(scAPAds3UTR.low)
## [1] 2714
length(sierrads3UTR.low)
## [1] 193

Number of lowly expressed PACs with poly(A) signals

sum(!is.na(scAPAtrapds3UTR.low@anno$Mm_gram))
## [1] 3318
sum(!is.na(scAPAds3UTR.low@anno$Mm_gram))
## [1] 877
sum(!is.na(sierrads3UTR.low@anno$Mm_gram))
## [1] 137

Ratio of lowly expressed PACs with poly(A) signals

sum(!is.na(scAPAtrapds3UTR.low@anno$Mm_gram))/length(scAPAtrapds3UTR.low)
## [1] 0.7881235
sum(!is.na(scAPAds3UTR.low@anno$Mm_gram))/length(scAPAds3UTR.low)
## [1] 0.3231393
sum(!is.na(sierrads3UTR.low@anno$Mm_gram))/length(sierrads3UTR.low)
## [1] 0.7098446

Number of lowly expressed PACs supported by annotated PACs

sum(scAPAtrapds3UTR.low@anno$Mm_ovp)
## [1] 3137
sum(scAPAds3UTR.low@anno$Mm_ovp)
## [1] 1530
sum(sierrads3UTR.low@anno$Mm_ovp)
## [1] 177

Ratio of lowly expressed PACs supported by annotated PACs

sum(scAPAtrapds3UTR.low@anno$Mm_ovp)/length(scAPAtrapds3UTR.low)
## [1] 0.7451306
sum(scAPAds3UTR.low@anno$Mm_ovp)/length(scAPAds3UTR.low)
## [1] 0.5637436
sum(sierrads3UTR.low@anno$Mm_ovp)/length(sierrads3UTR.low)
## [1] 0.9170984

1.4.2 Overlap of lowly expressed PACs among the three methods

scAPAtrap.g_low <- with(scAPAtrapds3UTR.low@anno, 
                        GRanges(seqnames = chr, 
                                ranges = IRanges(start = start, end = end), strand = strand))
scAPA.g_low <- with(scAPAds3UTR.low@anno, 
                    GRanges(seqnames = chr, 
                            ranges = IRanges(start = start, end = end), strand = strand))
sierra.g_low <- with(sierrads3UTR.low@anno, 
                     GRanges(seqnames = chr, 
                             ranges = IRanges(start = start, end = end), strand = strand))

ov1 <- ChIPpeakAnno::findOverlapsOfPeaks(scAPAtrap.g_low, 
                                         scAPA.g_low,sierra.g_low, ignore.strand = FALSE)
ChIPpeakAnno::makeVennDiagram(ov1, fill = c("#FF69B4", "#32CD32", "#FFFF00"))

1.4.3 Poly(A) signals and single nucleotide profiles around lowly expressed PACs

## Single nucleotide profiles
fafiles = faFromPACds(scAPAtrapds3UTR.low, bsgenome, what = "updn", 
                      fapre = "scAPAtrap3UTR.low", up = -300, dn = 100, chrCheck = FALSE)
## 4210 >>> scAPAtrap3UTR.low.fa
plotATCGforFAfile(fafiles, ofreq = TRUE, opdf = F, 
                  refPos = 301, mergePlots = F, filepre = "scAPAtrap3UTR.low")
## >>> scAPAtrap3UTR.low.scAPAtrap3UTR.low.freq

## motif
pas = scAPAtrapds3UTR.low@anno$Mm_gram[!is.na(scAPAtrapds3UTR.low@anno$Mm_gram)]
plotSeqLogo(pas)

1.4.4 Effect of lowly expressed PACs on APA extent

Changes in the number of genes with different number of PACs after inclusion of lowly expressed PACs.

scAPAtrapds3UTR.upper <- scAPAtrapds3UTR[rowSums(scAPAtrapds3UTR@counts) > cutoff, ]
temp1 <- dplyr::group_by(scAPAtrapds3UTR.upper@anno, gene) %>% summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
temp2 <- dplyr::group_by(scAPAtrapds3UTR@anno, gene) %>% summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
temp1_num <- c()
temp2_num <- c()
for (i in c(1:5)) {
    if (i < 5) {
        temp1_num <- c(temp1_num, sum(temp1$count == i))
        temp2_num <- c(temp2_num, sum(temp2$count == i))
    } else {
        temp1_num <- c(temp1_num, sum(temp1$count >= i))
        temp2_num <- c(temp2_num, sum(temp2$count >= i))
    }
}

nPAC <- paste0(c("=", "=", "=", "=", ">="), c(1:5))
lowplot <- data.frame(before = temp1_num, after = temp2_num, nPAC = nPAC)
lowplot
##   before after nPAC
## 1   7633  8896   =1
## 2   1859  2534   =2
## 3    445   712   =3
## 4     92   166   =4
## 5     27    79  >=5
lowplot <- reshape2::melt(lowplot, id = c("nPAC"))
ggplot(lowplot) + aes(x = nPAC, fill = variable, colour = variable, weight = value) + 
  geom_bar(position = "dodge") + 
  scale_fill_nejm() + 
  scale_color_nejm() + 
  labs(y = "The number of genes") + 
  theme()

2 Session Information

The session information records the versions of all the packages used in the generation of the present document.

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
## [3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
## [5] LC_TIME=Chinese (Simplified)_China.936    
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] shiny_1.5.0                               TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
##  [3] BSgenome.Mmusculus.UCSC.mm10_1.4.0        motifStack_1.30.0                        
##  [5] ade4_1.7-15                               MotIV_1.42.0                             
##  [7] grImport2_0.2-0                           gdtools_0.2.2                            
##  [9] eoffice_0.1.9                             ggsci_2.9                                
## [11] BiocStyle_2.14.4                          knitr_1.29                               
## [13] movAPA_0.1.0                              DEXSeq_1.32.0                            
## [15] DESeq2_1.26.0                             SummarizedExperiment_1.16.1              
## [17] DelayedArray_0.12.3                       BiocParallel_1.20.1                      
## [19] matrixStats_0.56.0                        GenomicFeatures_1.38.2                   
## [21] AnnotationDbi_1.48.0                      Biobase_2.46.0                           
## [23] ggbio_1.34.0                              ggplot2_3.3.2                            
## [25] data.table_1.12.8                         RColorBrewer_1.1-2                       
## [27] reshape2_1.4.4                            dplyr_1.0.0                              
## [29] BSgenome_1.54.0                           rtracklayer_1.46.0                       
## [31] Biostrings_2.54.0                         XVector_0.26.0                           
## [33] GenomicRanges_1.38.0                      GenomeInfoDb_1.22.1                      
## [35] IRanges_2.20.2                            S4Vectors_0.24.4                         
## [37] BiocGenerics_0.32.0                      
## 
## loaded via a namespace (and not attached):
##   [1] esquisse_0.3.0           R.utils_2.9.2            tidyselect_1.1.0         RSQLite_2.2.0           
##   [5] htmlwidgets_1.5.1        munsell_0.5.0            statmod_1.4.34           miniUI_0.1.1.1          
##   [9] withr_2.2.0              colorspace_1.4-1         OrganismDbi_1.28.0       uuid_0.1-4              
##  [13] rstudioapi_0.11          officer_0.3.12           shinyWidgets_0.5.3       labeling_0.3            
##  [17] GenomeInfoDbData_1.2.2   hwriter_1.3.2            bit64_0.9-7              farver_2.0.3            
##  [21] vctrs_0.3.1              ChIPpeakAnno_3.20.1      generics_0.0.2           lambda.r_1.2.4          
##  [25] xfun_0.15                ggthemes_4.2.0           biovizBase_1.34.1        BiocFileCache_1.10.2    
##  [29] regioneR_1.18.1          R6_2.4.1                 idr_1.2                  locfit_1.5-9.4          
##  [33] AnnotationFilter_1.10.0  bitops_1.0-6             reshape_0.8.8            gridGraphics_0.5-0      
##  [37] assertthat_0.2.1         promises_1.1.1           scales_1.1.1             nnet_7.3-14             
##  [41] gtable_0.3.0             ensembldb_2.10.2         seqLogo_1.52.0           rlang_0.4.6             
##  [45] genefilter_1.68.0        systemfonts_0.2.3        splines_3.6.3            lazyeval_0.2.2          
##  [49] acepack_1.4.1            dichromat_2.0-0          broom_0.5.6              checkmate_2.0.0         
##  [53] BiocManager_1.30.10      yaml_2.2.1               backports_1.1.7          httpuv_1.5.4            
##  [57] Hmisc_4.4-0              RBGL_1.62.1              tools_3.6.3              bookdown_0.20           
##  [61] ggplotify_0.0.5          ellipsis_0.3.1           devEMF_3.7               Rcpp_1.0.5              
##  [65] plyr_1.8.6               base64enc_0.1-3          progress_1.2.2           zlibbioc_1.32.0         
##  [69] purrr_0.3.4              RCurl_1.98-1.2           prettyunits_1.1.1        rpart_4.1-15            
##  [73] openssl_1.4.1            cluster_2.1.0            magrittr_1.5             futile.options_1.0.1    
##  [77] magick_2.3               flextable_0.5.10         ProtGenerics_1.18.0      mime_0.9                
##  [81] hms_0.5.3                evaluate_0.14            xtable_1.8-4             XML_3.99-0.3            
##  [85] VennDiagram_1.6.20       jpeg_0.1-8.1             gridExtra_2.3            compiler_3.6.3          
##  [89] biomaRt_2.42.1           tibble_3.0.1             crayon_1.3.4             R.oo_1.23.0             
##  [93] htmltools_0.5.0          mgcv_1.8-31              later_1.1.0.1            Formula_1.2-3           
##  [97] tidyr_1.1.0              geneplotter_1.64.0       DBI_1.1.0                formatR_1.7             
## [101] dbplyr_1.4.4             MASS_7.3-51.6            rappdirs_0.3.1           rGADEM_2.34.1           
## [105] Matrix_1.2-18            R.methodsS3_1.8.0        pkgconfig_2.0.3          rvcheck_0.1.8           
## [109] GenomicAlignments_1.22.1 foreign_0.8-75           plotly_4.9.2.1           xml2_1.3.2              
## [113] annotate_1.64.0          multtest_2.42.0          rvg_0.2.4                stringr_1.4.0           
## [117] VariantAnnotation_1.32.0 digest_0.6.25            graph_1.64.0             rmarkdown_2.3           
## [121] htmlTable_2.0.1          curl_4.3                 Rsamtools_2.2.3          lifecycle_0.2.0         
## [125] nlme_3.1-148             jsonlite_1.6.1           seqinr_3.6-1             futile.logger_1.4.3     
## [129] viridisLite_0.3.0        askpass_1.1              limma_3.42.2             pillar_1.4.4            
## [133] lattice_0.20-41          GGally_2.0.0             R.devices_2.16.1         fastmap_1.0.1           
## [137] httr_1.4.1               survival_3.2-3           GO.db_3.10.0             glue_1.4.1              
## [141] zip_2.0.4                png_0.1-7                bit_1.1-15.2             stringi_1.4.6           
## [145] blob_1.2.1               latticeExtra_0.6-29      memoise_1.1.0

3 References

[1] Shulman, E.D. and Elkon, R. Cell-type-specific analysis of alternative polyadenylation using single-cell transcriptomics data. Nucleic Acids Res 2019;47(19):10027-10039.

[2] movAPA: Modeling and visualization of dynamics of alternative polyadenylation across biological samples (under review).

[3] Lukassen, S., Bosch, E., Ekici, A.B., et al. Single-cell RNA sequencing of adult mouse testes. Scientific Data 2018;5(1):180192.

[4] Patrick, R., Humphreys, D.T., Janbandhu, V., et al. Sierra: discovery of differential transcript usage from polyA-captured single-cell RNA-seq data. Genome Biol 2020;21(1):167.