Here we adopted a mouse sperm scRNA-seq dataset to evaluate the performance of scAPAtrap and compared the results with other two tools, scAPA(Shulman et al, 2019) and Sierra(Patrick, et al., 2020). Majority of the comparison analyses were performed with the movAPA toolkit. For detailed usage of movAPA toolkit, please refer to link
We have used scAPAtrap, scAPA, and Sierra to identify poly(A) sites from the mouse sperm scRNA-seq data, respectively. The identified poly(A) sites can be downloaded from this link.
We focused on three stages of differentiation process as in the previous study (Shulman et al, 2019), including early stage (spermatocytes, SC), intermediate stage (round spermatids, RS), and late stage (elongating spermatids, ES). For detailed information, please refer to the original publication Lukassen, et al., 2018.
In this section, we import the required R packages and some required data. Please download the required data from the above link first before running the code. ### Load R package
library(ggplot2)
library(reshape2)
library(ggsci)
library(movAPA)
Load the mouse reference genome, reference genome annotation, and annotated poly(A) sites from two latest known poly(A) site databases from bulk 3?? seq, polyASite 2.0 and PolyA_DB 3.
Please replace the file path in the following commands.
## Load mouse reference genome
library(BSgenome.Mmusculus.UCSC.mm10)
bsgenome <- BSgenome.Mmusculus.UCSC.mm10
## Load reference genome annotation
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdbmm10 <- parseGenomeAnnotation(TxDb.Mmusculus.UCSC.mm10.knownGene)
## Load annotated poly(A) sites from polyAsite 2.0.
load('C:/Users/admin/Desktop/refdata/mm10polyAsite2.rda')
## Load annotated poly(A) sites from PolyA_DB 3.
load('C:/Users/admin/Desktop/refdata/mm10polyAdb3.rda')
Load poly(A) sites identified with the three tools.
## scAPA
load("C:/Users/admin/Desktop/scPACdsData/scAPAds.rda")
## scAPAtrap
load("C:/Users/admin/Desktop/scPACdsData/scAPAtrapds.rda")
## sierra
load("C:/Users/admin/Desktop/scPACdsData/sierrads.rda")
Here we refer to poly(A) sites in above loaded datasets as poly(A) site clusters (PACs), as nearby cleavage sites have been already grouped into one poly(A) site.
In this section, we will do some basic statistics on PACs identified by the three methods.
## Total number of PACs
sum(scAPAtrapds@counts)
## [1] 52592642
sum(scAPAds@counts)
## [1] 53185459
sum(sierrads@counts)
## [1] 31069317
## Average number of PACs per cell
sum(scAPAtrapds@counts)/ncol(scAPAtrapds@counts)
## [1] 25755.46
sum(scAPAds@counts)/ncol(scAPAtrapds@counts)
## [1] 26045.77
sum(sierrads@counts)/ncol(sierrads@counts)
## [1] 15215.14
## scAPAtrap
table(scAPAtrapds@anno$ftr)
##
## 3UTR 5UTR CDS exon intergenic intron
## 17998 139 3660 919 43698 21692
## scAPA
table(scAPAds@anno$ftr)
##
## 3UTR 5UTR CDS exon intergenic intron
## 17044 3 344 137 624 785
## sierra
table(sierrads@anno$ftr)
##
## 3UTR 5UTR CDS exon intergenic intron
## 11150 218 3355 722 1669 5505
## scAPAtrap
dsPlot = scAPAtrapds[scAPAtrapds@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles1 = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "scAPAtrap.200nt",
up = -100, dn = 100, byGrp = "ftr", chrCheck = FALSE)
## 43698 >>> scAPAtrap.200nt.intergenic.fa
## 17998 >>> scAPAtrap.200nt.3UTR.fa
## 3660 >>> scAPAtrap.200nt.CDS.fa
## 21692 >>> scAPAtrap.200nt.intron.fa
plotATCGforFAfile(fafiles1, ofreq = TRUE, opdf = FALSE, refPos = 101,
mergePlots = T, filepre = "scAPAtrap.200nt")
## >>> scAPAtrap.200nt.intergenic.scAPAtrap.200nt.freq
## >>> scAPAtrap.200nt.3UTR.scAPAtrap.200nt.freq
## >>> scAPAtrap.200nt.CDS.scAPAtrap.200nt.freq
## >>> scAPAtrap.200nt.intron.scAPAtrap.200nt.freq
## sierra
dsPlot = sierrads[sierrads@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles2 = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "sierra.200nt",
up = -100, dn = 100, byGrp = "ftr", chrCheck = FALSE)
## 11150 >>> sierra.200nt.3UTR.fa
## 5505 >>> sierra.200nt.intron.fa
## 3355 >>> sierra.200nt.CDS.fa
## 1669 >>> sierra.200nt.intergenic.fa
plotATCGforFAfile(fafiles2, ofreq = TRUE, opdf = FALSE, refPos = 101,
mergePlots = T, filepre = "sierra.200nt")
## >>> sierra.200nt.3UTR.sierra.200nt.freq
## >>> sierra.200nt.intron.sierra.200nt.freq
## >>> sierra.200nt.CDS.sierra.200nt.freq
## >>> sierra.200nt.intergenic.sierra.200nt.freq
## scAPA
dsPlot = scAPAds[scAPAds@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles3 = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "scAPA.200nt",
up = -100, dn = 100, byGrp = "ftr", chrCheck = FALSE)
## 17044 >>> scAPA.200nt.3UTR.fa
## 785 >>> scAPA.200nt.intron.fa
## 344 >>> scAPA.200nt.CDS.fa
## 624 >>> scAPA.200nt.intergenic.fa
plotATCGforFAfile(fafiles3, ofreq = TRUE, opdf = FALSE, refPos = 101,
mergePlots = T, filepre = "scAPA.200nt")
## >>> scAPA.200nt.3UTR.scAPA.200nt.freq
## >>> scAPA.200nt.intron.scAPA.200nt.freq
## >>> scAPA.200nt.CDS.scAPA.200nt.freq
## >>> scAPA.200nt.intergenic.scAPA.200nt.freq
## ploysite2.0
mm10polyAsite2<- annotatePAC(mm10polyAsite2,aGFF = txdbmm10)
mm10polyAsite2 <- ext3UTRPACds(mm10polyAsite2, ext3UTRlen=1000)
## 19027 PACs in extended 3UTR (ftr=intergenic >> ftr=3UTR)
## Get 3UTR length (anno@toStop) for 3UTR/extended 3UTR PACs
table(mm10polyAsite2@anno$ftr)
##
## 3UTR 5UTR CDS exon intergenic intron
## 96703 1173 16251 7408 112697 66527
## polyAsite2.0
dsPlot = mm10polyAsite2[mm10polyAsite2@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles4 = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "mm10polyAsite2.200nt",
up = -100, dn = 100, byGrp = "ftr", chrCheck = FALSE)
## 112697 >>> mm10polyAsite2.200nt.intergenic.fa
## 66527 >>> mm10polyAsite2.200nt.intron.fa
## 96703 >>> mm10polyAsite2.200nt.3UTR.fa
## 16251 >>> mm10polyAsite2.200nt.CDS.fa
plotATCGforFAfile(fafiles4, ofreq = TRUE, opdf = FALSE, refPos = 101,
mergePlots = T, filepre = "mm10polyAsite2.200nt")
## >>> mm10polyAsite2.200nt.intergenic.mm10polyAsite2.200nt.freq
## >>> mm10polyAsite2.200nt.intron.mm10polyAsite2.200nt.freq
## >>> mm10polyAsite2.200nt.3UTR.mm10polyAsite2.200nt.freq
## >>> mm10polyAsite2.200nt.CDS.mm10polyAsite2.200nt.freq
## ployAdb3
mm10polyAdb3<- annotatePAC(mm10polyAdb3, aGFF = txdbmm10)
mm10polyAdb3 <- ext3UTRPACds(mm10polyAdb3, ext3UTRlen=1000)
## 26966 PACs in extended 3UTR (ftr=intergenic >> ftr=3UTR)
## Get 3UTR length (anno@toStop) for 3UTR/extended 3UTR PACs
table(mm10polyAdb3@anno$ftr)
##
## 3UTR 5UTR CDS exon intergenic intron
## 123208 720 12776 11039 146249 90327
## polyAdb3
dsPlot = mm10polyAdb3[mm10polyAdb3@anno$ftr %in% c("3UTR", "CDS", "intron","intergenic")]
fafiles = faFromPACds(dsPlot, bsgenome, what = "updn", fapre = "mm10polyAdb3.200nt",
up = -100, dn = 100, byGrp = "ftr", chrCheck = FALSE)
## 146249 >>> mm10polyAdb3.200nt.intergenic.fa
## 123208 >>> mm10polyAdb3.200nt.3UTR.fa
## 90327 >>> mm10polyAdb3.200nt.intron.fa
## 12776 >>> mm10polyAdb3.200nt.CDS.fa
plotATCGforFAfile(fafiles, ofreq = TRUE, opdf = FALSE, refPos = 101,
mergePlots = T, filepre = "mm10polyAdb3.200nt")
## >>> mm10polyAdb3.200nt.intergenic.mm10polyAdb3.200nt.freq
## >>> mm10polyAdb3.200nt.3UTR.mm10polyAdb3.200nt.freq
## >>> mm10polyAdb3.200nt.intron.mm10polyAdb3.200nt.freq
## >>> mm10polyAdb3.200nt.CDS.mm10polyAdb3.200nt.freq
In this section, we extract PACs in the 3’ UTR and extended 3’ UTRs and compare the performance of the three methods from the two aspects – poly(A) signal and the overlap with annotated PACs.
scAPAtrapds3UTR <- scAPAtrapds[scAPAtrapds@anno$ftr=='3UTR',]
scAPAds3UTR <- scAPAds[scAPAds@anno$ftr=='3UTR',]
sierrads3UTR <- sierrads[sierrads@anno$ftr == '3UTR',]
## Show numbers of 3'UTR PACs by barplot
PAC <- c(nrow(scAPAtrapds3UTR@counts),nrow(scAPAds3UTR@counts),nrow(sierrads3UTR@counts))
method <- factor(c('scAPAtrap','scAPA','sierra'),levels = c('scAPAtrap','scAPA','sierra'))
temp <- data.frame(PAC = PAC,method = method)
ggplot(temp) + aes(x = method, fill = method, colour = method,weight = PAC) + geom_bar() + scale_fill_nejm() + scale_color_nejm() + theme()
## Number of genes with 3'UTR PACs
length(unique(scAPAtrapds3UTR@anno$gene))
## [1] 12387
length(unique(scAPAds3UTR@anno$gene))
## [1] 11811
length(unique(sierrads3UTR@anno$gene))
## [1] 7938
# Number of 3'UTR-APA genes
scAPAtrapds3UTRh2 <- get3UTRAPAds(scAPAtrapds3UTR, sortPA = TRUE, choose2PA = NULL)
scAPAds3UTRh2 <- get3UTRAPAds(scAPAds3UTR, sortPA = TRUE, choose2PA = NULL)
sierrads3UTRh2 <- get3UTRAPAds(sierrads3UTR, sortPA = TRUE, choose2PA = NULL)
length(unique(scAPAtrapds3UTRh2@anno$gene))
## [1] 3484
length(unique(scAPAds3UTRh2@anno$gene))
## [1] 3789
length(unique(sierrads3UTRh2@anno$gene))
## [1] 2246
Here we calculate the precision of poly(A) site prediction of different tools using PACs from PolyA_DB 3 or PolyASite 2.0 as reference. Cutoffs ranging from 10 bp to 100 bp in a 10 bp increment were used to determine whether a predicted poly(A) site is supported by annotated PACs.
## PolyASite2.0
scAPAtrap.map <- c()
scAPA.map <- c()
Sierra.map <- c()
for (d in 2:20){
scAPAtrapds3UTR <- annotateByKnownPAC(scAPAtrapds3UTR, mm10polyAsite2, labels="Mm", d=5*d)
scAPAtrap.map <- c(scAPAtrap.map,sum(scAPAtrapds3UTR@anno$Mm_ovp == 1))
scAPAds3UTR <- annotateByKnownPAC(scAPAds3UTR, mm10polyAsite2, labels="Mm", d=5*d)
scAPA.map <- c(scAPA.map,sum(scAPAds3UTR@anno$Mm_ovp == 1))
sierrads3UTR <- annotateByKnownPAC(sierrads3UTR, mm10polyAsite2, labels="Mm", d=5*d)
Sierra.map <- c(Sierra.map,sum(sierrads3UTR@anno$Mm_ovp == 1))
}
scAPAtrap.map <- round(scAPAtrap.map/length(scAPAtrapds3UTR),2)
scAPA.map <- round(scAPA.map/length(scAPAds3UTR),2)
Sierra.map <- round(Sierra.map/length(sierrads3UTR),2)
d <- c(2:20)*5
all.map <- data.frame(scAPAtrap = scAPAtrap.map,scAPA = scAPA.map, Sierra = Sierra.map,d=d)
all.map <- reshape2::melt(all.map, id = c("d"))
ggplot(all.map) + aes(x = d, y = value, fill = variable, colour = variable, shape = variable) +
geom_point(size = 3L) +
geom_smooth(span = 0.33) +
scale_fill_nejm() + scale_color_nejm() +
annotate("text", x = 60, y = 0.9, label = "polyA site2.0",
family = "Arial", size = 10L) +
labs(x = "Cutoff(distance)", y = "Relative to known PA(%)") +
theme()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## PolyA_DB 3
scAPAtrap.map <- c()
scAPA.map <- c()
Sierra.map <- c()
for (d in 2:20){
scAPAtrapds3UTR <- annotateByKnownPAC(scAPAtrapds3UTR, mm10polyAdb3, labels="Mm", d=5*d)
scAPAtrap.map <- c(scAPAtrap.map,sum(scAPAtrapds3UTR@anno$Mm_ovp == 1))
scAPAds3UTR <- annotateByKnownPAC(scAPAds3UTR, mm10polyAdb3, labels="Mm", d=5*d)
scAPA.map <- c(scAPA.map,sum(scAPAds3UTR@anno$Mm_ovp == 1))
sierrads3UTR <- annotateByKnownPAC(sierrads3UTR, mm10polyAdb3, labels="Mm", d=5*d)
Sierra.map <- c(Sierra.map,sum(sierrads3UTR@anno$Mm_ovp == 1))
}
scAPAtrap.map <- round(scAPAtrap.map/length(scAPAtrapds3UTR),2)
scAPA.map <- round(scAPA.map/length(scAPAds3UTR),2)
Sierra.map <- round(Sierra.map/length(sierrads3UTR),2)
d <- c(2:20)*5
all.map <- data.frame(scAPAtrap = scAPAtrap.map,scAPA = scAPA.map, Sierra = Sierra.map,d=d)
all.map <- reshape2::melt(all.map, id = c("d"))
ggplot(all.map) + aes(x = d, y = value, fill = variable, colour = variable, shape = variable) +
geom_point(size = 3L) +
geom_smooth(span = 0.33) +
scale_fill_nejm() + scale_color_nejm() +
annotate("text", x = 60, y = 0.9, label = "polyA DB3",
family = "Arial", size = 10L) +
labs(x = "Cutoff(distance)", y = "Relative to known PA(%)") +
theme()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In this part, we associate PACs identified by the three methods to PACs in polyAsite 2.0 with a distance cutoff of 100 bp.
scAPAtrapds3UTR <- annotateByKnownPAC(scAPAtrapds3UTR, mm10polyAsite2, labels = "Mm", d = 100)
## query: 17998
## known: 300759 ( Mm )
## 14986 query PACs overlapping known PACs
scAPAds3UTR <- annotateByKnownPAC(scAPAds3UTR, mm10polyAsite2, labels = "Mm", d = 100)
## query: 17044
## known: 300759 ( Mm )
## 13979 query PACs overlapping known PACs
sierrads3UTR <- annotateByKnownPAC(sierrads3UTR , mm10polyAsite2, labels = "Mm", d = 100)
## query: 11150
## known: 300759 ( Mm )
## 9112 query PACs overlapping known PACs
When d=100, the proportion of annotation sites on the association.
sum(na.omit(scAPAtrapds3UTR@anno$Mm_dist))/sum(!is.na(scAPAtrapds3UTR@anno$Mm_dist));
## [1] 31.44955
sum(na.omit(scAPAds3UTR@anno$Mm_dist))/sum(!is.na(scAPAds3UTR@anno$Mm_dist));
## [1] 41.5308
sum(na.omit(sierrads3UTR@anno$Mm_dist))/sum(!is.na(sierrads3UTR@anno$Mm_dist))
## [1] 45.04653
Plot the distribution of distances from predicted PACs to annotated ones.
scAPA.dist <- as.data.frame(table(na.omit(scAPAds3UTR@anno$Mm_dist)))
scAPAtrap.dist <- as.data.frame(table(na.omit(scAPAtrapds3UTR@anno$Mm_dist)))
sierra.dist <- as.data.frame(table(na.omit(sierrads3UTR@anno$Mm_dist)))
scAPA.dist$Freq <- scAPA.dist$Freq/sum(scAPA.dist$Freq)
scAPAtrap.dist$Freq <- scAPAtrap.dist$Freq/sum(scAPAtrap.dist$Freq)
sierra.dist$Freq <- sierra.dist$Freq/sum(sierra.dist$Freq)
scAPA.dist$method <- rep("scAPA", length(scAPA.dist$Freq))
scAPAtrap.dist$method <- rep("scAPAtrap", length(scAPAtrap.dist$Freq))
sierra.dist$method <- rep("Sierra", length(sierra.dist$Freq))
all_dist <- rbind(scAPA.dist, scAPAtrap.dist, sierra.dist)
all_dist$Var1 <- as.numeric(all_dist$Var1)
all_dist$Var1 <- all_dist$Var1
all_dist$method <- factor(all_dist$method, levels = c("scAPAtrap", "Sierra", "scAPA"))
ggplot(all_dist) + aes(x = Var1, y = Freq, fill = method, colour = method) +
geom_line(size = 1L) +
scale_fill_nejm() +
scale_color_nejm() +
labs(x = "Cutoff(distance)", y = "% of PA", fill = "method", color = "method") +
theme()
## Import mouse poly(A) signal hexamers
load('C:/Users/admin/Desktop/refdata/mm10Signal.rda')
## Set priority: AATAAA >> ATTAAA >> other hexamers
priority <- c(1, 2, rep(3, 16))
## scAPAtrap
scAPAtrapds3UTR <- annotateByPAS(pacds = scAPAtrapds3UTR, bsgenome,
grams = mm10Signal, priority = priority,
from = -60, to = 10,
label = "Mm", chrCheck = TRUE)
table(scAPAtrapds3UTR@anno$Mm_gram)
##
## AACAAA AACAAG AAGAAA AATAAA AATAAG AATAAT AATACA AATAGA AATATA AATGAA ACTAAA AGTAAA ATTAAA ATTACA ATTATA CATAAA GATAAA
## 291 114 399 8591 106 196 312 137 291 257 163 414 2400 127 137 260 206
## TATAAA
## 378
sum(!is.na(scAPAtrapds3UTR@anno$Mm_gram))/length(scAPAtrapds3UTR)
## [1] 0.8211468
## scAPA
scAPAds3UTR <- annotateByPAS(pacds = scAPAds3UTR, bsgenome,
grams = mm10Signal, priority = priority,
from = -60, to = 10,
label = "Mm", chrCheck = TRUE)
table(scAPAds3UTR@anno$Mm_gram)
##
## AACAAA AACAAG AAGAAA AATAAA AATAAG AATAAT AATACA AATAGA AATATA AATGAA ACTAAA AGTAAA ATTAAA ATTACA ATTATA CATAAA GATAAA
## 329 143 417 5122 138 204 274 145 306 300 165 350 1621 153 200 264 204
## TATAAA
## 443
sum(!is.na(scAPAds3UTR@anno$Mm_gram))/length(scAPAds3UTR)
## [1] 0.6323633
## Sierra
sierrads3UTR <- annotateByPAS(pacds = sierrads3UTR, bsgenome,
grams = mm10Signal, priority = priority,
from = -60, to = 10,
label = "Mm", chrCheck = TRUE)
table(sierrads3UTR@anno$Mm_gram)
##
## AACAAA AACAAG AAGAAA AATAAA AATAAG AATAAT AATACA AATAGA AATATA AATGAA ACTAAA AGTAAA ATTAAA ATTACA ATTATA CATAAA GATAAA
## 202 118 271 3573 92 112 152 97 154 168 87 217 915 81 100 142 91
## TATAAA
## 227
sum(!is.na(sierrads3UTR@anno$Mm_gram))/length(sierrads3UTR)
## [1] 0.6097758
Compare proportions of poly(A) signals identified by the three methods.
scAPA.signal <- scAPAds3UTR@anno$Mm_gram[!is.na(scAPAds3UTR@anno$Mm_gram)]
scAPA.signal[!scAPA.signal %in% c("AATAAA", "ATTAAA")] <- "1nt-variants"
scAPA.signal <- as.data.frame(table(scAPA.signal))
scAPA.signal$Freq <- scAPA.signal$Freq/length(scAPAds3UTR@anno$Mm_gram)
scAPA.signal$methed <- rep("scAPA", length(scAPA.signal$Freq))
colnames(scAPA.signal) <- c("signal", "freq", "method")
scAPAtrap.signal <- scAPAtrapds3UTR@anno$Mm_gram[!is.na(scAPAtrapds3UTR@anno$Mm_gram)]
scAPAtrap.signal[!scAPAtrap.signal %in% c("AATAAA", "ATTAAA")] <- "1nt-variants"
scAPAtrap.signal <- as.data.frame(table(scAPAtrap.signal))
scAPAtrap.signal$Freq <- scAPAtrap.signal$Freq/length(scAPAtrapds3UTR@anno$Mm_gram)
scAPAtrap.signal$methed <- rep("scAPAtrap", length(scAPAtrap.signal$Freq))
colnames(scAPAtrap.signal) <- c("signal", "freq", "method")
sierra.signal <- sierrads3UTR@anno$Mm_gram[!is.na(sierrads3UTR@anno$Mm_gram)]
sierra.signal[!sierra.signal %in% c("AATAAA", "ATTAAA")] <- "1nt-variants"
sierra.signal <- as.data.frame(table(sierra.signal))
sierra.signal$Freq <- sierra.signal$Freq/length(sierrads3UTR@anno$Mm_gram)
sierra.signal$methed <- rep("Sierra", length(sierra.signal$Freq))
colnames(sierra.signal) <- c("signal", "freq", "method")
all_signal <- rbind(scAPAtrap.signal, scAPA.signal, sierra.signal)
all_signal$signal = factor(all_signal$signal, levels = c("1nt-variants", "ATTAAA", "AATAAA"))
all_signal$method <- factor(all_signal$method, levels = c("scAPAtrap", "scAPA", "Sierra",'Bulk'))
ggplot(all_signal) +
aes(x = method, fill = signal, colour = signal, weight = freq) +
geom_bar(position = "stack") + scale_fill_nejm() +
scale_color_nejm() + labs(y = "% of signal", fill = "method", color = "method") +
theme()
scAPAtrapds3UTR.G<- with(scAPAtrapds3UTR@anno,
GRanges(seqnames = chr,
ranges = IRanges(start = start, end = end),
strand = strand))
scAPAds3UTR.G <- with(scAPAds3UTR@anno,
GRanges(seqnames = chr,
ranges = IRanges(start = start, end = end),
strand = strand))
sierrads3UTR.G <- with(sierrads3UTR@anno,
GRanges(seqnames = chr,
ranges = IRanges(start = start, end = end),
strand = strand))
ov <- ChIPpeakAnno::findOverlapsOfPeaks(scAPAtrapds3UTR.G, scAPAds3UTR.G,
sierrads3UTR.G , ignore.strand = FALSE)
ChIPpeakAnno::makeVennDiagram(ov, fill = c("#FF69B4", "#32CD32", "#FFFF00"))
Here we inspected lowly expressed poly(A) sites detected by the three methods from the mouse sperm data. We considered PACs that were expressed in at least two cells but supported by less than 30 reads as lowly expressed.
cutoff <- 30
scAPAtrapds3UTR.low <- scAPAtrapds3UTR[rowSums(scAPAtrapds3UTR@counts) <= cutoff, ]
scAPAds3UTR.low <- scAPAds3UTR[rowSums(scAPAds3UTR@counts) <= cutoff, ]
sierrads3UTR.low <- sierrads3UTR[rowSums(sierrads3UTR@counts) <= cutoff, ]
Number of lowly expressed PACs by each method.
length(scAPAtrapds3UTR.low)
## [1] 4210
length(scAPAds3UTR.low)
## [1] 2714
length(sierrads3UTR.low)
## [1] 193
Number of lowly expressed PACs with poly(A) signals
sum(!is.na(scAPAtrapds3UTR.low@anno$Mm_gram))
## [1] 3318
sum(!is.na(scAPAds3UTR.low@anno$Mm_gram))
## [1] 877
sum(!is.na(sierrads3UTR.low@anno$Mm_gram))
## [1] 137
Ratio of lowly expressed PACs with poly(A) signals
sum(!is.na(scAPAtrapds3UTR.low@anno$Mm_gram))/length(scAPAtrapds3UTR.low)
## [1] 0.7881235
sum(!is.na(scAPAds3UTR.low@anno$Mm_gram))/length(scAPAds3UTR.low)
## [1] 0.3231393
sum(!is.na(sierrads3UTR.low@anno$Mm_gram))/length(sierrads3UTR.low)
## [1] 0.7098446
Number of lowly expressed PACs supported by annotated PACs
sum(scAPAtrapds3UTR.low@anno$Mm_ovp)
## [1] 3137
sum(scAPAds3UTR.low@anno$Mm_ovp)
## [1] 1530
sum(sierrads3UTR.low@anno$Mm_ovp)
## [1] 177
Ratio of lowly expressed PACs supported by annotated PACs
sum(scAPAtrapds3UTR.low@anno$Mm_ovp)/length(scAPAtrapds3UTR.low)
## [1] 0.7451306
sum(scAPAds3UTR.low@anno$Mm_ovp)/length(scAPAds3UTR.low)
## [1] 0.5637436
sum(sierrads3UTR.low@anno$Mm_ovp)/length(sierrads3UTR.low)
## [1] 0.9170984
scAPAtrap.g_low <- with(scAPAtrapds3UTR.low@anno,
GRanges(seqnames = chr,
ranges = IRanges(start = start, end = end), strand = strand))
scAPA.g_low <- with(scAPAds3UTR.low@anno,
GRanges(seqnames = chr,
ranges = IRanges(start = start, end = end), strand = strand))
sierra.g_low <- with(sierrads3UTR.low@anno,
GRanges(seqnames = chr,
ranges = IRanges(start = start, end = end), strand = strand))
ov1 <- ChIPpeakAnno::findOverlapsOfPeaks(scAPAtrap.g_low,
scAPA.g_low,sierra.g_low, ignore.strand = FALSE)
ChIPpeakAnno::makeVennDiagram(ov1, fill = c("#FF69B4", "#32CD32", "#FFFF00"))
## Single nucleotide profiles
fafiles = faFromPACds(scAPAtrapds3UTR.low, bsgenome, what = "updn",
fapre = "scAPAtrap3UTR.low", up = -300, dn = 100, chrCheck = FALSE)
## 4210 >>> scAPAtrap3UTR.low.fa
plotATCGforFAfile(fafiles, ofreq = TRUE, opdf = F,
refPos = 301, mergePlots = F, filepre = "scAPAtrap3UTR.low")
## >>> scAPAtrap3UTR.low.scAPAtrap3UTR.low.freq
## motif
pas = scAPAtrapds3UTR.low@anno$Mm_gram[!is.na(scAPAtrapds3UTR.low@anno$Mm_gram)]
plotSeqLogo(pas)
Changes in the number of genes with different number of PACs after inclusion of lowly expressed PACs.
scAPAtrapds3UTR.upper <- scAPAtrapds3UTR[rowSums(scAPAtrapds3UTR@counts) > cutoff, ]
temp1 <- dplyr::group_by(scAPAtrapds3UTR.upper@anno, gene) %>% summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
temp2 <- dplyr::group_by(scAPAtrapds3UTR@anno, gene) %>% summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
temp1_num <- c()
temp2_num <- c()
for (i in c(1:5)) {
if (i < 5) {
temp1_num <- c(temp1_num, sum(temp1$count == i))
temp2_num <- c(temp2_num, sum(temp2$count == i))
} else {
temp1_num <- c(temp1_num, sum(temp1$count >= i))
temp2_num <- c(temp2_num, sum(temp2$count >= i))
}
}
nPAC <- paste0(c("=", "=", "=", "=", ">="), c(1:5))
lowplot <- data.frame(before = temp1_num, after = temp2_num, nPAC = nPAC)
lowplot
## before after nPAC
## 1 7633 8896 =1
## 2 1859 2534 =2
## 3 445 712 =3
## 4 92 166 =4
## 5 27 79 >=5
lowplot <- reshape2::melt(lowplot, id = c("nPAC"))
ggplot(lowplot) + aes(x = nPAC, fill = variable, colour = variable, weight = value) +
geom_bar(position = "dodge") +
scale_fill_nejm() +
scale_color_nejm() +
labs(y = "The number of genes") +
theme()
The session information records the versions of all the packages used in the generation of the present document.
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936 LC_CTYPE=Chinese (Simplified)_China.936
## [3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.936
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] shiny_1.5.0 TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
## [3] BSgenome.Mmusculus.UCSC.mm10_1.4.0 motifStack_1.30.0
## [5] ade4_1.7-15 MotIV_1.42.0
## [7] grImport2_0.2-0 gdtools_0.2.2
## [9] eoffice_0.1.9 ggsci_2.9
## [11] BiocStyle_2.14.4 knitr_1.29
## [13] movAPA_0.1.0 DEXSeq_1.32.0
## [15] DESeq2_1.26.0 SummarizedExperiment_1.16.1
## [17] DelayedArray_0.12.3 BiocParallel_1.20.1
## [19] matrixStats_0.56.0 GenomicFeatures_1.38.2
## [21] AnnotationDbi_1.48.0 Biobase_2.46.0
## [23] ggbio_1.34.0 ggplot2_3.3.2
## [25] data.table_1.12.8 RColorBrewer_1.1-2
## [27] reshape2_1.4.4 dplyr_1.0.0
## [29] BSgenome_1.54.0 rtracklayer_1.46.0
## [31] Biostrings_2.54.0 XVector_0.26.0
## [33] GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
## [35] IRanges_2.20.2 S4Vectors_0.24.4
## [37] BiocGenerics_0.32.0
##
## loaded via a namespace (and not attached):
## [1] esquisse_0.3.0 R.utils_2.9.2 tidyselect_1.1.0 RSQLite_2.2.0
## [5] htmlwidgets_1.5.1 munsell_0.5.0 statmod_1.4.34 miniUI_0.1.1.1
## [9] withr_2.2.0 colorspace_1.4-1 OrganismDbi_1.28.0 uuid_0.1-4
## [13] rstudioapi_0.11 officer_0.3.12 shinyWidgets_0.5.3 labeling_0.3
## [17] GenomeInfoDbData_1.2.2 hwriter_1.3.2 bit64_0.9-7 farver_2.0.3
## [21] vctrs_0.3.1 ChIPpeakAnno_3.20.1 generics_0.0.2 lambda.r_1.2.4
## [25] xfun_0.15 ggthemes_4.2.0 biovizBase_1.34.1 BiocFileCache_1.10.2
## [29] regioneR_1.18.1 R6_2.4.1 idr_1.2 locfit_1.5-9.4
## [33] AnnotationFilter_1.10.0 bitops_1.0-6 reshape_0.8.8 gridGraphics_0.5-0
## [37] assertthat_0.2.1 promises_1.1.1 scales_1.1.1 nnet_7.3-14
## [41] gtable_0.3.0 ensembldb_2.10.2 seqLogo_1.52.0 rlang_0.4.6
## [45] genefilter_1.68.0 systemfonts_0.2.3 splines_3.6.3 lazyeval_0.2.2
## [49] acepack_1.4.1 dichromat_2.0-0 broom_0.5.6 checkmate_2.0.0
## [53] BiocManager_1.30.10 yaml_2.2.1 backports_1.1.7 httpuv_1.5.4
## [57] Hmisc_4.4-0 RBGL_1.62.1 tools_3.6.3 bookdown_0.20
## [61] ggplotify_0.0.5 ellipsis_0.3.1 devEMF_3.7 Rcpp_1.0.5
## [65] plyr_1.8.6 base64enc_0.1-3 progress_1.2.2 zlibbioc_1.32.0
## [69] purrr_0.3.4 RCurl_1.98-1.2 prettyunits_1.1.1 rpart_4.1-15
## [73] openssl_1.4.1 cluster_2.1.0 magrittr_1.5 futile.options_1.0.1
## [77] magick_2.3 flextable_0.5.10 ProtGenerics_1.18.0 mime_0.9
## [81] hms_0.5.3 evaluate_0.14 xtable_1.8-4 XML_3.99-0.3
## [85] VennDiagram_1.6.20 jpeg_0.1-8.1 gridExtra_2.3 compiler_3.6.3
## [89] biomaRt_2.42.1 tibble_3.0.1 crayon_1.3.4 R.oo_1.23.0
## [93] htmltools_0.5.0 mgcv_1.8-31 later_1.1.0.1 Formula_1.2-3
## [97] tidyr_1.1.0 geneplotter_1.64.0 DBI_1.1.0 formatR_1.7
## [101] dbplyr_1.4.4 MASS_7.3-51.6 rappdirs_0.3.1 rGADEM_2.34.1
## [105] Matrix_1.2-18 R.methodsS3_1.8.0 pkgconfig_2.0.3 rvcheck_0.1.8
## [109] GenomicAlignments_1.22.1 foreign_0.8-75 plotly_4.9.2.1 xml2_1.3.2
## [113] annotate_1.64.0 multtest_2.42.0 rvg_0.2.4 stringr_1.4.0
## [117] VariantAnnotation_1.32.0 digest_0.6.25 graph_1.64.0 rmarkdown_2.3
## [121] htmlTable_2.0.1 curl_4.3 Rsamtools_2.2.3 lifecycle_0.2.0
## [125] nlme_3.1-148 jsonlite_1.6.1 seqinr_3.6-1 futile.logger_1.4.3
## [129] viridisLite_0.3.0 askpass_1.1 limma_3.42.2 pillar_1.4.4
## [133] lattice_0.20-41 GGally_2.0.0 R.devices_2.16.1 fastmap_1.0.1
## [137] httr_1.4.1 survival_3.2-3 GO.db_3.10.0 glue_1.4.1
## [141] zip_2.0.4 png_0.1-7 bit_1.1-15.2 stringi_1.4.6
## [145] blob_1.2.1 latticeExtra_0.6-29 memoise_1.1.0
[1] Shulman, E.D. and Elkon, R. Cell-type-specific analysis of alternative polyadenylation using single-cell transcriptomics data. Nucleic Acids Res 2019;47(19):10027-10039.
[2] movAPA: Modeling and visualization of dynamics of alternative polyadenylation across biological samples (under review).
[3] Lukassen, S., Bosch, E., Ekici, A.B., et al. Single-cell RNA sequencing of adult mouse testes. Scientific Data 2018;5(1):180192.
[4] Patrick, R., Humphreys, D.T., Janbandhu, V., et al. Sierra: discovery of differential transcript usage from polyA-captured single-cell RNA-seq data. Genome Biol 2020;21(1):167.