Contents

1 Introduction

Alternative polyadenylation (APA) has been indicated to play an important role in regulating mRNA stability, translation and localization. Diverse scRNA-seq protocols, such as Drop-seq, CEL-seq, and 10x Genomics, utilizing 3’ selection/enrichment in library construction, provide opportunities to extend bioinformatic analysis for studying APA at single cell resolution. We proposed a tool called scAPAtrap for the identification and quantification of APA sites in each individual cells by leveraging the resolution and huge abundance of scRNA-seq data generated by various 3’ tag-based protocols. scAPAtrap incorporates peak identification and poly(A) read anchoring, which is capable of identifying and pinpointing poly(A) sites even for those with low read coverage. scAPAtrap can also quantify expression levels of all identified APA sites, considering duplicates resulted from both IVT and PCR cycles.

2 scAPAtrap Installation

scAPAtrap can be installed from GitHub.

install.packages("devtools")
devtools::install_github("BMILAB/scAPAtrap", build = TRUE, build_vignettes = TRUE)

library(scAPAtrap)

Please also install the following tools before running scAPAtrap: samtools, subread, umi_tools, star

You can also install the above tools using conda.

conda install samtools -c bioconda
conda install subread -c bioconda
conda install umi_tools -c bioconda
conda install star -c bioconda

3 Demo data download

Please visit this link to download the demo data used in this vignette. The demo data is a BAM file containing aligned reads of chromosome 1 from dataset GSE104556. Instead, users can also download the entire GSE104556 from GEO to run pipeline, which may take more time.

4 Pipeline

The scAPAtrap toolkit is applicable to majority of poly(A)-captured scRNA-seq protocols. Currently scAPAtrap supports the following library construction technologies: 10x, Drop-seq, InDrops, CEL-seq, CEL-seq2. Due to the different data format of 10x from other protocols, we provide two procedures to show the use of scAPAtrap on 10x and other scRNA-seq protocols (hereinafter called non-10x), respectively.

In this section, we demonstrate how to obtain poly(A) sites from a BAM file of 10x data. For raw fastq file, please refer to the cellranger tool on the official website of 10x. For 3’-enriched scRNA-seq technologies other than 10x, cellranger is not applicable to the processing of raw fastq file, users may use STAR for mapping instead.

First, we need configure the following four paths of relevant tools including samtools, umitools, and featureCounts. Please replace path in the following commands with the path in your own computer.

samtools.path <- 'mypath/bin/samtools'
umitools.path <- 'mypath/bin/umi_tools'
featureCounts.path <- "mypath/bin/featureCounts"
star.path <- 'mypath/bin/STAR'

4.1 Preprocessing

4.1.1 Step 1: findUniqueMap

The BAM file obtained from the 10x tool cellranger normally contains reads mapped at multiple positions, which would affect the detection especially the quantification of poly(A) sites. Therefore, first we obtain uniquely mapped reads from the input BAM file.

The index and sort parameter in the findUniqueMap function indicate whether to sort and build the index of the BAM file, which is equivalent to the samtools sort and samtools index commands. It is recommended to sort and build the index by the findUniqueMap function, because in the following steps, both the BAM file and its corresponding index file are required.

Notably, this step is not required for all BAM files. If the BAM file contains multiple alignment results, such as the BAM file obtained by cellranger, you need to use the findUniqueMap function to perform filtering. Otherwise, you can skip this step.

demo.bam <- 'mypath/demo.bam'
nextinput <- findUniqueMap(samtools.path, input = demo.bam, thread = 24, index = T, sort = T)
nextinput

4.1.2 Step 2: dedupByPos

This step removes duplicates and saves only one read per UMI for each position.

The input of dedupBypos is a BAM file, but the BAM file needs to have a corresponding index. The output is the path where dedupBypos generates the BAM file. The TenX parameter indicates whether the current input BAM file is obtained from the 10x data. For non-10x protocols, such as CEL-seq and Drop-seq, please set TenX to FALSE.

nextinput <- dedupByPos(umitools.path, nextinput, TenX = T)
nextinput

4.1.3 Step 3: separateBamBystrand

The input file of separateBamBystrand is the BAM file generated from the above two steps. The output files are mybam.forward.bam and mybam.reverse.bam, storing reads mapped to the forward or reverse strand of the reference genome, respectively.

nextinput <- separateBamBystrand(samtools.path, nextinput, 24)
nextinput

4.2 findPeaks

After preprocessing, we perform peak calling on mybam.forward.bam and mybam.reverse.bam, respectively. Then we will get a file named peaks.saf in the folder “outputdir” which stores the identified peaks.

First, use the loadBpCoverages function to import the entire BAM file and calculate the coverage for each base of the chromosome.

Then, use findPeaks to identify peaks according to the calculated coverage, where the maxwidth parameter represents the maximum width of the peak. If the identified peak is larger than maxwidth, then the peak will be splitted iteratively until the width of each sub-peak is less than 1000 bp.

Finally, use generateSAF to generate a SAF file that records the information of identified peaks.

chrs <- c(as.character(1:19),'X','Y')
maxwidth <- 1000
readlength <- 98
outputdir <- './result'
fullcovF <- loadBpCoverages(nextinput[1],chrs)
fullcovR <- loadBpCoverages(nextinput[2],chrs)

forwardPeaks <-findPeaks(fullcovF, '+', readlength, maxwidth)
reversePeaks <-findPeaks(fullcovR, '-', readlength, maxwidth)

head(forwardPeaks)
head(reversePeaks)

peaksfile <- generateSAF(forwardPeaks, reversePeaks, outputdir)
peaksfile

4.3 countPeaks

Based on peak coordinates obtained from the peak calling step, next we count the read number for each peak using the BAM file and the peak file. This step may take a few minutes depending on the size of the BAM file. Finally, we will get a file named counts.tsv.gz in “outputdir”.

final.bam <- generateFinalBam(featureCounts.path,samtools.path,demo.bam,peaksfile,24)
final.bam
counts.tsv <- countPeaks(umitools.path,final.bam,outputdir,TenX=T)
counts.tsv

4.4 findTails

The peak calling step identifies potential regions where poly(A) sites are located. We can then arbitrarily set the 3’ end of each peak as the coordinate of the respective poly(A) site, however, this is not the best solution to determine the precise position of poly(A) site. Here we incorporate reads with A/T stretches to pinpoint poly(A) sites, which was implemented as function findTails in the scAPAtrap package. If the BAM file is large, please use function findChrTails instead.

tails <- findTails(bamfile = demo.bam)

4.5 generatescExpMa

scAPAtrap allows detecting poly(A) sites with the peak calling step and/or the poly(A) read anchoring step. The two steps, although independent, can be jointly considered for detecting poly(A) sites with higher confidence and resolution by taking into account information provided by both poly(A) reads and other reads aligned to the same peak zone.

We can combine the results of findTails and findPeaks to get a refined poly(A) site list. We can also output the poly(A) site data as a PACdataset object which can be used as the input file of the movAPA package for downstream analysis.

For quantifying poly(A) sites in each cell, scAPAtrap requires a list of valid cell barcodes for the input BAM file. If the BAM file was obtained using CellRanger, then we can simply use the barcodes.tsv file generated in the ‘filtered_gene_bc_matrices_mex’ folder.

Please make sure the chromosome names in the input files of generatescExpMa be the same.

barcodefile <- './barcode.tsv'
barcode <- read.delim2(barcodefile,header = F)
barcode <- gsub('-[0-9]','',barcode$V1)
expma<- generatescExpMa(countsfile, peaksfile, barcode, tails, min.cells = 2,min.count = 0)

devtools::install_github("BMILAB/movAPA")
coldata <- data.frame(group = colnames(expma)[7:ncol(expma)], row.names = colnames(expma)[7:ncol(expma)])
scPACds <- movAPA::readPACds(pacfile = expma, coldata = coldata)

5 Process raw fastq files from other 3’-enriched scRNA-seq

For 3’-enriched scRNA-seq technologies other than 10x, cellranger is not applicable to the processing of raw fastq file. We need some additional preprocessing and map reads to the genome by read mappers such as STAR.

Please visit this link to download the demo.fastq file

5.1 Step 1: Extract barcodes and UMIs

Run umi_tools to extract barcodes and UMIs, and add them to read names. For detailed usage of umi_tools, please refer to its help document

input.seq <- c('./demo_1.fastq','./demo_2.fastq')
pattern <- 'CCCCCCCCNNNN'
whitelist <- './whitelist.txt'
extractBcAndUb(umitools.path, pattern, input.seq, whitelist)

5.2 Step 2: Read mapping

You can use the STAR tool to build a reference genome index and align reads.

genome.fasta <- 'mypath/Mus_musculus.GRCm38.dna.primary_assembly.fa'
indexdir <- './mm10_index/'
generateRefIndex(star.path, genome.fasta, indexdir,24)
input.seq <- 'demo_2.extracted.fastq'
generateAlignBam(star.path, indexdir, input.seq, './BAM/rep1',12)

6 Session Information———————————–

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
## [3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
## [5] LC_TIME=Chinese (Simplified)_China.936    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] regioneR_1.18.1             derfinder_1.20.0            dplyr_1.0.0                
##  [4] GenomicAlignments_1.22.1    Rsamtools_2.2.3             Biostrings_2.54.0          
##  [7] XVector_0.26.0              SummarizedExperiment_1.16.1 DelayedArray_0.12.3        
## [10] BiocParallel_1.20.1         matrixStats_0.56.0          Biobase_2.46.0             
## [13] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1         IRanges_2.20.2             
## [16] S4Vectors_0.24.4            BiocGenerics_0.32.0         knitr_1.29                 
## [19] BiocStyle_2.14.4           
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1         ellipsis_0.3.1           qvalue_2.18.0            htmlTable_2.0.1         
##  [5] base64enc_0.1-3          rstudioapi_0.11          bit64_0.9-7              AnnotationDbi_1.48.0    
##  [9] codetools_0.2-16         splines_3.6.3            Formula_1.2-3            cluster_2.1.0           
## [13] dbplyr_1.4.4             png_0.1-7                BiocManager_1.30.10      compiler_3.6.3          
## [17] httr_1.4.1               backports_1.1.7          assertthat_0.2.1         Matrix_1.2-18           
## [21] acepack_1.4.1            htmltools_0.5.0          prettyunits_1.1.1        tools_3.6.3             
## [25] gtable_0.3.0             glue_1.4.1               GenomeInfoDbData_1.2.2   reshape2_1.4.4          
## [29] rappdirs_0.3.1           doRNG_1.8.2              Rcpp_1.0.5               bumphunter_1.28.0       
## [33] vctrs_0.3.1              rtracklayer_1.46.0       iterators_1.0.12         xfun_0.15               
## [37] stringr_1.4.0            lifecycle_0.2.0          rngtools_1.5             XML_3.99-0.3            
## [41] zlibbioc_1.32.0          scales_1.1.1             BSgenome_1.54.0          VariantAnnotation_1.32.0
## [45] hms_0.5.3                derfinderHelper_1.20.0   RColorBrewer_1.1-2       yaml_2.2.1              
## [49] curl_4.3                 memoise_1.1.0            gridExtra_2.3            ggplot2_3.3.2           
## [53] biomaRt_2.42.1           rpart_4.1-15             latticeExtra_0.6-29      stringi_1.4.6           
## [57] RSQLite_2.2.0            foreach_1.5.0            checkmate_2.0.0          GenomicFeatures_1.38.2  
## [61] rlang_0.4.6              pkgconfig_2.0.3          GenomicFiles_1.22.0      bitops_1.0-6            
## [65] evaluate_0.14            lattice_0.20-41          purrr_0.3.4              htmlwidgets_1.5.1       
## [69] bit_1.1-15.2             tidyselect_1.1.0         plyr_1.8.6               magrittr_1.5            
## [73] bookdown_0.20            R6_2.4.1                 generics_0.0.2           Hmisc_4.4-0             
## [77] DBI_1.1.0                pillar_1.4.4             foreign_0.8-75           survival_3.2-3          
## [81] RCurl_1.98-1.2           nnet_7.3-14              tibble_3.0.1             crayon_1.3.4            
## [85] BiocFileCache_1.10.2     rmarkdown_2.3            jpeg_0.1-8.1             progress_1.2.2          
## [89] locfit_1.5-9.4           grid_3.6.3               data.table_1.12.8        blob_1.2.1              
## [93] digest_0.6.25            tidyr_1.1.0              openssl_1.4.1            munsell_0.5.0           
## [97] askpass_1.1