PlantAPAdb

About
APA catalogue
APA conservation
APA sequence
APA signal
APA metric
APA switching
APA browser
Bulk download
Pipeline

During gene expression in eukaryotes, almost all newly transcribed pre-mRNAs are subjected to polyadenylation, a process in which the pre-mRNA is cleaved at 3' end and a poly(A) tail is added. Because the poly(A) tail marks the end of a mature mRNA, the choice of alternative poly(A) sites in the same genes, also known as alternative polyadenylation (APA), may generate different mRNAs with distinct sequence contents and functions (Tian and Manley, 2017). APA is increasingly recognized as an important regulator for eukaryotic gene expression. Recent large-scale studies using 3' end sequencing (3' seq) revealed that extensive APA exists in different systems from yeast to humans (Derti, et al., 2012; Li, et al., 2012; Smibert, et al., 2012; Ulitsky, et al., 2012; Hoque, et al., 2013). In plants and algae, current genome-wide data analyses suggest that about 50-70% of the genes show evidence of APA (Shen, et al., 2008; Shen, et al., 2011; Wu, et al., 2011; Sherstnev, et al., 2012; Fu, et al., 2016; Wang, et al., 2017; Chakrabarti, et al., 2018; Zhou, et al., 2019). APA has been demonstrated in the regulation of genes that are involved in flowering time control, amino acid biosynthesis, plant incompatibility, oxidative stress responses, and among others (Sherstnev, et al., 2012; Thomas, et al., 2012; Hong, et al., 2017; Lin, et al., 2017; Fu, et al., 2019; Zhou, et al., 2019).

PlantAPAdb provides a comprehensive and manually curated catalog of APA sites in plants based on a large volume of data from diverse biological samples generated by 3' seq. Currently, PlantAPAdb contains APA sites in seven plant organisms, including Oryza sativa L. (japonica and indica), Arabidopsis thaliana, Chlamydomonas reinhardtii, Medicago truncatula, Trifolium pratense, Populus trichocarpa and Phyllostachys edulis.
PlantAPAdb catalogues APA sites in different tissues in various physiological and pathological conditions, which were seuqenced from diverse 3' seq protocols and biological samples. A one-click search is integrated in PlantAPAdb, which enables the query of the whole database without limiting any search field. We designed a uniform and flexible processing pipeline for identifying high-confidence poly(A) sites from 3' seq, which allows the incorporation of all publically available or emerging sequencing data sets. We also developed a standard procedure for parsing genome annotation file obtained from a unified portal (Ensembl Plants), facilitating the annotation of poly(A) sites in a highly flexible way. PlantAPAdb provides rich information of the whole genome poly(A) sites, including genomic locations, heterogeneous cleavage sites, expression levels, related poly(A) signals, sample information, conservation information, etc. APA sites can be visualized in their genomic context via the Jbrowse genome browser. PlantAPAdb also provides full lists of poly(A) signals for poly(A) sites in different genomic regions and users are also free to download sequences surrounding poly(A) sites of selected groups. Moreover, PlantAPAdb contains comprehensive lists of APA sites and genes involving in 3' UTR shortening/lengthening between two biological samples with different methods and filtration conditions. PlantAPAdb also provides quantification of APA sites using several metrics, such as tissue specificity, poly(A) site usage, relative usage of distal poly(A) site (RUD), average 3' UTR length. More importantly, additional information about conservation of poly(A) sites across plants is also available in PlantAPAdb, providing insights into the polyadenylation configuration between species. Data from pooled samples and individual samples in PlantAPAdb are allowed for bulk download as flat files. Pipeline and scripts used for APA analyses in PlantAPAdb can be downloaded here. As a user-friendly database, PlantAPAdb is a large and extendable resource for improving genome annotation and elucidating APA mechanism, APA conservation and APA-mediated gene expression regulation.

To ensure the smooth access of PlantAPAdb, we have built two mirror websites:
1) http://www.bmibig.cn/plantAPAdb (stable)
2) http://bmi.xmu.edu.cn/plantAPAdb (fast)

If you are using PlantAPAdb, please cite:
Zhu S^#, Ye W^#, Ye L^*, Fu H, Ye C, Xiao X, Ji Y, Lin W, Ji G^*, Wu X^*: PlantAPAdb: A Comprehensive Database for Alternative Polyadenylation Sites in Plants. Plant Physiol 2020, 182(1):228-242.

If you have any questions or comments, please email to xhuister@xmu.edu.cn (Dr. Xiaohui Wu).

Reference

Shen, Y., et al. (2008) Genome level analysis of rice mRNA 3'-end processing signals and alternative polyadenylation, Nucleic Acids Res., 36, 3150-3161.
Fu, Y., et al. (2011) Differential genome-wide profiling of tandem 3' UTRs among human breast cancer and normal cells by high-throughput sequencing, Genome Res., 21, 741-747.
Shen, Y., et al. (2011) Transcriptome dynamics through alternative polyadenylation in developmental and environmental responses in plants revealed by deep sequencing, Genome Res., 21, 1478-1486.
Wu, X., et al. (2011) Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation, Proc. Natl. Acad. Sci. USA, 108, 12533-12538.
Derti, A., et al. (2012) A quantitative atlas of polyadenylation in five mammals, Genome Res., 22, 1173-1183.
Li, Y., et al. (2012) Dynamic landscape of tandem 3 ' UTRs during zebrafish development, Genome Res., 22, 1899-1906.
Sherstnev, A., et al. (2012) Direct sequencing of Arabidopsis thaliana RNA reveals patterns of cleavage and polyadenylation, Nat. Struct. Mol. Biol., 19, 845-852.
Smibert, P., et al. (2012) Global Patterns of Tissue-Specific Alternative Polyadenylation in Drosophila, Cell Reports, 1, 277-289.
Thomas, P.E., et al. (2012) Genome-Wide Control of Polyadenylation Site Choice by CPSF30 in Arabidopsis, Plant Cell, 24, 4376-4388.
Ulitsky, I., et al. (2012) Extensive alternative polyadenylation during zebrafish development, Genome Res., 22, 2054-2066.
Hoque, M., et al. (2013) Analysis of alternative cleavage and polyadenylation by 3 ' region extraction and deep sequencing, Nat. Methods, 10, 133-139.
Fu, H., et al. (2016) Genome-wide dynamics of alternative polyadenylation in rice, Genome Res., 26, 1753-1760.
Hong, L., et al. (2017) Alternative polyadenylation is involved in auxin-based plant growth and development, The Plant Journal, n/a-n/a.
Lin, J., et al. (2017) Role of Cleavage and Polyadenylation Specificity Factor 100: Anchoring Poly(A) Sites and Modulating Transcription Termination, The Plant Journal.
Tian, B. and Manley, J.L. (2017) Alternative polyadenylation of mRNA precursors, Nature reviews. Molecular cell biology, 18, 18-30.
Wang, T., et al. (2017) Comprehensive profiling of rhizome-associated alternative splicing and alternative polyadenylation in moso bamboo (Phyllostachys edulis), Plant J, 91, 684-699.
Arefeen, A., et al. (2018) TAPAS: Tool for Alternative Polyadenylation Site Analysis, Bioinformatics, bty110-bty110.
Chakrabarti, M., Dinkins, R.D. and Hunt, A.G. (2018) Genome-wide atlas of alternative polyadenylation in the forage legume red clover, Scientific Reports, 8, 14.
Ye, C., et al. (2018) APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data, Bioinformatics, 34, 1841-1849.
Fu, H., et al. (2019) Distinct genome-wide alternative polyadenylation during the response to silicon availability in the marine diatom Thalassiosira pseudonana, The Plant Journal, 99, 67-80.
Zhou, Q., et al. (2019) Differential alternative polyadenylation contributes to the developmental divergence between two rice subspecies Japonica and Indica, The Plant Journal, 98, 260-276.

PlantAPAdb catalogues all PAC (poly(A) site cluster) datasets, which are well organized by different categories such as sequencing protocols, biological samples, relevant authors and studies, and published year. Each dataset is displayed by a dataset card. By clicking the card of a dataset, users can view detailed statistical information of the dataset, such as the relevant study and literature, mapping information of the raw data, statistics of PAC and read distribution in different genomic regions. Lists of all potential PACs and high confidence PACs in simple bed format or with full genome annotation are provided for download. The list of heterogeneous cleavage sites before grouping into PACs is also provided for download, which is especially useful for analyzing the micro-heterogeneity phenomenon of polyadenylation or testing the impact of clustering procedure.

Each catalogue (dataset card) contains PACs from one biological sample (e.g., one tissue with multiple replicates). Raw RNA-seq data from this sample were pre-processed using the standard pipeline. It should be noted that the coordinates of PACs in this single sample are not necessarily all in (or the same as) the coordinates of all PACs from the whole species, because PACs in a single sample are not extracted from the pooled sample.

3' seq protocols

Protocols	Description	Studies	Cite
PAT-seq	A method to study the integration of 3' UTR dynamics with gene expression in the eukaryotic transcriptome	Buffer Control (SRP089899)Seedling Control (SRP089899)Cordycepin 2h (SRP089899)FLAG-RPL18 Control (SRP089899)FLAG-RPL18 Hypoxia 2h (SRP089899)Hypoxia 2h (SRP089899)Flower (SRP070055)Root (SRP070055)Auxin (SRP092240)Mock Control (SRP092240)Amp311 (SRP093950)Esp5 (SRP093950)CPSF30 (SRP050424)CPSF30m (SRP050424)Root Oxt6 Mutant (SRP050424)Root Wild Type (SRP050424)Leaf Oxt6 Mutant (SRP009685)Leaf Wild Type (SRP009685)AHBD2 (SRP005137)Fip1 Mutant (SRP187778)Salt-treated Fip1 Mutant (SRP187778)Wild Type Control (SRP187778)Salt-treated Wild Type (SRP187778)High Salt Acetate (SRP068667)High Salt (SRP068667)Tris Phosphate Acetate (SRP068667)Tris Phosphate (SRP068667)AHBG3 Test1 (SRP041219)AHBG3 Test2 (SRP041219)20 Days Leaf (SRP136202)5 Days Root (SRP136202)60 Days Leaf (SRP136202)60 Days Root (SRP136202)60 Days Stem (SRP136202)Anther (SRP136202)Dry Seed (SRP136202)Embryo (SRP136202)Endosperm (SRP136202)Husk (SRP136202)Imbibed Seed (SRP136202)Mature Pollen (SRP136202)Pistil (SRP136202)Seeding Shoot (SRP136202)20 Days Leaf (SRP073467)5 Days Root (SRP073467)60 Days Leaf (SRP073467)60 Days Root (SRP073467)60 Days Stem (SRP073467)Anther (SRP073467)Dry Seed (SRP073467)Embryo (SRP073467)Endosperm (SRP073467)Husk (SRP073467)Imbibed Seed (SRP073467)Mature Pollen (SRP073467)Pistil (SRP073467)Seedling Shoot (SRP073467)Flower (SRP186946)Leaf (SRP186946)Root (SRP186946)	Harrison P F, Powell D R, Clancy J L, et al. PAT-seq: a method to study the integration of 3' UTR dynamics with gene expression in the eukaryotic transcriptome[J]. Rna, 2015, 21(8): 1502-1510.
PAS-seq	A deep sequencing-based method called Poly(A) Site Sequencing (PAS-Seq) for quantitatively profiling RNA polyadenylation at the transcriptome level	Hlp1 Mutant (SRP013996)Seedling Wild Type Control (SRP013996)Latera Bud (SRP093919)New Shoot Tip (SRP093919)Rhizome Tip (SRP093919)	Shepard P J, Choi E A, Lu J, et al. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq[J]. Rna, 2011, 17(4): 761-772.
DRS	Direct RNA sequencing	DRS fpa (ERP003245)Seeding Wild Type DRS (ERP003245)Seeding Wild Type DRS1 (ERP001018)Seeding Wild Type DRS2 (ERP001018)	Ozsolak F, Platt A R, Jones D R, et al. Direct RNA sequencing[J]. Nature, 2009, 461(7265): 814.
EXPRSS	An Illumina based high-throughput expression-profiling method to reveal transcriptional dynamics	Leaf slh1 00hr (SRP033221)Leaf slh1 06hr (SRP033221)Leaf slh1 09hr (SRP033221)Leaf slh1 24hr (SRP033221)Leaf Wild Type 00hr (SRP033221)Leaf Wild Type 06hr (SRP033221)Leaf Wild Type 09hr (SRP033221)Leaf Wild Type 24hr (SRP033221)	Rallapalli G, Kemen E M, Robert-Seilaniantz A, et al. EXPRSS: an Illumina based high-throughput expression-profiling method to reveal transcriptional dynamics[J]. BMC genomics, 2014, 15(1): 341.

Studies

Studies	Species	Cite
Wu X, 2011	Arabidopsis thaliana	Wu X, Liu M, Downie B, et al. Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation[J]. Proceedings of the National Academy of Sciences, 2011, 108(30): 12533-12538.
Sherstnev A, 2012	Arabidopsis thaliana	Sherstnev A, Duc C, Cole C, et al. Direct sequencing of Arabidopsis thaliana RNA reveals patterns of cleavage and polyadenylation[J]. Nature structural & molecular biology, 2012, 19(8): 845.
Thomas P E, 2012	Arabidopsis thaliana	Thomas P E, Wu X, Liu M, et al. Genome-wide control of polyadenylation site choice by CPSF30 in Arabidopsis[J]. The Plant Cell, 2012, 24(11): 4376-4388.
Duc C, 2013	Arabidopsis thaliana	Duc C, Sherstnev A, Cole C, et al. Transcription termination and chimeric RNA formation controlled by Arabidopsis thaliana FPA[J]. PLoS genetics, 2013, 9(10): e1003867.
Liu M, 2014	Arabidopsis thaliana	Liu M, Xu R, Merrill C, et al. Integration of developmental and environmental signals via a polyadenylation factor in Arabidopsis[J]. PloS one, 2014, 9(12): e115779.
Rallapalli G, 2014	Arabidopsis thaliana	Rallapalli G, Kemen E M, Robert-Seilaniantz A, et al. EXPRSS: an Illumina based high-throughput expression-profiling method to reveal transcriptional dynamics[J]. BMC genomics, 2014, 15(1): 341.
Wu X, 2014	Medicago truncatula	Wu X, Gaffney B, Hunt A G, et al. Genome-wide determination of poly (A) sites in Medicago truncatula: evolutionary conservation of alternative poly (A) site choice[J]. BMC genomics, 2014, 15(1): 615.
Zhang Y, 2015	Arabidopsis thaliana	Zhang Y, Gu L, Hou Y, et al. Integrative genome-wide analysis reveals HLP1, a novel RNA-binding protein, regulates plant flowering by targeting alternative polyadenylation[J]. Cell research, 2015, 25(7): 864.
Bell S A, 2016	Chlamydomonas reinhardtii	Bell S A, Shen C, Brown A, et al. Experimental genome-wide determination of RNA polyadenylation in Chlamydomonas reinhardtii[J]. PloS one, 2016, 11(1): e0146107.
Fu H, 2016	Oryza sativa Japonica Group	Fu H, Yang D, Su W, et al. Genome-wide dynamics of alternative polyadenylation in rice[J]. Genome research, 2016, 26(12): 1753-1760.
Guo C, 2016	Arabidopsis thaliana	Guo C, Spinelli M, Liu M, et al. A genome-wide study of “non-3UTR” polyadenylation sites in Arabidopsis thaliana[J]. Scientific reports, 2016, 6: 28060.
de Lorenzo L, 2017	Arabidopsis thaliana	de Lorenzo L, Sorenson R, Bailey-Serres J, et al. Noncanonical alternative polyadenylation contributes to gene regulation in response to hypoxia[J]. The Plant Cell, 2017, 29(6): 1262-1277.
Hong L, 2017	Arabidopsis thaliana	Hong L, Ye C, Lin J, et al. Alternative polyadenylation is involved in auxin‐based plant growth and development[J]. The Plant Journal, 2018, 93(2): 246-258.
Lin J, 2017	Arabidopsis thaliana	Lin J, Xu R, Wu X, et al. Role of cleavage and polyadenylation specificity factor 100: anchoring poly (A) sites and modulating transcription termination[J]. The Plant Journal, 2017, 91(5): 829-839.
Chakrabarti M, 2018	Trifolium pratense	Chakrabarti M, Dinkins R D, Hunt A G. Genome-wide atlas of alternative polyadenylation in the forage legume red clover[J]. Scientific reports, 2018, 8(1): 11379.
Wang T, 2018	Phyllostachys edulis	Wang T, Wang H, Cai D, Gao Y, Zhang H, Wang Y, et al. Comprehensive profiling of rhizome-associated alternative splicing and alternative polyadenylation in moso bamboo (Phyllostachys edulis)[J]. Plant J, 2017, 91(4), 684-699. doi: 10.1111/tpj.13597.
Telléz‐Robledo B, 2019	Arabidopsis thaliana	Tellez-Robledo, B., et al. The polyadenylation factor FIP1 is important for plant development and root responses to abiotic stresses [J]. The Plant Journal, 2019, 99(6):1203-1219.
Zhou Q, 2019	Oryza sativa Indica Group	Zhou Q, Fu H, Yang D, et al. Differential alternative polyadenylation contributes to the developmental divergence between two rice subspecies, japonica and indica[J]. The Plant Journal, 2019, 98(2): 260-276.

Sample list Download

Organism	SRP group label	Environment condition	Plant type	Procotol	Tissue	Reference
arabidopsis_thaliana	Buffer Control (SRP089899)	Control	Wild type	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	Seedling Control (SRP089899)	Control	Wild type	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	Cordycepin 2h (SRP089899)	Cordycepin	Wild type	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	FLAG-RPL18 Control (SRP089899)	Control	FLAG-RPL18	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	FLAG-RPL18 Hypoxia 2h (SRP089899)	Hypoxia	FLAG-RPL18	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	Hypoxia 2h (SRP089899)	Hypoxia	Wild type	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	DRS fpa (ERP003245)	NA	fpa	DRS	Seed	Duc et al. PLoS genetics, 2013.
arabidopsis_thaliana	Seeding Wild Type DRS (ERP003245)	NA	Wild type	DRS	Seed	Duc et al. PLoS genetics, 2013.
arabidopsis_thaliana	Flower (SRP070055)	NA	Wild type	PAT-seq	Flower	Guo et al. Scientific reports, 2016.
arabidopsis_thaliana	Root (SRP070055)	NA	Wild type	PAT-seq	Root	Guo et al. Scientific reports, 2016.
arabidopsis_thaliana	Auxin (SRP092240)	Auxin	Wild type	PAT-seq	Seed	Hong et al. The Plant Journal, 2018.
arabidopsis_thaliana	Mock Control (SRP092240)	Control	Wild type	PAT-seq	Seed	Hong et al. The Plant Journal, 2018.
arabidopsis_thaliana	Amp311 (SRP093950)	NA	Amp311	PAT-seq	Leaf	Lin et al. The Plant Journal, 2017.
arabidopsis_thaliana	Esp5 (SRP093950)	NA	Esp5	PAT-seq	Leaf	Lin et al. The Plant Journal, 2017.
arabidopsis_thaliana	CPSF30 (SRP050424)	NA	CPSF30	PAT-seq	Root	Liu et al. PloS one, 2014.
arabidopsis_thaliana	CPSF30m (SRP050424)	NA	CPSF30m	PAT-seq	Root	Liu et al. PloS one, 2014.
arabidopsis_thaliana	Root Oxt6 Mutant (SRP050424)	NA	Oxt6	PAT-seq	Root	Liu et al. PloS one, 2014.
arabidopsis_thaliana	Root Wild Type (SRP050424)	Control	Wild type	PAT-seq	Root	Liu et al. PloS one, 2014.
arabidopsis_thaliana	Leaf slh1 00hr (SRP033221)	NA	slh1	EXPRSS	Leaf	Rallapalli et al. BMC genomics, 2014.
arabidopsis_thaliana	Leaf slh1 06hr (SRP033221)	NA	slh1	EXPRSS	Leaf	Rallapalli et al. BMC genomics, 2014.
arabidopsis_thaliana	Leaf slh1 09hr (SRP033221)	NA	slh1	EXPRSS	Leaf	Rallapalli et al. BMC genomics, 2014.
arabidopsis_thaliana	Leaf slh1 24hr (SRP033221)	NA	slh1	EXPRSS	Leaf	Rallapalli et al. BMC genomics, 2014.
arabidopsis_thaliana	Leaf Wild Type 00hr (SRP033221)	Control	Wild type	EXPRSS	Leaf	Rallapalli et al. BMC genomics, 2014.
arabidopsis_thaliana	Leaf Wild Type 06hr (SRP033221)	Control	Wild type	EXPRSS	Leaf	Rallapalli et al. BMC genomics, 2014.
arabidopsis_thaliana	Leaf Wild Type 09hr (SRP033221)	Control	Wild type	EXPRSS	Leaf	Rallapalli et al. BMC genomics, 2014.
arabidopsis_thaliana	Leaf Wild Type 24hr (SRP033221)	Control	Wild type	EXPRSS	Leaf	Rallapalli et al. BMC genomics, 2014.
arabidopsis_thaliana	Seeding Wild Type DRS1 (ERP001018)	NA	Wild type	DRS	Seed	Sherstnev et al. Nature structural & molecular biology, 2012.
arabidopsis_thaliana	Seeding Wild Type DRS2 (ERP001018)	NA	Wild type	DRS	Seed	Sherstnev et al. Nature structural & molecular biology, 2012.
arabidopsis_thaliana	Leaf Oxt6 Mutant (SRP009685)	NA	Oxt6	PAT-seq	Leaf	Thomas et al. The Plant Cell, 2012.
arabidopsis_thaliana	Leaf Wild Type (SRP009685)	Control	Wild type	PAT-seq	Leaf	Thomas et al. The Plant Cell, 2012.
arabidopsis_thaliana	AHBD2 (SRP005137)	NA	Wild type	PAT-seq	Seed	Wu X et al. PNAS, 2011.
arabidopsis_thaliana	Hlp1 Mutant (SRP013996)	NA	Hlp1	PAS-seq	Seedling	Zhang et al. Cell research, 2015.
arabidopsis_thaliana	Seedling Wild Type Control (SRP013996)	Control	Wild type	PAS-seq	Seedling	Zhang et al. Cell research, 2015.
arabidopsis_thaliana	Fip1 Mutant (SRP187778)	NA	Fip1	PAT-seq	Seed	Tellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thaliana	Salt-treated Fip1 Mutant (SRP187778)	Salt-treated	Fip1	PAT-seq	Seed	Tellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thaliana	Wild Type Control (SRP187778)	Control	Wild type	PAT-seq	Seed	Tellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thaliana	Salt-treated Wild Type (SRP187778)	Salt-treated	Wild type	PAT-seq	Seed	Tellez-Robledo et al. The Plant Journal, 2019.
bamboo	Latera Bud (SRP093919)	NA	Wild type	PAS-seq	Root	Wang et al. Plant J, 2017.
bamboo	New Shoot Tip (SRP093919)	NA	Wild type	PAS-seq	Root	Wang et al. Plant J, 2017.
bamboo	Rhizome Tip (SRP093919)	NA	Wild type	PAS-seq	Root	Wang et al. Plant J, 2017.
chlamydomonas_reinhardtii	High Salt Acetate (SRP068667)	High Salt Acetate	Wild type	PAT-seq	Media_grown	Bell et al. PloS one, 2016.
chlamydomonas_reinhardtii	High Salt (SRP068667)	High Salt	Wild type	PAT-seq	Media_grown	Bell et al. PloS one, 2016.
chlamydomonas_reinhardtii	Tris Phosphate Acetate (SRP068667)	Tris Phosphate Acetate	Wild type	PAT-seq	Media_grown	Bell et al. PloS one, 2016.
chlamydomonas_reinhardtii	Tris Phosphate (SRP068667)	Tris Phosphate	Wild type	PAT-seq	Media_grown	Bell et al. PloS one, 2016.
medicago_truncatula	AHBG3 Test1 (SRP041219)	NA	Wild type	PAT-seq	Mix	Wu X et al. BMC genomics, 2014.
medicago_truncatula	AHBG3 Test2 (SRP041219)	NA	Wild type	PAT-seq	Mix	Wu X et al. BMC genomics, 2014.
oryza_sativa_indica_group	20 Days Leaf (SRP136202)	NA	Wild type	PAT-seq	Leaf	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	5 Days Root (SRP136202)	NA	Wild type	PAT-seq	Root	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	60 Days Leaf (SRP136202)	NA	Wild type	PAT-seq	Leaf	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	60 Days Root (SRP136202)	NA	Wild type	PAT-seq	Root	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	60 Days Stem (SRP136202)	NA	Wild type	PAT-seq	Stem	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Anther (SRP136202)	NA	Wild type	PAT-seq	Anther	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Dry Seed (SRP136202)	NA	Wild type	PAT-seq	Seed	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Embryo (SRP136202)	NA	Wild type	PAT-seq	Embryo	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Endosperm (SRP136202)	NA	Wild type	PAT-seq	Endosperm	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Husk (SRP136202)	NA	Wild type	PAT-seq	Husk	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Imbibed Seed (SRP136202)	NA	Wild type	PAT-seq	Seed	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Mature Pollen (SRP136202)	NA	Wild type	PAT-seq	Pollen	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Pistil (SRP136202)	NA	Wild type	PAT-seq	Pistil	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Seeding Shoot (SRP136202)	NA	Wild type	PAT-seq	Seedling	Zhou et al. The Plant Journal, 2019.
oryza_sativa_japonica_group	20 Days Leaf (SRP073467)	NA	Wild type	PAT-seq	Leaf	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	5 Days Root (SRP073467)	NA	Wild type	PAT-seq	Root	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	60 Days Leaf (SRP073467)	NA	Wild type	PAT-seq	Leaf	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	60 Days Root (SRP073467)	NA	Wild type	PAT-seq	Root	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	60 Days Stem (SRP073467)	NA	Wild type	PAT-seq	Stem	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Anther (SRP073467)	NA	Wild type	PAT-seq	Anther	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Dry Seed (SRP073467)	NA	Wild type	PAT-seq	Seed	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Embryo (SRP073467)	NA	Wild type	PAT-seq	Embryo	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Endosperm (SRP073467)	NA	Wild type	PAT-seq	Endosperm	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Husk (SRP073467)	NA	Wild type	PAT-seq	Husk	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Imbibed Seed (SRP073467)	NA	Wild type	PAT-seq	Seed	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Mature Pollen (SRP073467)	NA	Wild type	PAT-seq	Pollen	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Pistil (SRP073467)	NA	Wild type	PAT-seq	Pistil	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Seedling Shoot (SRP073467)	NA	Wild type	PAT-seq	Seedling	Fu et al. Genome research, 2016.
trifolium_pratense	Flower (SRP186946)	NA	Wild type	PAT-seq	Flower	Chakrabarti et al. Scientific reports, 2018.
trifolium_pratense	Leaf (SRP186946)	NA	Wild type	PAT-seq	Leaf	Chakrabarti et al. Scientific reports, 2018.
trifolium_pratense	Root (SRP186946)	NA	Wild type	PAT-seq	Root	Chakrabarti et al. Scientific reports, 2018.

We used the Arabidopsis genome as the reference and obtained pair-wise genome alignment chain files from the Plant Ensembl to obtain synthetic regions between other genomes and the reference genome. Then coordinates of PACs of all other species were converted to coordinates of the reference genome. We adopted the reciprocal best match method (Wang, et al., 2018) to determine conserved PACs. Briefly, two PACs from two species was considered as orthologous if their distance was smaller than 25 nt based on the whole genome alignment. For each PAC in each species, the information of conservation is recorded, including the species where the conserved sites are found and the corresponding PACs.

Figure 1. The reciprocal best match method used to determine conserved PACs.

Description of the APA conservation table

id	poly(A) site id
chr	Chromosome of the PAC
start	Start coordinate of the PAC.
end	End coordinate of the PAC.
Strand	Strand
coord	Coordinate of the PAC, which is the coordinate of the most dominate cleavage site in a PAC
ftr	Genomic region the PAC located, e.g., 3UTR, 5UTR, exon, intron, intergenic
gene_id	Respective gene ID
arabidopsis_thaliana	Conserved PAC ID in the respective species
oryza_indica_group	Conserved PAC ID in the respective species
oryza_japonica_group	Conserved PAC ID in the respective species
medicago_truncatula	Conserved PAC ID in the respective species
trifolium_pratense	Conserved PAC ID in the respective species
chlamydomonas_reinhardtii	Conserved PAC ID in the respective species

Reference

Wang, R., et al. (2018) PolyA_DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes, Nucleic Acids Res, 46, D315-d319.

Upstream and downstream sequences around poly(A) sites located in different genomic regions (3' UTR, 5' UTR, CDS, intron, intergenic) can be downloaded.
Each sequence is of length 400 nt, with the upstream 300 nt and downstream 100 nt sequence of the respective poly(A) site cluster (PAC). The poly(A) site is at the 301^st position. You can browse all poly(A) site categories from APA CATALOGUE, or download poly(A) site atlas from BULK DOWNLOAD.

Poly(A) sites of each species were classified into four groups based on their genomic locations (3' UTR, CDS, intron, and 5' UTR) and then 50 top-ranked signal patterns were obtained by RSAT for each signal element (e.g., NUE, FUE, CE, DE) in each group of poly(A) sites.

Based on the single nucleotide profiles surrounding poly(A) sites in different species, we divided the eight species into two groups, one is Chlamydomonas, the other contains the rest species. Poly(A) signals of 3' UTR poly(A) sites have been investigated (Xing and Li, 2011).

Figure 2. Polyadenylation signals of nuclear mRNA in plants and representative algae. The algal signal information was mostly from three species with sequenced genomes: the green alga Chlamydomonas reinhardtii and two diatoms, Thalassiosira pseudonana and Phaeodactylum tricornutum. The specific signals of C. reinhardtii are denoted with *. The signal strength information is for both plants and algae based on classical genetics (plants) and bioinformatics (algae) analyses. The percentage data were estimated from the results of bioinformatics analysis. The question mark after FUE implies that some of the species might not have this signal. CDS, protein coding sequence; UTR, untranslated region; FUE, far upstream element; NUE, near upstream element; CE, cleavage element; PAS, poly(A) site; YA, predominant dinucleotide located at the poly(A) or cleavage site where Y = U or C. The 'A' is the last nucleotide before poly(A) tail; > or >= indicates that one nucleotide appears more than another. Not drawn to scale. (This figure is obtained from (Xing and Li, 2011)).

In Chlamydomonas, four signal elements were reported previously, including FUE (-150 to -25, Pentamers), NUE (-25 to -5, Pentamers), CE (-5 to +5, Heptamers), and DE (+5 to +30, Hexamers) (Shen, et al., 2008). In the other six species with similar single nucleotide profiles surrounding poly(A) sites, we used the signal model of Arabidopsis (Loke, et al., 2005), which contains FUE (-200 to -35, Hexamers), NUE (-35 to -10, Hexamers) and CE (-10 to +15, Hexamers). The choice of signal region range and the respective length of signal patterns are based on the observation in previous studies (Loke, et al., 2005; Shen, et al., 2008).

Figure 3. Polyadenylation signal regions defined in PlantAPAdb for Chlamydomonas and other plants.

To identify statistically significant signal patterns and sequence logos in a given poly(A) signal region, we applied an oligo analyzer called regulatory sequence analysis tools, or RSAT (Thomas-Chollier, et al., 2008). We classified poly(A) sites of each species into four groups based on their genomic locations (3' UTR, CDS, intron, and 5' UTR) and then obtained 50 top-ranked signal patterns by RSAT for each signal element in each group of poly(A) sites.

Description of poly(A) signals identified by RSAT

rank	Order of the motif
seq	oligomer sequence
id	oligomer identifier
exp_freq	expected relative frequency
occ	observed occurrences
exp_occ	expected occurrences
occ_P	occurrence probability (binomial)
occ_E	E-value for occurrences (binomial)
occ_sig	occurrence significance (binomial)
ovl_occ	number of overlapping occurrences (discarded from the count)
forb_occ	forbidden positions (to avoid self-overlap)

Reference

Loke, J.C., et al. (2005) Compilation of mRNA polyadenylation signals in Arabidopsis revealed a new signal element and potential secondary structures, Plant Physiol, 138, 1457-1468.
Shen, Y., et al. (2008) Unique features of nuclear mRNA poly(A) signals and alternative polyadenylation in Chlamydomonas reinhardtii, Genetics, 179, 167-176.
Thomas-Chollier, M., et al. (2008) RSAT: regulatory sequence analysis tools, Nucleic Acids Res., 36, W119-W127.
Xing, D. and Li, Q.Q. (2011) Alternative polyadenylation and gene expression regulation in plants, Wiley Interdisciplinary Reviews: RNA, 2, 445-458.

We adopted four PAC-level metrics and two gene-level metrics to quantify the dynamics of APA across samples.

The PAC-level metrics include the following four metrics.
NSE (Number of Samples Expressed): the NSE of a PAC is calculated as the number of samples in which the PAC is expressed.
PSE (Percentage of Samples Expressed): the PSE of a PAC is calculated as the ratio of NSE to the total number of samples.
Ratio: the relative usage of a PAC in a gene, which is calculated as the ratio of the expression level of a PAC to the total expression level of the respective gene.
Sample Specificity (Ni et al., 2013; Weng et al., 2016; Hu et al., 2017; Ji et al., 2018): Shannon entropy score was calculated for each PAC to quantify its overall sample specificity:

where n is the number of samples and p_s is ratio of the expression level of the PAC in sample s to the total expression level of this PAC in all samples. Then the specificity of a PAC for sample s can be calculated as: Q = H - log₂(P_s). A lower H or Q score means higher sample specificity.

The gene-level metrics include the following two metrics.
RUD (Relative Usage of Distal PAC) (Ji et al., 2009; Ji & Tian, 2009): the RUD of a gene in a sample s is calculated as the ratio of the number of 3' reads of the distal PAC in sample s to the number of total reads of proximal and distal PACs in sample s. Here only genes with at least two 3' UTR PACs were used. Proximal and distal PACs are defined as the two most abundant 3' UTR PACs or the two most distant 3' UTR PACs. The RUD score represents the relative 3' UTR length for a gene in a sample, with higher RUD indicating longer 3' UTR.
WUL (Weighted 3' UTR Length) (Ulitsky et al., 2012){Fu, 2016 #3619}: the WUL of a gene in a sample s is calculated as the average 3' UTR length of all 3' UTR PACs in this gene weighted by the number of supported 3' reads of each PAC.

Heatmap Description

The heatmap for PAC metric of tissue specificity shows the distribution of H values of each PAC across samples. Here PACs for the plot are first filtered by the following filtering conditions: total number of reads in all samples >= N1; the percentage of expressed samples >=N2. Then PACs are ranked by their minimum H values and only the top N3 PACs are used for the plot.
Similarly, for the heatmap for PAC metric of ratio, PACs are ranked by the variability (standard deviation) of ratio values across samples.
1) For the heatmap for gene metric of WUL, genes are ranked by the variability (coefficient of variation) of WUL scores across samples.
2) For the heatmap for gene metric of RUD, genes are ranked by the variability (standard deviation) of RUD scores across samples.

Output Table

In the output table of the gene-level metrics, each column is one sample, each row is one gene. The value in the table denotes the RUD or WUL score.

The metrics of NSE and PSE are given in the output table of metric Ratio or Sample Specificity. Description of output table of metric Ratio. Each row is one PAC named as "GeneID:PAC coordinate".

Sample1~N	Ratio of each sample, here replicates from the same sample are averaged first.
gene	Gene ID
chr	Chromosome
strand	Strand
coord	Coordinate of the PAC
ftr	Genomic region the PAC located
total	Total number of reads across all samples of the PAC
NSE	Number of samples expressed, here a replicate is counted as one sample.
PSE	Percentage of samples expressed, here a replicate is counted as one sample.

Description of output table of metric "Sample Specificity". Columns are the same as those of metric Ratio except for the following columns.

H	H score of Shannon entropy, reflecting the overall sample specificity of a PAC. Lower value means higher sample specificity.
Q_min	Minimum Q score across all samples.
Q_min_cond	Sample name(s) with minimum Q score (or the highest sample specificity).
Sample1~N	Q score of each sample, reflecting the sample specificity of a PAC in the respective sample. Lower value means higher sample specificity.

Reference

Ji Z, Lee JY, Pan Z, Jiang B, Tian B (2009) Progressive lengthening of 3' untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development. Proceedings of the National Academy of Sciences, USA 106, 7028-33.
Ji Z, Tian B (2009) Reprogramming of 3' untranslated regions of mRNAs by alternative polyadenylation in generation of pluripotent stem cells from different cell types. PloS one 4, e8419.
Ulitsky I, Shkumatava A, Jan CH, et al. (2012) Extensive alternative polyadenylation during zebrafish development. Genome Research 22, 2054-66.
Ni T, Yang Y, Hafez D, et al. (2013) Distinct polyadenylation landscapes of diverse human tissues revealed by a modified PA-seq strategy. BMC Genomics 14, 615.
Fu H, Yang D, Su W, et al. (2016) Genome-wide dynamics of alternative polyadenylation in rice. Genome Research 26, 1753-60.
Weng L, Li Y, Xie X, Shi Y (2016) Poly(A) code analyses reveal key determinants for tissue-specific mRNA alternative polyadenylation. RNA 19, 19.
Hu W, Li S, Park JY, et al. (2017) Dynamic landscape of alternative polyadenylation during retinal development. Cell Mol Life Sci 74, 1721-39.
Ji G, Chen M, Ye W, et al. (2018) TSAPA: identification of tissue-specific alternative polyadenylation sites in plants. Bioinformatics 34, 2123-5.

There are two methods for detecting 3' UTR shortening/lengthening events. The "linear trend" method is based on the chi-squared test for trend in proportions (Fu, et al., 2011; Fu, et al., 2016; Ye, et al., 2018; Zhou, et al., 2019). The methods based on DESeq2 extend the differential expression results from DESeq2 to identify genes with 3' UTR shortening/lengthening events (Arefeen, et al., 2018).

Two strategies were adopted for detecting 3' UTR shortening/lengthening events from samples with or without replicates.

Figure 4. Two strategies were adopted for detecting 3' UTR switching.

The first strategy is based on the chi-squared test for trend in proportions, which is applicable for samples without replicates (replicates were averaged first). The linear trend method has the advantage to considers both abundance and 3' UTR length of all 3' UTR PACs in a gene, which was adopted in several previous APA studies (Fu, et al., 2011; Fu, et al., 2016; Ye, et al., 2018; Zhou, et al., 2019). Briefly, PACs in a gene are sorted by the respective 3' UTR length (denoted as score). A contingency table of read count is then created with each row representing the indexes of samples and each column denoting the scores. Next the chi-squared test for trend in proportions is performed (R function prop.trend.test), and the Pearson correlation r is obtained using the read count in the table as the value and the score as the coordinate. The correlation r ranges from -1 to 1, with larger absolute value indicating higher extent of 3' UTR shortening/lengthening. Finally, an adjusted p-value was obtained by the Benjamin method and genes with the adjusted p-value smaller than a given cutoff (e.g., 0.05) are considered as genes with significant 3' UTR shortening/lengthening.

The second strategy is applicable for samples with replicates, which extends the differential expression (DE) results from DESeq2 to identify genes with 3' UTR shortening/lengthening events (Arefeen, et al., 2018). To detect DE PACs, genes with only one PAC were discarded. Then DEXSeq2, initially developed for detecting differential genes between conditions from RNA-seq, was used for DE PAC identification. For each PAC in APA genes, both p-value and adjusted p-value were obtained with DEXSeq2, and PACs with adjusted p-value below a given cutoff were considered as DE PACs. DESeq2 has been adopted in previous studies for detecting differentially expressed poly(A) sites (Lianoglou, et al., 2013; Fu, et al., 2016; Arefeen, et al., 2018; Zhou, et al., 2019). To detect 3' UTR shortening/lengthening events, 3' UTR APA genes with at least one DE PAC were filtered. Then the relative change (RC) for this pair of PACs is calculated. A Fisher's exact test was also performed to test the significance of differential usage of two PACs between two samples, and a p-value was obtained. Genes with |RC| larger than a given threshold (e.g., 1) and a p-value below a given cutoff (e.g., 0.05) were considered as genes with 3' UTR shortening/lengthening events.

To identify statistically significant signal patterns and sequence logos in a given poly(A) signal region, we applied an oligo analyzer called regulatory sequence analysis tools, or RSAT (Thomas-Chollier, et al., 2008). We classified poly(A) sites of each species into four groups based on their genomic locations (3' UTR, CDS, intron, and 5' UTR) and then obtained 50 top-ranked signal patterns by RSAT for each signal element in each group of poly(A) sites.

Note: The 3' UTR switching results using different poly(A) site data (normalized or not), different strategies, or different filtering criteria may vary greatly, we strongly recommend that the 3' UTR switching result provided in PlantAPAdb be used as a guide only.

Sample list Download

Organism	Switch group label	Environment condition	Plant type	Procotol	Tissue	Reference
arabidopsis_thaliana	Buffer Control	Control	Wild type	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	Seedling Control	Control	Wild type	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	Cordycepin 2h	Cordycepin	Wild type	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	FLAG-RPL18 Control	Control	FLAG-RPL18	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	FLAG-RPL18 Hypoxia 2h	Hypoxia	FLAG-RPL18	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	Hypoxia 2h	Hypoxia	Wild type	PAT-seq	Seedling	de Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thaliana	Seeding fpa DRS 2012	NA	fpa	DRS	Seed	Duc et al. PLoS genetics, 2013.
arabidopsis_thaliana	Seeding Wild Type DRS 2012	NA	Wild type	DRS	Seed	Duc et al. PLoS genetics, 2013.
arabidopsis_thaliana	Auxin	Auxin	Wild type	PAT-seq	Seed	Hong et al. The Plant Journal, 2018.
arabidopsis_thaliana	Mock Control	Control	Wild type	PAT-seq	Seed	Hong et al. The Plant Journal, 2018.
arabidopsis_thaliana	Amp311	NA	Amp311	PAT-seq	Leaf	Lin et al. The Plant Journal, 2017.
arabidopsis_thaliana	Esp5	NA	Esp5	PAT-seq	Leaf	Lin et al. The Plant Journal, 2017.
arabidopsis_thaliana	CPSF30	NA	CPSF30	PAT-seq	Root	Liu et al. PloS one, 2014.
arabidopsis_thaliana	CPSF30m	NA	CPSF30m	PAT-seq	Root	Liu et al. PloS one, 2014.
arabidopsis_thaliana	Root Oxt6 Mutant	NA	Oxt6	PAT-seq	Root	Liu et al. PloS one, 2014.
arabidopsis_thaliana	Root Wild Type	Control	Wild type	PAT-seq	Root	Liu et al. PloS one, 2014.
arabidopsis_thaliana	Leaf Oxt6 Mutant	NA	Oxt6	PAT-seq	Leaf	Thomas et al. The Plant Cell, 2012.
arabidopsis_thaliana	Leaf Wild Type	Control	Wild type	PAT-seq	Leaf	Thomas et al. The Plant Cell, 2012.
arabidopsis_thaliana	Hlp1 Mutant	NA	Hlp1	PAS-seq	Seedling	Zhang et al. Cell research, 2015.
arabidopsis_thaliana	Seedling Wild Type Control	Control	Wild type	PAS-seq	Seedling	Zhang et al. Cell research, 2015.
arabidopsis_thaliana	Fip1 Mutant	NA	Fip1	PAT-seq	Seed	Tellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thaliana	Salt-treated Fip1 Mutant	Salt-treated	Fip1	PAT-seq	Seed	Tellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thaliana	Wild Type Control	Control	Wild type	PAT-seq	Seed	Tellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thaliana	Salt-treated Wild Type	Salt-treated	Wild type	PAT-seq	Seed	Tellez-Robledo et al. The Plant Journal, 2019.
bamboo	Latera Bud	NA	Wild type	PAS-seq	Root	Wang et al. Plant J, 2017.
bamboo	New Shoot Tip	NA	Wild type	PAS-seq	Root	Wang et al. Plant J, 2017.
bamboo	Rhizome Tip	NA	Wild type	PAS-seq	Root	Wang et al. Plant J, 2017.
chlamydomonas_reinhardtii	High Salt Acetate	High Salt Acetate	Wild type	PAT-seq	Media_grown	Bell et al. PloS one, 2016.
chlamydomonas_reinhardtii	High Salt	High Salt	Wild type	PAT-seq	Media_grown	Bell et al. PloS one, 2016.
chlamydomonas_reinhardtii	Tris Phosphate Acetate	Tris Phosphate Acetate	Wild type	PAT-seq	Media_grown	Bell et al. PloS one, 2016.
chlamydomonas_reinhardtii	Tris Phosphate	Tris Phosphate	Wild type	PAT-seq	Media_grown	Bell et al. PloS one, 2016.
oryza_sativa_indica_group	20 Days Leaf	NA	Wild type	PAT-seq	Leaf	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	5 Days Root	NA	Wild type	PAT-seq	Root	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	60 Days Leaf	NA	Wild type	PAT-seq	Leaf	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	60 Days Root	NA	Wild type	PAT-seq	Root	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	60 Days Stem	NA	Wild type	PAT-seq	Stem	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Anther	NA	Wild type	PAT-seq	Anther	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Dry Seed	NA	Wild type	PAT-seq	Seed	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Embryo	NA	Wild type	PAT-seq	Embryo	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Endosperm	NA	Wild type	PAT-seq	Endosperm	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Husk	NA	Wild type	PAT-seq	Husk	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Imbibed Seed	NA	Wild type	PAT-seq	Seed	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Mature Pollen	NA	Wild type	PAT-seq	Pollen	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Pistil	NA	Wild type	PAT-seq	Pistil	Zhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group	Seeding Shoot	NA	Wild type	PAT-seq	Seedling	Zhou et al. The Plant Journal, 2019.
oryza_sativa_japonica_group	20 Days Leaf	NA	Wild type	PAT-seq	Leaf	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	5 Days Root	NA	Wild type	PAT-seq	Root	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	60 Days Leaf	NA	Wild type	PAT-seq	Leaf	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	60 Days Root	NA	Wild type	PAT-seq	Root	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	60 Days Stem	NA	Wild type	PAT-seq	Stem	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Anther	NA	Wild type	PAT-seq	Anther	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Dry Seed	NA	Wild type	PAT-seq	Seed	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Embryo	NA	Wild type	PAT-seq	Embryo	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Endosperm	NA	Wild type	PAT-seq	Endosperm	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Husk	NA	Wild type	PAT-seq	Husk	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Imbibed Seed	NA	Wild type	PAT-seq	Seed	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Mature Pollen	NA	Wild type	PAT-seq	Pollen	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Pistil	NA	Wild type	PAT-seq	Pistil	Fu et al. Genome research, 2016.
oryza_sativa_japonica_group	Seedling Shoot	NA	Wild type	PAT-seq	Seedling	Fu et al. Genome research, 2016.
trifolium_pratense	Flower	NA	Wild type	PAT-seq	Flower	Chakrabarti et al. Scientific reports, 2018.
trifolium_pratense	Leaf	NA	Wild type	PAT-seq	Leaf	Chakrabarti et al. Scientific reports, 2018.
trifolium_pratense	Root	NA	Wild type	PAT-seq	Root	Chakrabarti et al. Scientific reports, 2018.

Note: Samples without at least two replicates were not used for APA switching analysis.

Parameters

Regulation: shorter/longer means only the switching events that with shorter/longer 3' UTR in group 2 are filtered.
Fisher's exact test p-value: the p-value cutoff to filter significant switching events (for methods of DESeq2).
Log fold change: the cutoff of log fold change of the expression levels of poly(A) sites between the two samples (for methods of DESeq2).
Adjusted p-value: the cutoff of the adjusted p-value of the chi-squared test for trend in proportions (for linear trend method).

Description of the APA switching table

gene	Gene id
nPAC	Number of poly(A) sites in the gene
geneTag1	Number of total reads of the gene in the first sample
geneTag2	Number of total reads of the gene in the second sample
avgUTRlen1	Average 3' UTR length of the gene in the first sample
avgUTRlen2	Average 3' UTR length of the gene in the second sample
fisherPV	Fisher's test p-value (for the method of DESeq2)
logFC	Log fold change of the expression levels of poly(A) sites between the two samples (for methods of DESeq2)
padj	Adjusted p-value (for linear trend method)
cor	Correlation value (for linear trend method)
logRatio	Log ratio of the expression levels of poly(A) sites between the two samples (for linear trend method)
Change	Longer means 3' UTR is longer in the second sample, Shorter means 3' UTR is shorter in the second sample
PAs1	The expression levels of poly(A) sites in the first sample
PAs2	The expression levels of poly(A) sites in the second sample

Reference

Fu, Y., et al. (2011) Differential genome-wide profiling of tandem 3' UTRs among human breast cancer and normal cells by high-throughput sequencing, Genome Res., 21, 741-747.
Lianoglou, S., et al. (2013) Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression, Genes Dev., 27, 2380-2396.
Fu, H., et al. (2016) Genome-wide dynamics of alternative polyadenylation in rice, Genome Res., 26, 1753-1760.
Arefeen, A., et al. (2018) TAPAS: Tool for Alternative Polyadenylation Site Analysis, Bioinformatics, bty110-bty110.
Ye, C., et al. (2018) APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data, Bioinformatics, 34, 1841-1849.
Zhou, Q., et al. (2019) Differential alternative polyadenylation contributes to the developmental divergence between two rice subspecies Japonica and Indica, The Plant Journal, 98, 260-276.

Users can have a quick access to the PAC browser by clicking the "PAC browse" tab in the main menu or the “View” link in a PAC list.

Figure 5. Web page of the browser.

One or more data sets from each plant species can be quickly loaded and graphically browsed online, by selecting the checkboxes of data sets in the ‘Available Tracks' panel. Users can conduct a search with a gene or chromosome fragment to zoom in on particular PAC regions. Data tracks of PACs from different cells, tissues or conditions can be displayed in sync with tracks of PATs, offering a more intuitive way to explore and compare the usage of PACs among different samples. Users can download the data of one or more tracks onto their local computers.

Figure 6. Right-click context menu on a gene model or PAC

In addition to cataloging PAC from individual samples, PlantAPAdb also provides the PAC list of pooled samples for bulk download. For the pooled data, different files are available for download to meet the users' need. The file in bed format records simple information such as chromosome, strand, coordinate, and total number of reads for each PAC. The file in text format tabulates full information of each PAC, including the total read count, the raw and TPM normalized read count in each individual sample and replicate, the respective gene, genomic location (CDS, intron, 3' UTR, 5' UTR, intergenic, etc.), distance to neighbor genes (if the PAC is located in intergenic region). Moreover, files of heterogeneous cleavage sites in bed format were also provided, which allows users to inspect the polyadenylation in higher resolution.

1) All poly(A) site clusters in BED format
This file contains all poly(A) site clusters (PACs) with the information of chromosome, strand, coordinate, and score (raw read count from the pooled samples) for each PAC. Internal priming artifacts were removed. Cleavage sites within 24 nt of each other were grouped into one PAC.

2) All poly(A) site clusters with annotation (raw count)
This file contains full information of all PACs. The expression level (number of supported reads) for each PAC in each experiment is given. The annotation including the gene, gene type, genomic region, number of cleavage sites etc. for each PAC is also given.

ftr	Genomic region of the PAC, including three prime utr (3'UTR), five prime utr (5'UTR), cds, intron, exon, and intergenic.
gene_id	Gene id
biotype	Gene type, such as protein_coding, long non-conding RNA (lncRNA), non-coding RNA (ncRNA), tRNA.
ftr_start	Start coordinate of the genomic region ("ftr" column)
ftr_end	End coordinate of the genomic region ("ftr" column).
upstream_id	5' gene id of a PAC, only valid when the PAC is located in intergenic region.
upstream_start	Start coordinate of the 5' gene, only valid when the PAC is located in intergenic region.
upstream_end	End coordinate of the 5' gene, only valid when the PAC is located in intergenic region.
downstream_id	Same as upstream_id, except for the 3' gene.
downstream_start	Same as upstream_start, except for the 3' gene.
downstream_end	Same as upstream_end, except for the 3' gene.

3) All poly(A) site clusters with annotation (TPM count)
This file contains full information of all PACs and the expression levels were normalized by Tag Per Million.
Given a PAC table pac_sample, the TPM normalization is performed in R as follows:

library_size <- colSums(pac_sample)
pac_sample_nor <- t(t(pac_sample)/library_size*10^6)
pac_sample_nor <- as.data.frame(pac_sample_nor )

4) High confidence poly(A) site clusters in BED format
This file contains only high confidence PACs which are expressed (TPM>=1) in at least two experiments.
5) High confidence poly(A) site clusters with annotation (raw count)
This file contains only high confidence PACs with full annotation.
6) All cleavage sites
This file contains all cleavage sites, which has four columns: chromosome, strand, coordinate, and number of supported reads. This file was used for generating poly(A) site clusters.
7) Full sample list
This file contains the information of the full sample list in PlantAPAdb, including the SRR id, sample name, environmental condition, plant type, tissue, sequencing protocol, reference, read statistics.

Pipeline: Genome-wide identification of polyadenylation sites from 3'-end sequencing data in plants.

1 Overview

Our current PlantAPAdb database for identifying poly(A) sites relies on bioinformatics tools. We developed a framework for genome-wide identification of poly(A) sites from 3'-end sequencing data (3'-seq) in plants. Scripts for APA analyses in PlantAPAdb can be downloaded here.

2 Before You Start

2.1 Software Installation

Tools version:

fastq-dump : 2.9.6
FastQC v0.11.8
multiqc, version 1.7
Trimmomatic-0.38
STAR_2.6.0a
bedtools v2.27.1
bedmap version:  2.4.35 (typical)
samtools 1.8

Install the following tools:
A alignment software, such as STAR (recommend), Bowtie/Bowtie2, and TopHat.

# Get latest STAR source from releases
wget https://github.com/alexdobin/STAR/archive/2.7.1a.tar.gz
tar -xzf 2.7.1a.tar.gz
cd STAR-2.7.1a

# Alternatively, get STAR source using git
git clone https://github.com/alexdobin/STAR.git

Make sure Perl and related modules (details in ourPerl scripts) are installed on your computer

#The Perl version can be checked with the command
perl –version
#install Perl modules on Linux/Unix/Mac OS
#upgrade cpan
perl -MCPAN -e shell
#Install or upgrade Module::Build, and make it your preferred installer
cpan>install Module::Build
cpan>o conf prefer_installer MB
cpan>o conf commit
#install Bio::SeqIO module
cpan>install Bio::SeqIO

Make sure bedtools and bedmap are installed on your computer

#install bedtools on Linux/Unix/Mac OS
apt-get install bedtools
#install bedmap
wget –c https://github.com/bedops/bedops/releases/download/v2.4.36/bedops_linux_x86_64-v2.4.36.tar.bz2
tar jxvf bedops_linux_x86_64-vx.y.z.tar.bz2
cp bin/* /usr/local/bin

[Optional] You can install SRA Toolkit to download raw sequence data, and then use Trimmomatic or FASTX for quality trimming and adapter removal.

#Install SRA Toolkit
wget –c https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6-1/sratoolkit.2.9.6-1-ubuntu64.tar.gz
tar zxvf sratoolkit.2.9.6-1-ubuntu64.tar.gz
cp sratoolkit/bin/* /usr/local/bin
#Install Trimmomatic
wget –c http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip

2.2 Preparation of Sequence Data

Prepare the following sequence data
A reference genome. You can download the reference genome (fasta format) from public databases, such as Ensembl, NCBI and UCSC. In addition, the annotation file (gtf/gff3 format) can also be downloaded. For example, the reference genome of Arabidopsis thaliana (TAIR10) is obtained from Ensembl.

#download reference genome
wget –c ftp://ftp.ensemblgenomes.org/pub/plants/release-43/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
#download annotation gff3 file in gff3 format
wget –c ftp://ftp.ensemblgenomes.org/pub/plants/release-43/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.43.gff3.gz

3'-seq data. We can download 3'-seq data (fastq format) from public sequence repositories. For example, the data (SRR5055885/ SRR5055884/ SRR5055883）of Arabidopsis thaliana is downloaded from the Sequence Read Archive by SRA Toolkit.

#download sra
prefetch SRR5055885 SRR5055884 SRR5055883
#turn convert into fastq format
fastq-dump –O  /output-path/  SRR4243494.sra

3 Analysis Workflow

To illustrate the analysis workflow, the PAT-seq data (SRR5055885/SRR5055884/ SRR5055883) from Arabidopsis thaliana are taken as an example, which include three replicates and the reads are with poly(A) tail.

3.1 Data Processing

MAP_findTailAT.pl is an in-house Perl script for identifying reads with valid poly(A) tail and trimming poly(A) tail.
Usage

perl MAP_findTailAT.pl -h

For example,

perl MAP_findTailAT.pl -in ./SRR5055885.fastq -poly T -ml 25 -mp 6 -mg 5 -mm 2 -mr 2  -mtail 6 -debug F -bar 8 -odir "/output-path/" -suf "SRP093950"
#######################################################################
#-in=input fa or fq file
#-poly=A/T/A&T/A|T. -poly=A if the sequence with As tail, -poly=T if the sequence with Ts tail
#-ml=min length after poly(A)trimming
#-mp=min length of succesive poly (default=8)
#-mg=margin from the start (poly=T) or to the end (poly=A) (=5)
#-mm=mismatch between TNNTTT or ANNAAA (default=2)
#-mr=minT in reg (default=3)
#-mtail=min length of trimmed tail (default=8)
#-bar=the length of barcode if there is barcode
#-odir=output path (default is the same as input)
#-suf=suffix, default: xx.suf.T/A.fq
#-debug=T/F (default=T)

Then we can use Trimmomatic to remove adapter sequences and low quality reads.

java –jar  /path/Trimmomatic-0.38/trimmomatic-0.38.jar SE -threads 16  -phred 33  ./SRR5055885.SRP093950.T.fq  ./clean.SRR5055885.fq /path/ILLUMINACLIP:TruSeq3-SE:2:30:10
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:10 MINLEN:25

3.2 Sequence alignment with STAR

STAR is one of the most popular tools for sequence alignment. First, we need to generate genome indices.

STAR --runMode genomeGenerate \
   --runThreadN 16  \
   --genomeFastaFiles ./Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \
   --sjdbGTFfile ./Arabidopsis_thaliana.TAIR10.42.gff3 \
   --sjdbGTFtagExonParentTranscript Parent \
   --genomeDir    ./index

Then we can map clean PAT-seq data to the reference genome.

STAR --runThreadN 20 \
     --readFilesIn ./clean.SRR5055885.fq
     --genomeDir ./index \
     --outFileNamePrefix   ./leaf-esf-rep1\
     --outMultimapperOrder Random \
     --outFilterMultimapNmax 1

The output file "leaf-esf-rep1Aligned.out.sam" is the alignment result (sam format). For the data SRR5055884 and SRR5055883, we will get the alignment result "leaf-esf-rep2Aligned.out.sam" and "leaf-esf-ref3Aligned.out.sam", respectively.

3.3 Identification of Polyadenylation Sites

PAT2PA2PAC.sh is a shell script that identifies poly(A) sites by using in-house Perl scripts, bedtools and bedmap. This script mainly consists of two steps:
1) Identifying poly(A) sites by filtering out internal-priming reads;
2) grouping poly(A) tags into poly(A) site clusters (PAC).
Usage:

bash  /PAT2PA2PAC.sh  genome   /input-path/ /output-path/   distance  input-file

For example,

bash  /PAT2PA2PAC.sh  Arabidopsis_thaliana.TAIR10.dna.toplevel.fa /input-sam-file-path/  /output-path/  24  "*.out.sam"

bash  /PAT2PA2PAC.sh  Arabidopsis_thaliana.TAIR10.dna.toplevel.fa /input-sam-file-path/  /output-path/  24  "leaf-esf-rep1Aligned.out.sam  leaf-esf-rep2Aligned.out.sam   leaf-esf-rep3Aligned.out.sam "

4 Output

The above workflow will produce multiple output files.
a) all.PAC.header – is the column name of output file "all.PAC.PATcount".
"chr" - is name of the chromosome or scaffold;
"UPA_start" - is the start position of the PAC, with sequence numbering starting at 1.
"UPA_end" - is the end position of the PAC, with sequence numbering starting at 1.
"strand" - is the direction of PAC, defined as + (forward) or – (reverse)
"PAnum" - is the number of poly(A) site in the PAC region.
"tot_tagum" - is the total expression level of all samples in the PAC region (count).
"coord" - represents reference poly(A) sites in the PAC region.
"refPAnum" - is the number of poly(A) site in this "coord"
"sampleName" - is the PAC expression level of each sample (count).

>cat all.PAC.header
chr	UPA_start	UPA_end	strand	PAnum	tot_tagnum	coord	refPAnum leaf-esf-rep1Aligned.out.sam  leaf-esf-rep2Aligned.out.sam   leaf-esf-rep3Aligned.out.sam

b) all.PAC.PATcount – is the PAC result, and the file "all.PAC.header" provides the corresponding column names

>head all.PAC.PATcount
1	227	 227	+	1	1	227	1	0	0	1
1	5743	 5750	+	4	7	5744	3	1	6	0
1	5850	 5852	+	3	4	5850	2	1	1	2
1	5888	 5916	+	5	14	5895	8	1	9	4

c) all.PAC.info – is the part of "all.PAC.PATcount" with first 8 columns, including "chr", "UPA_start", "UPA_end", "strand", "PAnum", "tot_tagnum", "coord", "refPAnum".

>head  all.PAC.info
1	227	227	+	1	1	227	1
1	5743	5750	+	4	7	5744	3
1	5850	5852	+	3	4	5850	2
1	5888	5916	+	5	14	5895	8

d) all.PAC.PAcount – is the number of poly(A) sites for each sample in the PAC region, corresponding columns are "chr", "UPA_start", "UPA_end", "strand", "PAnum", "tot_tagnum", "coord", "refPAnum", "samplePAnum".

1	227	227	+	1	1	227	1	0	0	1
1	5743	5750	+	4	7	5744	3	1	3	0
1	5850	5852	+	3	4	5850	2	1	1	2
1	5888	5916	+	5	14	5895	8	1	4	2

e) all.PA.uniq.bed – is poly(A) sites information in standard bed format.

1	226	227	.	 1	+
1	1911	1912	.	 1	-
1	4647	4648	.	 1	-
1	5716	5717	.	 1	+

5 Annotation of poly(A) sites

Poly(A) sites were annotated with respective genes, genomic locations, etc. based on the latest genome annotations. An R script based on the GenomicFeatures R package was implemented to uniformly process the genome annotation file in GFF3 format. Annotation for both protein coding genes and non-coding genes were parsed. PACs were then annotated based on their genomic locations, i.e., 3' UTR, coding sequence (CDS) for protein coding genes, intron and exon for non-coding genes, and intergenic region. To resolve annotation ambiguity owing to multiple transcripts from the same gene, we annotated a PAC based on the following priority: 3' UTR, CDS, intron. Particularly, if a PAC is located in intergenic region, we recorded the neighboring genes and its distance from the 3' end of the nearby 5′ gene and the distance from the 5′ end of the nearby 3' gene. This strategy to annotate intergenic PACs allows annotation of PACs with higher flexibility to recruit PACs falling within extended 3' UTR regions.

Extended 3' UTR is defined as the downstream region of the annotated 3' UTR or the gene end (if there is no annotated 3' UTR). In PlantAPAdb, the length of extended 3' UTR region is defined as twice of the average length of annotated 3' UTRs. According to genome annotations used in PlantAPAdb, the average 3' UTR lengths are: arabidopsis_thaliana = 276, chlamydomonas_reinhardtii = 850, medicago_truncatula = 401, oryza_sativa_japonica_group = 322, oryza_sativa_indica_group = 322, trifolium_pratense = 207, bamboo = 498.

For APA switching analyses, PACs located in extended 3' UTR were also considered as 3' UTR PACs for identifying 3' UTR lengthening or shortening.

About & Help