About & Help

  • About
  • APA catalogue
  • APA conservation
  • APA sequence
  • APA signal
  • APA metric
  • APA switching
  • APA browser
  • Bulk download
  • Pipeline
During gene expression in eukaryotes, almost all newly transcribed pre-mRNAs are subjected to polyadenylation, a process in which the pre-mRNA is cleaved at 3' end and a poly(A) tail is added. Because the poly(A) tail marks the end of a mature mRNA, the choice of alternative poly(A) sites in the same genes, also known as alternative polyadenylation (APA), may generate different mRNAs with distinct sequence contents and functions (Tian and Manley, 2017). APA is increasingly recognized as an important regulator for eukaryotic gene expression. Recent large-scale studies using 3' end sequencing (3' seq) revealed that extensive APA exists in different systems from yeast to humans (Derti, et al., 2012; Li, et al., 2012; Smibert, et al., 2012; Ulitsky, et al., 2012; Hoque, et al., 2013). In plants and algae, current genome-wide data analyses suggest that about 50-70% of the genes show evidence of APA (Shen, et al., 2008; Shen, et al., 2011; Wu, et al., 2011; Sherstnev, et al., 2012; Fu, et al., 2016; Wang, et al., 2017; Chakrabarti, et al., 2018; Zhou, et al., 2019). APA has been demonstrated in the regulation of genes that are involved in flowering time control, amino acid biosynthesis, plant incompatibility, oxidative stress responses, and among others (Sherstnev, et al., 2012; Thomas, et al., 2012; Hong, et al., 2017; Lin, et al., 2017; Fu, et al., 2019; Zhou, et al., 2019).

PlantAPAdb provides a comprehensive and manually curated catalog of APA sites in plants based on a large volume of data from diverse biological samples generated by 3' seq. Currently, PlantAPAdb contains APA sites in seven plant organisms, including Oryza sativa L. (japonica and indica), Arabidopsis thaliana, Chlamydomonas reinhardtii, Medicago truncatula, Trifolium pratense, Populus trichocarpa and Phyllostachys edulis.
PlantAPAdb catalogues APA sites in different tissues in various physiological and pathological conditions, which were seuqenced from diverse 3' seq protocols and biological samples. A one-click search is integrated in PlantAPAdb, which enables the query of the whole database without limiting any search field. We designed a uniform and flexible processing pipeline for identifying high-confidence poly(A) sites from 3' seq, which allows the incorporation of all publically available or emerging sequencing data sets. We also developed a standard procedure for parsing genome annotation file obtained from a unified portal (Ensembl Plants), facilitating the annotation of poly(A) sites in a highly flexible way. PlantAPAdb provides rich information of the whole genome poly(A) sites, including genomic locations, heterogeneous cleavage sites, expression levels, related poly(A) signals, sample information, conservation information, etc. APA sites can be visualized in their genomic context via the Jbrowse genome browser. PlantAPAdb also provides full lists of poly(A) signals for poly(A) sites in different genomic regions and users are also free to download sequences surrounding poly(A) sites of selected groups. Moreover, PlantAPAdb contains comprehensive lists of APA sites and genes involving in 3' UTR shortening/lengthening between two biological samples with different methods and filtration conditions. PlantAPAdb also provides quantification of APA sites using several metrics, such as tissue specificity, poly(A) site usage, relative usage of distal poly(A) site (RUD), average 3' UTR length. More importantly, additional information about conservation of poly(A) sites across plants is also available in PlantAPAdb, providing insights into the polyadenylation configuration between species. Data from pooled samples and individual samples in PlantAPAdb are allowed for bulk download as flat files. Pipeline and scripts used for APA analyses in PlantAPAdb can be downloaded here. As a user-friendly database, PlantAPAdb is a large and extendable resource for improving genome annotation and elucidating APA mechanism, APA conservation and APA-mediated gene expression regulation.

To ensure the smooth access of PlantAPAdb, we have built two mirror websites:
1) http://www.bmibig.cn/plantAPAdb (stable)
2) http://bmi.xmu.edu.cn/plantAPAdb (fast)

If you are using PlantAPAdb, please cite:
Zhu S#, Ye W#, Ye L*, Fu H, Ye C, Xiao X, Ji Y, Lin W, Ji G*, Wu X*: PlantAPAdb: A Comprehensive Database for Alternative Polyadenylation Sites in Plants. Plant Physiol 2020, 182(1):228-242.

If you have any questions or comments, please email to xhuister@xmu.edu.cn (Dr. Xiaohui Wu).
Reference
  • Shen, Y., et al. (2008) Genome level analysis of rice mRNA 3'-end processing signals and alternative polyadenylation, Nucleic Acids Res., 36, 3150-3161.
  • Fu, Y., et al. (2011) Differential genome-wide profiling of tandem 3' UTRs among human breast cancer and normal cells by high-throughput sequencing, Genome Res., 21, 741-747.
  • Shen, Y., et al. (2011) Transcriptome dynamics through alternative polyadenylation in developmental and environmental responses in plants revealed by deep sequencing, Genome Res., 21, 1478-1486.
  • Wu, X., et al. (2011) Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation, Proc. Natl. Acad. Sci. USA, 108, 12533-12538.
  • Derti, A., et al. (2012) A quantitative atlas of polyadenylation in five mammals, Genome Res., 22, 1173-1183.
  • Li, Y., et al. (2012) Dynamic landscape of tandem 3 ' UTRs during zebrafish development, Genome Res., 22, 1899-1906.
  • Sherstnev, A., et al. (2012) Direct sequencing of Arabidopsis thaliana RNA reveals patterns of cleavage and polyadenylation, Nat. Struct. Mol. Biol., 19, 845-852.
  • Smibert, P., et al. (2012) Global Patterns of Tissue-Specific Alternative Polyadenylation in Drosophila, Cell Reports, 1, 277-289.
  • Thomas, P.E., et al. (2012) Genome-Wide Control of Polyadenylation Site Choice by CPSF30 in Arabidopsis, Plant Cell, 24, 4376-4388.
  • Ulitsky, I., et al. (2012) Extensive alternative polyadenylation during zebrafish development, Genome Res., 22, 2054-2066.
  • Hoque, M., et al. (2013) Analysis of alternative cleavage and polyadenylation by 3 ' region extraction and deep sequencing, Nat. Methods, 10, 133-139.
  • Fu, H., et al. (2016) Genome-wide dynamics of alternative polyadenylation in rice, Genome Res., 26, 1753-1760.
  • Hong, L., et al. (2017) Alternative polyadenylation is involved in auxin-based plant growth and development, The Plant Journal, n/a-n/a.
  • Lin, J., et al. (2017) Role of Cleavage and Polyadenylation Specificity Factor 100: Anchoring Poly(A) Sites and Modulating Transcription Termination, The Plant Journal.
  • Tian, B. and Manley, J.L. (2017) Alternative polyadenylation of mRNA precursors, Nature reviews. Molecular cell biology, 18, 18-30.
  • Wang, T., et al. (2017) Comprehensive profiling of rhizome-associated alternative splicing and alternative polyadenylation in moso bamboo (Phyllostachys edulis), Plant J, 91, 684-699.
  • Arefeen, A., et al. (2018) TAPAS: Tool for Alternative Polyadenylation Site Analysis, Bioinformatics, bty110-bty110.
  • Chakrabarti, M., Dinkins, R.D. and Hunt, A.G. (2018) Genome-wide atlas of alternative polyadenylation in the forage legume red clover, Scientific Reports, 8, 14.
  • Ye, C., et al. (2018) APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data, Bioinformatics, 34, 1841-1849.
  • Fu, H., et al. (2019) Distinct genome-wide alternative polyadenylation during the response to silicon availability in the marine diatom Thalassiosira pseudonana, The Plant Journal, 99, 67-80.
  • Zhou, Q., et al. (2019) Differential alternative polyadenylation contributes to the developmental divergence between two rice subspecies Japonica and Indica, The Plant Journal, 98, 260-276.
PlantAPAdb catalogues all PAC (poly(A) site cluster) datasets, which are well organized by different categories such as sequencing protocols, biological samples, relevant authors and studies, and published year. Each dataset is displayed by a dataset card. By clicking the card of a dataset, users can view detailed statistical information of the dataset, such as the relevant study and literature, mapping information of the raw data, statistics of PAC and read distribution in different genomic regions. Lists of all potential PACs and high confidence PACs in simple bed format or with full genome annotation are provided for download. The list of heterogeneous cleavage sites before grouping into PACs is also provided for download, which is especially useful for analyzing the micro-heterogeneity phenomenon of polyadenylation or testing the impact of clustering procedure.

Each catalogue (dataset card) contains PACs from one biological sample (e.g., one tissue with multiple replicates). Raw RNA-seq data from this sample were pre-processed using the standard pipeline. It should be noted that the coordinates of PACs in this single sample are not necessarily all in (or the same as) the coordinates of all PACs from the whole species, because PACs in a single sample are not extracted from the pooled sample.
3' seq protocols
Protocols Description Studies Cite
PAT-seq A method to study the integration of 3' UTR dynamics with gene expression in the eukaryotic transcriptome Buffer Control (SRP089899)Seedling Control (SRP089899)Cordycepin 2h (SRP089899)FLAG-RPL18 Control (SRP089899)FLAG-RPL18 Hypoxia 2h (SRP089899)Hypoxia 2h (SRP089899)Flower (SRP070055)Root (SRP070055)Auxin (SRP092240)Mock Control (SRP092240)Amp311 (SRP093950)Esp5 (SRP093950)CPSF30 (SRP050424)CPSF30m (SRP050424)Root Oxt6 Mutant (SRP050424)Root Wild Type (SRP050424)Leaf Oxt6 Mutant (SRP009685)Leaf Wild Type (SRP009685)AHBD2 (SRP005137)Fip1 Mutant (SRP187778)Salt-treated Fip1 Mutant (SRP187778)Wild Type Control (SRP187778)Salt-treated Wild Type (SRP187778)High Salt Acetate (SRP068667)High Salt (SRP068667)Tris Phosphate Acetate (SRP068667)Tris Phosphate (SRP068667)AHBG3 Test1 (SRP041219)AHBG3 Test2 (SRP041219)20 Days Leaf (SRP136202)5 Days Root (SRP136202)60 Days Leaf (SRP136202)60 Days Root (SRP136202)60 Days Stem (SRP136202)Anther (SRP136202)Dry Seed (SRP136202)Embryo (SRP136202)Endosperm (SRP136202)Husk (SRP136202)Imbibed Seed (SRP136202)Mature Pollen (SRP136202)Pistil (SRP136202)Seeding Shoot (SRP136202)20 Days Leaf (SRP073467)5 Days Root (SRP073467)60 Days Leaf (SRP073467)60 Days Root (SRP073467)60 Days Stem (SRP073467)Anther (SRP073467)Dry Seed (SRP073467)Embryo (SRP073467)Endosperm (SRP073467)Husk (SRP073467)Imbibed Seed (SRP073467)Mature Pollen (SRP073467)Pistil (SRP073467)Seedling Shoot (SRP073467)Flower (SRP186946)Leaf (SRP186946)Root (SRP186946) Harrison P F, Powell D R, Clancy J L, et al. PAT-seq: a method to study the integration of 3' UTR dynamics with gene expression in the eukaryotic transcriptome[J]. Rna, 2015, 21(8): 1502-1510.
PAS-seq A deep sequencing-based method called Poly(A) Site Sequencing (PAS-Seq) for quantitatively profiling RNA polyadenylation at the transcriptome level Hlp1 Mutant (SRP013996)Seedling Wild Type Control (SRP013996)Latera Bud (SRP093919)New Shoot Tip (SRP093919)Rhizome Tip (SRP093919) Shepard P J, Choi E A, Lu J, et al. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq[J]. Rna, 2011, 17(4): 761-772.
DRS Direct RNA sequencing DRS fpa (ERP003245)Seeding Wild Type DRS (ERP003245)Seeding Wild Type DRS1 (ERP001018)Seeding Wild Type DRS2 (ERP001018) Ozsolak F, Platt A R, Jones D R, et al. Direct RNA sequencing[J]. Nature, 2009, 461(7265): 814.
EXPRSS An Illumina based high-throughput expression-profiling method to reveal transcriptional dynamics Leaf slh1 00hr (SRP033221)Leaf slh1 06hr (SRP033221)Leaf slh1 09hr (SRP033221)Leaf slh1 24hr (SRP033221)Leaf Wild Type 00hr (SRP033221)Leaf Wild Type 06hr (SRP033221)Leaf Wild Type 09hr (SRP033221)Leaf Wild Type 24hr (SRP033221) Rallapalli G, Kemen E M, Robert-Seilaniantz A, et al. EXPRSS: an Illumina based high-throughput expression-profiling method to reveal transcriptional dynamics[J]. BMC genomics, 2014, 15(1): 341.
Studies
Studies Species Cite
Wu X, 2011 Arabidopsis thaliana Wu X, Liu M, Downie B, et al. Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation[J]. Proceedings of the National Academy of Sciences, 2011, 108(30): 12533-12538.
Sherstnev A, 2012 Arabidopsis thaliana Sherstnev A, Duc C, Cole C, et al. Direct sequencing of Arabidopsis thaliana RNA reveals patterns of cleavage and polyadenylation[J]. Nature structural & molecular biology, 2012, 19(8): 845.
Thomas P E, 2012 Arabidopsis thaliana Thomas P E, Wu X, Liu M, et al. Genome-wide control of polyadenylation site choice by CPSF30 in Arabidopsis[J]. The Plant Cell, 2012, 24(11): 4376-4388.
Duc C, 2013 Arabidopsis thaliana Duc C, Sherstnev A, Cole C, et al. Transcription termination and chimeric RNA formation controlled by Arabidopsis thaliana FPA[J]. PLoS genetics, 2013, 9(10): e1003867.
Liu M, 2014 Arabidopsis thaliana Liu M, Xu R, Merrill C, et al. Integration of developmental and environmental signals via a polyadenylation factor in Arabidopsis[J]. PloS one, 2014, 9(12): e115779.
Rallapalli G, 2014 Arabidopsis thaliana Rallapalli G, Kemen E M, Robert-Seilaniantz A, et al. EXPRSS: an Illumina based high-throughput expression-profiling method to reveal transcriptional dynamics[J]. BMC genomics, 2014, 15(1): 341.
Wu X, 2014 Medicago truncatula Wu X, Gaffney B, Hunt A G, et al. Genome-wide determination of poly (A) sites in Medicago truncatula: evolutionary conservation of alternative poly (A) site choice[J]. BMC genomics, 2014, 15(1): 615.
Zhang Y, 2015 Arabidopsis thaliana Zhang Y, Gu L, Hou Y, et al. Integrative genome-wide analysis reveals HLP1, a novel RNA-binding protein, regulates plant flowering by targeting alternative polyadenylation[J]. Cell research, 2015, 25(7): 864.
Bell S A, 2016 Chlamydomonas reinhardtii Bell S A, Shen C, Brown A, et al. Experimental genome-wide determination of RNA polyadenylation in Chlamydomonas reinhardtii[J]. PloS one, 2016, 11(1): e0146107.
Fu H, 2016 Oryza sativa Japonica Group Fu H, Yang D, Su W, et al. Genome-wide dynamics of alternative polyadenylation in rice[J]. Genome research, 2016, 26(12): 1753-1760.
Guo C, 2016 Arabidopsis thaliana Guo C, Spinelli M, Liu M, et al. A genome-wide study of “non-3UTR” polyadenylation sites in Arabidopsis thaliana[J]. Scientific reports, 2016, 6: 28060.
de Lorenzo L, 2017 Arabidopsis thaliana de Lorenzo L, Sorenson R, Bailey-Serres J, et al. Noncanonical alternative polyadenylation contributes to gene regulation in response to hypoxia[J]. The Plant Cell, 2017, 29(6): 1262-1277.
Hong L, 2017 Arabidopsis thaliana Hong L, Ye C, Lin J, et al. Alternative polyadenylation is involved in auxin‐based plant growth and development[J]. The Plant Journal, 2018, 93(2): 246-258.
Lin J, 2017 Arabidopsis thaliana Lin J, Xu R, Wu X, et al. Role of cleavage and polyadenylation specificity factor 100: anchoring poly (A) sites and modulating transcription termination[J]. The Plant Journal, 2017, 91(5): 829-839.
Chakrabarti M, 2018 Trifolium pratense Chakrabarti M, Dinkins R D, Hunt A G. Genome-wide atlas of alternative polyadenylation in the forage legume red clover[J]. Scientific reports, 2018, 8(1): 11379.
Wang T, 2018 Phyllostachys edulis Wang T, Wang H, Cai D, Gao Y, Zhang H, Wang Y, et al. Comprehensive profiling of rhizome-associated alternative splicing and alternative polyadenylation in moso bamboo (Phyllostachys edulis)[J]. Plant J, 2017, 91(4), 684-699. doi: 10.1111/tpj.13597.
Telléz‐Robledo B, 2019 Arabidopsis thaliana Tellez-Robledo, B., et al. The polyadenylation factor FIP1 is important for plant development and root responses to abiotic stresses [J]. The Plant Journal, 2019, 99(6):1203-1219.
Zhou Q, 2019 Oryza sativa Indica Group Zhou Q, Fu H, Yang D, et al. Differential alternative polyadenylation contributes to the developmental divergence between two rice subspecies, japonica and indica[J]. The Plant Journal, 2019, 98(2): 260-276.
Sample list  Download
Organism SRP group label Environment condition Plant type Procotol Tissue Reference
arabidopsis_thalianaBuffer Control (SRP089899)ControlWild typePAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaSeedling Control (SRP089899)ControlWild typePAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaCordycepin 2h (SRP089899)CordycepinWild typePAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaFLAG-RPL18 Control (SRP089899)ControlFLAG-RPL18PAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaFLAG-RPL18 Hypoxia 2h (SRP089899)HypoxiaFLAG-RPL18PAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaHypoxia 2h (SRP089899)HypoxiaWild typePAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaDRS fpa (ERP003245)NAfpaDRSSeedDuc et al. PLoS genetics, 2013.
arabidopsis_thalianaSeeding Wild Type DRS (ERP003245)NAWild typeDRSSeedDuc et al. PLoS genetics, 2013.
arabidopsis_thalianaFlower (SRP070055)NAWild typePAT-seqFlowerGuo et al. Scientific reports, 2016.
arabidopsis_thalianaRoot (SRP070055)NAWild typePAT-seqRootGuo et al. Scientific reports, 2016.
arabidopsis_thalianaAuxin (SRP092240)AuxinWild typePAT-seqSeedHong et al. The Plant Journal, 2018.
arabidopsis_thalianaMock Control (SRP092240)ControlWild typePAT-seqSeedHong et al. The Plant Journal, 2018.
arabidopsis_thalianaAmp311 (SRP093950)NAAmp311PAT-seqLeafLin et al. The Plant Journal, 2017.
arabidopsis_thalianaEsp5 (SRP093950)NAEsp5PAT-seqLeafLin et al. The Plant Journal, 2017.
arabidopsis_thalianaCPSF30 (SRP050424)NACPSF30PAT-seqRootLiu et al. PloS one, 2014.
arabidopsis_thalianaCPSF30m (SRP050424)NACPSF30mPAT-seqRootLiu et al. PloS one, 2014.
arabidopsis_thalianaRoot Oxt6 Mutant (SRP050424)NAOxt6PAT-seqRootLiu et al. PloS one, 2014.
arabidopsis_thalianaRoot Wild Type (SRP050424)ControlWild typePAT-seqRootLiu et al. PloS one, 2014.
arabidopsis_thalianaLeaf slh1 00hr (SRP033221)NAslh1EXPRSSLeafRallapalli et al. BMC genomics, 2014.
arabidopsis_thalianaLeaf slh1 06hr (SRP033221)NAslh1EXPRSSLeafRallapalli et al. BMC genomics, 2014.
arabidopsis_thalianaLeaf slh1 09hr (SRP033221)NAslh1EXPRSSLeafRallapalli et al. BMC genomics, 2014.
arabidopsis_thalianaLeaf slh1 24hr (SRP033221)NAslh1EXPRSSLeafRallapalli et al. BMC genomics, 2014.
arabidopsis_thalianaLeaf Wild Type 00hr (SRP033221)ControlWild typeEXPRSSLeafRallapalli et al. BMC genomics, 2014.
arabidopsis_thalianaLeaf Wild Type 06hr (SRP033221)ControlWild typeEXPRSSLeafRallapalli et al. BMC genomics, 2014.
arabidopsis_thalianaLeaf Wild Type 09hr (SRP033221)ControlWild typeEXPRSSLeafRallapalli et al. BMC genomics, 2014.
arabidopsis_thalianaLeaf Wild Type 24hr (SRP033221)ControlWild typeEXPRSSLeafRallapalli et al. BMC genomics, 2014.
arabidopsis_thalianaSeeding Wild Type DRS1 (ERP001018)NAWild typeDRSSeedSherstnev et al. Nature structural & molecular biology, 2012.
arabidopsis_thalianaSeeding Wild Type DRS2 (ERP001018)NAWild typeDRSSeedSherstnev et al. Nature structural & molecular biology, 2012.
arabidopsis_thalianaLeaf Oxt6 Mutant (SRP009685)NAOxt6PAT-seqLeafThomas et al. The Plant Cell, 2012.
arabidopsis_thalianaLeaf Wild Type (SRP009685)ControlWild typePAT-seqLeafThomas et al. The Plant Cell, 2012.
arabidopsis_thalianaAHBD2 (SRP005137)NAWild typePAT-seqSeedWu X et al. PNAS, 2011.
arabidopsis_thalianaHlp1 Mutant (SRP013996)NAHlp1PAS-seqSeedlingZhang et al. Cell research, 2015.
arabidopsis_thalianaSeedling Wild Type Control (SRP013996)ControlWild typePAS-seqSeedlingZhang et al. Cell research, 2015.
arabidopsis_thalianaFip1 Mutant (SRP187778)NAFip1PAT-seqSeedTellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thalianaSalt-treated Fip1 Mutant (SRP187778)Salt-treatedFip1PAT-seqSeedTellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thalianaWild Type Control (SRP187778)ControlWild typePAT-seqSeedTellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thalianaSalt-treated Wild Type (SRP187778)Salt-treatedWild typePAT-seqSeedTellez-Robledo et al. The Plant Journal, 2019.
bambooLatera Bud (SRP093919)NAWild typePAS-seqRootWang et al. Plant J, 2017.
bambooNew Shoot Tip (SRP093919)NAWild typePAS-seqRootWang et al. Plant J, 2017.
bambooRhizome Tip (SRP093919)NAWild typePAS-seqRootWang et al. Plant J, 2017.
chlamydomonas_reinhardtiiHigh Salt Acetate (SRP068667)High Salt AcetateWild typePAT-seqMedia_grownBell et al. PloS one, 2016.
chlamydomonas_reinhardtiiHigh Salt (SRP068667)High SaltWild typePAT-seqMedia_grownBell et al. PloS one, 2016.
chlamydomonas_reinhardtiiTris Phosphate Acetate (SRP068667)Tris Phosphate AcetateWild typePAT-seqMedia_grownBell et al. PloS one, 2016.
chlamydomonas_reinhardtiiTris Phosphate (SRP068667)Tris PhosphateWild typePAT-seqMedia_grownBell et al. PloS one, 2016.
medicago_truncatulaAHBG3 Test1 (SRP041219)NAWild typePAT-seqMixWu X et al. BMC genomics, 2014.
medicago_truncatulaAHBG3 Test2 (SRP041219)NAWild typePAT-seqMixWu X et al. BMC genomics, 2014.
oryza_sativa_indica_group20 Days Leaf (SRP136202)NAWild typePAT-seqLeafZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group5 Days Root (SRP136202)NAWild typePAT-seqRootZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group60 Days Leaf (SRP136202)NAWild typePAT-seqLeafZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group60 Days Root (SRP136202)NAWild typePAT-seqRootZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group60 Days Stem (SRP136202)NAWild typePAT-seqStemZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupAnther (SRP136202)NAWild typePAT-seqAntherZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupDry Seed (SRP136202)NAWild typePAT-seqSeedZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupEmbryo (SRP136202)NAWild typePAT-seqEmbryoZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupEndosperm (SRP136202)NAWild typePAT-seqEndospermZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupHusk (SRP136202)NAWild typePAT-seqHuskZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupImbibed Seed (SRP136202)NAWild typePAT-seqSeedZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupMature Pollen (SRP136202)NAWild typePAT-seqPollenZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupPistil (SRP136202)NAWild typePAT-seqPistilZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupSeeding Shoot (SRP136202)NAWild typePAT-seqSeedlingZhou et al. The Plant Journal, 2019.
oryza_sativa_japonica_group20 Days Leaf (SRP073467)NAWild typePAT-seqLeafFu et al. Genome research, 2016.
oryza_sativa_japonica_group5 Days Root (SRP073467)NAWild typePAT-seqRootFu et al. Genome research, 2016.
oryza_sativa_japonica_group60 Days Leaf (SRP073467)NAWild typePAT-seqLeafFu et al. Genome research, 2016.
oryza_sativa_japonica_group60 Days Root (SRP073467)NAWild typePAT-seqRootFu et al. Genome research, 2016.
oryza_sativa_japonica_group60 Days Stem (SRP073467)NAWild typePAT-seqStemFu et al. Genome research, 2016.
oryza_sativa_japonica_groupAnther (SRP073467)NAWild typePAT-seqAntherFu et al. Genome research, 2016.
oryza_sativa_japonica_groupDry Seed (SRP073467)NAWild typePAT-seqSeedFu et al. Genome research, 2016.
oryza_sativa_japonica_groupEmbryo (SRP073467)NAWild typePAT-seqEmbryoFu et al. Genome research, 2016.
oryza_sativa_japonica_groupEndosperm (SRP073467)NAWild typePAT-seqEndospermFu et al. Genome research, 2016.
oryza_sativa_japonica_groupHusk (SRP073467)NAWild typePAT-seqHuskFu et al. Genome research, 2016.
oryza_sativa_japonica_groupImbibed Seed (SRP073467)NAWild typePAT-seqSeedFu et al. Genome research, 2016.
oryza_sativa_japonica_groupMature Pollen (SRP073467)NAWild typePAT-seqPollenFu et al. Genome research, 2016.
oryza_sativa_japonica_groupPistil (SRP073467)NAWild typePAT-seqPistilFu et al. Genome research, 2016.
oryza_sativa_japonica_groupSeedling Shoot (SRP073467)NAWild typePAT-seqSeedlingFu et al. Genome research, 2016.
trifolium_pratenseFlower (SRP186946)NAWild typePAT-seqFlowerChakrabarti et al. Scientific reports, 2018.
trifolium_pratenseLeaf (SRP186946)NAWild typePAT-seqLeafChakrabarti et al. Scientific reports, 2018.
trifolium_pratenseRoot (SRP186946)NAWild typePAT-seqRootChakrabarti et al. Scientific reports, 2018.
We used the Arabidopsis genome as the reference and obtained pair-wise genome alignment chain files from the Plant Ensembl to obtain synthetic regions between other genomes and the reference genome. Then coordinates of PACs of all other species were converted to coordinates of the reference genome. We adopted the reciprocal best match method (Wang, et al., 2018) to determine conserved PACs. Briefly, two PACs from two species was considered as orthologous if their distance was smaller than 25 nt based on the whole genome alignment. For each PAC in each species, the information of conservation is recorded, including the species where the conserved sites are found and the corresponding PACs.

Figure 1. The reciprocal best match method used to determine conserved PACs.

Description of the APA conservation table
id poly(A) site id
chr Chromosome of the PAC
start Start coordinate of the PAC.
end End coordinate of the PAC.
StrandStrand
coord Coordinate of the PAC, which is the coordinate of the most dominate cleavage site in a PAC
ftr Genomic region the PAC located, e.g., 3UTR, 5UTR, exon, intron, intergenic
gene_id Respective gene ID
arabidopsis_thaliana Conserved PAC ID in the respective species
oryza_indica_group Conserved PAC ID in the respective species
oryza_japonica_group Conserved PAC ID in the respective species
medicago_truncatula Conserved PAC ID in the respective species
trifolium_pratense Conserved PAC ID in the respective species
chlamydomonas_reinhardtii Conserved PAC ID in the respective species
Reference
  • Wang, R., et al. (2018) PolyA_DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes, Nucleic Acids Res, 46, D315-d319.
Upstream and downstream sequences around poly(A) sites located in different genomic regions (3' UTR, 5' UTR, CDS, intron, intergenic) can be downloaded.
Each sequence is of length 400 nt, with the upstream 300 nt and downstream 100 nt sequence of the respective poly(A) site cluster (PAC). The poly(A) site is at the 301st position. You can browse all poly(A) site categories from APA CATALOGUE, or download poly(A) site atlas from BULK DOWNLOAD.

Poly(A) sites of each species were classified into four groups based on their genomic locations (3' UTR, CDS, intron, and 5' UTR) and then 50 top-ranked signal patterns were obtained by RSAT for each signal element (e.g., NUE, FUE, CE, DE) in each group of poly(A) sites.

Based on the single nucleotide profiles surrounding poly(A) sites in different species, we divided the eight species into two groups, one is Chlamydomonas, the other contains the rest species. Poly(A) signals of 3' UTR poly(A) sites have been investigated (Xing and Li, 2011).

Figure 2. Polyadenylation signals of nuclear mRNA in plants and representative algae. The algal signal information was mostly from three species with sequenced genomes: the green alga Chlamydomonas reinhardtii and two diatoms, Thalassiosira pseudonana and Phaeodactylum tricornutum. The specific signals of C. reinhardtii are denoted with *. The signal strength information is for both plants and algae based on classical genetics (plants) and bioinformatics (algae) analyses. The percentage data were estimated from the results of bioinformatics analysis. The question mark after FUE implies that some of the species might not have this signal. CDS, protein coding sequence; UTR, untranslated region; FUE, far upstream element; NUE, near upstream element; CE, cleavage element; PAS, poly(A) site; YA, predominant dinucleotide located at the poly(A) or cleavage site where Y = U or C. The 'A' is the last nucleotide before poly(A) tail; > or >= indicates that one nucleotide appears more than another. Not drawn to scale. (This figure is obtained from (Xing and Li, 2011)).


In Chlamydomonas, four signal elements were reported previously, including FUE (-150 to -25, Pentamers), NUE (-25 to -5, Pentamers), CE (-5 to +5, Heptamers), and DE (+5 to +30, Hexamers) (Shen, et al., 2008). In the other six species with similar single nucleotide profiles surrounding poly(A) sites, we used the signal model of Arabidopsis (Loke, et al., 2005), which contains FUE (-200 to -35, Hexamers), NUE (-35 to -10, Hexamers) and CE (-10 to +15, Hexamers). The choice of signal region range and the respective length of signal patterns are based on the observation in previous studies (Loke, et al., 2005; Shen, et al., 2008).

Figure 3. Polyadenylation signal regions defined in PlantAPAdb for Chlamydomonas and other plants.


To identify statistically significant signal patterns and sequence logos in a given poly(A) signal region, we applied an oligo analyzer called regulatory sequence analysis tools, or RSAT (Thomas-Chollier, et al., 2008). We classified poly(A) sites of each species into four groups based on their genomic locations (3' UTR, CDS, intron, and 5' UTR) and then obtained 50 top-ranked signal patterns by RSAT for each signal element in each group of poly(A) sites.
Description of poly(A) signals identified by RSAT
rank Order of the motif
seq oligomer sequence
id oligomer identifier
exp_freq expected relative frequency
occ observed occurrences
exp_occ expected occurrences
occ_P occurrence probability (binomial)
occ_E E-value for occurrences (binomial)
occ_sig occurrence significance (binomial)
ovl_occ number of overlapping occurrences (discarded from the count)
forb_occ forbidden positions (to avoid self-overlap)
Reference
  • Loke, J.C., et al. (2005) Compilation of mRNA polyadenylation signals in Arabidopsis revealed a new signal element and potential secondary structures, Plant Physiol, 138, 1457-1468.
  • Shen, Y., et al. (2008) Unique features of nuclear mRNA poly(A) signals and alternative polyadenylation in Chlamydomonas reinhardtii, Genetics, 179, 167-176.
  • Thomas-Chollier, M., et al. (2008) RSAT: regulatory sequence analysis tools, Nucleic Acids Res., 36, W119-W127.
  • Xing, D. and Li, Q.Q. (2011) Alternative polyadenylation and gene expression regulation in plants, Wiley Interdisciplinary Reviews: RNA, 2, 445-458.
We adopted four PAC-level metrics and two gene-level metrics to quantify the dynamics of APA across samples.

The PAC-level metrics include the following four metrics.
NSE (Number of Samples Expressed): the NSE of a PAC is calculated as the number of samples in which the PAC is expressed.
PSE (Percentage of Samples Expressed): the PSE of a PAC is calculated as the ratio of NSE to the total number of samples.
Ratio: the relative usage of a PAC in a gene, which is calculated as the ratio of the expression level of a PAC to the total expression level of the respective gene.
Sample Specificity (Ni et al., 2013; Weng et al., 2016; Hu et al., 2017; Ji et al., 2018): Shannon entropy score was calculated for each PAC to quantify its overall sample specificity:

where n is the number of samples and ps is ratio of the expression level of the PAC in sample s to the total expression level of this PAC in all samples. Then the specificity of a PAC for sample s can be calculated as: Q = H - log2(Ps). A lower H or Q score means higher sample specificity.

The gene-level metrics include the following two metrics.
RUD (Relative Usage of Distal PAC) (Ji et al., 2009; Ji & Tian, 2009): the RUD of a gene in a sample s is calculated as the ratio of the number of 3' reads of the distal PAC in sample s to the number of total reads of proximal and distal PACs in sample s. Here only genes with at least two 3' UTR PACs were used. Proximal and distal PACs are defined as the two most abundant 3' UTR PACs or the two most distant 3' UTR PACs. The RUD score represents the relative 3' UTR length for a gene in a sample, with higher RUD indicating longer 3' UTR.
WUL (Weighted 3' UTR Length) (Ulitsky et al., 2012){Fu, 2016 #3619}: the WUL of a gene in a sample s is calculated as the average 3' UTR length of all 3' UTR PACs in this gene weighted by the number of supported 3' reads of each PAC.
Heatmap Description
The heatmap for PAC metric of tissue specificity shows the distribution of H values of each PAC across samples. Here PACs for the plot are first filtered by the following filtering conditions: total number of reads in all samples >= N1; the percentage of expressed samples >=N2. Then PACs are ranked by their minimum H values and only the top N3 PACs are used for the plot.
Similarly, for the heatmap for PAC metric of ratio, PACs are ranked by the variability (standard deviation) of ratio values across samples.
1) For the heatmap for gene metric of WUL, genes are ranked by the variability (coefficient of variation) of WUL scores across samples.
2) For the heatmap for gene metric of RUD, genes are ranked by the variability (standard deviation) of RUD scores across samples.
Output Table
In the output table of the gene-level metrics, each column is one sample, each row is one gene. The value in the table denotes the RUD or WUL score.

The metrics of NSE and PSE are given in the output table of metric Ratio or Sample Specificity. Description of output table of metric Ratio. Each row is one PAC named as "GeneID:PAC coordinate".
Sample1~N Ratio of each sample, here replicates from the same sample are averaged first.
gene Gene ID
chr Chromosome
strand Strand
coordCoordinate of the PAC
ftr Genomic region the PAC located
totalTotal number of reads across all samples of the PAC
NSE Number of samples expressed, here a replicate is counted as one sample.
PSEPercentage of samples expressed, here a replicate is counted as one sample.
Description of output table of metric "Sample Specificity". Columns are the same as those of metric Ratio except for the following columns.
H H score of Shannon entropy, reflecting the overall sample specificity of a PAC. Lower value means higher sample specificity.
Q_min Minimum Q score across all samples.
Q_min_cond Sample name(s) with minimum Q score (or the highest sample specificity).
Sample1~N Q score of each sample, reflecting the sample specificity of a PAC in the respective sample. Lower value means higher sample specificity.
Reference
  • Ji Z, Lee JY, Pan Z, Jiang B, Tian B (2009) Progressive lengthening of 3' untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development. Proceedings of the National Academy of Sciences, USA 106, 7028-33.
  • Ji Z, Tian B (2009) Reprogramming of 3' untranslated regions of mRNAs by alternative polyadenylation in generation of pluripotent stem cells from different cell types. PloS one 4, e8419.
  • Ulitsky I, Shkumatava A, Jan CH, et al. (2012) Extensive alternative polyadenylation during zebrafish development. Genome Research 22, 2054-66.
  • Ni T, Yang Y, Hafez D, et al. (2013) Distinct polyadenylation landscapes of diverse human tissues revealed by a modified PA-seq strategy. BMC Genomics 14, 615.
  • Fu H, Yang D, Su W, et al. (2016) Genome-wide dynamics of alternative polyadenylation in rice. Genome Research 26, 1753-60.
  • Weng L, Li Y, Xie X, Shi Y (2016) Poly(A) code analyses reveal key determinants for tissue-specific mRNA alternative polyadenylation. RNA 19, 19.
  • Hu W, Li S, Park JY, et al. (2017) Dynamic landscape of alternative polyadenylation during retinal development. Cell Mol Life Sci 74, 1721-39.
  • Ji G, Chen M, Ye W, et al. (2018) TSAPA: identification of tissue-specific alternative polyadenylation sites in plants. Bioinformatics 34, 2123-5.
There are two methods for detecting 3' UTR shortening/lengthening events. The "linear trend" method is based on the chi-squared test for trend in proportions (Fu, et al., 2011; Fu, et al., 2016; Ye, et al., 2018; Zhou, et al., 2019). The methods based on DESeq2 extend the differential expression results from DESeq2 to identify genes with 3' UTR shortening/lengthening events (Arefeen, et al., 2018).

Two strategies were adopted for detecting 3' UTR shortening/lengthening events from samples with or without replicates.

Figure 4. Two strategies were adopted for detecting 3' UTR switching.


The first strategy is based on the chi-squared test for trend in proportions, which is applicable for samples without replicates (replicates were averaged first). The linear trend method has the advantage to considers both abundance and 3' UTR length of all 3' UTR PACs in a gene, which was adopted in several previous APA studies (Fu, et al., 2011; Fu, et al., 2016; Ye, et al., 2018; Zhou, et al., 2019). Briefly, PACs in a gene are sorted by the respective 3' UTR length (denoted as score). A contingency table of read count is then created with each row representing the indexes of samples and each column denoting the scores. Next the chi-squared test for trend in proportions is performed (R function prop.trend.test), and the Pearson correlation r is obtained using the read count in the table as the value and the score as the coordinate. The correlation r ranges from -1 to 1, with larger absolute value indicating higher extent of 3' UTR shortening/lengthening. Finally, an adjusted p-value was obtained by the Benjamin method and genes with the adjusted p-value smaller than a given cutoff (e.g., 0.05) are considered as genes with significant 3' UTR shortening/lengthening.

The second strategy is applicable for samples with replicates, which extends the differential expression (DE) results from DESeq2 to identify genes with 3' UTR shortening/lengthening events (Arefeen, et al., 2018). To detect DE PACs, genes with only one PAC were discarded. Then DEXSeq2, initially developed for detecting differential genes between conditions from RNA-seq, was used for DE PAC identification. For each PAC in APA genes, both p-value and adjusted p-value were obtained with DEXSeq2, and PACs with adjusted p-value below a given cutoff were considered as DE PACs. DESeq2 has been adopted in previous studies for detecting differentially expressed poly(A) sites (Lianoglou, et al., 2013; Fu, et al., 2016; Arefeen, et al., 2018; Zhou, et al., 2019). To detect 3' UTR shortening/lengthening events, 3' UTR APA genes with at least one DE PAC were filtered. Then the relative change (RC) for this pair of PACs is calculated. A Fisher's exact test was also performed to test the significance of differential usage of two PACs between two samples, and a p-value was obtained. Genes with |RC| larger than a given threshold (e.g., 1) and a p-value below a given cutoff (e.g., 0.05) were considered as genes with 3' UTR shortening/lengthening events.

To identify statistically significant signal patterns and sequence logos in a given poly(A) signal region, we applied an oligo analyzer called regulatory sequence analysis tools, or RSAT (Thomas-Chollier, et al., 2008). We classified poly(A) sites of each species into four groups based on their genomic locations (3' UTR, CDS, intron, and 5' UTR) and then obtained 50 top-ranked signal patterns by RSAT for each signal element in each group of poly(A) sites.

Note: The 3' UTR switching results using different poly(A) site data (normalized or not), different strategies, or different filtering criteria may vary greatly, we strongly recommend that the 3' UTR switching result provided in PlantAPAdb be used as a guide only.
Sample list  Download
Organism Switch group label Environment condition Plant type Procotol Tissue Reference
arabidopsis_thalianaBuffer ControlControlWild typePAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaSeedling ControlControlWild typePAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaCordycepin 2hCordycepinWild typePAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaFLAG-RPL18 ControlControlFLAG-RPL18PAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaFLAG-RPL18 Hypoxia 2hHypoxiaFLAG-RPL18PAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaHypoxia 2hHypoxiaWild typePAT-seqSeedlingde Lorenzo et al. The Plant Cell, 2017.
arabidopsis_thalianaSeeding fpa DRS 2012NAfpaDRSSeedDuc et al. PLoS genetics, 2013.
arabidopsis_thalianaSeeding Wild Type DRS 2012NAWild typeDRSSeedDuc et al. PLoS genetics, 2013.
arabidopsis_thalianaAuxinAuxinWild typePAT-seqSeedHong et al. The Plant Journal, 2018.
arabidopsis_thalianaMock ControlControlWild typePAT-seqSeedHong et al. The Plant Journal, 2018.
arabidopsis_thalianaAmp311NAAmp311PAT-seqLeafLin et al. The Plant Journal, 2017.
arabidopsis_thalianaEsp5NAEsp5PAT-seqLeafLin et al. The Plant Journal, 2017.
arabidopsis_thalianaCPSF30NACPSF30PAT-seqRootLiu et al. PloS one, 2014.
arabidopsis_thalianaCPSF30mNACPSF30mPAT-seqRootLiu et al. PloS one, 2014.
arabidopsis_thalianaRoot Oxt6 MutantNAOxt6PAT-seqRootLiu et al. PloS one, 2014.
arabidopsis_thalianaRoot Wild TypeControlWild typePAT-seqRootLiu et al. PloS one, 2014.
arabidopsis_thalianaLeaf Oxt6 MutantNAOxt6PAT-seqLeafThomas et al. The Plant Cell, 2012.
arabidopsis_thalianaLeaf Wild TypeControlWild typePAT-seqLeafThomas et al. The Plant Cell, 2012.
arabidopsis_thalianaHlp1 MutantNAHlp1PAS-seqSeedlingZhang et al. Cell research, 2015.
arabidopsis_thalianaSeedling Wild Type ControlControlWild typePAS-seqSeedlingZhang et al. Cell research, 2015.
arabidopsis_thalianaFip1 Mutant NAFip1PAT-seqSeedTellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thalianaSalt-treated Fip1 MutantSalt-treatedFip1PAT-seqSeedTellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thalianaWild Type ControlControlWild typePAT-seqSeedTellez-Robledo et al. The Plant Journal, 2019.
arabidopsis_thalianaSalt-treated Wild Type Salt-treatedWild typePAT-seqSeedTellez-Robledo et al. The Plant Journal, 2019.
bambooLatera BudNAWild typePAS-seqRootWang et al. Plant J, 2017.
bambooNew Shoot TipNAWild typePAS-seqRootWang et al. Plant J, 2017.
bambooRhizome TipNAWild typePAS-seqRootWang et al. Plant J, 2017.
chlamydomonas_reinhardtiiHigh Salt AcetateHigh Salt AcetateWild typePAT-seqMedia_grownBell et al. PloS one, 2016.
chlamydomonas_reinhardtiiHigh SaltHigh SaltWild typePAT-seqMedia_grownBell et al. PloS one, 2016.
chlamydomonas_reinhardtiiTris Phosphate AcetateTris Phosphate AcetateWild typePAT-seqMedia_grownBell et al. PloS one, 2016.
chlamydomonas_reinhardtiiTris PhosphateTris PhosphateWild typePAT-seqMedia_grownBell et al. PloS one, 2016.
oryza_sativa_indica_group20 Days LeafNAWild typePAT-seqLeafZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group5 Days RootNAWild typePAT-seqRootZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group60 Days LeafNAWild typePAT-seqLeafZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group60 Days RootNAWild typePAT-seqRootZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_group60 Days StemNAWild typePAT-seqStemZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupAntherNAWild typePAT-seqAntherZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupDry SeedNAWild typePAT-seqSeedZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupEmbryoNAWild typePAT-seqEmbryoZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupEndospermNAWild typePAT-seqEndospermZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupHuskNAWild typePAT-seqHuskZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupImbibed SeedNAWild typePAT-seqSeedZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupMature PollenNAWild typePAT-seqPollenZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupPistilNAWild typePAT-seqPistilZhou et al. The Plant Journal, 2019.
oryza_sativa_indica_groupSeeding ShootNAWild typePAT-seqSeedlingZhou et al. The Plant Journal, 2019.
oryza_sativa_japonica_group20 Days LeafNAWild typePAT-seqLeafFu et al. Genome research, 2016.
oryza_sativa_japonica_group5 Days RootNAWild typePAT-seqRootFu et al. Genome research, 2016.
oryza_sativa_japonica_group60 Days LeafNAWild typePAT-seqLeafFu et al. Genome research, 2016.
oryza_sativa_japonica_group60 Days RootNAWild typePAT-seqRootFu et al. Genome research, 2016.
oryza_sativa_japonica_group60 Days StemNAWild typePAT-seqStemFu et al. Genome research, 2016.
oryza_sativa_japonica_groupAntherNAWild typePAT-seqAntherFu et al. Genome research, 2016.
oryza_sativa_japonica_groupDry SeedNAWild typePAT-seqSeedFu et al. Genome research, 2016.
oryza_sativa_japonica_groupEmbryoNAWild typePAT-seqEmbryoFu et al. Genome research, 2016.
oryza_sativa_japonica_groupEndospermNAWild typePAT-seqEndospermFu et al. Genome research, 2016.
oryza_sativa_japonica_groupHuskNAWild typePAT-seqHuskFu et al. Genome research, 2016.
oryza_sativa_japonica_groupImbibed SeedNAWild typePAT-seqSeedFu et al. Genome research, 2016.
oryza_sativa_japonica_groupMature PollenNAWild typePAT-seqPollenFu et al. Genome research, 2016.
oryza_sativa_japonica_groupPistilNAWild typePAT-seqPistilFu et al. Genome research, 2016.
oryza_sativa_japonica_groupSeedling ShootNAWild typePAT-seqSeedlingFu et al. Genome research, 2016.
trifolium_pratenseFlowerNAWild typePAT-seqFlowerChakrabarti et al. Scientific reports, 2018.
trifolium_pratenseLeafNAWild typePAT-seqLeafChakrabarti et al. Scientific reports, 2018.
trifolium_pratenseRootNAWild typePAT-seqRootChakrabarti et al. Scientific reports, 2018.
Note: Samples without at least two replicates were not used for APA switching analysis.

Parameters
Regulation: shorter/longer means only the switching events that with shorter/longer 3' UTR in group 2 are filtered.
Fisher's exact test p-value: the p-value cutoff to filter significant switching events (for methods of DESeq2).
Log fold change: the cutoff of log fold change of the expression levels of poly(A) sites between the two samples (for methods of DESeq2).
Adjusted p-value: the cutoff of the adjusted p-value of the chi-squared test for trend in proportions (for linear trend method).
Description of the APA switching table
gene Gene id
nPAC Number of poly(A) sites in the gene
geneTag1Number of total reads of the gene in the first sample
geneTag2Number of total reads of the gene in the second sample
avgUTRlen1Average 3' UTR length of the gene in the first sample
avgUTRlen2Average 3' UTR length of the gene in the second sample
fisherPV Fisher's test p-value (for the method of DESeq2)
logFCLog fold change of the expression levels of poly(A) sites between the two samples (for methods of DESeq2)
padjAdjusted p-value (for linear trend method)
corCorrelation value (for linear trend method)
logRatioLog ratio of the expression levels of poly(A) sites between the two samples (for linear trend method)
ChangeLonger means 3' UTR is longer in the second sample, Shorter means 3' UTR is shorter in the second sample
PAs1The expression levels of poly(A) sites in the first sample
PAs2The expression levels of poly(A) sites in the second sample
Reference
  • Fu, Y., et al. (2011) Differential genome-wide profiling of tandem 3' UTRs among human breast cancer and normal cells by high-throughput sequencing, Genome Res., 21, 741-747.
  • Lianoglou, S., et al. (2013) Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression, Genes Dev., 27, 2380-2396.
  • Fu, H., et al. (2016) Genome-wide dynamics of alternative polyadenylation in rice, Genome Res., 26, 1753-1760.
  • Arefeen, A., et al. (2018) TAPAS: Tool for Alternative Polyadenylation Site Analysis, Bioinformatics, bty110-bty110.
  • Ye, C., et al. (2018) APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data, Bioinformatics, 34, 1841-1849.
  • Zhou, Q., et al. (2019) Differential alternative polyadenylation contributes to the developmental divergence between two rice subspecies Japonica and Indica, The Plant Journal, 98, 260-276.
Users can have a quick access to the PAC browser by clicking the "PAC browse" tab in the main menu or the “View” link in a PAC list.

Figure 5. Web page of the browser.


One or more data sets from each plant species can be quickly loaded and graphically browsed online, by selecting the checkboxes of data sets in the ‘Available Tracks' panel. Users can conduct a search with a gene or chromosome fragment to zoom in on particular PAC regions. Data tracks of PACs from different cells, tissues or conditions can be displayed in sync with tracks of PATs, offering a more intuitive way to explore and compare the usage of PACs among different samples. Users can download the data of one or more tracks onto their local computers.

Figure 6. Right-click context menu on a gene model or PAC

In addition to cataloging PAC from individual samples, PlantAPAdb also provides the PAC list of pooled samples for bulk download. For the pooled data, different files are available for download to meet the users' need. The file in bed format records simple information such as chromosome, strand, coordinate, and total number of reads for each PAC. The file in text format tabulates full information of each PAC, including the total read count, the raw and TPM normalized read count in each individual sample and replicate, the respective gene, genomic location (CDS, intron, 3' UTR, 5' UTR, intergenic, etc.), distance to neighbor genes (if the PAC is located in intergenic region). Moreover, files of heterogeneous cleavage sites in bed format were also provided, which allows users to inspect the polyadenylation in higher resolution.

1) All poly(A) site clusters in BED format
This file contains all poly(A) site clusters (PACs) with the information of chromosome, strand, coordinate, and score (raw read count from the pooled samples) for each PAC. Internal priming artifacts were removed. Cleavage sites within 24 nt of each other were grouped into one PAC.

2) All poly(A) site clusters with annotation (raw count)
This file contains full information of all PACs. The expression level (number of supported reads) for each PAC in each experiment is given. The annotation including the gene, gene type, genomic region, number of cleavage sites etc. for each PAC is also given.
ftr Genomic region of the PAC, including three prime utr (3'UTR), five prime utr (5'UTR), cds, intron, exon, and intergenic.
gene_id Gene id
biotype Gene type, such as protein_coding, long non-conding RNA (lncRNA), non-coding RNA (ncRNA), tRNA.
ftr_startStart coordinate of the genomic region ("ftr" column)
ftr_endEnd coordinate of the genomic region ("ftr" column).
upstream_id 5' gene id of a PAC, only valid when the PAC is located in intergenic region.
upstream_start Start coordinate of the 5' gene, only valid when the PAC is located in intergenic region.
upstream_end End coordinate of the 5' gene, only valid when the PAC is located in intergenic region.
downstream_id Same as upstream_id, except for the 3' gene.
downstream_start Same as upstream_start, except for the 3' gene.
downstream_end Same as upstream_end, except for the 3' gene.

3) All poly(A) site clusters with annotation (TPM count)
This file contains full information of all PACs and the expression levels were normalized by Tag Per Million.
Given a PAC table pac_sample, the TPM normalization is performed in R as follows:
library_size <- colSums(pac_sample)
pac_sample_nor <- t(t(pac_sample)/library_size*10^6)
pac_sample_nor <- as.data.frame(pac_sample_nor )
4) High confidence poly(A) site clusters in BED format
This file contains only high confidence PACs which are expressed (TPM>=1) in at least two experiments.
5) High confidence poly(A) site clusters with annotation (raw count)
This file contains only high confidence PACs with full annotation.
6) All cleavage sites
This file contains all cleavage sites, which has four columns: chromosome, strand, coordinate, and number of supported reads. This file was used for generating poly(A) site clusters.
7) Full sample list
This file contains the information of the full sample list in PlantAPAdb, including the SRR id, sample name, environmental condition, plant type, tissue, sequencing protocol, reference, read statistics.
Pipeline: Genome-wide identification of polyadenylation sites from 3'-end sequencing data in plants.
1 Overview
Our current PlantAPAdb database for identifying poly(A) sites relies on bioinformatics tools. We developed a framework for genome-wide identification of poly(A) sites from 3'-end sequencing data (3'-seq) in plants. Scripts for APA analyses in PlantAPAdb can be downloaded here.
2 Before You Start
2.1 Software Installation
Tools version:
fastq-dump : 2.9.6
FastQC v0.11.8
multiqc, version 1.7
Trimmomatic-0.38
STAR_2.6.0a
bedtools v2.27.1
bedmap version:  2.4.35 (typical)
samtools 1.8
Install the following tools:
A alignment software, such as STAR (recommend), Bowtie/Bowtie2, and TopHat.
# Get latest STAR source from releases
wget https://github.com/alexdobin/STAR/archive/2.7.1a.tar.gz
tar -xzf 2.7.1a.tar.gz
cd STAR-2.7.1a

# Alternatively, get STAR source using git
git clone https://github.com/alexdobin/STAR.git
Make sure Perl and related modules (details in ourPerl scripts) are installed on your computer
#The Perl version can be checked with the command
perl –version
#install Perl modules on Linux/Unix/Mac OS
#upgrade cpan
perl -MCPAN -e shell
#Install or upgrade Module::Build, and make it your preferred installer
cpan>install Module::Build
cpan>o conf prefer_installer MB
cpan>o conf commit
#install Bio::SeqIO module
cpan>install Bio::SeqIO
Make sure bedtools and bedmap are installed on your computer
#install bedtools on Linux/Unix/Mac OS
apt-get install bedtools
#install bedmap
wget –c https://github.com/bedops/bedops/releases/download/v2.4.36/bedops_linux_x86_64-v2.4.36.tar.bz2
tar jxvf bedops_linux_x86_64-vx.y.z.tar.bz2
cp bin/* /usr/local/bin
                                    
[Optional] You can install SRA Toolkit to download raw sequence data, and then use Trimmomatic or FASTX for quality trimming and adapter removal.
#Install SRA Toolkit
wget –c https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6-1/sratoolkit.2.9.6-1-ubuntu64.tar.gz
tar zxvf sratoolkit.2.9.6-1-ubuntu64.tar.gz
cp sratoolkit/bin/* /usr/local/bin
#Install Trimmomatic
wget –c http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip
2.2 Preparation of Sequence Data
Prepare the following sequence data
A reference genome. You can download the reference genome (fasta format) from public databases, such as Ensembl, NCBI and UCSC. In addition, the annotation file (gtf/gff3 format) can also be downloaded. For example, the reference genome of Arabidopsis thaliana (TAIR10) is obtained from Ensembl.
#download reference genome
wget –c ftp://ftp.ensemblgenomes.org/pub/plants/release-43/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
#download annotation gff3 file in gff3 format
wget –c ftp://ftp.ensemblgenomes.org/pub/plants/release-43/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.43.gff3.gz

3'-seq data. We can download 3'-seq data (fastq format) from public sequence repositories. For example, the data (SRR5055885/ SRR5055884/ SRR5055883)of Arabidopsis thaliana is downloaded from the Sequence Read Archive by SRA Toolkit.
#download sra
prefetch SRR5055885 SRR5055884 SRR5055883
#turn convert into fastq format
fastq-dump –O  /output-path/  SRR4243494.sra
3 Analysis Workflow
To illustrate the analysis workflow, the PAT-seq data (SRR5055885/SRR5055884/ SRR5055883) from Arabidopsis thaliana are taken as an example, which include three replicates and the reads are with poly(A) tail.
3.1 Data Processing
MAP_findTailAT.pl is an in-house Perl script for identifying reads with valid poly(A) tail and trimming poly(A) tail.
Usage
perl MAP_findTailAT.pl -h
For example,
perl MAP_findTailAT.pl -in ./SRR5055885.fastq -poly T -ml 25 -mp 6 -mg 5 -mm 2 -mr 2  -mtail 6 -debug F -bar 8 -odir "/output-path/" -suf "SRP093950"
#######################################################################
#-in=input fa or fq file
#-poly=A/T/A&T/A|T. -poly=A if the sequence with As tail, -poly=T if the sequence with Ts tail
#-ml=min length after poly(A)trimming
#-mp=min length of succesive poly (default=8)
#-mg=margin from the start (poly=T) or to the end (poly=A) (=5)
#-mm=mismatch between TNNTTT or ANNAAA (default=2)
#-mr=minT in reg (default=3)
#-mtail=min length of trimmed tail (default=8)
#-bar=the length of barcode if there is barcode
#-odir=output path (default is the same as input)
#-suf=suffix, default: xx.suf.T/A.fq
#-debug=T/F (default=T)
Then we can use Trimmomatic to remove adapter sequences and low quality reads.
java –jar  /path/Trimmomatic-0.38/trimmomatic-0.38.jar SE -threads 16  -phred 33  ./SRR5055885.SRP093950.T.fq  ./clean.SRR5055885.fq /path/ILLUMINACLIP:TruSeq3-SE:2:30:10
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:10 MINLEN:25
3.2 Sequence alignment with STAR
STAR is one of the most popular tools for sequence alignment. First, we need to generate genome indices.
STAR --runMode genomeGenerate \
   --runThreadN 16  \
   --genomeFastaFiles ./Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \
   --sjdbGTFfile ./Arabidopsis_thaliana.TAIR10.42.gff3 \
   --sjdbGTFtagExonParentTranscript Parent \
   --genomeDir    ./index
Then we can map clean PAT-seq data to the reference genome.
STAR --runThreadN 20 \
     --readFilesIn ./clean.SRR5055885.fq
     --genomeDir ./index \
     --outFileNamePrefix   ./leaf-esf-rep1\
     --outMultimapperOrder Random \
     --outFilterMultimapNmax 1
The output file "leaf-esf-rep1Aligned.out.sam" is the alignment result (sam format). For the data SRR5055884 and SRR5055883, we will get the alignment result "leaf-esf-rep2Aligned.out.sam" and "leaf-esf-ref3Aligned.out.sam", respectively.
3.3 Identification of Polyadenylation Sites
PAT2PA2PAC.sh is a shell script that identifies poly(A) sites by using in-house Perl scripts, bedtools and bedmap. This script mainly consists of two steps:
1) Identifying poly(A) sites by filtering out internal-priming reads;
2) grouping poly(A) tags into poly(A) site clusters (PAC).
Usage:
bash  /PAT2PA2PAC.sh  genome   /input-path/ /output-path/   distance  input-file
For example,
bash  /PAT2PA2PAC.sh  Arabidopsis_thaliana.TAIR10.dna.toplevel.fa /input-sam-file-path/  /output-path/  24  "*.out.sam"
or
bash  /PAT2PA2PAC.sh  Arabidopsis_thaliana.TAIR10.dna.toplevel.fa /input-sam-file-path/  /output-path/  24  "leaf-esf-rep1Aligned.out.sam  leaf-esf-rep2Aligned.out.sam   leaf-esf-rep3Aligned.out.sam "
4 Output
The above workflow will produce multiple output files.
a) all.PAC.header – is the column name of output file "all.PAC.PATcount".
"chr" - is name of the chromosome or scaffold;
"UPA_start" - is the start position of the PAC, with sequence numbering starting at 1.
"UPA_end" - is the end position of the PAC, with sequence numbering starting at 1.
"strand" - is the direction of PAC, defined as + (forward) or – (reverse)
"PAnum" - is the number of poly(A) site in the PAC region.
"tot_tagum" - is the total expression level of all samples in the PAC region (count).
"coord" - represents reference poly(A) sites in the PAC region.
"refPAnum" - is the number of poly(A) site in this "coord"
"sampleName" - is the PAC expression level of each sample (count).
>cat all.PAC.header
chr	UPA_start	UPA_end	strand	PAnum	tot_tagnum	coord	refPAnum leaf-esf-rep1Aligned.out.sam  leaf-esf-rep2Aligned.out.sam   leaf-esf-rep3Aligned.out.sam
b) all.PAC.PATcount – is the PAC result, and the file "all.PAC.header" provides the corresponding column names
>head all.PAC.PATcount
1	227	 227	+	1	1	227	1	0	0	1
1	5743	 5750	+	4	7	5744	3	1	6	0
1	5850	 5852	+	3	4	5850	2	1	1	2
1	5888	 5916	+	5	14	5895	8	1	9	4
c) all.PAC.info – is the part of "all.PAC.PATcount" with first 8 columns, including "chr", "UPA_start", "UPA_end", "strand", "PAnum", "tot_tagnum", "coord", "refPAnum".
>head  all.PAC.info
1	227	227	+	1	1	227	1
1	5743	5750	+	4	7	5744	3
1	5850	5852	+	3	4	5850	2
1	5888	5916	+	5	14	5895	8
d) all.PAC.PAcount – is the number of poly(A) sites for each sample in the PAC region, corresponding columns are "chr", "UPA_start", "UPA_end", "strand", "PAnum", "tot_tagnum", "coord", "refPAnum", "samplePAnum".
1	227	227	+	1	1	227	1	0	0	1
1	5743	5750	+	4	7	5744	3	1	3	0
1	5850	5852	+	3	4	5850	2	1	1	2
1	5888	5916	+	5	14	5895	8	1	4	2
e) all.PA.uniq.bed – is poly(A) sites information in standard bed format.
1	226	227	.	 1	+
1	1911	1912	.	 1	-
1	4647	4648	.	 1	-
1	5716	5717	.	 1	+
5 Annotation of poly(A) sites
Poly(A) sites were annotated with respective genes, genomic locations, etc. based on the latest genome annotations. An R script based on the GenomicFeatures R package was implemented to uniformly process the genome annotation file in GFF3 format. Annotation for both protein coding genes and non-coding genes were parsed. PACs were then annotated based on their genomic locations, i.e., 3' UTR, coding sequence (CDS) for protein coding genes, intron and exon for non-coding genes, and intergenic region. To resolve annotation ambiguity owing to multiple transcripts from the same gene, we annotated a PAC based on the following priority: 3' UTR, CDS, intron. Particularly, if a PAC is located in intergenic region, we recorded the neighboring genes and its distance from the 3' end of the nearby 5′ gene and the distance from the 5′ end of the nearby 3' gene. This strategy to annotate intergenic PACs allows annotation of PACs with higher flexibility to recruit PACs falling within extended 3' UTR regions.

Extended 3' UTR is defined as the downstream region of the annotated 3' UTR or the gene end (if there is no annotated 3' UTR). In PlantAPAdb, the length of extended 3' UTR region is defined as twice of the average length of annotated 3' UTRs. According to genome annotations used in PlantAPAdb, the average 3' UTR lengths are: arabidopsis_thaliana = 276, chlamydomonas_reinhardtii = 850, medicago_truncatula = 401, oryza_sativa_japonica_group = 322, oryza_sativa_indica_group = 322, trifolium_pratense = 207, bamboo = 498.

For APA switching analyses, PACs located in extended 3' UTR were also considered as 3' UTR PACs for identifying 3' UTR lengthening or shortening.