Methods
Origin of DataThe Data of the Sponge Genetree Server (SGS) is taken from public databases such as
NCBI Genbank,
EMBL-EBI ENA and the
Sponge Barcoding Project in regular time scales.
Additionally, the phylogenetic information of selected
Sponge Barcoding Project sequences may be visualized here prior to their submission to Genbank.
Taxon SetIn the current version particularly focusses on Demospongiae, however selected gene trees of the other main lineages Calcarea, Homoscleromorpha and Hexactinellida are provided. The demosponge tree is rooted with sequences from homoscleromorph or calcarean sponges. The data analysed for the Sponge Gene Tree Server is derived from a master alignment for every gene. From this master alignment individual partitions are extracted and analyzed separately. It has been attempted to identify sequences from identical specimens as far as possible (based on voucher numbers submitted to Genbank) for concatenation.
Taxon LabelsThe taxon names are provided as given in Genbank, including occasional haplotype designations. Names are followed by the Genbank accession number. In order to save calculation time and to reduce the size of trees, sequences with identical species names AND nucleotide sequence are merged under provision of all accession numbers separated by underscores. In this case, accession numbers of merged taxa are reduced until the last variable ciphers in order to prevent over-length of taxon names. Example: "Aiolochroia_crassa_EF519553_4_6_7" consists of "Aiolochroia crassa"-labelled sequences EF519553, EF519554, EF519556 and EF519557. "Scopalina_ruetzleri_EF519669_72" consists of "Scopalina ruetzleri"-labelled sequences EF519669 and EF519672. The letter "t" in the accession number stands for "to" and indicates a range of accession numbers. Example: "Aplysina_cauliformis_EF519564t6_8" consists of "Aplysina cauliformis"-labelled sequences EF519564, EF519565, EF519566 and EF519568. Sequences that differ by unresolved nucleotides have been regarded as different.
Concatenated sequences are labelled with an "a" (for "and") between their relevant accession numbers. The arragement of accession numbers is in 5' > 3' order of the sequences in respect to their position in the complete gene. Example: "Spirastrella_sp_AY626294aAY626335" is concatenated of AY626294 followed by AY626335 from (apparently) the same Spirastrella specimen.
Since 2010 new sequences are not merged anymore due to advances in tree reconstruction algorithms. Neither are sequences concatenated anymore but rather analyzed "as is". This has the advantage that species with many sequence can be easily identified.
Character setsThe current version comprises CO1 data of both the "DNA-Barcoding" fragment and the I3M11 region [1]. LSU-rDNA data (28S) of different fragments reconstructed under secondary structure specific models are included. likewise SSU-rDNA data (18S). additional genes will be included when considerable taxon sets are available. Gaps between concatenated sequences are filled with question marks.
Ribosomal data is only included if a considerable amount of characters is present. There are numerous ITS studies which include a few dozens of LSU and SSU sequences as anchoring region for their primers. Sequences of less than 300 bp were therefore omitted from the analyses.
In the course of this project some demosponge LSU sequences have been identified as pseudogenic. As pseudogenes are may not be subject to concerted evolution they are not suitable for phylogenetic inference and therefore (certainly) have not been included in the analysis. Such sequences have been identified by their lack of a large range of conserved domains in comparison to congeneric sequences, which possess the full array of domains. However, the absence of one or few helices in some sponge taxa, particularly Haplosclerida is unsuspicious. For further information on those potential pseudogenic sequences please contact the authors.
HardwareAll analyses are carried out on the 2x32 node Unix-cluster of the
Molecular Geo- & Palaeobiology Labs Munich.
Reconstruction MethodsOnly likelihood-based methods are used to reconstruct SGS trees. The following software and parameters have been recruited for the current version:
CO1 nucleotide analyses:<
RAxML 7.0.0 [4]. Rapid bootstrap analysis and search for the best-scoring ML tree in one single run (-f a). GTRMIXI model. 1.000 runs (-# 1000).
LSU (28S) and SSU (18S) analyses:
Sequences are aligned in SEAVIEW [7] following published secondary structure models e.g. [8,9]. Non alignable regions are omitted from the analyses. The alignment with definition of corresponding helix sites is translated into a PHASE (www.bioinf.manchester.ac.uk/ resources/phase/index.html) readable phylip format using 2ANALYSIS [9]. In PHASE the helix regions are analysed under the RNA7D Model [10], the unpaired regions under the REV Model [3]. Analyses are performed with 5,000,000 iterations of which every 100st tree has been sampled. The first 20% of the trees has been disregarded as burnin. Following the output possibilites of PHASE phylogram and corresponding cladogram with support values are published separately.
Post 2010 all analyses are performed with RAxML 7.3.2 with GTR Model for loop sites and the S16 Model for helices.
Tree visualizingThe Sponge Genetree Server displays the trees using
iTOL (Letunic and Bork 2016)Scale bars refer to substitutions / site.
References:
[1] Erpenbeck, D., Hooper, J.N.A., Wörheide, G. (2006) CO1 phylogenies in diploblasts and the 'Barcoding of Life' - are we sequencing a suboptimal partition? Molecular Ecology Notes 6: 550-553.
[2] Ronquist, F., and Huelsenbeck, J.P. (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. BIOINFORMATICS 19, 1572-1574.
[3] Tavare, S. (1986). Some probabilistic and statisical problems on the analysis of DNA sequences. Lect. Math. Life Sci 17, 57-86.
[4] Stamatakis, A. (2006). RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688-2690.
[5] Adachi, J., Waddell, P.J., Martin, W., and Hasegawa, M. (2000). Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. Journal of Molecular Evolution 50, 348-358.
[6] Lartillot, N., and Philippe, H. (2004). A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution 21, 1095-1109.
[7] Galtier, N., Gouy, M., and Gautier, C. (1996). SEAVIEW and PHYLO_WIN: Two graphic tools for sequence alignment and molecular phylogeny. In Comput Appl Biosci, vol. 12. pp. 543-548.
[8] Schnare, M.N., Damberger, S.H., Gray, M.W., and Gutell, R.R. (1996). Comprehensive comparison of structural characteristics in eukaryotic cytoplasmic large subunit (23 S-like) ribosomal RNA. Journal of Molecular Biology 256, 701-719.
[9] Voigt, O., Erpenbeck, D., and Wörheide, G. (2008). Molecular evolution of rDNA in early diverging Metazoa: First comparative analysis and phylogenetic application of complete SSU rRNA secondary structures in Porifera. In BMC Evol Biol, vol. 8. pp. 69.
[10] Tillier, E.R.M., and Collins, R. (1998). High apparent rate of simultaneous compensatory basepair substitutions in ribosomal RNA. Genetics 148, 1993-2002.
[11] Jordan, G.E., and Piel, W.H. (2008). PhyloWidget: web-based visualizations for the tree of life. Bioinformatics 24, 1641-1642.