Research Article |
|
Corresponding author: Alejandra Lorena Goncalves ( alejandragoncalves@fceqyn.unam.edu.ar ) Academic editor: Katharina Budde
© 2025 Alejandra Lorena Goncalves, María Victoria García, Emilie Chancerel, Olivier Lepais, Myriam Heuertz.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Goncalves AL, García MV, Chancerel E, Lepais O, Heuertz M (2025) High-throughput sequence-based microsatellite genotyping for the non-model Neotropical tree species Anadenanthera colubrina (Leguminosae). Plant Ecology and Evolution 158(1): 43-52. https://doi.org/10.5091/plecevo.138834
|
Background and aims – Anadenanthera colubrina is a Neotropical native forest tree species with significant ecological importance in Seasonally Dry Tropical Forests. Developing genetic markers for this species is relevant for conservation, breeding, and evolutionary studies. Previously available genetic markers for A. colubrina consisted of a few microsatellites. Next-generation sequencing (NGS) strategies allow simple and cost-effective development of new SSR loci from low-coverage whole genome shotgun sequencing. The main aim was to develop microsatellite markers for sequence-based high-throughput genotyping (SSRseq) in the species and to characterize their information content against traditional capillary electrophoresis-based microsatellite data by estimating the amount of molecularly accessible size homoplasy of each locus. Additionally, the reliability of these markers for population genetic analysis was assessed by genotyping two age classes (reproductively mature trees and seedlings) in a typical location in Argentina.
Key results – Sixty primer pairs targeting microsatellites were designed, of which 25 were validated with allelic error rates < 3% and genotype missingness < 20%. A significantly higher number of alleles per locus and heterozygosity was detected for SSRseq considering sequence polymorphisms compared to analysing the same data based on sequence size (length) only. Size homoplasy, calculated as the proportion of mismatches between datasets relative to the number of alleles differing in length, averaged 97.85% over all SSR loci. High levels of population genetic diversity were detected in adults and seedlings from Paranaense forests, exceeding those reported in previous studies of A. colubrina using traditional SSRs. The generated datasets increase the resolution of capillary-based microsatellite genotyping, allowing for more accurate inference of eco-evolutionary processes in non-model tree species.
genotyping, multiplex PCR, next-generation sequencing, nuclear microsatellites, size homoplasy, SSRseq
Anadenanthera colubrina (Vell.) Brenan (Leguminosae, Caesalpinioideae) is a non-model Neotropical forest tree species inhabiting Seasonally Dry Tropical Forests (SDTF). The high plant population biodiversity that characterizes this biome exhibits a fragmented distribution across Latin America and the Caribbean (
Microsatellites or SSRs (Simple Sequence Repeats) remain one of the most widely used molecular marker types; their codominance and high polymorphism characterize them as robust tools for population genetics analyses of plant species (
Sequence-based microsatellite genotyping (SSRseq) is a new high-throughput, accurate, and rapid technique by next‐generation sequencing (NGS) that allows the detection of higher levels of variation compared to traditional fragment size scoring (
The main aim of this study was to develop new SSRseq loci and generate sequence-based microsatellite data for A. colubrina, for which the only prior genomic data consists of a few dozen flanking regions of SSR, and some sequenced fragments of the nuclear, chloroplast, and mitochondrial genome. We characterized the advantages of the SSRseq method against traditional capillary electrophoresis-based microsatellite genotyping by considering nucleotide polymorphisms and we estimated the amount of molecularly accessible size homoplasy of each locus as support for the use of sequencing over assessing length polymorphism for genotyping. The methodological relevance of these markers was also tested in a biological framework by comparing reproductively mature trees and seedlings from the same population in a typical location of A. colubrina (Paranaense forest, Argentina), which is especially relevant for fragmented landscapes.
Young leaves from 107 individuals of A. colubrina (adults and seedlings) were collected from four different Argentinean ecoregions: Paranaense forest (n = 95), Yungas (n = 6), Humid Chaco (n = 2), and the Delta and Islands of the Paraná River (n = 4). In the southern region of the Paranaense forests (Santa Ana, Misiones; -27.43372198, -55.579419), where A. colubrina characterizes forest patches within grassland landscapes, two life stages were sampled: 31 reproductively mature trees and 64 seedlings resulting from the germination of seeds collected from the fruits of four mother trees. The seedlings thus represent different sample sizes of four half-sib families and may contain full sibs since a previous study of an A. colubrina population in the same region suggested a selfing rate of 51–56% estimated from the inbreeding coefficient s = 2FIS/(1 + FIS) (
Total genomic DNA was extracted for each individual using the modified cetyl-trimethylammonium bromide (CTAB) method (
SSRseq markers were developed from low-coverage shotgun sequencing of a single library, prepared using the Qiaseq FX DNA library kit (Qiagen, Hilden, Germany). This library was generated from four pooled samples, each representing a different ecoregion (YS113, F134, IT185, SB5). Sequencing was conducted using Illumina MiSeq v.3 (Illumina, San Diego, USA) 2 × 300 bp paired-end sequencing, generating 4,497,218 read pairs. Overlapping forward and reverse reads were merged using BBMerge v.38.87 (
Size homoplasy was calculated for 25 high-quality validated loci as the number of alleles differing in sequence minus the number of alleles differing by length and divided by the number of alleles differing by their length. Data analyses were conducted per locus on the entire set of individuals (n = 107) and based on two different datasets, for which alleles were coded according to the amplicon length, and the sequence identity. The genetic variability of each dataset was characterized by locus by the number of alleles (NA), the effective number of alleles (NE), the allelic richness (R), the observed heterozygosity (HO), and the expected heterozygosity (HE). Rarefied allelic richness for a random subsample of gene copies (k = 166) was calculated based on the minimum sample size per locus (n = 83). The genetic diversity estimates were computed in SPAGeDi v.1.5a (
Biologically informed statistical analyses were performed on population samples to determine the reliability of the SSRseq data for subsequent population genetic analyses. Genotyping errors and null alleles were assessed using Micro-Checker v.2.2.3 (
Forty-six developed loci out of 60 were successfully amplified in the simplex PCR test and 25 were validated as good-quality loci after amplification of the 46 loci in a single multiplex and sequencing (Table
Characterization of the 25 validated SSRseq loci developed to Anadenanthera colubrina and genetic diversity indices per locus for the two different datasets based on allele length, and SSRseq (sequence‐identity).
| Locus | Primer sequences (5’ – 3’) | Missing rate | Genotyping error rate | SSR-length | SSR-seq | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N | N A | N E | R | H O | H E | N | N A | N E | R | H O | H E | |||||
| SSRseq.A06 | F: | CGATCCACATGGTTCGCTGGGTCAT | 7.48% | 1.20% | 99 | 7 | 4.107 | 6.93 | 0.455 | 0.760 | 99 | 28 | 9.599 | 24.67 | 0.505 | 0.900 |
| R: | GGCATTTCAACATATAGGACCACCA | |||||||||||||||
| SSRseq.A14 | F: | TGCACTTAATGTGGAAGGGATTGCA | 0.93% | 0.00% | 106 | 4 | 2.103 | 3.68 | 0.509 | 0.527 | 106 | 7 | 2.969 | 6.36 | 0.698 | 0.666 |
| R: | TTGCCAACAAAGATTTCCTGGAGTT | |||||||||||||||
| SSRseq.A17 | F: | GTGGTTGAACGCCGCCCATTCTCAA | 0.00% | 0.00% | 107 | 6 | 1.874 | 5.86 | 0.477 | 0.469 | 107 | 8 | 1.992 | 7.85 | 0.523 | 0.500 |
| R: | TCTCGAGATAGATGATTTGTCCAGGA | |||||||||||||||
| SSRseq.A19 | F: | GAAACTTGAAAGCAATTAGGCGGGT | 14.95% | 1.27% | 91 | 15 | 8.407 | 14.33 | 0.857 | 0.886 | 91 | 18 | 11.183 | 17.32 | 0.901 | 0.916 |
| R: | CGGAGAGCCTTCTGTGCCTTGAGCA | |||||||||||||||
| SSRseq.A20 | F: | TCCCGACTAACCCTGACTTGCCACT | 13.08% | 1.23% | 93 | 11 | 4.955 | 10.85 | 0.505 | 0.802 | 93 | 18 | 5.715 | 16.72 | 0.527 | 0.829 |
| R: | ACGGATCCACGTTCTGCAATGTATG | |||||||||||||||
| SSRseq.A21 | F: | ACTGCAAAGATAATGCCAACATGTC | 0.93% | 0.00% | 106 | 6 | 1.787 | 5.65 | 0.358 | 0.443 | 106 | 11 | 2.412 | 9.67 | 0.491 | 0.588 |
| R: | TGCCTAAGTGGTCAGGTCTCATTCA | |||||||||||||||
| SSRseq.A27 | F: | ACCTCACATTTAACACCACAAGCCC | 0.00% | 0.00% | 107 | 5 | 1.995 | 4.35 | 0.467 | 0.501 | 107 | 11 | 2.452 | 9.54 | 0.617 | 0.595 |
| R: | TGGCTAGTGAGGAAGACGAAGACGA | |||||||||||||||
| SSRseq.A29 | F: | AGCTCAGCTCTTTCTTTCATACGCA | 0.93% | 0.00% | 106 | 9 | 2.826 | 8.47 | 0.604 | 0.649 | 106 | 12 | 3.664 | 10.83 | 0.651 | 0.730 |
| R: | CCGAGTTTGTGTTGCACCCAGCTCA | |||||||||||||||
| SSRseq.A30 | F: | AATCATCTAACACGCAGCCTCACTT | 1.87% | 0.55% | 105 | 6 | 2.069 | 5.8 | 0.571 | 0.519 | 105 | 14 | 5.087 | 12.86 | 0.829 | 0.807 |
| R: | AGGCCGGATACATGGTTTGCTGACC | |||||||||||||||
| SSRseq.A33 | F: | AGCTCTGCTTCAATGGCGGAACTGA | 0.93% | 0.56% | 106 | 10 | 5.139 | 9.57 | 0.840 | 0.809 | 106 | 11 | 5.219 | 10.25 | 0.84 | 0.812 |
| R: | CGGTGTTGGTTACTGGCAACCCACC | |||||||||||||||
| SSRseq.A34 | F: | GCCGCCTACTATACCAAGCCATGCA | 2.80% | 1.69% | 104 | 13 | 3.613 | 12.22 | 0.721 | 0.727 | 104 | 27 | 10.594 | 24.40 | 0.846 | 0.910 |
| R: | ACTGCGTCTAACCCATATGTGATGGT | |||||||||||||||
| SSRseq.A37 | F: | GCCAATATTTAAGACCGGCGTGACCA | 0.93% | 0.00% | 106 | 9 | 2.823 | 7.93 | 0.613 | 0.649 | 106 | 24 | 6.658 | 20.95 | 0.858 | 0.854 |
| R: | ACCTAATCTGGAGACGACCGTCCGA | |||||||||||||||
| SSRseq.A38 | F: | GGTTAACGACCCAAGAGCAATAAGA | 14.02% | 1.95% | 92 | 7 | 2.851 | 6.78 | 0.522 | 0.653 | 92 | 9 | 2.881 | 8.69 | 0.522 | 0.656 |
| R: | GGGATTGAGTGGTGAAGTGTAGAAA | |||||||||||||||
| SSRseq.A39 | F: | TTCCTCTCCTTCTCCGCCACTTGCC | 0.00% | 0.00% | 107 | 6 | 3.140 | 5.66 | 0.701 | 0.685 | 107 | 20 | 6.947 | 18.14 | 0.869 | 0.860 |
| R: | ACGCGCCGTTTCATCTGGTTGATGC | |||||||||||||||
| SSRseq.A40 | F: | TGCAGAGGTATTTGAAATTAGGGCT | 1.87% | 0.00% | 105 | 12 | 4.377 | 11.62 | 0.790 | 0.775 | 105 | 21 | 6.417 | 19.62 | 0.867 | 0.848 |
| R: | ATGATCAGTGGACCCATTGACCTGA | |||||||||||||||
| SSRseq.A43 | F: | AGGAATCATTGCACACCCAAAGATGA | 22.43% | 2.67% | 83 | 11 | 4.097 | 10.93 | 0.217 | 0.760 | 83 | 12 | 4.108 | 11.93 | 0.217 | 0.761 |
| R: | GGCCGTCAATCGCTAGTGGCAGAAG | |||||||||||||||
| SSRseq.A44 | F: | GCTAGGCCACTCCACAACATTGCAGG | 3.74% | 0.55% | 103 | 13 | 5.945 | 12.29 | 0.864 | 0.836 | 103 | 30 | 8.354 | 26.00 | 0.903 | 0.885 |
| R: | TCGAGGAGATTAGGTGGTGACTTGT | |||||||||||||||
| SSRseq.A45 | F: | TGCTTCCACGACGTTATTCTCTAGCA | 8.41% | 2.75% | 98 | 11 | 4.103 | 10.64 | 0.653 | 0.760 | 98 | 13 | 4.315 | 12.34 | 0.653 | 0.772 |
| R: | CCGAGATGCAGGCTATCTGTTCAAC | |||||||||||||||
| SSRseq.A47 | F: | TTTCCGTCTCTGTCTTCCTGCTATA | 2.80% | 0.57% | 104 | 10 | 4.27 | 9.36 | 0.798 | 0.770 | 104 | 15 | 6.516 | 13.86 | 0.846 | 0.851 |
| R: | TGCCTTCCTCCATGCTGTTATCTGC | |||||||||||||||
| SSRseq.A48 | F: | CGCGAACTTCACTTTGGCGTAGGTG | 2.80% | 1.12% | 104 | 10 | 5.478 | 9.66 | 0.865 | 0.821 | 104 | 13 | 5.854 | 12.23 | 0.865 | 0.833 |
| R: | GCGAGCTGTTGCAATGCCGGAATTG | |||||||||||||||
| SSRseq.A51 | F: | CCCTTTGCAGTTTATGGTCCCAGCA | 1.87% | 0.00% | 105 | 4 | 2.265 | 3.90 | 0.600 | 0.561 | 105 | 8 | 2.409 | 7.03 | 0.61 | 0.588 |
| R: | GGACTTATGGGATTGGGCCGAGAG | |||||||||||||||
| SSRseq.A54 | F: | AAAGCTCTCGCCGTTCAAACCTGCC | 0.93% | 0.00% | 106 | 6 | 1.629 | 5.87 | 0.396 | 0.388 | 106 | 6 | 1.629 | 5.87 | 0.396 | 0.388 |
| R: | TGACGATTAGGAGGGCGAGCTCTGA | |||||||||||||||
| SSRseq.A55 | F: | GGGAACAGAAGCGGGAATCTTGAAG | 2.80% | 2.75% | 104 | 8 | 1.735 | 7.40 | 0.433 | 0.426 | 104 | 9 | 1.738 | 8.10 | 0.442 | 0.427 |
| R: | TGCATCAGCCTGCCACTTGCATGAT | |||||||||||||||
| SSRseq.A59 | F: | ACATGAAGCAGCTGATTGAGGAAAGT | 0.93% | 0.00% | 106 | 12 | 4.586 | 11.14 | 0.717 | 0.786 | 106 | 34 | 8.496 | 28.60 | 0.84 | 0.886 |
| R: | CACAATCCTGCCTTGTGGGTCCAACA | |||||||||||||||
| SSRseq.A60 | F: | TGAACAGGAACTTGTTGGCGGAGGG | 3.74% | 1.10% | 103 | 8 | 1.709 | 6.99 | 0.369 | 0.417 | 103 | 12 | 3.722 | 10.99 | 0.66 | 0.735 |
| R: | CGGCCTCTTTGTCCACCTTCCCAGT | |||||||||||||||
| Mean | 4.45% | 0.80% | 102 | 9 | 3.515 | 8.32 | 0.596 | 0.655 | 102 | 16 | 5.237 | 14.19 | 0.679 | 0.744 | ||
| SE | 0.012 | 0.002 | 1.237 | 0.606 | 0.334 | 0.585 | 0.036 | 0.031 | 1.237 | 1.553 | 0.561 | 1.323 | 0.038 | 0.031 | ||
The microsatellite loci were highly polymorphic in both sequences and length for most of them (Table
The SSRseq adult trees dataset from the Paranaense forest showed no evidence of scoring errors due to stuttering or large allele dropout across all 25 loci in the genotyping error analysis. However, null alleles were detected at three loci (SSRseq.A06, SSRseq.A20, SSRseq.A43), potentially causing an excess of homozygosity. Consequently, these loci were excluded from subsequent population analyses. No statistically significant deviations from Hardy-Weinberg equilibrium were detected in none of the two life stages, despite expectations of higher relatedness due to family structure in the seedlings. No evidence of linkage disequilibrium was observed between pairs of loci.
High population genetic diversity was detected in adults and seedlings from the Paranaense forest (based on 22 loci). The mean and effective numbers of alleles per locus (NA, NE), observed and expected heterozygosity (HO, HE), and the inbreeding coefficient (FIS) did not show significant differences between life stages. However, rarefied allelic richness (R) values were significantly higher in adults than in seedlings (p = 0.023) (Table
Genetic diversity indices per life stage for the dataset based on 22 SSRseq of Anadenanthera colubrina from a forest site of the Paranaense region.
| Life stages | N A | N E | R | H O | H E | F IS | |
| Adults (N = 31) | Mean | 10.955 | 6.226 | 10.488 | 0.751 | 0.773 | 0.023 |
| SE | 0.913 | 0.831 | 0.851 | 0.031 | 0.029 | 0.026 | |
| Seedlings (N = 64) | Mean | 9.136 | 4.116 | 7.651 | 0.696 | 0.679 | -0.018 |
| SE | 0.836 | 0.454 | 0.652 | 0.043 | 0.037 | 0.021 | |
| Total | Mean | 12.500 | 4.919 | 9.175 | 0.714 | 0.720 | 0.009 |
| SE | 1.095 | 0.605 | 0.743 | 0.036 | 0.034 | 0.018 |
High-throughput sequencing allowed the de novo development of an NGS-based multiplex marker panel for Anadenanthera colubrina. The effectiveness of SSRseq strengthens the advantage of using modern NGS platforms for qualitatively and quantitatively increasing data in large-scale population genetic studies.
The quality of DNA extraction is a crucial starting point for genetic studies. For Leguminosae forest tree species, the CTAB method is highly recommended because it is an inexpensive and rapid protocol that continues to be widely used in several species of the family (
Microsatellites have been used as highly polymorphic markers in previous population genetic studies of Anadenanthera colubrina, although the number of nuclear and plastid SSRs was limited (
The 25 new SSRseq markers provide an additional advantage over traditional fragment-length genotyping. Here, the NGS-based method allowed us to detect a high percentage of size homoplasy and to resolve almost double the number of alleles than expected based on fragment length accessible by capillary electrophoresis-based SSR. The generated datasets increase the resolution for more accurate eco-evolutionary inference in A. colubrina populations. The percentage of increase in the detected number of alleles due to sequence analysis resulted as high as 97.85% in A. colubrina, while in other tree species, it was 36% in chestnut (Castanea sativa; C. crenata;
Beyond detecting size homoplasy, the sequencing approach expands the scope of SSRs due to the use of compound marker systems integrating linked polymorphisms with different mutational dynamics, such as a microsatellite and its flanking sequences, allowing improvements in the estimation of population structure and inferences of demographic history (
High levels of population genetic diversity were detected in adults and seedlings from a Paranaense forest. The mean number of alleles per locus (NA = 12.50) and the allelic richness after rarefaction (R = 10.49 in adults) were particularly high. These estimates exceed those reported in previous studies of A. colubrina at various spatial scales, which employed traditional nuclear microsatellites. For example, a survey of two populations located in northeastern São Paulo state, Brazil, detected a mean NA = 7 (
Forest tree species generally exhibit a high genetic diversity, primarily distributed within populations, a pattern often attributed to life history traits such as longevity and outcrossing mating systems (
The low inbreeding coefficients detected contrast with those reported in a previous study on fine-scale population genetic structure in A. colubrina, where both adults and saplings exhibited high FIS coefficients (
Recent advances in molecular and computational techniques, combined with transdisciplinary research that integrates ecology, evolution, and genetics, are crucial for understanding and conserving processes that support plant genetic diversity in a changing world. Overall, the development of SSRseq is a powerful tool for genetic analysis and can be used to identify genetic variation and diversity within and among populations, which can be useful for sustainable management and conservation policy.
The primers designed for SSR high-throughput genotyping-by-sequencing are promising genetic tools useful in many population genetics applications such as genetic characterization of entire populations with less sequencing effort. The developed SSRseq markers may be particularly advantageous for efficient analysis of genetic diversity, providing renewed opportunities to explore ecological and evolutionary processes that shape population genetic structure in non-model species such as A. colubrina.
The comparison between allele length and SSR sequence identity revealed that SSRseq are more informative markers than traditional SSRs due to their sequence-based nature, allowing for greater variability in repeat numbers and flanking sequences. The SSRseq development involves clear criteria and allows, after proper laboratory testing, multiplexing, and high-throughput genotyping, simultaneously enabling efficient and cost-effective analysis of multiple markers. Therefore, this study builds the basis for new approaches using the information provided by NGS technologies that can also be used for developing molecular markers for genotyping on a large scale mainly in population genetic and genomic studies.
Raw data from the shotgun whole genome sequencing are available in the Sequence Read Archive (SRA) under BioProject PRJNA1033700 with SRA number SRR26587918 from the National Center for Biotechnology Information Repository. Genotype data for every individual and microsatellite loci are available on Zenodo: https://doi.org/10.5281/zenodo.11106586.
Technical developments and sequencing were performed at the PGTB (https://doi.org/10.15454/1.5572396583599417E12) with the help of Z. Compagnie and E. Guichoux. The authors thank C. Lalanne for their technical assistance. Also, ALG wishes to thank “Consejo Nacional de Investigaciones Científicas y Técnicas” (CONICET) for providing a postdoctoral fellowship for a short research stay at INRAE. This research has benefited from the support of a grant from “Investissement d’Avenir” grants of the French National Research Agency (CEBA:ANR-10-LABX-25–01) to MH and ALG. Also, this research was partially supported by a multiannual research project (PIP N° 112–2015001-00860CO) from “Consejo Nacional de Investigaciones Científicas y Técnicas” (CONICET) to MVG.
Characterization of the 60 SSRseq markers generated from low-coverage shotgun sequencing of a single Anadenanthera colubrina library.