Chapter 2: Markers and Sequencing

Thomas Lübberstedt; Madan Bhattacharyya; and Walter Suza

Mutations are the ultimate source of all genetic variation. Mutations can occur at all levels of genetic organization, classified as gene, chromosome or genome mutations. Gene and chromosome mutations are discussed in this lesson. Gene mutations involve nucleotides or short DNA segments (one or few nucleotides substituted, inserted, or deleted). Chromosome mutations are large-scale chromosome alterations including deletions, insertions, inversions, and translocations. Genome mutations – involving changes in number of whole chromosomes or sets of chromosomes.

Genetic variation – dissimilarity between individuals attributable to differences in genotype – that is generated by mutations is acted upon by various evolutionary forces. Evolutionary processes that alter species and populations include selection, gene flow (migration), and genetic drift – whether plants are cultivated or wild. Evolution can be defined as a change in gene frequency over time. The way that plants evolve is dependent on both genetic characteristics and the environment that they face.


Learning Objectives
  • Understand the importance of DNA sequence variation within species
  • Learn the principles of sequencing
  • Understand next generation (nextgen) sequencing
  • Understand nextgen sequencing bioinformatics
  • Understand prerequisites, prospects, and limitations in using sequencing for genotyping
  • Understand the concept of imputation
  • Understand RFLP, SSR, and AFLP as examples of classical markers
  • Understand SNP and INDEL marker basics
  • Understand two basic applications of markers: fingerprinting and gene-tagging
  • Understand underlying technologies of non-DNA marker systems and strengths and weaknesses of non-DNA versus DNA markers

Genetic Variation

Photo of a woman in a lab preparing specimens.
Fig. 1 A U.S. Food and Drug Administration microbiologist prepares DNA samples for gel electrophoresis analysis at the FDA lab in Atlanta. Photo by the U.S. Food and Drug Administration.

Genetic variation results from differences in DNA sequences and, within a population, occurs when there is more than one allele present at a given locus. Changes in gene frequencies within populations caused by natural selection can lead to enhanced adaptation, while changes caused by human-directed selection can facilitate development of useful genetic variability and selection of superior genotypes. Selection is the differential reproduction of the products of recombination — both within and between chromosomes.

The basic tools used to characterize genetic variation within and between populations are called genetic markers. Markers can be visible traits, proteins, genes, DNA sequences, or RNA sequences, and can be genetically mapped to a particular chromosomal location. They can be used to track the inheritance of nearby genes to which they are closely linked. A marker may be part of a gene itself or more commonly in a chromosome segment close by a gene of interest. Markers are characteristically locus-specific and polymorphic (i.e., segregating) in the population under study, and also have easily observable phenotypes. Markers allow determination of alleles present in individuals or populations.

Principles of Sequencing

Historically, plant breeders seeking sources of variability were constrained in choice of parental materials or plant genetic resources that were interfertile through the process of recombining locally adapted plant materials via sexual reproduction within closely related gene pools or just by evaluation and selection of particular desirable, existing genotypes to produce improved plant varieties. But a range of new techniques such as mutation induction, genetic engineering (transgenic or transformed plants), and in vitro methods (somatic cell hybridization, tissue culture, doubled haploids, induced polyploids) expand the source and scope of variability that can be used in crop improvement.

Our expanding understanding of the molecular basis of genetics has provided insights and technologies that further not only our basic understanding of genes and their regulation, but also provide additional tools for crop improvement. Molecular techniques enable breeders to generate genetic variability, transfer genes between unrelated species, move synthetic genes into crops, and make selections at the molecular, cellular, or tissue levels. Combining these laboratory techniques with conventional field approaches can shorten the time and reduce the costs for developing improved cultivars. The importance and application of molecular technologies are rapidly increasing.

Sequencing Efforts

Sequencing is the determination of the order of the nucleotides on a DNA molecule. A major milestone in plant biology was reached when the genome of Arabidopsis thaliana was published (The Arabidopsis Genome Initiative, 2000). Thereafter, the scientific community pursued the genomes of several crop plants used for feed and food. This effort attempts to keep a current list of sequenced plant genomes.


Photo close-up of small white flowers.
Fig. 2 Arabidopsis thaliana in bloom. Photo by Alberto Salguero; licensed under CC-BY-SA 3.0 via Wikimedia Commons.
Table 1 List of plants whose genome sequences are available.
Common name Scientific name Year
Potato Solanum tuberosum 2011
Grape Vitis vinifera 2007
Cucumber Cucumis sativus 2009
Popular Populus trichocarpa 2006
Strawberry Fragaria vesca 2010
Castor bean Ricinus communis 2010
Apple Malus x domestica 2010
Cannabis Cannabis sativa 2011
Lotus Lotus japonicus 2008
Soybean Glycine max 2010
Pigeon pea Cajanus cajan 2011
Chocolate Theobroma cacao 2010
Papaya Carica papaya 2008
Arabidopsis Arabidopsis thaliana 2000
Arabidopsis Arabidopsis lyrata 2011
Various Brassica rapa 2011
Thellungiella Thellungiella parvula 2011
Date palm Phoenix dactylifera 2011
Rice Oryza sativa L.  2002
Brachy Brachypodium distachyon 2010
Maize Zea mays 2009
Sorghum Sorghum bicolor 2009
Moss Physcomitrella patens 2008
Selaginella Selaginella moellendorffii 2011

Sanger’s Dideoxy DNA-Sequencing Procedure

This procedure was developed by Fred Sanger in the 1970s. Sanger, along with Walter Gilbert, won the Nobel Prize in chemistry in 1980 for their sequencing developments. The method uses enzymatic reactions to incorporate specific terminators of DNA chain elongation called 2′,3′-dideoxynucleoside triphosphates (ddNTPs). The ddNTP molecules can be incorporated into the growing DNA chain through their 5′ triphosphate groups. However, because they lack a hydroxyl (OH) groups on the 3′-C of the sugar moiety, they cannot form a phosphodiester bond with deoxynucleotide triphosphates (dNTPs) during the sequencing reaction, resulting in termination of DNA chain elongation.

Sequenced Plant Species

Table 2 Examples of plant species, whose genomes have been sequenced and published.
Plant name References Plant name References
Arabidopsis Arabidopsis sequence in Nature Brachypodium Brachypodium sequence in Nature
Rice Rice sequence in Nature Sugar beet Sugar beet sequence in Nature
Maize Maize sequence in Science Flax Flax genome in Plant Journal
Sorghum Sorghum sequence in Nature Cassava Cassava sequence in Nature Biotechnology
Soybean Soybean sequence in Nature Peach Peach sequence in Nature Genetics
Potato Potato sequence in Nature Common bean Common bean sequence in Nature Genetics
Cucumber Cucumber sequence in Nature Genetics Cacao Cacao sequence in Nature Genetics
Apple Apple sequence in Nature Genetics Sweet orange Sweet orange sequence in Nature Genetics
Papaya Papaya sequence in Nature Sunflower Sunflower sequence in Nature
Medicago Medicago sequence in Nature Wheat Wheat sequence in Nature
Grape Grape sequence in Nature Barley Barley sequence in Nature
Poplar Poplar sequence in Science Watermelon Watermelon sequence in Nature Genetics
Castor bean Castor bean sequence in Nature Biotechnology Amborella Amborella sequence in Science
Pigeonpea Pigeonpea sequence in Nature Biotechnology Tomato Tomato genome in Nature
Strawberry Strawberry sequence in Nature Genetics Peanut Peanut sequence in Nature Genetics
Date Palm Date Palm sequence in Nature Biotechnology

Additional Resources

Try This: Dideoxy Sequencing

The figure below shows gel electrophoresis sequencing results based on the dideoxy method. Examine the banding pattern and describe the order of the nucleotides of the original template strand.

Electrophoresis visualization with lines

Maxam & Gilbert Procedure

This procedure was developed by Allan Maxam and Walter Gilbert in 1977. The Maxam & Gilbert procedure is based on chemical degradation of DNA chains. In this procedure, a segment of DNA is labeled at one end with a radioactive label (32P ATP). A solution containing the labeled DNA is distributed into four different tubes. A chemical that specifically destroys one or two of the four bases (G, A+G, C, C+T) in the DNA is added into each tube. Addition of the chemical piperidine to the DNA results in cleavage of the strand at the position of the modified base. The length of the cleaved fragments depends on the distance between the modified base and the labeled end of the DNA segment. The cleaved products of each of the four reactions (G, A+G, C, and C+T) will be evaluated by autoradiography and the banding pattern on film is scored to determine the DNA sequence.

Try This: Maxam & Gilbert Sequencing

Electrophoresis visualization: from top to bottom, G, A+G, T+C, and C are the columns.

Next Generation Sequencing


Next generation sequencing is defined as a high-throughput sequencing method that combines parallel processes to produce millions of sequences at once. Several nextgen technologies are currently in use. The lesson focuses on the following technologies: pyrosequencing, Illumina, SOLiD, single molecule real time, and ion torrent sequencing.

Sequencing system device product photo.
Fig. 3 An Ion Torrent Sequencing system. Photo by ThermoFisher Scientific.

Pyrosequencing or 454 Sequencing

454 sequencing was developed by Roche and uses a procedure known as sequencing by synthesis, or pyrosequencing.


The SOLiD system was developed by Life Technologies and is based on a technique of oligonucleotide ligation and detection. Like pyrosequencing, SOLiD sequencing begins with fragmented DNA on an agarose bead.


The Illumina system uses a terminator-based method to detect single bases as they are incorporated into a growing DNA strand.

SMRT Sequencing

Single-Molecule Real-Time (SMRT) system was developed by Pacific Biosciences and uses a polymerase based approach to sequence single DNA molecules in real-time. The technique works similarly to 454 pyrosequencing, but it uses a luminous dye attached to the phosphate chain of each nucleotide.

Third-Generation Sequencing

Ion Torrent sequencing

Ion Torrent sequencing works similarly to NGS technologies like pyrosequencing. However, instead of recording a visible light signal that results from a cascade of chemical reactions, Ion Torrent technology senses a hydrogen ion that is released naturally when a base is added to a DNA strand. An ion-sensitive layer detects these ions and records them as voltage spikes, which can be decoded into the original bases.

Assembling Aspects of Nextgen Sequencing

Sequence Alignment

In general, nextgen technologies result in very large numbers of reads that are shorter than those produced using capillary electrophoresis. Therefore, nextgen sequencing requires more robust algorithms to assemble the large quantity of data generated. Along with the increased output, the challenge is to manage and track sequence runs, and automating downstream data analyses. Several academic and private institutions provide core services for nextgen sequencing. Among them are Beckman Coulter Genomics, Massachusetts General Hospital, and the Office of Biotechnology at Iowa State University.

General Bioinformatics Workflow

DNA and RNA sequencing analysis processes, visualized in a flow chart. Both go through image aquisition, quality metric assignment, filtering, and data reporting. The specific steps in the middle vary in order and amount.

Genotyping by Sequencing


Next-generation sequencing has made it possible to sequence entire plant genomes in much shorter time and at a lower cost than using the approaches based on Sanger dideoxy sequencing (Glenn, 2011). Sequencing of multiple related genomes using NGS technologies can be done to sample genetic diversity within and between germplasm. However, even with NGS technologies, species with large complex genomes are a challenge to sequence. To address this challenge, genotyping-by-sequencing (GBS) was developed as a tool for association studies and genomics-assisted breeding for various crops species, including those with complex genomes.

Library Construction

GBS uses restriction enzymes in combination with multiplex sequencing to reduce genome complexity sequencing cost (Fig. 4).

Library construction process illustrated.
Fig. 4 Genotyping-by-sequencing in plants, Library Construction involves plating the DNA and adapter pair, digestion with a restriction enzyme (or two), and ligation of adapters to the ends of DNA fragments. Samples are pooled and cleaned up before PCR. The PCR products are also cleaned up and evaluated for quality. Adapted from Elshire et al. (2011).

Sequence Barcoding

NGS technologies can produce more than 1 billion base pairs in a single sequencing run. A challenge is to use this enormous capacity for multiple DNA samples, for which only a fraction of the 1 billion bp sequence information is required. Barcoding enables to label sequences originating from a particular sample, and to pool barcoded DNA in a single sequencing run (Fig. 5). Barcodes in the context of DNA sequencing are short, unique sequences of DNA added to samples to be pooled, then processed and sequenced in parallel (Fig. 6). The sequence produced from the barcoded samples contains information to determine its origin. By barcoding the DNA, base-by-base error rate and array-to-array or day-to-day variability are reduced.

Adapter and Sequencing Primer Design

Simple graphic of genotyping by sequencing, described in caption.
Fig. 5 Genotyping-by-sequencing in plants. A barcode adapter and a common adapter flank the DNA insert to be sequenced. Primers 1 and 2 bind specific sequences on the 3′ ends of the barcode and common adapters, respectively. Adapted from Elshire et al. (2011).

Multiplexed Sequencing Process

Sequence Barcoding

Image description in caption.
Fig. 6 Preparation of barcoded libraries. Specific regions of genomes from multiple individuals are amplified by PCR. The PCR amplicons are pooled together and end-modified with barcoded adapters. The barcoded amplicons (referred to as an indexed library) are pooled together and sequenced by NGS. Adapted from Craig et al. (2008).

Haplotype Maps

Development of Haplotype Maps

Genome-wide association studies (GWAS) require both phenotypic and genotypic data from multiple individuals. The concept of GWAS will be covered in the module “Cluster Analysis, Association and QTL Mapping”. Thus, GBS can be used to develop genotypic data for the construction of haplotype maps (HapMap) for GWAS. An example of the use of GBS for GWAS is from the work of Huang et al. (2010) describing the sequencing of 517 rice genomes using the Ilumina technology to generate about one-fold sequence coverage per genotype. The data generated by Huang et al. (2010) were used to construct a HapMap for GWAS for several agronomic traits in rice.

Data Imputation

Photo of rice grains growing in a field.
Fig. 7 Rice plants. Photo by IRRI Images. Licensed under Creative Commons Attribution 2.0 via Wikimedia Commons.

The process of DNA sequencing is not free of error. Also, depending on the NGS system used, the length of base pair reads will be variable. Errors in sequencing and the length of the reads obtained by NGS may result in missing genotypes, thus affecting the quality of data.

The concern of missing data arises in almost all statistical analyses. This is actually what Huang and coworkers (Huang et al. 2010) encountered. Importantly, Huang and coworkers understood that linkage disequilibrium (LD) and the nonrandom correlation among allelic variants is extensive in rice. This meant that they could infer missing genotypes (missing data) with high confidence using data imputation (Marchini et al. 2007). Imputation is a statistical term describing substitution of a value for missing data.

The next section summarizes the approach used by Huang and coworkers to assign values to the missing genotypes.

Step 1: SNP Identification and Annotation

Single-base pair genotypes of 520 individuals obtained by Illumina sequencing were integrated to screen for single-nucleotide polymorphisms (SNPs) across the genome. Candidate SNPs were identified by comparing Illumina sequence data with the rice reference genome.

Step 2: Data Imputation To Assign Genotypes

Step 2 Animation
Play the video to see data animation.

Sliding Window Approach

The sliding window (Fig. 8) is a multi-loci mapping algorithm commonly used in association mapping. It involves three steps: Local haplotypes are inferred from contiguous SNP loci Genotypes are grouped according to inferred haplotypes Statistics (F-test) for the genotype-phenotype associationand p-values are computed.

Table, graph, adn windows showing chromosomal regions in various displays.
Fig. 8 The sliding window approach for data imputation. Defined chromosomal regions are based on the number of SNPs in a chromosomal region, i.e. defining a window size of w SNPs, and allowing the window to vary according to the size of chromosomal regions showing strong LD. During this process, the window slides along the chromosome at an interval of one SNP until the missing data are inferred for the entire chromosome. Adapted from Huang et al. (2010).

Map Construction by NGS Sequencing

In Figure 9, variation in agronomic traits (e.g., heading date and tiller branch number) among lines are shown in the upper part of the figure. NGS sequences are aligned to the reference genome for genome-wide genotyping. Aligned reads (gray boxes) facilitate SNP (bases in upper case) identification among lines. Consistent patterns of mismatch between NGS sequence and the reference genome distinguish genetic variability from random sequence variation.

A simple graphic with five plants and ref sequences showing their sequencing.
Fig. 9 Construction of a HapMap by NGS sequencing.  Imputation is used to “fill in” missing genotypes (bases in lower case) in areas not covered by sequencing. Boxes with dashed lines are referred to as paired-end reads, and are used to facilitate proper read alignment. Adapted from Clark (2010).

Bin Maps

Bin boundaries for maize chromosomes shown, with various lengths of chromosomes aligned by their center.
Fig. 10 Bin boundaries for maize chromosomes. The chromosome partitions (white horizontal lines) are based on the concept of bins. Adapted from MaizeGDB.

A bin (Gardiner et al. 1993) is a chromosome segment of about 20 cM flanked by two fixed core markers (a locus or probe that defines a bin boundary). A bin contains all loci within a left fixed core marker to the right fixed core marker. Assigning a locus to a bin is highly dependent on the precision of the mapping data, and increases in likelihood as the number of markers or mapping population increases in size. Bin maps contain coordinates named by the chromosome number, followed by a decimal, and a numeric identifier. 1.00 is the most distal (left or top-most, see arrow in Fig. 10) bin on the short arm of the chromosome. At right is the representation of bin boundaries for the 10 maize chromosomes.

Bin Map Example

An example of application of GBS to develop a bin map for a crop with complex genome is provided from work by Poland et al. (2012). In this example, GBS was evaluated in wheat and barley, and a de novo genetic map was constructed using SNP markers from the GBS data (Fig. 11).

Three charts, described in the caption.
Fig. 11 Distribution of SNPs discovered by GBS in bin map of barley (only chromosomes 3, 4, and 5 are shown). Histograms represent the number of SNPs from GBS that map to each bin. The number of SNPs mapping to a single bin is represented by the blue bars. SNPs that did not match a particular bin are represented by grey bars. Red triangles below the plots represent bins that failed to match any SNP marker from GBS data. Adapted from Poland et al. (2012).

Tag SNPs

A tag SNP is a polymorphism in a region of the chromosome with high LD and can be used as a marker for genetic variation without genotyping an entire chromosome.

Chart plotting SNPs. Longer description in caption.
Fig. 12 LD plot of SNPs with top-ranked BFs in CHB of 1000 Genome Phase I. by Weihua Shou,Dazhi Wang, Kaiyue Zhang, Beilan Wang, Zhimin Wang. Licensed under Creative Commons Attribution 3.0 via Wikimedia Commons.
Study Question 1

In the below figure, suppose SNPs in two alleles (major and minor) reveal a particular pattern among haplotypes.

Charting an unknown haplotype against haplotype patterns (P1 through P4) and SNP loci (S1 through S12). Minor and major alleles are highlighted.

DNA Markers


The surge in the development of new tools for molecular genetics starting in the 1980s made it possible to identify genetic variation at the molecular level based on DNA changes and their impact on the phenotype. Such DNA changes (polymorphisms) can be exploited, e.g., as markers for a particular trait of interest, by plant breeders. Availability of an increasing amount of sequence data from sequencing projects together with new technologies such as next generation sequencing, and bioinformatic tools have reduced the cost of marker discovery and application.

Molecular Markers

Molecular or DNA markers reveal sites of variation in DNA. Variability in DNA facilitates the development of markers for mapping and detection or traits. Any DNA sequence can be genetically mapped, like genes leading to plant phenotypes. Prerequisite is, that there must be a polymorphism available for the sequence to be mapped, i.e., two or more different alleles. This can basically be a single nucleotide polymorphism (SNP), a single nucleotide variant at a particular position within the target sequence, or an insertion / deletion (INDEL) polymorphism. Target sequences can be amplified by various methods, including Polymerase chain reaction (PCR), and subsequently be visualized to generate “molecular phenotypes” comparable to visual phenotypes, that can be observed by using appropriate equipment. The main use of those SNPs and INDEL polymorphisms is as molecular markers. By genetic mapping as described above, linkage between genes affecting agronomic traits or morphological characters, and DNA-based SNP or INDEL markers can be established. It can be more effective in the context of plant breeding, to select indirectly for markers (DNA or non-DNA), than directly for target traits. Reasons can be: lower costs for marker analyses, the ability to run multiple such assays (for DNA markers) in parallel, the ability to select early and to discard undesirable genotypes or to perform selection before flowering, codominant inheritance of markers, among others. Below is a discussion on various types of markers used in plant breeding.

General Properties

DNA markers are readily detectable DNA sequences, whose inheritance can be monitored. The advantage of DNA-based markers is that they are independent of environmental factors. An ideal DNA marker (system) should possess the following properties:

  1. Be highly polymorphic
  2. Display co-dominant inheritance (to discriminate homozygotes from heterozygotes)
  3. Occur at high frequency in the genome
  4. Display selective neutral behavior
  5. Provide easy access
  6. Be simple to evaluate by available set of tools
  7. Display high reproducibility, and
  8. Facilitate easy exchange of data between laboratories

Historically, DNA markers can be grouped into three main categories: (1) hybridization-based markers, e.g. restriction fragment length polymorphism (RFLP) markers; (2) PCR-based markers, e.g., amplified fragment length polymorphism (AFLP), and simple sequence repeat (SSR); and (3) sequence or chip-based markers, e.g., some procedures for detecting single nucleotide polymorphism (SNP) markers. Examples of molecular markers belonging to the above three categories are further discussed.


Restriction fragment length polymorphism (RFLP) markers involve cutting DNA into fragments and comparing patterns of variability in fragment size, or polymorphisms. RFLP patterns are analyzed by scoring an autoradiograph of a Southern blot. More information about RFLP is found in Crop Genetics eModule8.

Strengths of RFLP

  • Co-dominance
  • No sequence information is required
  • Simplicity, not requiring costly instrumentation
  • RFLP probe sequences can be used to develop additional markers e.g. Indel
  • Transferability across related species

Weaknesses of RFLP

  • Analysis requires large amounts of high quality DNA
  • Low genotypic throughput (few loci detected per assay)
  • Difficult to automate
  • Use of radioactive probes restricts the analysis to specific laboratories
  • Probes must be physically maintained not allowing sharing between laboratories
  • Expensive


Simple Sequence Repeat (SSR) markers are widely used markers based upon the high rate of variation in microsatellite loci. SSRs represent a few to hundreds highly variable tandem copies of DNA repeats. Such tandem repeats of usually one to four bases are widespread in higher organisms. Many different microsatellite loci (>100,000) can be present in any plant species. SSRs are a result of slippage during DNA replication or unequal crossover during meiosis.

Variation in SSRs is observed by developing locus-specific primers that anneal to sequences flanking the repeat region; Polymerase Chain Reaction (PCR) is subsequently used to amplify the target region. Alleles (fragments) are visualized as bands with different migration pattern on a gel after electrophoresis. More recently, capillary electrophoresis is used, which also allows to multiplex up to about 16 SSRs per capillary.

The activity in the next screen will allow you to use database web tools to search for SSRs and design primers to detect SSR by PCR.

Try This: Activity
  1. Develop PCR primers to detect SSRs for maize teosinte branched1.
  2. Obtain Zea mays tb1 sequence from NCBI.
  3. Use the tb1 sequence you obtained from NCBI search for the tb1 locus in MaizeGDB.
  4. In your BLAST results view the locations of tb1 in the maize genome.
  5. Select a hit with an E-value of 0.

NCBI database screenshot.

Maize GDB screenshot of BLAST tutorial.

Screenshot of Maize GDB BLAST results.

Screenshot of maize GDB BLAST results, with visual alignment for chromosomes expanded.

  1. Select View at Maize GDB
  2. Select Maize B73 RefGen_v2
  3. Select Download Decorated FASTA File
  4. Select Show 100 kbp to allow you to extract about 100 kbp containing the tb1 locus
  5. Select Go

Screenshot of chromosome alignment results from query in text above.

Screenshot of refgen results with several dropdowns highlighted.

After performing the actions described in Step 5, a new window with the entire 100 kbp sequence you extracted will be displayed. Copy the sequence and proceed to step 6.

Screenshot of plain text chromosome sequence.

Several internet-based programs are available for searching microsatellites. For this exercise, you will use a web-based software called WebSat. To access WebSat, go to

  1. You have the option of pasting your copied FASTA sequence or uploading a FASTA file of your sequence. Also, you can change parameters related to the repeat.
  2. After entering the 100 kbp tb1 sequence, select Submit It!

Screenshot of Websat interface with raw DNA sequence pasted in.

After submitting your sequence WebSat will search for SSRs within the sequence and return results with the option to design PCR primers. Design PCR primers around a repeat in the tb1 sequence by selecting a stretch of CTs found between positions 49,141 and 49,351.

Screenshot of Websat interface with sections of the DNA sequence highlighted in yellow, green, and blue.

WebSat selects primers that fit the parameters of your choice to flank the SSR of your interest. In this case the forward and reverse primers around a stretch of CTs you selected will result in a PCR fragment of about 377 bp.

Screenshot of Websat interface various filters and parameters set.

SSR – Strengths & Weaknesses

Strength of SSR markers:

  • Hypervariable, multiple alleles (high PIC)
  • In silico development straightforward

Weakness of SSR markers:

  • Capability for multiplexing limited (max. 10-15)
  • Affects costs/datapoint
  • Few intragenic SSRs

More information online:

Transferability of molecular markers can help increase resolution of genomes that are not well characterized.

SSR markers located within genes can be used for direct selection of an allele.


Amplified Fragment Length Polymorphism (AFLP) markers combine RFLP and PCR. In AFLP genomic DNA is digested with restriction enzymes followed a ligation step where adapters are added to both ends of the restriction fragments. PCR is carried out on the adapter-ligated mixture, using primers that target the adapter, but that vary in the base(s) at the 3’ end of the primer. Figures 21 and 22 in the text (see example of AFLP) describe the steps to detect and analyze AFLP markers.

Strength of AFLP markers:

  • High marker index
  • Amenable for automation
  • Robust
  • No prior sequence information required
  • Special applications: Gene family profiling; Methylation assay
  • Established service company: KeyGene

Weakness of AFLP markers:

  • Random loci, might differ between populations
  • Dominant marker system


Single nucleotide polymorphisms (SNPs) are the most abundant kind of polymorphisms in eukaryotic genomes. SNPs are single nucleotide differences (transition or transversion) between allelic sequences. SNPs might cause polymorphisms detectable as RFLP or AFLP markers, if they occur in restriction enzyme recognition sites. Some principles exploited in SNP detection are shown in Figures 13, 14 , 15 and 16.

A to G SNP example. two strands of DNA are presented, with A and T and G and C matched up.
Fig. 13 An ‘A’ to ‘G’ transition is described, including the various methods that can be used to detect the A- and the G-alleles.
Simple diagram of two DNA sequences. the first is split with cleavage, so A is left hanging on one side and t is left hanging as an indent on the other. G and C are aligned with no cleavage.
Fig. 14 Restriction enzymes may be used for allele-specific cleavage of the target DNA when a SNP changes the restriction site for the enzyme.
Simple diagram of two DNA sequences. A and T match stable, and G is normal, but C has a break in the line and the label mismatch unstable.
Fig. 15 Two short probes are used to detect the polymorphism by hybridization. Only the probe that perfectly matches the target will be stable, and a mismatch would be unstable. The probes are usually labeled with a fluorescing dye or radioisotope for the detection by a laser scanner, or auto- radiography (e.g., Southern blot analysis).
Simple diagram of allele-specific primers, T and C, affecting the two sets of DNA. A and T match with an extension, but G and C mismatch, and have no extension.
Fig. 16 A method called primer extension is described. Primer extension uses two allele-specific primers that anneal to the target sequences adjacent to the SNP and have a nucleotide that is complimentary to the SNP at the 3’ end. Only primers that match perfectly to the target sequence will be extended by the DNA polymerase.

SNP Explanation

Overall, many different genotyping approaches are available ranging from low to high throughput. Some platforms permit users to pick custom SNPs but the highest throughput assays are available only in fixed contents. Not all custom SNPs will work for every format and multiple SNPs may be required to carry out most projects targeting specific SNPs. However, there are still trade-offs for throughput, that is, samples versus SNPs to be analyzed. Ultimately, cost will dictate how a SNP project is designed. Regardless of the study, design, quality control and tracking are critical to the success of the project. Laboratory Information Management Systems (LIMS) are important in every study design. The following are examples of SNP genotyping systems that are commonly used by plant breeders.

SNP: TaqMan Assays

TaqMan SNP assays (Fig. 17) are based on PCR using four oligonucleotide primers: (1) A set of forward and reverse primers that are designed and tested for each SNP, and (2) Two hydrolysis (Taqman) assay probes conjugated with fluorescent dyes and quenchers. Taqman probes are designed to anneal within a region of the PCR fragment resulting from the forward and reverse primers. The quencher ensures that a dye does not fluoresce before Taqman probes have annealed to their target during PCR. The PCR reaction is catalyzed by a polymerase enzyme with 5′ to 3′ exonuclease activity. The 5′ to 3′ exonuclease activity is required to cleave the quencher from the dye allowing fluorescence to be produced during PCR amplification.

An A to G transition is shown within the target DNA (DNA template).

Simple graphic of a TaqMan assay, where probes are aligned to sections of unzipped DNA strands.
Fig. 17 Two Taqman probes are designed to recognize either A or G. The probes are linked with dyes and quenchers.

The 5′ to 3′ exonuclease activity of the polymerase degrades the probe that has annealed to the template. This releases the dye allowing fluorescence to occur.

SNP: Sequenom MassArray System

The Sequenom MassArray system (Fig. 18) uses highly multiplexed PCR reactions to screen multiple mutation sites simultaneously by primer extension combined with Matrix-Assisted Laser Desorption/Ionization-Timeof Flight mass spectrometry (MALDI-TOF-MS). Thesystem provides rapid and quantitative readout allowing detection of mutations, gene copy number, methylation status, and level of expression of allelic variants. Up to about 20 SNPs multiplied by about 400 samples can be analyzed at a time.

A photo of a black, flat chip with many holes in the top.
Fig. 18 A Sequenom genotyping chip with room for 384 samples.Photo by Magnus Manske; licensed under CC-SA 4.0 International via Wikimedia Commons.

Key Steps in Sequenom MassArray System

SNP: Sequenom Massarray System

The three key steps in SNP analysis using the Sequenom system are (1) Target amplification (2) Primer extension and (3) Signal detection and ratio analysis. These steps are further described in Fig. 19.


Chart of SNP detection. SNP alleles are denatures, have primers added, are polymerased, and analyzed by mass spectrometry.
Fig. 19 SNP detection with DNA polymerase-assisted single single nucleotide primer extension.

SNP: GoldenGate Assay

The GoldenGate assay involves the addition of biotin to genomic DNA to immobilize the DNA on avidin-coated particles which bind biotin. The assay uses three oligonucleotide primers, two of these (P1 and P2) are specific for the two SNP alleles, and the third (P3) is a locus-specific primer that is tagged with a sequence for capture on solid support. In the reaction, the allele- and locus-specific primers anneal with the genomic DNA followed by extension using a DNA polymerase. After extension, the products are ligated to the tag sequence by a ligase. PCR primers containing fluorescence labels recognize the P1, P2, P3 sequences. The extension products containing the fluorescence labels are captured on the BeadArray containing complementary tag sequences for fluorescence detection.


Insertions and deletions (Indels) cause changes in DNA sequence by deletion or insertion. Indels can range in size from one or few bases to multiple megabases. Small deletions from a few base pairs to kilobases in length most often arise from unequal crossover during meiosis.

The Arabidopsis INDEL array (Salathia et al. 2007) is a microarray-based system (Fig. 20) that can be used to assess up to 240 polymorphic markers by hybridization. The array is based on 70-mer oligonucleotides of indels present in two Arabidopsis ecotypes, Columbia-0 (Col-0) and Landsberg erecta (Ler). PCR primers are also available for validation of array-based data. Groups of 16 lines can be genotyped together in a single experiment.

In Detail: Dye Swaps

As shown in Figure 20, Salathia et al. (2007) swapped the dye labels for Col and Ler. What is the significance of dye swaps?

Simple graphic of genome sequence analyzed through dye swaps to identify markers and controls.
Fig. 20 Arabidopsis INDEL array technology. The array surface is coated with 70 bp indel oligonucleotides unique to the Columbia ecotype. The DNA from Col and Ler is labeled with fluorescent dyes. The labeled DNA is hybridized to the array. Adapted from Salathia et al. (2007).

Basic Marker Applications

Marker Applications – Genetic Fingerprinting

Genetic fingerprinting is a method that employs the uniqueness of DNA to classify individuals into distinct or similar groups. Based on the fact that genomes of different individuals will contain polymorphisms, a particular DNA profile can be established for a particular organism. This profile is specific to that individual and as unique as a fingerprint (Fig. 21).


Electrophoresis results, a series of black lines against a neutral background.
Fig. 21 PCR gel electrophoresis results. Image by Rkalendar. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons.
Try This! Genetic Fingerprinting


Genetic Fingerprinting by RFLP

The concept of fingerprinting is increasingly being applied to determine the ancestry of plants and animals. Genetic fingerprinting can be used in the breeding of endangered species or commercially important crops because it can help guarantee the authenticity of the plants. With the ability of obtaining highly specific DNA profiles, genetic fingerprinting can be used to protect from illegal use of patented or otherwise registered varieties. For commercially important crops that are difficult to characterize phenotypically, genetic fingerprinting is an important tool to identify genetic diversity within breeding populations. One of the earliest methods of genetic fingerprinting used hybridization-based RLFP markers. An example of genetic fingerprinting data is provided in Figure 22.

Visualization of a genetic fingerprint, several black sections, in lines and chunks, against a light background.
Fig. 22 An example of a genetic fingerprint based on AFLP analysis.
Study Question 2 – genetic fingerprint

Visualization of a genetic fingerprint, several black sections, in lines and chunks, against a light background.

Genetic Fingerprinting Process

Genetic fingerprinting can be applied at all phases of cultivar development. The phases are described at right.

Phase 1: Identifying genetic variation

  • In parent selection
  • In recurrent selection
  • In assigning individuals to heterotic pools
  • In choosing genetic resources

Phase 2: Developing variety parents or testing hybrids

  • To measure heterozygosity to predict hybrid performance
  • In conducting backcrossing

Phase 3: Seed multiplication and variety protection

  • To ensure purity of hybrids and blends
  • For variety approval
  • To identify “essentially derived varieties” (EDV)

Gene Tagging

If markers flank a gene of interest, the likelihood of a recombination event occurring between markers and gene of interest depends on the genetic distance between them. Thus, the closer the marker is to the gene controlling a trait of interest, the higher chance that there will be no recombination between gene and marker. Absolutely linked markers co-segregate with the trait of interest.

A marker linked to a gene controlling a gene of interest can serve as a “tag” for that gene/trait (Fig. 23).

Tomato gene examples, with genes tagged. Large tomatoes have either two dominant or one dominant allele. Small tomato has two recessive alleles.
Fig. 23 An example of gene tagging with a molecular marker completely linked to a trait of interest. A hypothetical gene “A” controls fruit size in tomato and is dominant over “a”. A co-dominant marker is available to identify individuals carrying either of the fruit size alleles. The marker “M” is linked to the dominant allele, and “m” to the recessive allele. The detection of markers M and m by PCR produces fragments that can be separated by gel electrophoresis.
Study Question 3

Use of Linked Markers: Step 1

Use of Linked Markers and Fingerprinting to Assist Backcrossing

Marker-assisted backcrossing involves three steps (Figs. 24-26):

Step 1. Selection of donor allele at the markers linked to target gene to reduce loss of target allele due to recombination. In this step, markers are useful if the trait is controlled by a recessive allele, or when multiple resistance genes are to be obtained from the donor. Also, markers are useful for environmentally-sensitive genes and for expensive phenotypes, for example, grain quality.

Simple chart of donor crossed with recurrent maize parents and resulting offspring. Described in caption.
Fig. 24 Development of male-sterility by marker-assisted backcrossing in maize. A male sterile donor is crossed with a fertile recurrent parent. Charts depict proportions of donor and recurrent parent genomes; bars depict chromosome segments of donor and recurrent parent.

Step 2. Selection of recurrent parent allele at other linked markers. This helps reduce linkage drag when introgressing wild or exotic germplasm.

Simple graphic of gene segregation in BC1, from a quarter of a pie graph to several sections of lines with small areas highlighted.

AFLP genetic fingerprint image, with arrow pointing down segregation in BC1 to highlighted offspring with recurrent parent.
Fig. 25 Progeny containing largest proportion of recurrent parent genome can be detected as early as in the BC1 generation using molecular markers and genetic fingerprinting. Overall, the use of markers helps speed production of new male sterile lines.

Step 3. Selection of recurrent parent allele at unlinked markers throughout the genome (background selection). It is all a matter of probability to identify backross progeny that are similar to the recurrent parent (Fig. 26). Therefore, markers inherent to the recurrent parent (background markers) can help identify progeny most similar to the recurrent parent.

Use of Linked Markers: Step 2

Step 2. Selection of recurrent parent allele at other linked markers. This helps reduce linkage drag when introgressing wild or exotic germplasm.

Use of Linked Markers: Step 3

Step 3. Selection of recurrent parent allele at unlinked markers throughout the genome (background selection). It is all a matter of probability to identify backross progeny that are similar to the recurrent parent (Fig. 27). Therefore, markers inherent to the recurrent parent (background markers) can help identify progeny most similar to the recurrent parent.

Process to identify recurrent parent in backcross process, visualized in simple graphics.
Fig. 26 Probability is used to identify BC progeny that are similar to the recurrent parent [(N)MsMs].

Markers and Selection

Use of Linked Markers and Fingerprinting to Assist Backcrossing

Markers are useful for foreground selection of lines having the donor allele in heterozygous condition. An example of the use of markers for foreground selection is described in Figure 27. Without a marker it would not be possible to distinguish progeny heterozygous for the male sterility trait (Msms) from homozygous (MsMs) genotypes because both scenarios result in fertile plants. The use of a co-dominant marker linked to Ms/ms, heterozygotes helps identify heterozygotes and eliminates the need to expend time and resources for selfing and scoring individuals based on pollen production.

Backcross process visualized with highlighted section noting that a market linked to MS/ms can help distinguish heterozygotes from homozygotes.
Fig. 27 The use of molecular markers for foreground selection. Backcross of (S)Msms to (N)MsMs produces fertile plants, but of different genotypes (Msms or MsMs). Selfing the MsMs BC1 progeny will produce all MsMs fertile plants. Selfing of BC1 Msms progeny will produce fertile and sterile plants in the ratio of 3:1. The use of a linked marker will help eliminate additional work to self and phenotypic screening of the plants.

DNA Versus Non-DNA Markers

Groups of Non-DNA Markers

Genetic markers are broadly classified into two groups. (1) DNA markers: those based on detection of DNA. (2) Non-DNA markers: those based on visually distinguishable traits, also referred to as morphological markers (e.g. flower color or seed shape); and those based on gene products, referred to as biochemical markers (e.g. RNA, protein, and other cellular metabolites).

The advantage of DNA markers is that they are not affected by environmental factors. However, presence of a particular DNA sequence may not always lead to the expected expression for a trait of interest. This is, because the expression of a particular allele depends on environmental conditions, and also interaction with other genes. Thus, even though an allele with a known effect on a particular trait is present, it might not result in the expected phenotype. Therefore, DNA markers are considered to be a measure of the genetic potential of an individual. The equivalent in human genetics is the risk concept. Based on DNA information, it is possible to predict the risk of a patient for showing a particular condition (e.g., 30% to get pancreatic cancer at a certain age). However, whether this condition is expressed, depends on other circumstances. In contrast, if RNA- or metabolite-based biomarkers for this cancer type are available, onset of this condition can be predicted with high accuracy. Thus, non-DNA markers are indicative of the realized potential of an individual.

Visible Markers

The advantage of morphological markers, also called visible markers, is that they are in general easy to score. However, morphological markers are affected by environmental conditions, making their use less reliable across environments. Also, morphological markers are limited in number compared to the abundance of DNA markers. Biochemical markers are affected by the developmental stage of the plant, and the cell type from which they are isolated. This has to be carefully selected. This is a major difference compared to DNA-markers, which are stable and valid, independent of the tissue from which the respective DNA has been isolated. The World Health Organization defines a biomarker as any parameter that can be used to measure an interaction between a biological system and an environmental agent, which may be chemical, physical or biological. Therefore, diagnosis of presence of disease condition and possible treatment requires use of biomarkers. In conclusion, the term biomarker is broadly defined and may include DNA- and non-DNA markers. However, sometimes the term biomarker is used in a more narrow sense for biochemichal non-DNA markers.

Genetic Pathways from DNA to RNA

To understand the relationship between DNA markers and non-DNA markers, review the pathways by which genetic information in deoxyribonucleic acid (DNA) is transferred to ribonucleic acid (RNA) molecules (called transcription), and then transferred from RNA to a protein (termed translation) by a code that specifies the amino acid sequence. Epigenetic mechanisms including DNA methylation/ demethylation and histone acetylation/deacetylation may also impact gene expression (Fig. 28).


Scheme of genetic (and epigenetic) information pathways from DNA to RNA to protein.
Fig. 28 DNA, RNA, and proteins can be used as markers. If the sequence of the protein is known, it may be used to track the DNA (the gene) from which it was encoded. Variation in DNA sequence will result in variation in RNA and protein sequences. If change in the amino acid sequence of an enzyme results in change in its function, the observable phenotype (morphological or biochemical) can be used as a marker. Non-coding RNAs, e.g. miRNA are also important. Many miRNA genes are expressed at specific tissues and developmental stages to regulate expression of specific genes by affecting mRNA stability and translation.

Figure 28 illustrates two steps in gene expression, transcription (production of mRNA from DNA), and translation (production of proteins from mRNA). Not all genes produce mRNA that can be translated into proteins. Certain genes are transcribed into non-coding RNAs (e.g. micro-RNAs – miRNAs) or short-interfering RNAs – siRNAs) that serve a regulatory function during plant growth and development. A gene can be either “on” or “off” depending on the cell-type, stage of development, and environmental signals, meaning that at any moment each cell makes coding and non-coding RNA from only a proportion of its genome.

Microarray Technologies

Development of a microarray starts with the synthesis of probes. Probes can be either (a) cDNA sequences derived from expressed sequence tags (EST) clones or small fragments from PCR or (b) synthetic oligonucleotides, short sequences designed to complement genomic targets of interest. The oligonucleotides may be long (60-mer) or short (25-mer) depending on the purpose of the experiment. Longer probes bind their targets with higher specificity than shorter probes. However, shorter probes may be spotted at a higher density on an array than longer probes, thus reducing the cost of array production.

Microarray technologies allow parallel assessment of thousands of genes in a single experiment to generate data for gene function, or trait characterization. Microarray analysis involves hybridization of target sequences with gene-specific probes spotted on an array (Fig. 29) For the development of RNA-based markers, target sequences are prepared from total RNA or mRNA.


Flowchart of micro array process. The micro array consists of DNA probes fixed to a solid support. RNA is extracted from cells and reverse transcription produced cDNA molecules with a fluorescent tag. The tagged cDNA will pair with probes, and hybridization will result in colored dots indicating the relative amount of mRNA.
Fig. 29 Microarrays allow the detection of the expression of thousands of genes.

Microarray Analysis (1)

Micro RNAs (miRNAs) are small non-coding RNAs which play key roles in regulating the translation and degradation of mRNAs. Genetic or epigentic alterations may affect miRNA expression, thereby leading to aberrant target gene(s) expression in diseases such as cancer. Thus, miRNAs may also provide useful biomarkers for diseases diagnosis. For example, a study by Yanaihara et al. (2006) identified 43 miRNAs that are uniquely expressed in affected lung tissue. Recent studies indicate that miRNAs expression in plants is affected by stress conditions such as drought (Li et al. 2011).


Graphic representing caner cells and noncancer cells being identified through micro arrays.
Fig. 30 Detection of variation in gene expression by microarrays to predict the occurrence of cancer.

The differences in mRNA profiles between genotypes can be exploited to develop biomarkers for predicting future performance of an individual. In humans, for example, variation in gene expression can be used to predict the occurrence of cancer (Fig. 30).

  • Cancer and noncancer cells are removed from patients with breast cancer.
  • Messenger RNA from the cells is converted to cDNA and labeled with red (cancer cells) or green (noncancer cells) fluorescent nucleotides.
  • The cDNAs are mixed and hybridized to DNA probes on a chip.
  • The chip is scanned spot by spot.

A gene sequence in green and red blocks against a black background. A yellow line bisects the image, and represents an equal expression of the gene in both types of cells.

  • Each spot represents the expression of one gene in one patient’s tumor compared with the expression of that gene in her noncancer cells.
  • Tumors above the yellow line came primarily from patients who remained cancer-free for at least 5 years.
  • Tumors below the yellow line came primarily from patients in whom the cancer spread within 5 years.

Microarray Analysis (2)

1. On-slide synthesized arrays Such arrays are prepared by chemical synthesis of probes on the array surface, e.g., Affymetrix arrays (Fig. 31).


A promotional image for the Affymetrix GeneChip, showing a photo of a data chip and a graphic of multiple DNA strands representing the strand of DNA presented on the chip.
Fig. 31 On-slide synthesized arrays on the Affymetrix GeneChip. The actual size of the chip is 1.28 cm x 1.28 cm and costs about $400. Probe spots on each cell are 10 µm. Oligo probes (25-mers) are synthesized by a chemical process known as photolithographic synthesis. Eleven to twenty “match” probes and 11-20 “mismatch” probes per each gene are spotted on the array surface. There is only one target per each array, and arrays are not reused.

Microarray Analysis (3)

  1. On-slide synthesized arrays Such arrays are prepared by chemical synthesis of probes on the array surface, e.g., Affymetrix arrays.

Microarray Analysis (4)

  1. Spotted cDNA arrays This type of arrays is prepared by spotting purified PCR products from a cDNA library on glass using a robotic arrayer.
  2. Spotted gene-specific sequence tag arrays Similar to spotted cDNA arrays, PCR products are spotted on glass by a robotic arrayer. However, in contrast to spotted cDNA arrays, spotted gene-specific sequence tags are developed by PCR using primers targeting unique segments of genes or BAC clones.
  3. Spotted long oligonucleotide arrays These arrays constitute oligos ranging from 50-70 base chemically synthesized to match a particular region of a gene of interest. The 50-70mer oligos are spotted on glass slides robotically.
A small, nondescript mechanical device.
Fig. 32 The GenePix 4000B Microarray Scanner is used to scan Nimblegen and other arrays spotted in a 1” x 3” format at 5 µm -100 µm resolution (16-bit dynamic range).

Protein-Based Markers

Common protein markers are isozymes. Isozymes are enzymes with similar function derived from more than one locus. Isozymes are encoded by gene families resulting from duplication events. Isozymes are different from allozymes in that allozymes represent one enzyme derived from a single locus.

Isozymes are analyzed by a procedure called electrophoresis. Electrophoresis is a technique for separating macromolecules on a gel by means of an electric field and specific chemical staining (Fig. 33). Therefore, to be useful as markers, isozymes must be electrophoretically resolvable (i.e., bands can be clearly separated for visualization on a gel), and detectable by various in-gel assay methods.


Electrophoresis machine, with sections labeled.
Fig. 33 Electrophoresis is a laboratory technique used to evaluate isozymes. Image from NIH-NHGRI.


The advantage of isozymes is that they are robust and highly reproducible. Also, isozymes have codominant expression, meaning that both homozygotes can be distinguished from the heterozygote and neither allele is recessive. However isozymes are gene products, so they reveal only a small subset of the actual variation in DNA sequences between individuals and do not reveal variation in the non-coding regions of the genome. Other limitations of isozymes as markers include: (i) data complexity as a result of dimers or multimers of the enzymes; (ii) multi-allelic and multi-locus systems can make interpretation of the banding patterns difficult; (iii) the system is limited to those enzymes that can be detected in situ, resulting in a narrow coverage of the genome; (iv) relatively few biochemical assays are available to detect isozymes; and (v) the assay is based on a phenotype, and thus sensitive to the environment.

Currently, isozymes are used mainly for germplasm identification and population genetics studies. Other examples of application of proteomic approaches are listed below.

  1. Two-dimensional polyacrylamide gel electrophoresis was used to detect polymorphic protein markers in several plant species (Vienne et al. 1996).
  2. A proteomic approach was used to identify protein markers in lung cancer (Mehan et al. 2012).
  3. Isozymes were useful in developing the linkage map for tomato (Bernatzky and Tanksley, 1986).
  4. Maquet et al. (1997) used allozyme markers to study the genetic structure of Lima bean (Phaseolus lunatus L.) base collection.
  5. Ibáñez et al. (1999) evaluated isozyme uniformity in a wild extinct insular plant [Lysimachia minoricensis J.J. Rodr. (Primulaceae)].
  6. Rouamba et al. (2001) assessed allozyme variation of onion (Allium cepa L.) populations from West Africa.

Metabolite-Based Biomarkers

As described earlier, in human health, changes from healthy states to disease state and conditions can be described in terms of important metabolites of cells. A similar approach can be used to determine biomarker metabolites in plants during growth and development. For example, a study by Tarpley et al. (2005) established a biomarker metabolite set for rice during development.

The advantage of metabolite-based markers is that their levels are more closely associated with phenotypes than DNA markers. Therefore, establishing a set of metabolite biomarkers for a plant may be useful in predicting agronomic performance under different environments (Sulpice et al. 2009, 2010; Steinfath et al. 2010).

Techniques for Profiling

Two techniques used largely for profiling metabolite biomarkers are: (1) mass spectrometry (MS); and (2) nuclear magnetic resonance (NMR). Examples of methods for establishing metabolite biomarkers in plants include gas chromatography- mass spectrometry (GC-MS), liquid chromatography mass-spectrometry (LC-MS), and NMR. The application of some of these methods is described in the work by Skogerson et al. (2009) to establish metabolite profiles for various wines as biomarkers for wine sensory properties. The advantage of such wine biomarkers is that they may be used to replace expensive and laborious sensory panels (Skogerson et al. 2009). Also, such biomarkers may be useful for various regulatory purposes, for example, detection of adulterations.

In evaluating biomarkers, there is a trade-off between metabolic coverage and the quality of the metabolite data. As shown in Figure 34, analysis of a single metabolite or a metabolite class yields data of higher quality than broad analysis for several chemical classes.

Graph of complexity of analyses, described in caption.
Fig. 34 Relationship between number of metabolites analyzed by MS and NMR techniques and data quality. Analysis of a single compound will result in data of higher quality than analysis of several metabolites in a biochemical pathway, or an entire organism. The current methods are unable to fully cover all metabolites in a cell (metabolomes). Higher plants produce tens of thousands of different metabolites making the analysis challenging. HPLC high performance liquid chromatography, TOF time of flight, FTICR Fourier-transform-ion-cyclotron resonance. Adapted from Fernie et al. (2004).

Plant Phenomics

Plant phenomics is the study of how genetic makeup of an individual influences its physical and biochemical characteristics in a particular environment (Furbank and Tester, 2011). High-throughput phenomics facilities are using automated plant imaging for the repeated, non-destructive acquisition of high-dimensional phenotypic data on a whole-plant scale.

For example, The Australian Plant Phenomics Facility uses the Plant Accelerator. The LemnaTec phenomics systems can handle small plants (Arabidopis) and large plants (corn) to measure various parameters including, leaf area, chlorophyll content, stem diameter, height, biomass, color, and leaf tracking over time. The LemnaTec technology can also be used to measure responses to salt, and drought stress. The application of phenomics is important for studying complex stress traits such as drought because non-destructive imaging methods allow temporal resolution and monitoring of the same plants during the experiment (Berger et al. 2010).

Ultimately, phenomic data (e.g., canopy reflectance) can be used as indirect trait for agronomic traits of interest. An example is the measurement of canopy “greenness” to describe the nitrogen use efficiency (NUE) of plant genotypes.


Akhunov, E., C. Nicolet, and J. Dvorak. 2009. Single nucleotide polymorphism genotyping in polyploidy wheat with the Illumina GoldenGate assay. Theor Appl Genet 119:507-517.

Baskin et al. 2009.

Clark, R.M. 2010. Genome-wide association studies coming of age in rice. Nature Genetics 42:11, 926-927.

Craig, D.W., et al. 2008. Identification of genetic variants using bar-coded multiplexed sequencing. Nature Methods 5, 887-893. doi:10.1038/nmeth.1251

DiGuistini et al. 2009. De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biology 10:R94.

Eid et al. 2009. Real-Time DNA sequencing from single polymerase molecules. Science 323: 133-138.

Elshire et al. 2011. PLoS ONE 6. doi:10.1371/journal.pone.0019379

Fernie et al. 2004. Metabolite profiling: from diagnostics to systems biology. Nature Reviews Molecular Cell Biology 5, 763-769 (September 2004). doi:10.1038/nrm1451

Flicek, P., and E. Birney. 2009. Sense from sequence reads: methods for alignment and assembly. Nature methods. 6:S6-S12.

Hawkins et al. 2010. Next-generation genomics: an integrative approach. Nature Reviews Genetics. 11:476-486.

Huang et al. 2009. High-throughput genotyping by whole-genome resequencing. Genome Res. 2009 Jun;19(6):1068-76. doi: 10.1101/gr.089516.108

Illumina, Inc. 2006. GoldenGate Assay workflow.

Li, H., and N. Homer. 2010. A survey of sequence alignment algorithms for next-generation sequenc-ing. Briefings in Bioinformatics. 11: 473-483.

MaizeGDB. Maize bin viewer.

Maxam, A M., and W. Gilbert. 1977. A new method for sequencing DNA. Proc. Natl. Acad, Sci. USA. 74:560-564.

Metzker. 2010. Sequencing technologies — the next generation. Nature Reviews Genetics. 11:31-46

Munroe, D., and T. J. R. Harris. 2010. Third-generation sequencing fireworks at Marco Island. Nat. Biotechnol. 28:426-428.

Perkel, J. 2008. SNP genotyping: six technologies that keyed a revolution. Nature Methods 5:447-454.

Poland, J.A., P.J. Brown, M.E. Sorrells, and J-L Jannink. 2012. Development of High-Density Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by-Sequencing Approach. PLoS ONE 7(2): e32253. doi:10.1371/journal.pone.0032253

Rothberg et al. 2011. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 475:348-352.

Salathia, N., H. N. Lee, T. A. Sangster, et al. 2007. Indel arrays: an affordable alternative for genotyping. Plant J. 51: 727-737.

Schneeberger et al. 2009. SHOREmap: simultaneous mapping and mutation identification by deep sequencing. Nat Methods. 6:550-551.

Schneeberger, K., and D. Weigel. 2011. Fast-forward genetics enabled by new sequencing technologies. Trends Plant Sci. 16:282-8.

Shendure, J., and H. Ji. 2008. Next-generation DNA sequencing. Nature Biotechnology 26:1135-1145.

Syvänen, A. 2001. Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat Rev Genet. 2:930-942.

Syvänen, A. 2005. Toward genome-wide SNP genotyping. Nat Genet. 37: S5-S10.

ThermoFisher Scientific. Contract Research Organizations To Adopt Ion Torrent Next-Generation Sequencing Platform.–0

Xu, Y. Molecular Plant Breeding. CABI, Wallingford, Oxon.


How to cite this module: Lübberstedt, T., M. Bhattacharyya, and W. Suza. (2023). Markers and Sequencing. In W. P. Suza, & K. R. Lamkey (Eds.), Molecular Plant Breeding. Iowa State University Digital Press.


Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Chapter 2: Markers and Sequencing Copyright © 2023 by Thomas Lübberstedt; Madan Bhattacharyya; and Walter Suza is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.