Chapter 13: Introduction to Bioinformatics

Ursula Frei; Walter Suza; Thomas Lübberstedt; and Madan Bhattacharyya

A biological sequence database is a collection of molecular data organized in a manner that allows easy access, management, and update of the data. Biological sequence databases serve an important role of providing access to sequence information to the research community. The databases contain molecular information of multiple organisms and are constantly being updated and re-designed to allow more robust data query and analysis. Examples of biological databases include European Molecular Biology Laboratory (EMBL), GenBank, the National Center for Biotechnology Information (NCBI), and the DNA Databank of Japan (DDBJ). Every sequence submitted to the database has a unique number assigned to it, called the Accession number. Even if the same gene has been submitted several times by different investigators each will have a different accession number. The chapter includes practical examples of using database tools. It is recommended that you use “try this” questions to become familiar with sequence databases.

Learning Objectives
  • Familiarize with some of the most commonly used databases in molecular plant breeding
  • Learn the tools for accessing and manipulating biological databases
  • Develop proficiency in the use of biological databases

Database Types

Databases can be classified in to primary (archival), secondary (curated), and composite databases.

  • A primary database (e.g. EMBL/DDBJ/GenBank for nucleic acids) contains information of the sequence or structure alone, for example, DNA, RNA, or protein sequences.
  • A secondary database (e.g. eMOTIF at Stanford University, PROSITE of Swiss Institute of Bioinformatics) contains information derived from the primary databases and represent sequences that are consensus of a population, for example, conserved features and motifs of a sequence.
  • A composite database contains a variety of different primary databases and provides multiple options for database search (e.g. NCBI, MaizeGDB). New tools are continuously developed to make both submission and access to sequence databases more efficient.

Access and Use of Sequence Databases

Once a new sequence has been determined a common step in its analysis is to compare the sequence with related genes that have already been sequenced, often from other organisms. A few things to keep in mind about database searches and sequence databases in general:

  1. Do not assume that if a sequence is in the database it must be correct. Databases are full of errors!
  2. Similarity with a known protein or gene does not necessarily mean the query is the same gene as the one it has similarity with.
  3. Two nucleotide sequences may have low similarity yet code for proteins that are functionally related.
  4. Protein sequences may also have low similarity yet still be functionally or structurally related.

About NCBI

NCBI was created in 1988 as a division of the US National Library of Medicine at the National Institute of Health. The role of NCBI is to create automated systems for storing and analyzing sequence information.

  1. To access various resources available through NCBI select Resources.
  2. We recommend that you set up an account with NCBI to allow you the option of saving your results. Click the Sign in link to do so
  3. Video tutorials are available under the Training & Tutorials link to enhance learning.

A screenshot of the NCBI website with labels on the Resources dropdown (1), Sign in (2), and Trainign and tutorials (3).

Sign Up for NCBI

1. Click Register to set up a new NCBI account.

Screenshot of the sign in page for NCBI, with the Register for an account link highlighted.

NCBI Training

NCBI was created in 1988 as a division of the US National Library of Medicine at the National Institute of Health. The role of NCBI is to create automated system for storing and analyzing sequence information.

Training and tutorials page from NCBI.

Information Retrieval from NCBI

One of the most widely used interfaces for the retrieval of sequence information from biological databases is the NCBI Entrez system. Entrez relies on preexisting, logical relationships between the individual sequences (data points) available in various public databases.

  1. Searching all databases is often a good starting point to get an overview of the state of your research field.
  2. Searches are based on keywords.

Screenshot of NCBI search bar, with the All databases dropdown highlighted.

Searching NCBI by Keywords

Searches can be restricted to a single database or expanded to include all other databases. The simplest way to query is through the use of individual search terms, coupled by Boolean operators such as AND, OR, or NOT. A Boolean operator is a variable that can have only a true or false value.

  1. Select individual databases, or search them all.
  2. AND: To ‘AND’ two search terms together instructs Entrez to find all documents that contain BOTH terms
    OR: To ‘OR’ two search terms together instructs Entrez to find all documents that contain EITHER term.
    NOT: To ‘NOT’ two search terms together instructs Entrez to find all documents that contain search term 1 BUT NOT search term 2.

Screenshot of NCBI search bar, with the All databases dropdown highlighted.

Try This: Access and Use of Sequence Databases

This activity consists of the following steps:

Step 1: Compare the sequences

Compare the sequences for the adh1 gene in maize and sorghum. Navigate to the NCBI site.

  1. Enter adh1 in the “search across databases” window. How many adh1 candidates did your search find?

Screenshot of NCBI search bar, with "adh1" being searched.

Step 2: Results of a search

Results of a search for “adh1” across all databases:

Search results with the word "Gene" highlighted.

Step 3: Compare the results

Compare the sequences for the adh1 gene in maize and sorghum:

  1. Enter adh1 AND Zea in the search window.Screenshot of NCBI search with the search terms "adh1 AND Zea." And is in all capital letters.

Compare the results in the Gene category.

  1. Boolean operators can be used to restrict a search and allow users to obtain specific information about their organism of interest.Search results for adh1 and Zea. There are two gene results.

Step 4: Operators

Now try these operators.

  1. Enter adh1 AND Zea[orgn] OR Sorghum[orgn] in the search window.
  2. Results

A new search example, of adh1 And Zea with orgn in brackets followed by Or Sorghum with orgn in brackets. There are over 33,000 results for genes.

Step 5: Operators

Now try these operators.

  1. Enter adh1 AND Zea[orgn] OR Sorghum[orgn] in the search window.

Screenshot of search in NCBI databases as described above.

What stands out when you compare results when using the search terms “adh1 AND Zea[orgn] OR Sorghum[orgn]” and “adh1 AND (Zea[orgn] OR Sorghum[orgn])? Can you identify any differences among the results obtained from the following sets of search terms?

  • “adh1 AND Zea[orgn]” and “adh1 AND Zea[orgn] OR Sorghum[orgn]”
  • “adh1 AND Zea[orgn]” and “adh1 AND (Zea[orgn] OR Sorghum[orgn])
  1. Enter adh1 AND (Zea[orgn] OR Sorghum[orgn]) in the search window.
  2. Compare the results in the Gene category.

Screenshot of search in NCBI databases as described above.

Step 6: Gene-centered Info

  1. Click on “Gene” to get gene-centered information on the output in the last screen results (also shown here).
  2. Click on the first “adh1″.
  3. Review the output window.

Screenshot of search in NCBI databases as described above.

Screenshot of search in NCBI databases as described above.

Summary of one result frmo your search, for adh1, alcohol dehydrogenase 1.

Step 7: adh1

What is the function of adh1?

  1. Answer found below.

Result for adh 1 with the gene description line highlighted.

Step 8: Nucleotide results

Let’s examine the Nucleotide results.

  1. Click this pull-down menu for more information about this gene and select Nucleotide.
  2. Click Search. Your search will result in 100s of hits.

Screenshot of search in NCBI databases as described above.

Step 9: Nucleotide results

Let’s examine the Nucleotide results.

  1. After selecting the nucleotide option as in previous screen, click on the adh1 mRNA as indicated. You may have to scroll down to find it.

Screenshot of search results in NCBI databases, with one result highlighted, for adh1.

Let’s examine the Nucleotide results.

  1. Click the FASTA link
  2. Reference sequences are accessed through GenBank to provide non-redundant curated data derived from experimental knowledge of known genes.

Additional information about RefSeq can be found here.

Screenshot of search in NCBI databases, with the link for FASTA highlighted.

Screenshot, example of FASTA results.

Long gene sequence shown in plain text from NCBI screenshot.

Exercise

After clicking the FASTA link, what kind of information do you get?

Does the entire mRNA sequence for adh1 you obtained code for a protein product? If not, how would you identify the coding sequence?

NCBI BLAST

NCBI Basic Local Alignment Search Tool (BLAST)

Not only keywords can be used to search sequence databases. Sequences can also be used to perform a BLAST search, making BLAST probably the most important tool in any sequence database. BLAST allows the comparison of sequence data using an algorithm developed by Altschul et al. (1990). The algorithm attempts to detect high-scoring segment pairs, which are pairs of sequences that can be aligned with one another and, when aligned, meet certain scoring and statistical criteria.

Screenshot of NCBI database, with Resources dropdown expanded to DNA and RNA and then BLAST.

BLAST Interface

On the BLAST Interface, the user can restrict searches to a specific species and to the assembled reference sequences for that species. For a plant researcher, it may not be necessary to restrict a search except for those working with rice and Arabidopsis. For all other plant species reference sequences are not fully developed.

Screenshot of BLAST Interface homepage.

BLAST Features

  1. Basic BLAST features include blastn, blastp, blastx, tblastn, and tblastx.
  2. Specialized features include “Global Align” for sequence alignment.

Elongated screenshot of BLAST interface homepage.

Try This: Using NCBI BLAST

Try This: Using NCBI BLAST

  1. Within the Basic BLAST window, click on Nucleotide BLAST. A new window appears asking you to setup your search options.
  2. This is where your query sequence will go.
  3. This selects the Database you want to search.
  4. Other parameters you may want to set different from the standard settings.

Screenshot of Blast homepage with Nucleotide Blast button highlighted.

Screenshot of BLAST Search interface under the blastn tab, with the "enter query sequence" box highlighted.

Try This: Using NCBI BLAST

You have various options of entering your query sequence: copy and paste or uploading a saved sequence from your computer.

Your query sequence has to be annotated in FASTA format. FASTA is a text-based format consisting of a definition line followed by the sequence data in single letter code. The definition line starts with the character “>”, followed by a sequence name, and ends with a return or newline. Everything that follows until the next “>” will be considered as the sequence data. It is possible to save multiple sequences in one FASTA file.

  1. In the screenshot below,
    Definition line starts with “>” character,
    gi stands for GenBank identification, followed by GenBank ID number,
    ref stands for reference sequence, followed by the accession number.
    Both GenBank ID and reference sequence numbers can be used to enter a query sequence into BLAST.

Screenshot of nucleotide description on NCBI, with genbank identification number followed by gene sequence in plain text.

Try This: Using NCBI BLAST

You may enter your query (adh1) as a sequence in FASTA format.

  1. To do that, copy the entire adh1 sequence
  2. Paste it in the Enter accession numbers(s), gi(s), or FASTA sequence(s) window.
  3. Note that the Job Title filled automatically.

Screenshot of gene sequence pasted into blastn search interface.

Try This: Using NCBI BLAST

  1. Alternatively, you can query your sequence using the Run BLAST command. Click “Run BLAST” to query the sequence from the FASTA display screen.

Screenshot of gene reference sequence page with the link "run BLAST" highlighted in the righthand side of the page.

Try This: Using NCBI BLAST

Clicking on the Run BLAST command will lead you to this window.

  1. Accession number of adh1 will automatically fill in
  2. Job Title should automatically fill in, if it does not you can click in the Job Title field and it should appear automatically.
  3. Optimize your search to megablast to identify highly similar sequences.
  4. Finally, select the BLAST button.

Screenshot of blastn query using accession number.

Try This: Using NCBI BLAST

  1. Graphic Summary: BLAST results that are summarized in a graphic form.

Screenshot of BLAST results page with graphic summary expanded, showing a distribution of the top 102 blast hits on 100 subject sequences.

Try This: Using NCBI BLAST

  1. Alignments: BLAST results that contain sequence alignment information.

Screenshot of BLAST results page with alignments expanded, showing pairs of A, C, T, and G

Try This: Using NCBI BLAST

  1. Descriptions: Accession number and source organism information is provided for sequences producing high alignment scores.

Screenshot of BLAST results page with descriptions expanded, showing a table with links to sequences producing significant alignments.

Try This: Using NCBI BLAST

Step 4: Locating adh1 on a chromosome

  1. From the NCBI home page, select Genome.

Screenshot of NCBI homepage with link to Genome highlighted.

Try This: Using NCBI BLAST

Locating adh1 on a chromosome

  1. From the genome page, select Genome Data Viewer (previously known as Map Viewer).

Screenshot of Genome page on NCBI.

Try This: Using NCBI BLAST

Locating adh1 on a chromosome

  1. Within Genome Data Viewer home you can select your organism or species.

Screenshot of Genome data viewer page with organism selection box highlighted.

Try This: Using NCBI BLAST

Locating adh1 on a chromosome

  1. Try searching the Zea mays genome for the adh1 gene.

Screenshot of Genome data viewer page with "Search in genome" box highlighted and adh1 entered.

Try This: Using NCBI BLAST

The NCBI Map View search for adh1 on the maize genome produces these results. “Ideogram view” highlights chromosome 1 to show that the adh1 gene is located on chromosome 1.

Screenshot of Genome data viewer page with results for previous search. Multiple graphs and visualizations are provided of genes.

Plant Species Sequence Databases

The advent of genomics has resulted in a number of plant species specific sequence databases. For this lesson, Maize Genetics and Genomics Database (MaizeGDB) will be the focus.

Decorative image.

MaizeGDB

MaizeGDB was first released in 1991 (as MaizeDB) and has transitioned from a focus on curation of genetic maps and stocks to the handling of reference maize genome sequence, multiple maize genomes, and sequence-based gene expression data. MaizeGDB relies on the research community for data and on expertise distributed across the USA. We recommend the use of an internet browser other than Internet Explorer (e.g. Google Chrome) to access the MaizeGDB site.

  1. Tutorials are available under the About menu under “Outreach.”

Screenshot of the maizeGDB homepage with the About menu extended to show FAQs, tutorials, and other helpful links.

 

MaizeGDB: Tutorials

Useful MaizeGDB tutorials are available to help the user become familiar with the tool.

Screenshot of maizegdb tutorials page with listed videos.

Try This: Using MaizeGDB

Step 1: Perform a Basic Search

Similar to NCBI, the MaizeGDB is a composite database allowing you to search broadly among databases or to restrict your query to a single database.

  1. Open your web browser and go to https://www.maizegdb.org.
  2. Enter adh1 into the search box.
  3. Press Enter or click the Search icon to search within all available data.

Screenshot of the maizegdb website with the URL highlighted.

Step 2: Explore the Search Results

This search will lead you to a window containing various options.

  1. Click on Locus Lookup (1) in the left-hand menu.
  2. Click on Gene Models (15) in the left-hand menu.

Explore the other data available to you by clicking the links in the green box. Click the image below to see a larger version.

Screenshot of the maize gdb data search with Locus lookup highlighted.

Screenshot of the maize gdb data search with gene models highlighted.

Step 3: Access the Genome Browser

Access the genome browser to obtain information about maize adh1.

  1. Type adh1 into the Search bar.
  2. Select loci from the Search options.
  3. Click the Search icon.

This time when the results load, you will see only the loci associated with adh1.

  1. Click the link for adh1 alcohol dehydrogenase1 to take a closer look at the gene.

Screenshot of the maize gdb data search with the search box and a gene (adh1) highlighted in the search results.

Step 4: Explore the Gene Record

The locus record screen provides detailed information on the adh1 gene. Explore the genetic information for adh1 alcohol dehydrogenase1.

  1. Click on Chromosome Coordinates when ready to proceed.

Screenshot of the maize gdb data search with genetic information and chromosome coordinates highlighted in the search results.

Step 5: See Details in Locus Lookup

The page will scroll down to the Locus Lookup section.

  1. Click on Show details to expand this section of the results.

Screenshot of the maize gdb locus lookup page and "see details" highlighted in the search results.

Step 6: Expanded Details in Locus Lookup

When the details have loaded, explore the available information.

Note the position of adh1 based on “AGIs B73 RefGen_v2 sequence” (adh1 is located between 273,983,286 and position 273,986,641 on chromosome 1.

  1. Click on the map image to launch the MaizeGDB genome browser.

Screenshot of the maize gdb data search locus lookup results. 10 maize chromosomes are highlighted.

Step 7: View Datasets in the MaizeGDB Genome Browser

The MaizeGDB Genome Browser is displayed. Click the image below to see a larger version.

  1. Here you can use the other datasets available in MaizeGDB including B73 RefGen_v1 sequenceB73 RefGen_v3 sequenceB73 RefGen_v4 sequence, and BAC-based genome assembly.

Screenshot of the maize MaizeGDB Genome Browser.

 

 

Study Questions: MaizeGDB 
Try This: Navigate to BLAST

Next, we will conduct a BLAST search for adh1 in maize GDB using adh1 mRNA from GenBank.

  1. In the navigation bar, hover over Tools
  2. Then click the BLAST button.

Screenshot of the MaizeGDB Genome Browser. Tools and BLAST are highlighted

Input the BLAST Parameters

  1. Enter the adh1 mRNA sequence in FASTA format in the box.
  2. Use the default parameters to search for adh1 and click the BLAST button.

Screenshot of the MaizeGDB Genome Browser BLAST search page. Steps to input sequences, select datasets, select BLAST parameters, and selected output type are highlighted.

View the BLAST Results

The table of BLAST results includes information on chromosomes, probability values, sequence identity, and the number of likely candidates (hits). Also, you can view a representation of the entire genome in the context of where adh1 may be located.

  1. Click the arrow next to “Whole genome view” to see the entire genome in context of where adh1 may be located.

Screenshot of MaizeGDB Genome Browser BLAST results from above steps. Whole genome view is highlighted.

Explore the Whole Genome View

The whole genome view allows visualization of the 10 chromosomes of maize including, the predicted position match the adh1 sequence.

  1. Click on “Chr1” corresponding to the red box on Chromosome 1 (E-value = 0).
  2. Now, click on “View at MaizeGDB”, next to the hit on Chr1.

Screenshot of MaizeGDB Genome Browser whole genome view results from above.

Screenshot of a long double-stranded DNA sequence

Change the Data Source

The selection you made on the last screen will open a new window containing information on the position of adh1 and data sources.

  1. Click on the pull-down menu of “data source” (arrow) to explore other data sets.

Screenshot of MaizeGDB Genome Browser. Maize B73 RefGen_v2 is highlighted.

 

Study Questions: BLAST

Possible Explanation for BLAST Study Question Results

One reason for discrepancies might be that there are in this genomic region several copies of the gene (eventually ancient duplication no longer actively transcribed due to mutations or whatever). Depending on the origin of the query sequence you use to find the gene, they might show different hit scores from these versions of the gene. As for the version2 pseudo-molecule, the location seems to be quite similar…

screenshot of a search output page
Fig. 1 Screenshot of the BLAST search output page.

Multiple Sequence Alignment

Some of the key steps in building a multiple alignment include:

  1. Obtain the sequences to align by database searching
  2. Run the multiple alignment program and,
  3. Identify the residues that differ or are conserved among the sequences (finding polymorphisms)

Enter the NCBI site and use the following steps to guide your activity.

Try This: Multiple Sequence Alignment

Search the NCBI Website for the Allelic Sequences

Find the allelic sequences for a maize gene. Here we will use teosinte branched1 (tb1) gene from maize as an example.

  1. Open your web browser and go to https://www.ncbi.nlm.nih.gov/.
  2. Enter tb1 AND Zea[orgn] into the search box.
  3. Press Enter or click Search.

Screenshot of NCBI page

Narrow the Search Results

  1. Select PopSet (population data sets).

Screenshot of U.S. National Library of Medicine. Search results for tb1 gene include 18 PopSet

Choose the Specific Search Result

  1. Select the result containing 17 aligned sequences of tb1 partial cds from a population study. (UID 209362237)

Screenshot of results from selecting 18 PopSet from above

Explore the Alignment

Scroll down to see the alignment of the 17 tb1 partial cds.

  1. Click the + sign until you can see the nucleotides.
  2. Click the arrow to pan right in the sequences until you can see the region between 1500 and 1530.

Screenshot of results from selecting the first PopSet from above

 

Study Questions: Multiple Sequence Alignment 

Review your output from this activity.

Screenshot of part of the results from above

Finding Polymorphisms

Using Clustal Omega

To detect polymorphisms in a set of candidate genes requires a program that aligns multiple sequences. Clustal Omega is one of the commonly used programs. Clustal Omega is a hierarchical multiple alignment program that combines a robust method for multiple sequence alignment with a user-friendly interface. There are different webservers that provide access to Clustal Omega. For this lesson we will use the European Bioinformatics Institute webserver. Clustal Omega can also be downloaded to a personal computer for more routine use. The following is an example of how to use Clustal Omega to detect polymorphisms.

Try This: Using Clustal Omega

Search the NCBI Website

  1. Go to the NCBI website and search for tb1 AND Zea[orgn].
  2. Click Search

Screenshot of a NCBI page. All databases query for tb1 is highlighted.

Explore the Search Results

  1. Select PopSet (population data sets).

Screenshot of results from above

Select the Population Set

  1. Click on the population set we studied earlier (UID 209362237)

Screenshot of PopSet list from above

Create a FASTA File

Create a FASTA file of the 17 tb1 sequences.

  1. Click the pull-down menu Send to: at the top right of the screen
  2. In the menu that appears, select File for the destination
  3. Select the FASTA format
  4. Finally, click Create File.

Screenshot of steps to create FASTA files from all PopSets from above

Access Clustal Omega

Access the Clustal Omega program through EMBL-EBI.

  1. Click the Services link
  2. Under Browse by type, click DNA & RNA

Screenshot of the EMBL-EBI web page

Perform Alignment

Perform alignment of tb1 partial cds using Clustal Omega. Within the Clustal Omega window you have the option of pasting sequences, or uploading files containing your sequences in FASTA format. We will upload the FASTA file you created in Step 2. As you may notice in this window, the default is set as “PROTEIN.” Since you wish to align tb1 DNA sequences, you must change this parameter. Upload your file and click Submit.

  1. Click Clustal Omega
  2. Select DNA from the dropdown
  3. Click Choose File to browse for the file you created.
  4. Click Submit

Screenshot EMBL-EBI tools and data resources.

Screenshot of DNA analysis tool in EMBL-EBI

Explore the Output

It will take a moment before you obtain a report of your job request. You can click and save the “Your Job Output” URL to view your results for up to seven days.

  1. Click the Job ID link
  2. You can click the Download Alignment File but that is not necessary for this activity
  3. Click Result Summary

Screenshot of EMBL-EBI response after submitting the query from above.

Screenshot of results from above

View Result with Jalview

  1. Click View result with Jalview
  2. Once Jalview opens, click Colour then Nucleotide
  3. Use the scroll bar to navigate to the alignment.
  4. Scroll to align the region from 1680 to 1740.

Screenshot of results from above. View result with Jalview is highlighted.

Screenshot of Jalview results from above

Screenshot of sequence alignments results from Jalview query.

Compare Results from JalView and NCBI BLAST

Analyze region 1680 to 1740 of your JalView results (below).

What is unique about this region?

How does it compare with the region between positions 1500 and 1530 in the NCBI BLAST?

  1. JalView
  2. NCBI BLAST

Screenshot of results comparing NCBI-BLAST and Jalview sequence alignment results.

Developing Marker Assays

Recall in Module 2 you learned how SSR and SNP can be analyzed by PCR and restriction enzymes. In lesson 8 of this course, you will learn additional strategies to detect DNA polymorphisms for marker development.

Summary

Biological sequence databases serve an important role of providing access to sequence information to the research community. Searches can be restricted to a single database or expanded to include all other databases. Whole genomes can be explored to predict positions that match a specific sequence. To detect polymorphisms in a set of candidate genes a program that aligns multiple sequences is required. The detected polymorphisms can be used to develop markers to assist in selection.

 

How to cite this module: Frei, U., W. Suza, T. Lübberstedt, and M. Bhattacharyya. (2023). Introduction to Bioinformatics. In W. P. Suza, & K. R. Lamkey (Eds.), Molecular Plant Breeding. Iowa State University Digital Press.

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Chapter 13: Introduction to Bioinformatics Copyright © 2023 by Ursula Frei; Walter Suza; Thomas Lübberstedt; and Madan Bhattacharyya is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.