
In humans, the genome consists of information encoded in DNA (deoxyribonucleic acid) molecules. The information in DNA is stored in a code made up of four chemical bases, or "letters": adenine (A), guanine (G), cytosine (C), and thymine (T). The order, or sequence, of the code determines the information that is used to drive biological processes. As the source of all hereditary information, DNA has developed a mechanism to ensure that data can be replicated and transmitted efficiently: bases always pair up with the same partner, A with T and C with G, to form units called base pairs. Each base is attached to a sugar molecule and a phosphate molecule, collectively known as a nucleotide. Base pairs are arranged in two long strands that form the familiar spiral ladder shape known as the double helix (see illustration).
There are 3 billion base pairs in the human genome and nearly every cell in the human body contains the complete DNA sequence. In cells, the genome is stored on organized structures known as chromosomes. Humans usually have 23 pairs of chromosomes, or a total of 46 chromosomes and 6 billion bits of information. Chromosomes contain genes, the functional units of DNA, that encode the instructions for making molecules called proteins. While genes provide the instructions, proteins are the workhorses of the cell and are responsible for nearly all aspects of cellular activity. A gene has both coding regions called exons that are used to make proteins and non-coding regions called introns that are removed prior to making a protein. The areas between genes are known as intergenic regions and are believed to have regulatory, structural and other protective functions. Although estimates vary, there are approximately 20,000-25,000 genes in humans and they can range in size from a few hundred bases to more than 2 million bases.
The collection of these genes is called the exome. Every person has two copies of each gene, one inherited from each parent. It was previously believed that one gene contained instructions for only one protein, although now it appears that genes can code multiple proteins through a process known as alternative splicing, which allows for greater diversity of proteins. Despite the relatively small number of protein-coding genes, there are over a million unique proteins in the human body.