Chapter 2: Molecular Basis of Human Genetics



What distinguishes the human from other mammalian species? It is not the nucleotides within our DNA sequence, the packaging of DNA into chromosomes, the mechanism of DNA replication, or the translation of genetic information into protein. These components of heredity are similar in all species, from human beings to chimpanzees, to axolotls to Mongolian gerbils. For decades it was thought that human complexity was attributable to a greater number of genes—once estimated at 100,000 to 150,000.

The Human Genome Project, however, taught us differently: humans have 20,000 to 22,000 protein-coding genes and in this way do not differ from other higher forms of life. It is not the number of genes that makes human beings different but rather their intelligent interaction, or the regulation of their expression. The recent discovery of more than 8,000 noncoding RNA genes and the manifold functions of noncoding RNA made us realize that the complexity of these mechanisms had been largely underestimated.


Deoxyribonucleic acid (DNA) is the genetic code of humans. It is a strandlike macromolecule of numerous nucleotides (nt), each composed of a base, a sugar, and a phosphate group. The sugar and phosphate groups give the macromolecule its structure, whereas the bases are the actual carriers of genetic information, the letters of the genetic text.

The sugar moiety of the nucleotide in the DNA is deoxyribose. The prefix “deoxy-” indicates that this sugar has one fewer oxygen atom than ribose, from which it originates. Ribose constitutes the sugar moiety of the nucleotide of RNA (ribonucleic acid).

The nitrogen-containing bases of DNA are derivatives of purines or pyrimidines. Purine bases in DNA are adenine (A) and guanine (G); pyrimidine bases are thymine (T) and cytosine (C). RNA also consists of adenine, guanine, and cytosine. In place of thymine, the pyrimidine base is uracil (U) (Fig. 2.1).

Figure 2.1.Nucleotide Bases.

Nucleotide Bases.

DNA is a double strand of two complementary nucleotide chains. The sequence of the nucleotide bases of the one strand (in the 5? to 3? direction) complements the sequence of nucleotide bases on the other strand (in the 3? to 5? direction). The two nucleotide chains run antiparallel. They are coiled around a common axis, connected with one another by hydrogen bonds between the bases A and T (two hydrogen bonds) and between the bases G and C (three hydrogen bonds) and in this way form a double helix.

A-T and G-C are called complementary base pairs. The ratio of A to T and of G to C is 1:1. From the proportion of a single nucleotide base it is therefore possible to determine the contribution of all other bases (Fig. 2.2).

Figure 2.2.Complementary Structure of the Double-Stranded DNA.

Complementary Structure of the Double-Stranded DNA.

(From Barsh et al., 2002.)


With humans, the contribution of the nucleotide adenine is 29%. Thymine has the same share (i.e., 29%). Since (A + T) + (G + C) = 1, (G + C) = 100% ? (29% + 29%) = 42%. G and C each contribute 21% of the nucleotide bases of human DNA.

The purine and pyrimidine bases face the inside of the double helix, whereas the sugar and phosphate moieties are on the outside. Because the spatial relationship of the opposing bases is fixed, the two chains of the double helix are exactly complementary. The diameter of the helix is 2 nm. Neighboring bases along the axis of the helix are 0.34 nm apart. A twist of the DNA double helix corresponds to 10 consecutive base pairs (i.e., 3.4 nm).

Because the sugar and phosphate groups are placed on the outside, the double helix is not symmetrical. There are two forms: a right-turning form (B-form) and a left-turning form (Z-form). The DNA of the cell’s nucleus is mainly of the (more stable) right-turning B-form (Fig. 2.3).

Figure 2.3.DNA Double Helix.

DNA Double Helix.

The length of DNA is measured by the number of base pairs (bp) or nucleotides (nt). Larger segments are measured in kilo bases (kb = 1,000 bp) or mega bases (Mb = 1 million bp). For obvious reasons, the length of (single-stranded) RNA is always given in nucleotides (nt).

Replication of DNA

In 1953 James Watson and Francis Crick recognized the three-dimensional structure of DNA from which they could deduce the mechanism of its replication that allows the reliable transmission of genetic information from one generation to the next.

During each cell division, replication provides for the creation of two identical copies of the cell’s DNA molecules. It starts in parallel at more than 10,000 origins, which are recognized by specific replication initiation proteins; these origins separate the so-called replication units or replicons of the DNA. Replication proceeds with the aid of a multienzyme complex containing helicases, topoisomerases, various DNA polymerases, and other proteins. In this way, a single strand of DNA is prepared over a length of approximately 2,000 bp as a template for a new, complementary chain and in effect a new copy of the double helix.


Enzymes that start replication by unwinding the DNA bidirectionally, leading to the separation of the hydrogen bonds and, in effect, the two chains. This results in the creation of two so-called replication forks.


Enzymes that prevent a supercoiling of the DNA helix during unwinding and separation because they are capable of cutting individual strands, permitting them to unwind.


Enzymes that splice cut DNA pieces (e.g., during replication).


Enzymes that synthesize DNA or RNA strands.

Replication proceeds from the starting point in both directions of the replicons, up to the point where the two approaching replication bubbles meet. It begins with a small complementary RNA primer, which is formed by polymerase ? and later cut out and replaced by DNA. The main synthesis of the nucleotide strand occurs through polymerase ?. New DNA can only be synthesized in the 5? to 3? direction, for only at the 3? end of the growing chain can the next nucleotide be attached.

This means that only one of the unwinding DNA parent strands (the so-called leading strand) allows continuous synthesis in the 5? to 3? direction. The other strand, denoted the lagging strand, is 5? to 3? replicated discontinuously (quasi-backwards) in small 200-bp fragments called Okazaki fragments. Each such fragment needs a new RNA primer, and after replication, adjacent fragments are linked by a DNA ligase. During replication the proofreading function of DNA polymerase identifies errors, cuts out faulty bases, and replaces them with the correct bases (Figs. 2.4 and 2.5).

Figure 2.4.Replication of DNA.

Replication of DNA.

The so-called “parent strands” are shown in red, while black indicates the newly synthesized strands.

Figure 2.5.Flow of Genetic Information.

Flow of Genetic Information.


Replication provides for the formation of identical DNA molecules within a cell. The concept of “transcription” refers to the transcription of DNA into a complementary RNA molecule. Subsequently, the language of the nucleic acids is translated into the language of a polypeptide sequence.
(From Barsh et al., 2002.)

The result of replication is two daughter DNA molecules, whose one strand is newly synthesized, while the other strand derives from the parent DNA. This semiconservative replication, aided by sophisticated repair mechanisms, enables the genome to copy itself with remarkable precision through millions of cell divisions over a lifetime.


The term gene as a name for a hereditary factor was first coined in 1909 by Wilhelm Johannsen. It derived from the Greek terms genos (“clan”) and genesis (“origin”). Since then the definition has undergone several evolutions, many of which had to be modified or were dropped as new research became available. An example would be the “one gene one protein” hypothesis still familiar to students today but only partially applicable as it does not explain the occurrence of splice site variants, nor the diversity of RNA.

The more we have learned about the structure and function of the human genome, the more cautious we have become in our definition of the term gene. For practical purposes, we can define a gene as a functional unit in the genome that contains the genetic information for one or more gene products; however, we should realize that this definition may be modified in the future. A typical protein-coding gene has three components: the coding sequence, regulatory sequences, and seemingly useless (some of them probably regulatory) sequences.

Within the genomic DNA, a gene is defined by the direction of transcription in the 5? to 3? direction and may be located on either strand of a chromosome. Different genes are not necessarily physically separated: some genes are located within other genes or contain regulatory sequences in genes far away; sometimes the complementary strands at a single locus contain two different genes.


A functional unit in the genome that contains the genetic information for one or more gene products.


DNA sequences that have all the characteristics of a potential encoding transcription unit but which encode for no functional product.

The DNA strand, which (except for Ts and Us) corresponds to the RNA sequence, is called the sense strand; its complement (which serves as a template for RNA biosynthesis) is the antisense strand. The DNA before the 5? start of the gene (the transcribed region) lies upstream, while the DNA beyond the 3? end lies downstream. At the start of the gene lies the promoter region, which serves as a docking station for various specific transcription factors and an RNA polymerase that represents the transcription initiation complex.

The human genome contains different kinds of promoters, many of which are highly conserved between species and contain various specific short (4 to 8 nt) sequence motifs. A classic promoter sequence is the “TATA box” (TATAAA or its variants). It is situated 25 nucleotides upstream of the transcription start site. Different types of promoters produce different regulatory characteristics, resulting in varying patterns of expression for the genes they direct in the course of the organism’s development.

Enhancers Silencers, Insulators

Regulatory elements that should not be confused with promoters are enhancers or silencers. They are DNA segments that strengthen or weaken the transcription of a gene through a direct interaction with the transcription initiation complex (RNA polymerase II or transcription factors). While the promoters are always located 5? upstream of the gene, enhancers may be at varying distances away from the gene whose transcription they regulate.

Some of the enhancers are situated within the intron of a gene whose expression they regulate. The same is true for the silencers, the inhibiting counterparts of the enhancers. In addition, there are insulator sequences that limit the action of the enhancers.

Various segments of the transcribed sequence of a gene are distinguished according to their fate during RNA processing and translation. The sequences before the start codon and after the stop codon of the gene are called untranslated regions (UTRs); there is a 5? UTR and a 3? UTR. The sequence at which transcription terminates contains the polyadenylation signal AAUAAA (see Section 2.4). Transcripts of human (and most eukaryotic) genes usually contain segments that are removed during messenger RNA (mRNA) processing and are not translated into protein.

These seemingly useless sequences are called introns, whereas the other DNA segments are called exons. The first exon(s) contain the 5? UTR and the last exon the 3? UTR. Human exons are usually short, with an average length of about 150 nt, whereas introns are usually longer than 10,000 nt (10 kb) long. Individual exons often correspond to structural and/or functional domains of the resulting protein.

Exons and introns are numbered consecutively in the 5? to 3? direction of the transcript; exon 1 is followed by intron 1. Introns almost always begin with the nucleotides GT (or GU in RNA) and end with AG. The 5? start of an exon is the splice acceptor site; the 5? end, the splice donor site (Fig. 2.6).

Figure 2.6.Structure of a Eukaryotic Gene.

Structure of a Eukaryotic Gene.

Coding region:

The part of the gene that is translated into protein.


Coding sequences in the pre-mRNA that are separated by noncoding introns.


Noncoding sequences in a gene that are positioned between coding sequences (exons) and are removed by splicing from the pre-mRNA transcript.

Untranslated region (UTR):

The part of the gene that is transcribed and included in the mature mRNA but is not translated into protein. The 5? UTR denotes the translated sequences prior to the start codon, while the 3? UTR denotes the translated sequences after the stop codon.

While still in the nucleus, introns are removed from the pre-mRNA by RNA splicing. Thus, introns do not contribute to the polypeptide product of a gene; however, they can have regulatory effects as noncoding RNAs (ncRNAs). The extent of this function (i.e., the full relevance of introns in the regulation of gene expression) remains to be clarified.


INtrons remain IN the cell’s nucleus. EXons leave the nucleus of the cell and are EXpressed.


Pseudogenes are DNA sequences that have all the characteristics of a potential encoding transcription unit (promoter, encoding region, splice acceptor points, etc.) but do not encode for a functional product. Many pseudogenes were derived through gene duplication and subsequent mutation.

A well-known example is a pseudogene that belongs to the family of ? globins (“??”). It has all the characteristics of a functional globin gene, yet one single-point mutation in the coding region prevents its expression as a complete globin. Sometimes a duplicated gene contains a residual function and may partly compensate for the deficiency of the “real” gene; examples are SMN1 and SMN2 in spinal muscular atrophy (Chapter 31.1.3).

Pseudogenes can also be created through reverse transcription and subsequent integration. In case the mRNA of a real gene accidentally gets transcribed by the enzyme reverse transcriptase into a complementary DNA (cDNA), it can be incorporated into the genomic DNA. In such cases the genomic DNA includes a gene without introns and with a poly A tail (“processed pseudogene”).