DNA
DNAAlso known as: deoxyribonucleic acid
The double-stranded helical polymer of nucleotides that encodes genetic information in all known living organisms, serving as the fundamental substrate for reading, writing, and engineering biological systems.
DNA (deoxyribonucleic acid) is the molecular blueprint of life — a double-stranded helical polymer composed of four nucleotide bases (adenine, thymine, guanine, cytosine) that encodes the instructions for building and operating every known living organism. Its structure, first described by Watson and Crick in 1953 1, revealed how genetic information could be stored, copied, and transmitted across generations.
Structure
DNA consists of two antiparallel strands wound into a right-handed double helix:
- Sugar-phosphate backbone: Alternating deoxyribose sugars and phosphate groups form the structural frame of each strand
- Base pairing: Adenine pairs with thymine (A-T, two hydrogen bonds) and guanine pairs with cytosine (G-C, three hydrogen bonds) according to Chargaff’s rules
- Major and minor grooves: The helical twist creates grooves of different widths, which serve as binding sites for regulatory proteins and transcription factors
- Directionality: Each strand runs 5’ to 3’, and the two strands are antiparallel — a property essential for replication and transcription
The human genome contains approximately 3.1 billion base pairs organized across 23 chromosome pairs. The Telomere-to-Telomere (T2T) Consortium completed the first truly gapless human genome sequence in 2022 2.
Central Dogma and Information Flow
DNA participates in three fundamental processes:
- Replication: DNA polymerase copies the entire genome before cell division, with an error rate of roughly 1 per 10^9 bases per replication cycle
- Transcription: RNA polymerase reads a DNA template strand to produce messenger RNA (mRNA), which carries instructions to the ribosome
- Reverse transcription: Retroviruses and retrotransposons use reverse transcriptase to convert RNA back into DNA — a process exploited in cDNA library construction and RNA-seq workflows
Computational Considerations
DNA is the primary data type in computational biology:
- Sequence alignment: Algorithms like BLAST, BWA, and minimap2 align DNA reads against reference genomes to identify variants, structural rearrangements, and evolutionary relationships
- Genome assembly: De novo assemblers (Hifiasm, Flye) reconstruct contiguous genome sequences from raw sequencing reads — a computationally intensive graph-traversal problem
- Codon optimization: When designing synthetic genes, algorithms optimize nucleotide sequences for expression in a target host organism while avoiding secondary structures, repeat elements, and rare codons 3
- Foundation models: Large language models trained on DNA sequences (Nucleotide Transformer, DNABERT, Evo) can predict gene expression levels, chromatin accessibility, and the effects of mutations from sequence alone
Applications in Synthetic Biology
DNA is both the substrate and the product of synthetic biology:
- Gene synthesis: Chemical synthesis of custom DNA sequences enables the construction of genes, pathways, and entire genomes from scratch 3
- DNA data storage: The theoretical information density of DNA (~215 petabytes per gram) makes it a candidate for archival data storage
- Genetic parts libraries: Standardized DNA parts (promoters, terminators, ribosome binding sites) can be composed into genetic circuits using assembly standards like BioBrick and MoClo
- Directed evolution: Combinatorial DNA libraries generated through error-prone PCR or DNA shuffling enable screening billions of variants for desired function
Limitations
- Synthesis length constraints: Current chemical synthesis is limited to ~200-300 nucleotides per oligo; longer constructs require assembly of multiple fragments
- Repetitive sequences: Highly repetitive regions (centromeres, telomeres) remain difficult to sequence, assemble, and synthesize accurately
- Epigenetic information: DNA methylation, histone modifications, and 3D chromatin organization carry regulatory information not captured in the nucleotide sequence alone
- Off-target effects: In genome engineering, unintended modifications at sites with partial sequence homology remain a safety concern
Woolf Software builds computational pipelines for genome analysis, synthetic gene design, and sequence optimization. Get in touch.
DNA sequence analysis underpins nearly every computational biology workflow — from genome assembly and variant calling to codon optimization for synthetic gene design. Machine learning models trained on DNA sequence data now predict gene expression, protein folding, and regulatory function with increasing accuracy.
Related Terms
References
- Watson JD, Crick FHC.. Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid . Nature (1953) DOI
- Nurk S, Koren S, Rhie A, et al.. The complete sequence of a human genome . Science (2022) DOI
- Kosuri S, Church GM.. Large-scale de novo DNA synthesis: technologies and applications . Nature Methods (2014) DOI