A large international team led by the US National Institutes of Health has completed the DNA sequence of all protein-coding genes in the human genome, about two decades after starting the project.

Launched in 2000 at the University of California-Santa Cruz, the Human Genome Project could cover only 92 per cent of our genome and the rest remained unsequenced and unstudied.

Derided by some as "junk DNA" with no clear function, roughly 151 million base pairs of sequence data scattered throughout the genome were still a black box.

But now the scientists in a paper published in the journal Science, revealed the long-missing pieces of our genome - the final eight per cent.

Within the new data are mysterious pockets of noncoding DNA that do not make protein, but still play crucial roles in many cellular functions and may lie at the heart of conditions in which cell division runs amok, such as cancer.

DNA double helix
A DNA double helix is seen in an undated artist's illustration released by the National Human Genome Research InstituteREUTERS/National Human Genome Research Institute/Handout

"You would think that, with 92 per cent of the genome completed long ago, another eight per cent wouldn't contribute much. But from that missing eight percent, we're now gaining an entirely new understanding of how cells divide, allowing us to study a number of diseases we had not been able to get at before," said Erich D. Jarvis, from the Rockefeller University.

To fill in the missing pieces, a group of researchers formed a Telomere-to-Telomere (T2T) consortium. The new reference genome, called T2T-CHM13, adds nearly 200 million base pairs of novel DNA sequences, including 99 genes likely to code for proteins and nearly 2,000 candidate genes that need further study.

The team tested the full genome's ability to support the sequencing of DNA from thousands of people.

DNA
DNAIANS

The researchers found that it corrected tens of thousands of errors produced by the previous rendition of the genome and was better for the analysis of more than 200 genes of medical relevance.

The gaps now filled by the new sequence include the entire short arms of five human chromosomes and cover some of the most complex regions of the genome. These include highly repetitive DNA sequences found in and around important chromosomal structures such as the telomeres at the ends of chromosomes and the centromeres that coordinate the separation of replicated chromosomes during cell division.

The new sequence also reveals previously undetected segmental duplications, long stretches of DNA that are duplicated in the genome and are known to play important roles in evolution and disease.

The findings suggest that the T2T's genome could greatly propel research into genetic disorders, and that further in the future, patients might reap the benefits of more reliable diagnoses.

"We are finally digging into what we once called junk DNA, because we could not understand it or look at it accurately," Giulio Formenti, a postdoc in Jarvis' lab at Rockefeller.

"We now know that many diseases are linked to structural repeats in the centromere and, now that these sequences are no longer missing from the human reference genome, we can begin to map the origins of these diseases."