How is digital data stored on DNA?

A brief introduction into to encode data

Introduction

We all know of it, DNA 🧬, the source code of life. Every one of us has it in us. All the plants and animals as well. It has existed since the beginning of life! It defines everything organic that we see by the data that is stored in it.

Digital data also defines everything we see and experience on our computers. So why not merge these two worlds? What if we could store digital data onto the biological storage unit DNA? We can! Just in 2019, all 16GB of Wikipedia has been stored on synthetic DNA.0

But why would one want to store data on DNA? Why not keep it on the hard drives we already have? These are good questions. Hard drives last for a reasonably long time, about 7 to 10 years. If we want to archive data for the future, that is, however, almost nothing. The DNA found in Ötzi1 is still readable and gave us a lot of information about the past, and he lived around 5,000 years ago! DNA is very durable and does not need any maintenance costs. Comparing that with current storage media that often need to be connected to a power supply, keeping the data for a long time is significant.

Another aspect is the amount of data that we can store on DNA. Each DNA helix contains the whole information that is needed to create your body. This information is roughly 725 MB of data2—Think of how each cell in our body entails a complete DNA helix and how small a cell is. It is estimated that one gram of DNA can hold up to 215 Petabytes,3 which are around 215,000,000 GB and less than two cubic centimeters or two milliliters. All this data would fit on a teaspoon.

So, storing data on DNA is a very efficient way of keeping a lot of information for a very long time. But how does this work? How does data get saved on DNA?

On this little website, we will understand the concept of how digital data is getting stored on DNA. Step by step, we will go through small examples of code and store types. Each topic has an interactive visualization, so don’t hesitate to play around with them!

Code

What is code?

To first understand how digital data is stored and how DNA works, we need to understand code in general. Here is a description of “code” from a dictionary:

Code is a system of words, letters, figures, or symbols used to represent others, especially for the purposes of secrecy. — Oxford Languages

What does this mean? Code is anything that needs to be known by the spectator so it can be understood. A language, for example, is a form of code. One needs to know the language to be able to understand it. Let's take french for example. This language is based on the Latin alphabet. So first of all, one needs to understand the tiny pictograms that we know as a letter. It takes us humans a while to understand what each character stands for. So to realize that an A is an “A”, we first need to understand this code—the key.

Having learned that, we then can take chains of these characters to form words, then sentences, then paragraphs, and texts. Each is a code that builds upon the other. But we also need to understand grammar and vocabulary to understand the language. And that is only one language! The Chinese alphabet looks completely different, and we need to start understanding the characters again.

Ok, this has been quite an elaborate example of code. Let's take a couple of steps back and look at other models

Traffic Light

Let's think of a form of code that only requires two values. On or off. A classic example for this one would be a traffic light. Green means go, while red means stop. A simple code that we learned early on to know when it's safe to cross a street.

Try it out and let the X-Wing4 fly or force it to stop. Click on the traffic lights 🚦 to change the sign.

Click on on the traffic lights to stop the car or let it drive.

Great! A simple code with only two values! That was easy. But the principles mentioned above apply. We need to understand that “red” means “stop” and “green” means “go.” If we don’t know this code, it is hard to read the lights.

Can we create even more complex information just using two values? Of course! And I am sure you have heard of the following example as well.

Morse Code

Morse code has been an essential method of communication. It has been used to send messages over long distances. The benefit of this code is that one does not need multiple inputs. We can simply use a lamp or something that generates a sound to create two values: on and off. On represented by “sound” or “light on,” off represented by “light off” or “no sound.”

We can have two values of the “on” variants: long and short. One short is read as an “E,” one long as a “T.” Combining short and long variations can create all the letters and numbers of the basic Arabic alphabet. If you simply add pauses between the characters, you can create whole sentences, just like writing a text with pen and paper. The word “data” would then look like this:

-.. .- - .-

This simplicity allows communication over long distances, e.g., when ships needed to communicate with the harbor or send messages over very long distances as signals before the telephone.5

But enough of reading about it, let’s try it out! Try to write your name in the morse code below!6

Here is a legend of all the morse codes and characters if you want to cheat 😉
Click on "long" or "short" to create a morse signal. If you found a matching letter, it will be shown above. If it's not matching, a red cross will appear. Click "clear" or wait for 4 seconds to try another one.

Phew, great! Isn’t it incredible what one can already store with these few elements? What might be possible if we have more building parts to our system? What codes would then be possible? Let’s have a look at the following sections to find out.


Music

Similar to texts, music is a way of coding information. By chaining different notes together, we can create music that is happy, sad, angry, etc. We always use the same components but, to keep it very simple, we only adjust the timing and order.

With five notes, one can already create a couple of different songs! The game “The Legend of Zelda Ocarina of Time” and “The Legend of Zelda Majora's Mask"7 do that. Try to play some of them!

Here is a list of all the Zelda songs from OoT and MM 🎶 But maybe look at it later!
Press the note buttons to create a song. You can make up to 8 notes and play it. Experiment a bit to see how many different songs you can create just using five notes! You might even find a particular track 🔎. Sounds source

Isn’t it great to see what versions of code you already have been using? So many different variations of changing parts together to create something bigger! But apart from this website, nothing about digital code or DNA code was included so far… So let’s continue to understand that.


How a computer uses code: Binary code

By now, you should have got an understanding of simple codes and what sequencing can do. Before we dig a little deeper into how data is stored on DNA, we first need to understand how digital information is stored on a computer.

You probably have heard of the term "binary code" already. So in the next section, we try to understand the basic capabilities of this code.

Numbers on the base of 2

Base-10, that’s the system we are using mainly. It is the system we have in our day-to-day, no matter if it’s the price of a product in the supermarket, a street name, or telling the time. Each digit has ten options, starting with 0 up to the 10th digit 9. If it would hit this one, we increase the next higher digit. So if we count from 9 up, we increase the digit to its left to 1 and reduce the current to 0, we get 10.

If we split the numbers into a table, then each column would have a power of ten more.8 So one column with all the 1s, one with the 10s, one with the 100s, and so on. Each cell can have ten options. So—as we are used to—0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. If the value goes above this limit, the next column counts one up. The table would look like this:

100 10 1 Total
0 0 4 4
2 3 1 231

Then we add the numbers in each column to get the result of a row. Reading the table above would look something like this:

( 2 × 100 ) + ( 3 × 10 ) + ( 1 × 1 ) = 200 + 30 + 1 = 231

A number system of the base of 2 behaves similarly. But instead of having in each column on the power of 10, we multiply by the value 2. Hence the columns generate like this:1, 2, 4, 8, 16, etc. As we are on the base of two, each cell can only have two options: 0 or 1. The table would look like this:

4 2 1 Total
0 0 1 1
0 1 0 2
0 1 1 3

Now we calculate in the same way. We add up the number of the table head times to the number we need:

( 0 × 4 ) + ( 1 × 2 ) + ( 1 × 1 ) = 0 + 2 + 1 = 3

Another way to think about counting in binary is to always start from the right and shift the digit one forward (to the left) if the value increases. If the shift up would hit a 1, then the number left to this increases, and all the lower digits get set to 0 again. Like we do with the power of ten, just after it would hit 2 already.

This might sound confusing, but try it out yourself below. Try to increase the numbers slowly. Type in the numbers you want to transform, or play around if you can generate the numbers you want in binary!

Change the number and see how the binary values change, or click on the binary values to see how the number changes. Try to increase the number using your arrow keys and see how the number changes steadily.

In computer code, this is the smallest unit existing. One digit is called a “bit” b. I assume you have heard this term before. In the following section, we will try to make something else out of these bits. So let’s have a look!

Generating Characters

Eight of these bits are called a “byte” B. You have heard of them as well. This is the regular unit we are talking about when it comes to file size and data storage. One byte is not a lot of information. That’s why we usually refer to Kilobyte KB, Megabyte MB, Gigabyte GB, etc. One KB contains 1024 B, and one MB is equal to 1024 KB, and so on.

But what is stored in one byte? As one byte contains eight bits, it can have up to 256 different states.9 If one byte is one character, these eight binary values can display one of 256 characters. Hence 100 B could be a text with 100 characters. This is how text is stored on your computer.

Now we only need the key to understand which of these states is assigned to which character. This information is stored in our computers to read the data and translate it into a human-readable format.

It sounds very abstract, and it is, but we use this example only to understand the concept of how data is encoded. Play around below to test what letter is using which code in your computer. Or the other way around!

Change the character and see how the binary values change or click on the binary values to see which character it represents.

Okay, now we have encoded text in binary, but what about all the other files and elements we use on our computers? Let’s go briefly over it.

The data on our digital devices

Above have been two very straightforward examples. But just like languages, this can get infinitely complex. 1s and 0s get stored to create characters and numbers, could generate sentences, etc. The first bytes are used to specify the type of file, so the computer can understand if it is an image, a text document, or even something completely different. It then knows how to interpret the following bytes based on the codes the computer understands.

Reversing this process means that everything on your computer is finally stored in 1s and 0s. Think about a decision tree—only having two options at each time—that is infinitely long. It might need a lot of decisions, but eventually, you will arrive at the outcome you intended. In a very, very abstract way, this is how data is stored on your computer.

So now we got a glimpse of the concept of digital data. Let’s get to our last part, how is this data translated into DNA.


DNA

How does our DNA work?

DNA is the building plan of life on Earth. It is this little strand that contains all the information that is needed to build plants 🪴, animals 🦆, and humans 🤸. But how does it do that?

Let's start with the components the DNA is built of: Nucleobases. There are five of those so-called primary nucleobases. Four of them form a DNA strand: adenine (A), cytosine (C), guanine (G), and thymine (T). These four are put in a chain and store the instructions to build every cell of a living organism. Enzymes read the strand and translate it into RNA, which then is used by Ribosomes to read and form proteins.10

A very rough explanation of the usage of DNA, but just to give a glimpse, the most important part are the four bases. But are just four bases sufficient to form every living cell that exists on Earth? Yes! Remember how digital data is stored using only 1 and 0? Our DNA behaves similarly but stores data on the power of 4 instead of 2. Let’s dive in a little deeper.

Numbers on the base of 4

Just like the power of two examples, each column entails one number. But instead of just 1 or 0, we now can enter four different options: 0, 1, 2, and 3. As with the others, each column contains one power of 4, which we then could select up to 4 times. The table would look like this:

16 4 1 Total
0 0 1 1
1 1 1 21
2 3 1 45

Like in the previous examples above, we multipy the cell value with the head value and add them up in the row. It would be read like this:

( 2 × 16 ) + ( 3 × 4 ) + ( 1 × 1 ) = 32 + 12 + 1 = 45

Ok, now that we understood this, let’s test it out below!

Change the number and see how the binary values change, or click on the binary values to see how the number changes. Try to increase the number using your arrow keys and see how the number changes steadily.

Ok, now we know how we generate numbers on the power of 4. Still no DNA. Let’s do that in the final example!

Storing simple data on DNA

Ok, we know we have four nucleobases: A, C, G, and T. If we set each base to a value between 0 and 3, we have the same storage system of the power of 4. This can then be translated into the base system of 2, back to the binary system of a computer. This binary chain can then be read like all digital data or the character example above. The key for this translation looks like this:11

Nucelobase Power of 4 Binary
A 0 00
C 1 01
G 2 10
T 3 11

So depending on how we chain the nucleobases in our synthetic DNA, we can translate the information into binary, digital files again, which can be translated into their intended format. Just like the characters to words to sentences to paragraphs etc.

To make this a little more tangible, try writing a sentence below and see how the data would be stored in a DNA strand. Type a sentence in the text field above and see how a DNA strand would look like:12

Type in a text above into the field. A potential DNA strand generates, displaying your text in Nucelobases.

You did it! You translated your text into a DNA strand! This is a simulation of how data would be stored on DNA.

Outro

After reading all this, I hope you understood how codes work, how digital data is stored, and finally, how this would be stored on a DNA strand. Even if this website simplifies the topic a lot, I hope it still helps you understand it better.

We are still at the beginning of this technology, but it could bring a huge potential and impact on our future. Big data storage facilities could get replaced and consume way less energy and space. It might get implemented in our day-to-day devices and increase the storage capacity tremendously. Or it could just be used to preserve data for our past, saving it for future generations. When the technology is developed enough is a question of years, if not decades. How this impact will be visible in our everyday lives will show.

Thank you, and stay excited!

Footnotes
  1. You don’t believe me? Have a look here!
  2. Ötzi, also called “the Iceman,” was found in the Austrian “Ötztal” Alps in 1991. He is a natural mummy who lived between 3400 and 3100 BCE. Learn more about him on Wikipedia.
  3. This info comes from a Wikipedia article. So please take it with a grain of salt.
  4. Here you can read more at the Wyss Institute for Biologically Inspired Engineering.
  5. The X-Wing is one of the starfighters from “Star Wars.” Find more here. If you haven’t watched Star Wars I highly suggest you to do so now!
  6. If you want to learn more about morse code, have a look at its Wikipedia page.
  7. Thank you to @mohayonao for hosting this gist!
  8. The legend of Zelda Ocarina of Time and Majora' Mask are games from the N64 system. They were released in 1998 (OoT) and 2000 (MM). In my (biased) opinion, two of the best games ever made! Learn more about them: OoT and MM.
  9. If you haven’t watched the “Powers of ten” video by Ray and Charles Eames, then I highly recommend watching it! It displays beautifully what times 10 can do. You can find it here.
  10. 2 to the power of 8 is 256. This displays all the possible numbers we have in one byte, as we have eight times two options. So 2×2×2×2×2×2×2×2.
  11. Here is a super brief video that explains what DNA is and how it gets read.
  12. This is one possible option to encode the data, taken from this TED talk. Of course, every other mapping would be possible as well. This would need to be manifested in a convention.
  13. Adenine always matches with thymine and guanine with cytosine when two DNA strands form a double helix. This is usually indicated with matching endings.