Combining 23andMe and AncestryDNA Raw Data Files – Mac / Linux

This method of combining the 23andMe and AncestryDNA raw data files involves using a Mac or Linux terminal window.

These directions assume you have some minor skills in navigating in a terminal window.

 

First, be sure you have downloaded both the 23andMe and AncestryDNA raw data files. Know where the .txt files are on your hard drive, and make sure they are both in the same folder.

 

Open a terminal window and navigate to the folder where your raw data files are located.

 

1 )  Strip out the header information on both text files.

Before we get to combining the two files, let’s clean up the extra information at the top of each file.

Simply open the files in TextEdit (or your default .txt file editor) and delete the first lines that explain how the file was created.

It will look like this:

 

Save the files and close the text editor.

 

2) Combine allele 1 and 2 on AncestryDNA file:

The data for the AncestryDNA file is given in five columns – Rs id, chromosome #, position, and then allele1 and allele 2.

The data from 23andMe is given in four columns – Rs id, chromosome #, position, and then combined genotype (allele1 + allele2).

Thus, we need to make the AncestryDNA file (5 columns) match with the 23andMe data (4 columns).

To do this, in the terminal window we will run an AWK command combining the 4th and 5th column in the AncestryDNA data file. This will output to a new file name.

Be sure to change the filenames to whatever you named your 23andMe and AncestryDNA files. 

awk 'BEGIN {FS="\t"};{print $1"\t"$2"\t"$3"\t"$4 and $5}' AncestryDNA.txt > AncestryCombined.txt

 

Now your AncestryDNA file will be the same format as the 23andMe raw data file (4 columns, genotype combined).

 

3) Combine the 23andMe and the edited AncestryDNA files, remove duplicates:

We will use a couple more commands to combine the two files and remove the duplicate rs ids.

Again, be sure to change the file names to whatever your file names are.

cat 23andMe.txt AncestryCombined.txt | awk '!seen[$0]++' > ComboRawGenome.txt

 

If you open up your newly created file (ComboRawGenome.txt), you will see all of your genetic data combined.

The original files for 23andMe v4, v5, and AncestryDNA are between 600,000 to 700,000 rows (depending on when you did the tests).

You can check to see how many rows are in the combined files using the following awk command:

awk 'END{print NR}' ComboRawGenome.txt

The number it outputs is the number of rows. In my example, I was using 23andMe v4 and AncestryDNA v2 data with the mitochondrial and Y chromosome data stripped out to give me a combined total of 997, 940 rows (which can be imported into Excel).

 

That’s it!  Now you have a file with the combined 23andMe and AncestryDNA data with duplicates removed.  Note, in the example above, I put the 23andMe data file first. For the final file, the first instance of the rs id (in this case, 23andMe data) will be included and the second instance of the rs id (AncestryDNA) will be removed.

 

If you are a Genetic Lifehacks member, you can connect your combined data file and use that information when looking at all the reports and articles.



Author Information:   Debbie Moon
Debbie Moon is the founder of Genetic Lifehacks. She holds a Master of Science in Biological Sciences from Clemson University and an undergraduate degree in engineering. Debbie is a science communicator who is passionate about explaining evidence-based health information. Her goal with Genetic Lifehacks is to bridge the gap between the research hidden in scientific journals and everyone's ability to use that information. To contact Debbie, visit the contact page.