Good Day Everyone. Hope everyone is doing well!!!! In this document, the process of Phasing will be discussed and explained. Phasing is the newest craze in Genetic Genealogy. Right now there aren't that many tools out there to perform phasing features. What is phasing? What is it all about? Let's take a look at the new kid on the block.
Introduction To Phasing
What if you wanted to DNA test one of your parents but you can't. Let's assume one of your parents is unavailable and you would like to gather DNA from that parent. Can you do this? The answer is yes. This is where phasing comes in. Phasing is the attempt to reconstruct a parent's DNA data from a single child and the other contributing parent. The idea behind phasing is: Child DNA - Parent 1 DNA -> Parent 2 DNA. The result from phasing is a pseudo DNA data file that contains SNPs of the untested parent. In order to phase a single parent's DNA, you need the DNA data file of BOTH a tested child and tested parent. Let's take a look at how phasing works.
If you remember, an individual has two of every known SNP like this -> AG. The letters "A" and "G" are DNA bases called SNPs (snips). SNP stands for single nucleotide polymorphism. SNPs are sometimes erroneously referred to as alleles. The reason you have two of every known SNP is that you receive one from each parent. Let's say a child has the following SNPs or alleles -> AG. Now let's assume that a tested parent (mother) has the following SNPs -> AT. Can we figure which SNP the child received from which parent? The answer is yes. Since both child and mother share the common SNP -> "A", (Mom -> AT, Child -> AG), this means the child must of inherited the "A" from the mother and the "G" from the father (Mom -> AT, Child -> AG, Dad -> ?G). The result then will be a phased DNA data file that contains the single paternal SNP -> "G". Normally your DNA data file from either Family or Relative Finder has two of every SNP or allele. However a pseudo phased DNA data file will contain only one (half) of every known SNP or allele.
Remember that an autosomal DNA test produces matches. When a person is a match to you, that person matches to half of the SNPs that are in your normal DNA data file. In other words, a match is related to one side of your family. Because a phased pseudo DNA data file only contains SNPs from a single parent, only matches from one side of your family are revealed. If fact, this is the reason behind phasing.
Phasing: The Reason Behind It
Phasing is good in cases where you don't have a DNA data file from a parent. This works well in cases where one has say a deceased parent. For example, recently I phased SOME of my deceased paternal grandfather's DNA data. However phasing really shines in "lining up" your matches. Remember that an autosomal DNA test produces matches on both sides of your family. More important, an autosomal DNA test cannot tell you which side of your family a match is on. There are two main ways to determine which side of your family a match is on:
- Simply test both of your parents and see where the matches line up. If a match appears in the DNA match list of a particular parent, then you know which side the match is on.
- Simply test a single parent and observe if the match doesn't appear in the DNA match list of the tested parent. If the match doesn't appear in the tested parent, then match can be assumed to be on the opposite parent's side. This type of exclusion can only be done when considering close relatives (parent through and including 2nd cousins). However, starting at or beyond the 3rd cousin level, exclusion is based entirely on a probability. Starting at the 3rd cousin level, a non-match to a parent doesn't necessarily mean no relation. In other words, a non-match can still be related to the tested parent, even though that tested parent didn't match. This is because the "masking" effects of recombination begin to appear at the 3rd cousin level.
Phasing: Limitations and Catches
The biggest catch to phasing is that your pseudo DNA data file will only contain, at maximum, half the SNPs (alleles) that a tested person's DNA data file will contain. Remember with phasing, you are creating a virtual DNA data file without actually testing an individual. From a practical perspective, a phased DNA data file will actually contain much less than half the SNPs a normal DNA data file will contain. There are two reasons for this: Random No Calls and Random On Homozygous. Let's take a look.
Phasing: Random No Calls
Let's say both a child is -> AG and the mother is -> AG at the same location on their chromosomes. In this case, the child's two SNPs are different, but they both are identical to the parent. Can we determine between the child's "A" and "G", which SNP came from which parent. The answer is NO. As you can see, either the "A" or "G" could have come from the mother. Therefore, there is no way to deduce which SNP came from the mother or the father. This is what is referred to as a random no call. Random-no-call SNPs are the reason why linked DNA segments are used in an autosomal DNA test to identify common ancestry. A DNA data file that's generated from a tested individual, by default, contains random-no-call SNPs. A well designed matching algorithim would simply ignore all random-no-call SNPs as it detects them.
In a phased DNA data file, random-no-call SNPs are not inserted into the file. This is one of the reasons why a phased DNA data file is much smaller than a normal DNA data file. To give you an idea, here is a picture of the phased output of my paternal grandfather - William E. Handy Sr.
I recently used the new Gedmatch Phasing Utility to create a pseudo phased DNA data file of my deceased paternal grandfather - William E. Handy Sr. The kit number is PF208196P1.
If you click the picture shown above, you will see a phased listing of my paternal grandfather's DNA matches. The top match (F208196) is his son, who is my father, Steve Handy Sr. A parent and child normally share between 3300cMs - 3400cMs of DNA. As one can see, my dad and his father only share 410cMs of DNA in this phased reading. The low cM DNA amount is due to size of the phased DNA data file. Moving on to myself (F200507), my grandfather and I only share 217.3cMs of DNA. We are suppose to share between 1700cM - 1900cMs of DNA since William Handy Sr is my grandfather.
One way around this is to simply phase all of your full siblings against the same parent. That way, you can build a bigger list of matches which fall on the side of the phased parent. Each sibling will likely generate a different phased virtual DNA data file against the same parent. However each sibling has the potential to filter and reveal more matches that fall on the side of the phased parent.
The important concept though is that all of the matches shown at the above URL link are ALL on my paternal grandfather's side. In other words, the matches shown at the above URL, all "line up" on my paternal grandfather's side. There is no need to worry about matches shown on my father's maternal side because none are shown in this phased output. The phased DNA data file has completely filtered out all of my father's maternal matches.
Phasing: Random On Homozygous
Let's say both a child is -> AA and the mother is -> AA at the same location on their chromosomes. Both the child's SNPs are the same value (homozygous) and identical to the mother. Can we determine which SNP came from which parent? The answer is yes. The father and mother both contributed a SNP with the value of "A". However, this will not help in an autosomal DNA test. The reason is that because both parents are identical to the child at that location, there is no way to determine which parent a match is related to.
For example suppose mom is -> AAAAA and dad is -> AAAAA. If a match is -> AAAAA, then how can you know which parent the match is related to? This is called Random On Homozygous. In normal tested cases, this presents the same problem as in a phased scenario. People whom are descended from or are apart of an endogamous population suffer from random on homozygous or ROH. A good example would be people of Ashkenazi Jewish ancestry. Many first cousins married each other and produced offspring. As a result of the inbreeding, the SNP or allele pool can become highly homogenous over time. Phasing would not be help much in this case.
On a final note half siblings already have phased data.
Well that's it for phasing. Hopefully you know have a basic and clear understanding of phasing. Currently there are two tools out there that do phasing - GedMatch and David Pike's Tool. To see that actual comparison in my case - use kit number PF208196P1 in the second link - Compare Kits. To generate a phased file use the first or third link shown below.
- GedMatch Phasing - http://ww2.gedmatch.com:8006/autosomal/phase1.php
- Compare Kits - http://ww2.gedmatch.com:8006/autosomal/r-list1.php
- David Pike - http://www.math.mun.ca/~dapike/FF23utils/