Sunday, November 4, 2012

Understanding Mitochondrial DNA Testing


     Good Day Everyone. In this document, I am going to provide an introduction to the basis of a mitochondrial DNA test. This document should remove any confusion people may have concerning the test. As it stands right now, Family Tree DNA is the premier company that peforms mtDNA testing. The company known as 23andME currently does NOT perform a mtDNA test. 23andME only provides a haplogroup assignment which is an added and extra piece of information to the test. Let's begin with two important and basic principals that DNA tests are built on.


Basic Principals
     The first principal is that when two or more people share or match segments (regions) of DNA, they share a common ancestor in their past. It is from that ancestor that the shared DNA segments are inherited. In this case, the common ancestor was a woman.

   The second principal is that the more DNA you share with someone, the more closer you are to that person. This means your shared common ancestor lived in a more recent time. As we are going to see, this principal is extremely important when considering mtDNA given its slow rate of change.

Now let's look at the mtDNA basics. 

Short Science Part - mtDNA Basics
     The mitochondrian is a structure that sits inside the human cell. It's job is to provide energy to the cell. There are multiple copies of it that lay outside the nucleus. Inside the mitochondrian is a round piece of DNA called mtDNA. mtDNA is circular and has 16,569 DNA base pairs. The mtDNA is composed of three DNA regions - HVR1, HVR2, and CR (Coding Region that has genes). FTDNA has three mtDNA tests based on these three regions.
  1. Low resolution (HVR1) test
  2. High resolution (HVR1 + HVR2) test
  3. Full Genome Sequence test (HVR1 + HVR2 + CR) which looks at the entire mtDNA.
Three important points
  1. Only women pass along their mtDNA to a son and daughter. Men cannot pass along their mtDNA. This means that the inheritance of the mtDNA is child -> mother -> mother's mother -> mother's mother's mother -> etc. In other words, a mtDNA test look at the strict maternal side. 
  2. The word "match" in this context means having an identical mtDNA region (HVR1, HVR2, or CR) as someone else. NOT one base pair should be different. For example, the HVR1 region contains 400 DNA base pairs.  An HVR1 low resolution match means you and someone both share the exact and entire 400 base pairs. A single base mismatch can mean a difference of say 1000 years between you and someone else.
  3. The mtDNA changes very very slowly over time. Because of this, the mtDNA test is mainly used for deep distant ancestry. For example, if you have a HVR1 match, you are very distantly related to that person. In other words - your last common maternal ancestor could have lived over thousands of years ago. The more mtDNA regions you match with someone (there is only 3 regions) - the closer you are related to that person. Ideally and from a practical perspective, you really want to match someone in all three mtDNA regions such as between a mother and daughter. This means your last common maternal ancestor lived recently - say within the last 6 to 8 generations - which is approximately within the last 125 years. 
An mtDNA test also provides a separate piece of information known as a haplogroup. Let's take a look.

Short Science Part - Haplogroups
       A haplogroup is a population of people who are all descendants of a single man or woman who lived in the distant past. In this case - we are talking about mtDNA haplogroups. Each mtDNA haplogroup has a unique set of mtDNA markers that define that haplogroup. Every member of a single haplogroup bears a unique set of mtDNA markers that sets them apart from being a member in a different haplogroup. 

     There are currently 26 known mtDNA haplogroups. All 7 billion humans that currently live on the planet fall into a mtDNA haplogroup. Letters of the alphabet are assigned to a mtDNA haplogroup. An example of a mtDNA haplogroup is L3e. Essentially L3e represents a single woman that lived in the very very distant past. As science studies more populations, more mtDNA haplogroups will be added.

    IMPORTANT: Your haplogroup maternal common ancestor (L3e for example) and your last common maternal ancestor are two completely different women. Let's now look at how to get an estimate of when your last common maternal ancestor lived.

 Statistics
  Unfortunately DNA doesn't have a sign on it that tells you exactly in time when your last common maternal ancestor lived. Because of this, we have to use statistics to get a probability of when your last common maternal ancestor lived. Family Tree DNA currently uses the following accepted criteria to determine a time period.  
  1.   Matching on HVR1 (low resolution match) means that you have a 50% chance of sharing a common maternal ancestor within the last fifty-two generations. That is about 1,300 years.
  2.   Matching on HVR1 and HVR2 (high resolution match) means that you have a 50% chance of sharing a common maternal ancestor within the last twenty-eight generations. That is about 700 years.
  3.   Matching on the Mitochondrial DNA Full Genomic Sequence test (full resolution match) brings your matches into more recent times. It means that you have a 50% chance of sharing a common maternal ancestor within the last 5 generations. That is about 125 years.
      As you can see, these time ranges can be quite large. Remember these are probabilites that are based on an ancestor which could have lived within one of two intervals of a time range. For example, an HVR1 match means that your last common maternal ancestor may have lived within the last 1300 years. This also means that there is still a 50 percent chance that the maternal ancestor could have lived beyond 1300 years ago!!!!

   As you can see, from a practical standpoint, you really want to match someone at the Full Genomic Matching level. In other words, if you take a mtDNA test, you should probably order the FGS test and hopefully match to someone at that level. At the FGS level, your last common maternal ancestor is likely to have lived within the last 5 generations which is a genealogical time frame of about 125 years.

     Well that's it!!!  In short, mtDNA testing involves finding matches that reveal a shared common maternal ancestor. As you can see, the mtDNA changes very slowly which means it's mainly used distant ancestry, but it can be used for recent ancestry as well.

Hope that helps. Please let me know if you have questions. As always, it's a pleasure!!!!

Thanks
Steve Handy

Saturday, November 3, 2012

Understanding Y-DNA Genealogical Testing


Good Day Everyone,

   How is everyone doing? In this document, the Y-DNA genealogical test will be explained. Some people are confused as to exactly what a Y-DNA test is. This document will serve to remove the confusion that surrounds a Y-DNA test. Currently, Family Tree DNA is the premier DNA testing company that performs a Y-DNA genealogical test. This is mainly due to FTDNA's large STR marker system. (I will explain STR's shortly). The company known as 23andME currently doesn't perform a Y-DNA genealogical test. 23andME provides only a Y-DNA haplogroup assignment which is an add on. Let's begin.

Basic Principals
     The first principal is that when two or people share or match regions of DNA, they share a common ancestor in their past. It is from that common ancestor that the shared DNA segments or regions are inherited. In this case, the common ancestor was a male.

     The second principal is that the more DNA you share with someone, the more closer you are to that person. This means your shared common ancestor lived in a more recent time. For example, a brother and sister's last common ancestor is their mother. On the other hand, two first cousin's last common ancestor would be their grandmother. As we are going to see, this principal is extremely important when considering Y-chromosome given its moderate rate of change.

Y-DNA Basics
     So first off - ladies please don't be upset LOL. But a Y-DNA test is strictly for men. Here is why!! Humans have 46 chromosomes. In men, the last chromosome, (46th chromosome) is known as the Y-chromosome. The Y-chromosome is sometimes called the Y-DNA. The Y-DNA has a gene on it called the SRY gene. This master swtich gene (which switches on a bunch of genes) converts a human embryo into a male. Therefore, by definition, only a male has a Y-chromosome (Y-DNA). The Y-DNA has an area of DNA that a Y-DNA genealogical test looks at. This Y-DNA area contains a type of DNA called STRs. STR stands for short tandem repeat. A STR is a repeat of a DNA sequence. I will explain.

     DNA has four bases called A,T,C,G. For example, a DNA sequence would be -> "GCATCATG". The DNA sequence,"CAT", is a STR marker. As you can see, the STR is repeated 2 times (CATCAT). Researchers have a STR naming convention called the DYS system. DYS stands for DNA Y Segment. For example, a common studied DYS marker is DYS393. DYS393 has the STR sequence known as "AGAT". If you see a statement that says "DYS393 = 3", then it means that the DNA sequence, "AGAT", is repeated 3 times like this -> AGATAGATAGAT.

   A Y-DNA genealogical test looks at the DYS markers that currently all modern human men have along their Y-chromosome. Let's take a look. 

Y-DNA Genealogical Test
     A Y-DNA test looks at the DYS markers along the Y-chromosome between any two men. All modern human males have the same set of DYS markers which are situated in the same order along their Y-chromosome. For example, along the Y-chromosome you will see DYS393-DYS390-DYS19 in this order from left to right. The reason for this is that all human men have a common distant paternal ancestor who is known as Y-Chromosome Adam

    A Y-DNA test will look at a set of studied DYS markers and values between any two men. If there are enough matching DYS marker values, then a common paternal ancestor has been revealed between two or more men. This common male ancestor may have lived within a genealogical time frame (last 100 to 200 years). STR markers can change between generations. For example, Male A may have DYS393=10. Male B may have DYS393=12. The difference is 12-10 which is 2. This difference is known genetic distance. Genetic distance is a property of a Y-DNA genealogical test. It's used to get a degree of the relatedness between two or more men.  

     The Y-DNA has a strict inheritance pattern. The pattern is son -> father -> father's father -> father's father's father -> etc. Like the Y-DNA which is passed from father to son, your surname (last name) is typically inherited in a similar fashion. Therefore, a Y-DNA is typically used to see if a group of men who have the same last name, are related. For example, the last name of Williams is fairly common. If you want to know if a group of say, male Williams are related, then a Y-DNA genealogical test would be used. This is commonly used today for adoption, name change, etc. Currently, FTDNA has a panel of 111 DYS markers which makes their Y-DNA test a very popular option.

     Companies such as Family Tree DNA (FTDNA) typically market, package, and sell their Y-DNA tests based on a set number of studied DYS markers. For example - a 12 marker Y-DNA test means a set consisting of 12 popular DYS markers will be analyzed by the DNA testing company. Your specific set of DYS markers and values is known as a Y-DNA haplotype. For example shown below is a picture my personal 12 marker Y-DNA haplotype:

If you click this picture shown toward the left, it will show my set of 12 DYS markers and their values. For example - my DYS393 marker has a value of 15. This means I have a DNA base sequence or STR of "AGAT" that is repeated 15 times along the length of my Y-DNA. 

If another male has the exact set of DYS marker values that I have, then we would be considered a match. This means that me and the other male gentlemen share a paternal common ancestor.  The more DYS markers you share with someone, the more likely you are closely related to that person. I placed the word "likely" in bold, because a Y-DNA test is not always a clear cut test in terms of measuring the relatedness of two or more men. Here is what is meant.

The current thinking is that a male should match to another male on at least 37 DYS markers and above to be considered related within a genealogical time frame (last 100 to 200 years). This is logical and reasonable thinking. However DYS markers can change between generations. 

For example - ideally a father and son, whom are closely related, should match on all known DYS markers. This is true since a son inherits a copy of his father's Y-DNA. However, it's possible even a father and son may differ in DYS marker values. 

To make things more interesting - sometimes the opposite is true. Two or more men can be an exact match on all of their shared DYS marker values and yet be distantly related. There are known cases where two or more men have been an exact match at 111 DYS marker values and yet turned out to be very distantly related (10th cousins). Cases such as this can happen in Y-DNA testing so one should be aware of this.   

While a Y-DNA test is typically used for recent ancestry, a Y-DNA test can be used to reveal deep distant ancestry. This is where haplogroups come into the picture.

Y-DNA Haplogroups
     A Haplogroup is a population of people who are all descendants of a single man or woman who lived in the distant past. In this case - we are talking about Y-DNA haplogroups. Each Y-DNA haplogroup has a unique set of markers that define that haplogroup. Every member of a single haplogroup bears the same unique set of Y-DNA markers which sets them apart from being a member in a different haplogroup. These unique markers arose in a single individual, the haplogroup ancestor, a long time ago. Letters of the alphabet are given to the different Y-DNA haplogroups. A popular Y-DNA haplogroup is E1B1A. Every person, male or female, has a Y-DNA Haplogroup. In essence, a Y-DNA haplogroup, such as E1B1A, represents a single male that lived in the very very distant past!!!  

     The DNA markers used for haplogroup assignment are known as SNPs (pronounced "snip"). A SNP is a DNA base that has changed. For example, suppose a DNA sequence changes from CATG -> CATA. In this case, "G" changed to "A". The base "A" would be considered a SNP. SNP's change very slowly which is why they are used for haplogroup assignment.

    There are approximately 29 known Y-DNA haplogroups. By definition, all modern human men fit into the African Y-DNA Haplogroup known as A. Haplogroup A is then split into the two major Haplogroups, B and CT respectively. From the Y-DNA Haplogroup known as CT, the remaining African and Non-African Y-DNA haplogroups (DE, F, etc) are descended.

    Because of people's different religious, marital, and social practices/histories, certain people tend to be strongly associated with certain haplogroups. For example, the Y-DNA haplogroup known as E1B1A is very strongly associated with African-American males. The Y-DNA known as Q1a3a1 is strongly associated with Native American males.

    It's important to know that your last common paternal ancestor and a haplogroup paternal common ancestor are two different men. Your Y-DNA haplogroup ancestor lived thousands of years ago, whereas your last common paternal ancestor (father or grandfather, etc) lived recently within a genealogical time frame.  All men are related distantly but not all men are related recently.

  Well that's it!!!! As always, it has a pleasure. If anyone has any questions, please feel free to ask.

Thanks
Steve

Understanding BGA Testing


     In this document, I am going to explain BGA Testing. These days there are a number of companies which claim that from your DNA, your ancestry can be determined. DNA stores information such as the color of our eyes and hair. DNA keeps a record of our past ancestors and who we are related to. A person's ethnic composition, religion, language, and name are types of information that is NOT stored nor defined by DNA. From the results of a BGA DNA test, people tend to infer socially defined concepts such as person's religion from DNA. Such inferences can be wrong because DNA doesn't store such information.

Humans tend to categorize things based on observed patterns. Those patterns can not be defined by DNA. So please keep that in mind.

     Now let's please turn our attention the basics of BGA or Admixture Testing. The current position of the scientific community is that the jury is still out on BGA Testing. As we are going to see, there is very good reason for this!!!

BGA Basics And Science
     BGA stands for biogeographical analysis. BGA tests are sometimes call Admixture Tests. A BGA test basically tries to use your DNA to determine or pinpoint what part of the world your ancestor(s) originated. Using your DNA to show if two people have a common ancestor is valid. DNA contains information such as whether or not two people are related.

     However using your DNA to pinpoint where an ancestor was born, lived, or came from, is entirely different.  Here is the idea behind a BGA test.

  Suppose we have a population called the Handy Clan. The Handy Clan has 1000 people and is located on a remote island. Now let's say everyone in the Handy Clan population has a rare DNA marker which we will call -> M. In other words, the frequency of this DNA marker is 100% because everyone (1000 people) has the DNA marker M. Also, let's assume that no one outside of the Handy Clan, which is on this remote island, has the DNA marker M.

Now Laurie lives in the US in Oakland, California which is located outside the remote island and outside of the Handy Clan population. Let's suppose we discover Laurie has this same rare DNA marker M. 

    Can we say Laurie is from or has ancestry from the Handy Clan population?

     Under simple circumstances, yes!!!  We can confidently say that. If no other population in the world has this rare genetic marker M, then we can say yes. Laurie is either from, or has had an ancestor, that originated from the Handy Clan population. That's what a BGA does. It compares your DNA markers to a studied population. Since all one thousand people have the same DNA marker M, then Laurie must either have been born in that Handy Clan population or Laurie had an ancestor from that population.

However reality is not as simple as that!!!!!  Let's see a more realistic scenario.

A More Realistic Scenario 
     Now suppose we have three separate populations, the Handy Clan, Williams Clan, and Henderson Clan. Each population is located in a different part of the world. Each population or clan has 1000 people in it. Every person in each of the populations has the genetic marker M.  In other words, the frequency of the DNA marker M is 100% in each population.

     Now we discover again that Laurie, who lives in Oakland, which is outside each population, has the genetic marker M. 

Question: Does Laurie has ancestry from the Handy Clan population?  

Now things have changed. The question is now harder to answer. The fact that Laurie has a DNA marker M in multiple populations doesn't necessarily mean Laurie has ancestry from the Handy Clan population.  Laurie could of had an ancestor that lived or was born in any of those populations. 

That's the problem with a BGA DNA test. As we can see, the truth is not so clear cut in tests of this nature. The truth is based on a probability.  Any newly introduced population can change things dramatically. Therefore, when interpreting the results from a BGA or Admixture test, please keep in mind that your results may differ or change tomorrow. Laurie would need a paper trail or some definitive piece of evidence to confirm the inference drawn from the BGA results. The BGA data numbers alone don't necessarily prove anything.

The reason is that a BGA test is attempting to infer information from DNA that DNA doesn't define. An ancestor's original location can be any where. DNA simply doesn't reflect or store that type of information. From the frequency (or concentration) of those DNA markers in each population, we are making an inference which could be right or wrong. If a child is born in say Atlanta, Georgia, that geographical location and information will not be stored in the child's DNA. 

  One of the biggest misconceptions out there, is that a BGA or Admixture Test, can pinpoint the exact tribe or small population someone is from. As one can clearly see, this is not necessarily true. DNA alone simply cannot do this as it's advertised. This is one of the reasons, the scientific community as a whole has not embraced BGA tests.

Now let's look at the basic BGA concepts.

BGA Concepts
In BGA terms, the DNA marker M, is called an ancestry informative marker or AIM. Each population is called a reference population. An example of a reference population is the Yoruba. The Yoruba is a West African ethnic group that is studied by population geneticists. Many African-Americans have DNA markers that match to the Yoruba group.  

Now that we have the BGA basics, let's look at the BGA process and engine which is known as PCA.

BGA Process and PCA 
     The engine or workhorse of most BGA Analysis is PCA. PCA stands for Principal Component Analysis. PCA is a complex mathematical process that separates a bunch of data into its components. For example, let's say we have a bag of 100 jelly beans that are of different colors. After separating the jelly beans by color, we see this -> blue=25, red=25, purple=25, and yellow=25. This means that each of the four colors make up 25% (25/100) of the jelly beans. PCA would essentially separate the jelly beans in the exact same way.

     The BGA process starts off with about 300,000 AIMs or SNPs. These SNPs are found across the first 44 chromosomes in humans. The SNPs are matched to a number of reference populations. The results are percentages that represent the concentration of the SNPs in each reference population. The engine running the show is PCA, which runs in the background of an algorithm.

Now let's look at a few BGA tests.

BGA Tests: Population Finder, Ancestry Painting, McDonald
There are a number of BGA tests out there. Family Tree DNA's BGA test is Population Finder. 23andME's is called Ancestry Painting. The Population Finder is a BETA test so it's a work in progress. Population Finder uses continental groups in addition to reference groups.

Here is an example of PF

Continent (Subcontinent)     Population              Percentage    Margin of Error
Europe (Western European)   French, Orcadian       28.53%            ±0.48%
Africa (West African)             Yoruba, Mandenka     71.47%            ±0.48%

There are four reference populations -> French, Orcadian, Yoruba, Mandenka. This person basically has DNA markers that match those reference populations. It's likely this person has ancestry from some of those populations, but not necessarily all of them. A paper trail would be needed to confirm ancestry.

Because the Population Finder is a beta test and has limited reference populations (same for 23andME's Ancestry Painting), many people turn to an Extended BGA Analysis. This is where Dr Doug McDonald comes in.

McDonald's Extended BGA 
Dr Douglass McDonald is a chemist at University of Illinois in Urbana, Illinois. In fact, he actually created the Population Finder for Family Tree DNA. McDonald has access to more studied reference populations which Family Tree DNA or 23andMe currently doesn't have. Because of this, you can get a more "fleshed" out or "extended" BGA Analysis.

McDonald gives his results in the form of an email with four graphs. Here are McDonald's results of my cousin Lonette Lanier's extended BGA test as shown in quotes below:

"LonetteFayLanier216745-autosomal-o-results.csv
Most likely fit is 27.9% (+-  0.1%) Europe (various subcontinents) and 72.1% (+-  0.1%) Africa (all West African).

The following are possible population sets and their fractions, most likely at the top

French= 0.279 Mandenka= 0.721
Hungary= 0.280 Mandenka= 0.720
English= 0.277 Mandenka= 0.723

There is also about 0.4% Native American that is strong and likely real, as well as other little bits on the chromosomes but they are weak and probably unimportant."

Each line, "French= 0.279 Mandenka= 0.721", is a population set. There are three population sets. Each population set gives a likely or probable ancestry for my cousin Lonette. Each population set is a combination that gives the best fit for Lonette's data. It doesn't mean Lonette necessarily has ancestry from say, the French. But she does have DNA markers that match the French reference population. The multiple population sets are the result of Lonette's DNA markers that are spread across multiple populations. This is why it's difficult to pinpoint a person's ancestral origin to a specific tribe or single population via your DNA alone.

It's important to always backup DNA evidence with documents or other pieces of evidence to validate a claim. The numbers alone don't always or neccesarily identify the truth.

Now let's look at the issues the scientific community has with BGA Testing

Issues With BGA Or Admixture Testing
The scientific community as a whole hasn't really embraced BGA or Admixture Testing. Using your DNA to establish whether two or people are related via a common ancestor is valid. However using your DNA to locate where your ancestor(s) originated is quite a different task. An ancestor could have been born or lived in any part of the world. More important - DNA simply doesn't define or contain information such as ancestor's geographical location or point of origin. That type of information is NOT an attribute of a genetic mutation. Therefore BGA or Admixture tests don't have a basis in genetics. That's the scientific community's main objection to BGA or Admixture tests. The results from a BGA or Admixture test are used to make inferences from observed correlations. A correlation can be dangerous in science because it can lead to an incorrect inference from an observed set of data. 

There is a very big difference between a casual relationship (correlation) versus a direct relationship between two variables.

This doesn't mean BGA tests aren't valuable. A BGA test can lead one into finding insight into their past. However you must understand that the results from a BGA test aren't final. The results from a BGA test are tenative and can easily change tomorrow.

There are at least three main current hurdles with a BGA Analysis

1) Populations can change location and identity. They are not static. What we know about a population's history is limited and based on what we currently know. Moderns humans have been here for approximately 200,000 years. No one can know the entire history of any population. We can have approximate knowledge, but NOT complete knowledge.

2) We simply don't at this time have a complete set of reference populations to make any final judgment calls as of yet. (I will explain this shortly)

3) Different algorithms can produce different results.

For example suppose Dr McDonald gives me the following simple BGA results:

Finnish=.100 and Yoruba=.900.

This is based on the fact that the scientific community has studied the Yoruba and Finnish etc. This would lead one to believe that one has a large Yoruba ancestry. The Yoruba ancestry may be true with a paper trail.

Now suppose the scientific community has studied and approved a new reference population, C, in say a few years. Now a rerun of Dr McDonald's results yields the following:

Population C=.450, Finnish=.100, and Yoruba=.450

Now as you can see, things have changed. My ancestor now could have lived in the Yoruba, or could have lived in the new reference population C. This scenario could happen. As you can see, none of these results are absolute or final in the sense that they can't change.

     In addition, different algorithms can produce different results. An algorithm is simply a method or set of steps to solve a problem. The algorithm is very important. It's what produces your DNA results. Right now there are a number of tools out there that claim the ability to produce valid BGA results. Each of these tools may run under different algorithms.

For example - I have taken three BGA tests: Ancestry Painting, Population Finder, and McDonald. Each has produced different results. The analysis from 23andME stated I had 7 percent Asian ancestry. Now this could be significant or it could be noise. Neither FTDNA's Population Finder nor McDonald's findings gave 7% percent ancestry. The bigger question is which one is correct? Population Finder is a BETA test. So I can assume that it's findings are approximate. Can the same be said for 23andME's Ancestry Painting results or Dr McDonald's BGA findings? The truth is that at this time - it's impossible to tell which one is correct or is incorrect.

     The most important point to take from this tutorial is that a BGA can yield valuable information not necessarily definitive information. Technically, the only factual based information that can be produced from a BGA test is that a person has DNA markers (AIMs) that match a reference population. 

Well that's it for BGA Analysis. If anyone has questions, please free to ask.

Thanks
Steve

Autosomal DNA Testing: Phasing


     Good Day Everyone. Hope everyone is doing well!!!!  In this document, the process of Phasing will be discussed and explained. Phasing is the newest craze in Genetic Genealogy. Right now there aren't that many tools out there to perform phasing features. What is phasing? What is it all about? Let's take a look at the new kid on the block.

Introduction To Phasing    
     What if you wanted to DNA test one of your parents but you can't. Let's assume one of your parents is unavailable and you would like to gather DNA from that parent. Can you do this? The answer is yes. This is where phasing comes in. Phasing is the attempt to reconstruct a parent's DNA data from a single child and the other contributing parent. The idea behind phasing is: Child DNA - Parent 1 DNA -> Parent 2 DNA. The result from phasing is a pseudo DNA data file that contains SNPs of the untested parent. In order to phase a single parent's DNA, you need the DNA data file of BOTH a tested child and tested parent. Let's take a look at how phasing works.

    If you remember, an individual has two of every known SNP like this -> AG. The letters "A" and "G" are DNA bases called SNPs (snips). SNP stands for single nucleotide polymorphism. SNPs are sometimes erroneously referred to as alleles. The reason you have two of every known SNP is that you receive one from each parent. Let's say a child has the following SNPs or alleles -> AG. Now let's assume that a tested parent (mother) has the following SNPs -> AT. Can we figure which SNP the child received from which parent? The answer is yes. Since both child and mother share the common SNP -> "A", (Mom -> AT, Child -> AG), this means the child must of inherited the "A" from the mother and the "G" from the father (Mom -> AT, Child -> AG, Dad -> ?G). The result then will be a phased DNA data file that contains the single paternal SNP -> "G". Normally your DNA data file from either Family or Relative Finder has two of every SNP or allele. However a pseudo phased DNA data file will contain only one (half) of every known SNP or allele.

   Remember that an autosomal DNA test produces matches. When a person is a match to you, that person matches to half of the SNPs that are in your normal DNA data file. In other words, a match is related to one side of your family. Because a phased pseudo DNA data file only contains SNPs from a single parent, only matches from one side of your family are revealed. If fact, this is the reason behind phasing. 

Phasing: The Reason Behind It
   Phasing is good in cases where you don't have a DNA data file from a parent. This works well in cases where one has say a deceased parent. For example, recently I phased SOME of my deceased paternal grandfather's DNA data. However phasing really shines in "lining up" your matches. Remember that an autosomal DNA test produces matches on both sides of your family. More important, an autosomal DNA test cannot tell you which side of your family a match is on. There are two main ways to determine which side of your family a match is on:
  1. Simply test both of your parents and see where the matches line up. If a match appears in the DNA match list of a particular parent, then you know which side the match is on.
  2. Simply test a single parent and observe if the match doesn't appear in the DNA match list of the tested parent. If the match doesn't appear in the tested parent, then match can be assumed to be on the opposite parent's side. This type of exclusion can only be done when considering close relatives (parent through and including 2nd cousins). However, starting at or beyond the 3rd cousin level, exclusion is based entirely on a probability. Starting at the 3rd cousin level, a non-match to a parent doesn't necessarily mean no relation. In other words, a non-match can still be related to the tested parent, even though that tested parent didn't match. This is because the "masking" effects of recombination begin to appear at the 3rd cousin level.
The third way is phasing. Phasing will automatically reveal which side of your family your matches will fall on. Phasing is considered to have much promise. However there are limitations and catches to phasing. As in most cases, it's never that simple. Let's read on to find out. 

Phasing: Limitations and Catches
    The biggest catch to phasing is that your pseudo DNA data file will only contain, at maximum, half the SNPs (alleles) that a tested person's DNA data file will contain. Remember with phasing, you are creating a virtual DNA data file without actually testing an individual. From a practical perspective, a phased DNA data file will actually contain much less than half the SNPs a normal DNA data file will contain. There are two reasons for this: Random No Calls and Random On Homozygous. Let's take a look.

Phasing: Random No Calls 
     Let's say both a child is -> AG and the mother is -> AG at the same location on their chromosomes. In this case, the child's two SNPs are different, but they both are identical to the parent. Can we determine between the child's "A" and "G", which SNP came from which parent. The answer is NO. As you can see, either the "A" or "G" could have come from the mother. Therefore, there is no way to deduce which SNP came from the mother or the father. This is what is referred to as a random no call. Random-no-call SNPs are the reason why linked DNA segments are used in an autosomal DNA test to identify common ancestry. A DNA data file that's generated from a tested individual, by default, contains random-no-call SNPs. A well designed matching algorithim would simply ignore all random-no-call SNPs as it detects them.

   In a phased DNA data file, random-no-call SNPs are not inserted into the file. This is one of the reasons why a phased DNA data file is much smaller than a normal DNA data file. To give you an idea, here is a picture of the phased output of my paternal grandfather - William E. Handy Sr.


I recently used the new Gedmatch Phasing Utility to create a pseudo phased DNA data file of my deceased paternal grandfather - William E. Handy Sr. The kit number is PF208196P1.

     If you click the picture shown above, you will see a phased listing of my paternal grandfather's DNA matches. The top match (F208196) is his son, who is my father, Steve Handy Sr. A parent and child normally share between 3300cMs - 3400cMs of DNA. As one can see, my dad and his father only share 410cMs of DNA in this phased reading. The low cM DNA amount is due to size of the phased DNA data file. Moving on to myself (F200507), my grandfather and I only share 217.3cMs of DNA. We are suppose to share between 1700cM - 1900cMs of DNA since William Handy Sr is my grandfather.

     One way around this is to simply phase all of your full siblings against the same parent. That way, you can build a bigger list of matches which fall on the side of the phased parent. Each sibling will likely generate a different phased virtual DNA data file against the same parent. However each sibling has the potential to filter and reveal more matches that fall on the side of the phased parent.

     The important concept though is that all of the matches shown at the above URL link are ALL on my paternal grandfather's side. In other words, the matches shown at the above URL, all "line up" on my paternal grandfather's side. There is no need to worry about matches shown on my father's maternal side because none are shown in this phased output. The phased DNA data file has completely filtered out all of my father's maternal matches.

Phasing: Random On Homozygous   
     Let's say both a child is -> AA and the mother is -> AA at the same location on their chromosomes. Both the child's SNPs are the same value (homozygous) and identical to the mother. Can we determine which SNP came from which parent? The answer is yes. The father and mother both contributed a SNP with the value of "A". However, this will not help in an autosomal DNA test. The reason is that because both parents are identical to the child at that location, there is no way to determine which parent a match is related to.  

   For example suppose mom is -> AAAAA and dad is -> AAAAA. If a match is -> AAAAA, then how can you know which parent the match is related to? This is called Random On Homozygous. In normal tested cases, this presents the same problem as in a phased scenario. People whom are descended from or are apart of an endogamous population suffer from random on homozygous or ROH. A good example would be people of Ashkenazi Jewish ancestry. Many first cousins married each other and produced offspring. As a result of the inbreeding, the SNP or allele pool can become highly homogenous over time. Phasing would not be help much in this case.

On a final note half siblings already have phased data.

     Well that's it for phasing. Hopefully you know have a basic and clear understanding of phasing. Currently there are two tools out there that do phasing - GedMatch and David Pike's Tool. To see that actual comparison in my case - use kit number PF208196P1 in the second link - Compare Kits. To generate a phased file use the first or third link shown below.              

Thanks
Steve 

Understanding Haplogroups


     A haplogroup is a population of individuals whom share a unique set of genetic markers that were derived from a common ancestor. All members of a haplogroup are descendants of a single man or woman that lived in the very very distant past. The Y-chromosome (Y-DNA) and mtDNA are the genetic structures that are associated with haplogroups. (The autosomal chromosomes are NOT associated with haplogroups.) Each mtDNA (maternal) and Y-DNA (paternal) haplogroup has a distinct set of genetic markers that define, distinguish and separate the different mtDNA and Y-DNA haplogroups. Your haplogroup, (L3e2a for example is a mtDNA subhaplogroup), is defined by unique set of DNA markers that's present on either the mtDNA or Y-DNA. Only members of the same haplogroup that you belong to have those distinct DNA markers. Every person has a mtDNA haplogroup and a Y-DNA haplogroup. In essence - a mtDNA haplogroup, such as L3e2a, represents a single woman that you are descended from who lived hundreds of thousands of years ago. 

    There are three basic mtDNA haplogroups - L, M, and N. The L haplogroup represents Mitochondrial Eve - an ancient and distant African woman.  Mitochondrial Eve is the most recent common maternal ancestor of all current living humans. By definition, all of modern humanity fits into the L Haplogroup. As Mitochondrial Eve produced daughters, grand-daughters, etc, her original mtDNA sequence was copied and changed. This eventually produced the modern haplogroups that we see today. The L haplogroup is divided into seven subhaplogroups, L0,L1,L2,L3,L4,L5,L6. These six subhaplogroups (L0,L1,L2,L4,L5,L6) are found almost exclusively in Africa. This supports the notion that modern humanity began in Africa. Eventually modern humans traveled outside of Africa. The L3 mtDNA subhaplogroup represents that transition out of Africa. The M and N haplogroups are both descended from the L3 mtDNA haplogroup. Essentially all of the other mtDNA haplogroups which are found outside of Africa (A,B,C,R,T and etc) are descended from either the M or N mtDNA haplogroups. Haplogroups can be further subdivided into more subhaplogroups based on newly discovered DNA markers.

    Similarly there are Y-DNA haplogroups with each given letters of the alphabet as well. Y-Chromosomal Adam represents the most recent common paternal ancestor of all living humans. By definition, all modern human men fit into the African Y-DNA Haplogroup known as the A Haplogroup. Haplogroup A is then split into the two major Haplogroups, B and CT, respectively. From the Y-DNA Haplogroup known as CT, the remaining African and Non-African Y-DNA haplogroups (DE, F, etc) are descended from.

    Because of people's different religious histories, marital, and social practices, certain people tend to be strongly associated with certain haplogroups. For example, the A,B,C,D, and X mtDNA haplogroups are strongly associated with Native Americans.  The R mtDNA haplogroup is present among 89% of Europeans. The Y-DNA haplogroup known as E1B1A is very strongly associated with African-American males.

    It's important to understand that two or more people within the same haplogroup are related within an anthropological time frame (thousands of years), not necessarily a genealogical time frame (one to three hundred years). Two or more people may not share a recent common ancestor, but they still may share a distant haplogroup common ancestor. Your last (recent) common ancestor and your distant (haplogroup) common ancestor are two separate individuals. For example, my mother would be my last (not distant) common ancestor. 

    By the same token, two or more people can be in the same haplogroup and be related within a genealogical time frame. An example of this is seen among Native Americans. Having crossed the Bering Straits within the last 12,000 years, many Native American groups and circles have remained fairly small and are genetically similar enough to share BOTH matching haplogroups and recent common ancestors.

    On the other hand, two or people may be in two different haplogroups, but still be related in a genealogical time frame. For example, my dad and I are in two different mtDNA haplogroups, but my dad is still my most recent common ancestor. This can be easily understood if one remembers that a haplogroup represents a single and separate distant side of a person's ancestry. A mtDNA haplogroup represents a single distant strict maternal ancestry (child -> mother -> mother's mother, etc). A Y-DNA haplogroup represents a single distant strict paternal ancestry (son -> father -> father's dad, etc)

Inferences From Haplogroup Assignments
A common practice is to make an inference about a person's ethnic composition, religion, and other socially defined information based on a person's haplogroup assignment. These inferences or conclusions are based on the frequencies of haplogroups. Basically the frequencies of certain haplogroups differs across certain populations. These frequencies exist because of the social practices, martial histories, and other social norms that many cultures exhibit.

For example - the Y-DNA haplogroup known as E1B1A7A has a very high distribution across Western Africa. The E1B1A7A has a very high frequency and presence among African-American males. This leads to a conclusion that a male who is E1B1A7A must be African-American.  This is of course is not necessarily true. The E1B1A7A also has a distribution in South American as well.

As an another example - the Y-DNA haplogroup, known as R1B1A, has a strong presence and high distribution among European men. Yet my 2nd cousin - Lewis Lamar, who is considered African-American, has the R1B1A haplogroup. The E1B1A7A haplogroup is absent in his paternal lineage.

Such inferences drawn from a haplogroup assignment are very dangeous to make. The important concept to grasp and understand is that DNA does NOT define nor store information such as person's ethnic composition or religion. Those are socially defined concepts which are NOT defined by a genetic mutation. A person can no more be said to be 100% African or European from the DNA data alone.

Other forms of collaborating evidence would be required to confirm an individual ethnic composition, religion, or any other socially defined concept.

  If anyone has any questions, please feel free to ask.

Thanks
Steve

Autosomal DNA Testing: Recombination


     Good Day Everyone. How is everyone doing???  Fine I hope. In this document, we are going to review the natural and biological process of recombination. Recombination is an important concept that one should probably be famaliar with in DNA Genealogy. This is especially true in Autosomal DNA testing. The other DNA tests, Y-DNA and mtDNA, are immune to the effects of recombination. The same can not be said for an autosomal DNA test. An autosomal DNA test is constrained and effected by genetic recombination. In this document, we are going to see why. Let's begin with taking a high level look at recombination. Then, we will move to a more low level view of recombination and see how the tests are affected. Let's begin.

High Level View: Basics Of Genetic Recombination 
Genetic recombination is defined as a biological process where genetic material is broken and joined to form new genetic material. The diversity of life that surrounds you is due to recombination. Recombination occurs when a child is being conceived for the first time. Humans have 46 chromosomes. When people have children, only 23 chromsomes (half) are passed to the child from each parent. The key is that 23 new chromosomes from each parent are created and passed to the child. Here is what is meant by new.

Basic Mechanism of Recombination In A Parent
Start                              Touch                      New Chromosomes
C(Blue)--><--C(Red)     C(Blue)C(Red)         C(Blue/Red)<--->C(Red/Blue)

Let's start off with two chromosomes which we'll call -> C. One chromosome is blue -> C(Blue) and the other chromosome is red -> C(Red). In each parent when conception occurs, the 46 chromosomes divide up into 23 pairs. Each pair consists of two chromosomes like shown above. What happens in a parent is that both chromosomes within a pair, physically touch each other and separate. When the two chromosomes touch each other, they exchange genetic information. The result is that after separation, two new chromosomes are created. 23 new chromosomes are then placed into a sperm cell and 23 new chromosomes are placed into an egg cell -> sperm[23,C(Blue/Red)] and egg[23,C(Red/Blue)].

When a sperm and egg cell combine to form a new child, the child will have 46 new chromsomes. This is why siblings (brothers and sisters) look different from the each other and from the parents. In a nut shell, that's basically what recombination entails. The exception to this is identical twins. Without recombination, this would be a boring world as everyone would look the same.

Now let's look at recombination from a low level view.

Low Level View: Exchange of Genetic Material
DNA is composed of four bases -> A, T, C, and G. A DNA segment would look like this -> ATTTTCGC. Let's take a look at recombination at the DNA segment level.

Start                                                                   Touch
Chr1-AAAAAA->  <-TTTTTT-Chr2                     Chr1-AAAAAATTTTTT-Chr2  

New Segments
Chr1-AAATTT    AAATTT-Chr2

Shown above is an example of recombination at the DNA level. We start of with two chromosomes -> Chr1 and Chr2. Basically recombination has created two new DNA segments -> AAATTT and AAATTT. The two DNA segments have exchanged DNA or genetic material with each other. Of course there are many possible combinations that can be generated from recombination. We could of gotten ATAAAA, TTAATA, etc.

Genetic recombination is responsible for another process, the loss of genetic material.

Low Level View: Loss of DNA
Each of us has 46 chromosomes. This is a basic fact. 23 of our 46 chromosomes we inherit from mom, and the other 23 we get from dad. In other words, 50 percent (23/46), or half of your DNA, is from each parent. If 50 percent, or half of your DNA, is from your mom, then it stands to reason that half of that 50 percent, or 25 percent, is from your mother's parents. In other words, we each get 25 percent of DNA from our grandparents.

This is what basically happens over time

50% parent -> 25% grandparent -> 12.5% great-grandparent -> 6.2% great-great grandparents -> etc.

As you can see, the amount or percentage of DNA that's inherited from an ancestor gets smaller as you go back further into the past. Each generation, the percentage of your previous ancestor's contribution to your DNA is halved. The reason for this is recombination. What this means is that a DNA segment from an ancestor gets smaller and smaller as the DNA segment is passed down thru the generations.

Ancestor-AAAATTTTGGGGCC --> Child-AAAATTTTGGGG --> Grandchild-AAAA --> Great Grandchild-AAA --> etc

Shown above is an example of recombination at work over the generations. We start off with a DNA segment in a ancestor that has 15 DNA bases. Over time, this DNA segment has been reduced as it's passed down through the generations. Notice that as the DNA segment was passed from the child to the grandchild, a large chunk of that DNA segment was removed!!!!! On the other hand, notice from grandchild to great-grandchild, only a single DNA base was removed. This highlights an important theme when dealing with recombination.

IMPORTANT: The amount of DNA that is lost between generations is random.

This is by far the most important concept to grasp and understand. Recombination is an unbias and haphazard process. Recombination doesn't care. It removes DNA in a completely random, unbias, and unpredictable pattern and fashion. The amount removed can be quite large, or quite small. It's makes no difference.

Now let's look at autosomal DNA tests in light of recombination.


Recombination: Autosomal DNA Testing 
Now that you are armed with the basics of recombination, we can look at how an autosomal DNA test operates. Behind the scenes of an autosomal DNA test such as Family Finder or Relative Finder is a matching algorithm that performs the work. The matching algorithm identifies linked DNA segments that would be present in two or more people whom are descended from a common ancestor. These linked DNA segments are composed of DNA bases called SNPs (snips). If there are enough matching DNA segments between two or more people, a "match" is declared.

     If two or more people are descended from a common ancestor, they both should share linked DNA segments from that common ancestor. Simple enough right??? However there is one player in the game that needs to be recognized -> recombination. As those segments are passed down across the generations, recombination will shorten those linked DNA segments. In some cases, recombination may even completely "erase" some of those linked DNA segments. That's why an autosomal DNA test can only go back 5 to 7 generations. After a certain period of time (5 or 7 generations), recombination has the potential to completely erase linked DNA segments from a shared common ancestor.

Let's take a look
Line 1: Ancestor-AAAATTTTGGGGCC --> Child-AAAATTTTGGGG -->  Grandchild-AAAATTT --> Great Grandchild-AAAATT ---> person A (match)

Line 2: Ancestor-AAAATTTTGGGGCC --> Child-AAAATTTTGG -->   Grandchild-AAAATTTTG --> Great Grandchild-AAAATTG --> person B (match)

Line 3: Ancestor-AAAATTTTGGGGCC --> Child-AAAAT -->   Grandchild-AAA -->
Great Grandchild-AA -->  person C  (non-match)

Shown above are three lines of descent from a shared common ancestor in three people A, B, and C. Let's assume we know before hand, that these people are related and are descended from the same ancestor. The matching algorithm doesn't know the three people are related. The matching algorithm only knows that its job is to look for matching DNA segments. Let's say in this example, that the matching algorithm will declare a match if the DNA linked segments are identical and have at least six matching SNPs. In this case, the algorithm will declare both person A-->AAAATT and person B- ->AAAATTG as a match.

Person C will be left out. As we can see, recombination has randomly removed enough SNPs from Person C's DNA segment such that the matching algorithm will not declare a match. This highlights an important theme in autosomal DNA testing.

IMPORTANT -> Just because two or more people don't match, doesn't necessarily mean they aren't related in genealogical time.

Now let's look at some examples of recombination and it's effects

Example 1: Loss Of Large cM Amounts
Here is an example from my personal family. Remember that the centiMorgan (cM) is the unit of measurement that is relevant in an autosomal DNA test. The centiMorgan contains various DNA testing properties such as DNA segment length, number of SNPs, etc, all rolled into one. The centiMorgan gives us a consistent way to compare apples to apples and make judgements as to whether two or people are related.

Technically, the centiMorgan gives a probability or propensity of recombination. But let's keep things simple.

Family Case 1: Match A
Juliette Turner (grandmother) & Match A
Shared cM -> 65.13
Longest segment -> 42.96cM

Steve Handy Sr (father) & Match A
Shared cM -> 62.02
Longest Segment -> 42.96cM

Steve Handy Jr (myself via 23andME) & Match A
Shared cM -> 12
Longest Segment -> 12cM

As one can see, the effects of recombination are truly random and unbiased. In the generation between my grandmother and dad, only 3.11cMs of DNA were lost (65.13cM - 62.02cM). However in the generation between my dad and myself, over 50cMs were lost!!! (62.02cM - 12cM). This is how recombination operates. It operates in a haphazard and unpredictable manner. In essence, a matching algorithm deals with the results after recombination has done it's part.

Now let's look at 2nd example of the effects of recombination.


Example 2: Siblings Matching Differently 
Sometimes a distant cousin will match two siblings at different cM levels and on different chromosomes.

Family Case 2: Match B
Michael Mitchell (sibling) & Match B
Shared cM -> 47.69
Longest Segment -> 25.55
Chromosome 3 -> 25.55cM
Chromosome 6 -> 22.14cM

Muriel Mitchell (sibling) & Match B
Shared cM -> 10
Longest Segment -> 10
Chromosome 3 -> 10cM

Steve Handy Jr (myself) & Match B
Shared cM -> 0cM
Longest Segment -> 0cM
No ancestral cMs on any of my chromosomes

     Shown above is an example of a Match B on my maternal side. Muriel Mitchell is my mother.  Michael Mitchell is my maternal uncle. Match B shares 47cMs with my maternal uncle and 10cMs with my mother. Both Michael and Muriel Mitchell are siblings. The key is the last recombinational event at the sibling level. The last recombinational event occurred separately between my maternal grandmother and each of her children. Genetic recombination randomly removed a large chunk of ancestral cMs in my mother's line. From maternal grandmother to the ancestor, the line of descent is one and the same. However in the last generation from my maternal grandmother to my mother, a single recombinational event removed a large chunk of ancestral cMs.

     On the other hand, from my maternal grandmother to my maternal uncle (mom's brother), 47.69cMs remained which was enough for the matching algorithm to detect it.

    FTDNA's matching algorithm doesn't report anything less than 20cM. Therefore the Match B actually didn't show in my mother's FFinder match list. I had to look at the FTDNA chromosome browser to see the amount of overlap which was 10cM. As one can see, siblings can match a distant cousin at different cM levels. By the time I arrived on the scene, recombination completely erased all traces of the ancestor's matching DNA from my chromosomes.

    One of the biggest misconceptions out there is that an autosomal DNA test can work against a group of people due to their ethnic background or history. As one can see, this is simply not the case and is actually quite impossible to do. 

There are two important biological events to remember in this tutorial.

  1. Genetic recombination indiscriminately and unbiasedly "chops" up DNA in each successive generation as a child is being conceived. 
  2. Each parent then randomly passes on 50% of DNA to a child while the parent retains the other 50% of their own DNA
    The two biological processes that were just mentioned are simply too impossible to control and to predict.

     The only time a population's ethnic background, history, or any other social factors is relevant in an autosomal DNA test, is if there is a history of inbreeding among close relatives. For example, people who are of Ashkenazi Jewish ancestry are descendants of ancestors where many first cousins married each other and produced children. This also occurred in early colonial America as well. This was, and still is, a social norm in many cultures. In examples of this nature, an autosomal DNA test will show distant cousins that share an unexpectant large cM amount of DNA. The reason for this is that the population's gene pool has became highly homogeneous under such inbreeding conditions over time.  Recombination will simply dice up more similar DNA segments and more offspring will receive those similar DNA segments.

     Even in light of such historical inbreeding, genetic recombination operates in the same fashion as it always does. DNA is randomly chopped and dispersed to descendants in an unbiased, random, and uncontrollable fashion. In the end, the matching algorithm is left to deal with the "scraps" and make a decision. The autosomal DNA matching algorithm always operates within an area of uncertainty. It merely looks at the resultant DNA scraps and leftovers from recombination, declares if there is evidence of common ancestry between parties, and gives a prediction of the actual relationship between two or more parties.

Well that's it for recombination. Hopefully that will clear up any misconceptions.

As always, it was a pleasure to serve everyone.

Thanks
Steve