Saturday, November 3, 2012

Understanding BGA Testing

     In this document, I am going to explain BGA Testing. These days there are a number of companies which claim that from your DNA, your ancestry can be determined. DNA stores information such as the color of our eyes and hair. DNA keeps a record of our past ancestors and who we are related to. A person's ethnic composition, religion, language, and name are types of information that is NOT stored nor defined by DNA. From the results of a BGA DNA test, people tend to infer socially defined concepts such as person's religion from DNA. Such inferences can be wrong because DNA doesn't store such information.

Humans tend to categorize things based on observed patterns. Those patterns can not be defined by DNA. So please keep that in mind.

     Now let's please turn our attention the basics of BGA or Admixture Testing. The current position of the scientific community is that the jury is still out on BGA Testing. As we are going to see, there is very good reason for this!!!

BGA Basics And Science
     BGA stands for biogeographical analysis. BGA tests are sometimes call Admixture Tests. A BGA test basically tries to use your DNA to determine or pinpoint what part of the world your ancestor(s) originated. Using your DNA to show if two people have a common ancestor is valid. DNA contains information such as whether or not two people are related.

     However using your DNA to pinpoint where an ancestor was born, lived, or came from, is entirely different.  Here is the idea behind a BGA test.

  Suppose we have a population called the Handy Clan. The Handy Clan has 1000 people and is located on a remote island. Now let's say everyone in the Handy Clan population has a rare DNA marker which we will call -> M. In other words, the frequency of this DNA marker is 100% because everyone (1000 people) has the DNA marker M. Also, let's assume that no one outside of the Handy Clan, which is on this remote island, has the DNA marker M.

Now Laurie lives in the US in Oakland, California which is located outside the remote island and outside of the Handy Clan population. Let's suppose we discover Laurie has this same rare DNA marker M. 

    Can we say Laurie is from or has ancestry from the Handy Clan population?

     Under simple circumstances, yes!!!  We can confidently say that. If no other population in the world has this rare genetic marker M, then we can say yes. Laurie is either from, or has had an ancestor, that originated from the Handy Clan population. That's what a BGA does. It compares your DNA markers to a studied population. Since all one thousand people have the same DNA marker M, then Laurie must either have been born in that Handy Clan population or Laurie had an ancestor from that population.

However reality is not as simple as that!!!!!  Let's see a more realistic scenario.

A More Realistic Scenario 
     Now suppose we have three separate populations, the Handy Clan, Williams Clan, and Henderson Clan. Each population is located in a different part of the world. Each population or clan has 1000 people in it. Every person in each of the populations has the genetic marker M.  In other words, the frequency of the DNA marker M is 100% in each population.

     Now we discover again that Laurie, who lives in Oakland, which is outside each population, has the genetic marker M. 

Question: Does Laurie has ancestry from the Handy Clan population?  

Now things have changed. The question is now harder to answer. The fact that Laurie has a DNA marker M in multiple populations doesn't necessarily mean Laurie has ancestry from the Handy Clan population.  Laurie could of had an ancestor that lived or was born in any of those populations. 

That's the problem with a BGA DNA test. As we can see, the truth is not so clear cut in tests of this nature. The truth is based on a probability.  Any newly introduced population can change things dramatically. Therefore, when interpreting the results from a BGA or Admixture test, please keep in mind that your results may differ or change tomorrow. Laurie would need a paper trail or some definitive piece of evidence to confirm the inference drawn from the BGA results. The BGA data numbers alone don't necessarily prove anything.

The reason is that a BGA test is attempting to infer information from DNA that DNA doesn't define. An ancestor's original location can be any where. DNA simply doesn't reflect or store that type of information. From the frequency (or concentration) of those DNA markers in each population, we are making an inference which could be right or wrong. If a child is born in say Atlanta, Georgia, that geographical location and information will not be stored in the child's DNA. 

  One of the biggest misconceptions out there, is that a BGA or Admixture Test, can pinpoint the exact tribe or small population someone is from. As one can clearly see, this is not necessarily true. DNA alone simply cannot do this as it's advertised. This is one of the reasons, the scientific community as a whole has not embraced BGA tests.

Now let's look at the basic BGA concepts.

BGA Concepts
In BGA terms, the DNA marker M, is called an ancestry informative marker or AIM. Each population is called a reference population. An example of a reference population is the Yoruba. The Yoruba is a West African ethnic group that is studied by population geneticists. Many African-Americans have DNA markers that match to the Yoruba group.  

Now that we have the BGA basics, let's look at the BGA process and engine which is known as PCA.

BGA Process and PCA 
     The engine or workhorse of most BGA Analysis is PCA. PCA stands for Principal Component Analysis. PCA is a complex mathematical process that separates a bunch of data into its components. For example, let's say we have a bag of 100 jelly beans that are of different colors. After separating the jelly beans by color, we see this -> blue=25, red=25, purple=25, and yellow=25. This means that each of the four colors make up 25% (25/100) of the jelly beans. PCA would essentially separate the jelly beans in the exact same way.

     The BGA process starts off with about 300,000 AIMs or SNPs. These SNPs are found across the first 44 chromosomes in humans. The SNPs are matched to a number of reference populations. The results are percentages that represent the concentration of the SNPs in each reference population. The engine running the show is PCA, which runs in the background of an algorithm.

Now let's look at a few BGA tests.

BGA Tests: Population Finder, Ancestry Painting, McDonald
There are a number of BGA tests out there. Family Tree DNA's BGA test is Population Finder. 23andME's is called Ancestry Painting. The Population Finder is a BETA test so it's a work in progress. Population Finder uses continental groups in addition to reference groups.

Here is an example of PF

Continent (Subcontinent)     Population              Percentage    Margin of Error
Europe (Western European)   French, Orcadian       28.53%            ±0.48%
Africa (West African)             Yoruba, Mandenka     71.47%            ±0.48%

There are four reference populations -> French, Orcadian, Yoruba, Mandenka. This person basically has DNA markers that match those reference populations. It's likely this person has ancestry from some of those populations, but not necessarily all of them. A paper trail would be needed to confirm ancestry.

Because the Population Finder is a beta test and has limited reference populations (same for 23andME's Ancestry Painting), many people turn to an Extended BGA Analysis. This is where Dr Doug McDonald comes in.

McDonald's Extended BGA 
Dr Douglass McDonald is a chemist at University of Illinois in Urbana, Illinois. In fact, he actually created the Population Finder for Family Tree DNA. McDonald has access to more studied reference populations which Family Tree DNA or 23andMe currently doesn't have. Because of this, you can get a more "fleshed" out or "extended" BGA Analysis.

McDonald gives his results in the form of an email with four graphs. Here are McDonald's results of my cousin Lonette Lanier's extended BGA test as shown in quotes below:

Most likely fit is 27.9% (+-  0.1%) Europe (various subcontinents) and 72.1% (+-  0.1%) Africa (all West African).

The following are possible population sets and their fractions, most likely at the top

French= 0.279 Mandenka= 0.721
Hungary= 0.280 Mandenka= 0.720
English= 0.277 Mandenka= 0.723

There is also about 0.4% Native American that is strong and likely real, as well as other little bits on the chromosomes but they are weak and probably unimportant."

Each line, "French= 0.279 Mandenka= 0.721", is a population set. There are three population sets. Each population set gives a likely or probable ancestry for my cousin Lonette. Each population set is a combination that gives the best fit for Lonette's data. It doesn't mean Lonette necessarily has ancestry from say, the French. But she does have DNA markers that match the French reference population. The multiple population sets are the result of Lonette's DNA markers that are spread across multiple populations. This is why it's difficult to pinpoint a person's ancestral origin to a specific tribe or single population via your DNA alone.

It's important to always backup DNA evidence with documents or other pieces of evidence to validate a claim. The numbers alone don't always or neccesarily identify the truth.

Now let's look at the issues the scientific community has with BGA Testing

Issues With BGA Or Admixture Testing
The scientific community as a whole hasn't really embraced BGA or Admixture Testing. Using your DNA to establish whether two or people are related via a common ancestor is valid. However using your DNA to locate where your ancestor(s) originated is quite a different task. An ancestor could have been born or lived in any part of the world. More important - DNA simply doesn't define or contain information such as ancestor's geographical location or point of origin. That type of information is NOT an attribute of a genetic mutation. Therefore BGA or Admixture tests don't have a basis in genetics. That's the scientific community's main objection to BGA or Admixture tests. The results from a BGA or Admixture test are used to make inferences from observed correlations. A correlation can be dangerous in science because it can lead to an incorrect inference from an observed set of data. 

There is a very big difference between a casual relationship (correlation) versus a direct relationship between two variables.

This doesn't mean BGA tests aren't valuable. A BGA test can lead one into finding insight into their past. However you must understand that the results from a BGA test aren't final. The results from a BGA test are tenative and can easily change tomorrow.

There are at least three main current hurdles with a BGA Analysis

1) Populations can change location and identity. They are not static. What we know about a population's history is limited and based on what we currently know. Moderns humans have been here for approximately 200,000 years. No one can know the entire history of any population. We can have approximate knowledge, but NOT complete knowledge.

2) We simply don't at this time have a complete set of reference populations to make any final judgment calls as of yet. (I will explain this shortly)

3) Different algorithms can produce different results.

For example suppose Dr McDonald gives me the following simple BGA results:

Finnish=.100 and Yoruba=.900.

This is based on the fact that the scientific community has studied the Yoruba and Finnish etc. This would lead one to believe that one has a large Yoruba ancestry. The Yoruba ancestry may be true with a paper trail.

Now suppose the scientific community has studied and approved a new reference population, C, in say a few years. Now a rerun of Dr McDonald's results yields the following:

Population C=.450, Finnish=.100, and Yoruba=.450

Now as you can see, things have changed. My ancestor now could have lived in the Yoruba, or could have lived in the new reference population C. This scenario could happen. As you can see, none of these results are absolute or final in the sense that they can't change.

     In addition, different algorithms can produce different results. An algorithm is simply a method or set of steps to solve a problem. The algorithm is very important. It's what produces your DNA results. Right now there are a number of tools out there that claim the ability to produce valid BGA results. Each of these tools may run under different algorithms.

For example - I have taken three BGA tests: Ancestry Painting, Population Finder, and McDonald. Each has produced different results. The analysis from 23andME stated I had 7 percent Asian ancestry. Now this could be significant or it could be noise. Neither FTDNA's Population Finder nor McDonald's findings gave 7% percent ancestry. The bigger question is which one is correct? Population Finder is a BETA test. So I can assume that it's findings are approximate. Can the same be said for 23andME's Ancestry Painting results or Dr McDonald's BGA findings? The truth is that at this time - it's impossible to tell which one is correct or is incorrect.

     The most important point to take from this tutorial is that a BGA can yield valuable information not necessarily definitive information. Technically, the only factual based information that can be produced from a BGA test is that a person has DNA markers (AIMs) that match a reference population. 

Well that's it for BGA Analysis. If anyone has questions, please free to ask.



  1. thank for your post,i now understand much better the principle behind admixture test thank to you.
    can you show us example for different algorithm?

    1. Ronen - I don't have an example of a different algorithm. My apologies. I just wanted to break down the basics of how a BGA test works.

  2. one more question,does admixture tests calculate the frequency for each snp separately or for each entire gene .
    if it is the former ,then how is this supposed to be accurate ,since we don't inherit snps from our parents separately, but we inherit the whole gene,gene has his own origin unlike snps since we can't said that the gene CTCCTC for example is part african (first snp -C) part east asian (second snp - T) and part european (third snp C) it will be ridiculous.
    if it's the latter ,then how my results added up to 100%?it's doesn't sound realistic that each part of my genes can be found only in these tiny references populations . or in other words it's doesn't make sense that i share identical segments of my entire genome,with all these references populations together.

  3. Hi Ronen. Thanks for your comments. I have posted results from FTDNA's Population Finder and McDonald. Those are two different BGA tests that use different algorithms and reference populations. You can also look at the BGA tests aren't looking at genes. BGA tests look at SNP frequency and concentrations. Basically a BGA test produces percentages that reflect how your SNPs match different reference populations.

    Hope that helps

  4. If Dr. Mcdonald gives you "0.4% Native American" and tells you it's "likely real",I doubt it. Not even Dr. Mcdonald has an adequate enough population reference sampling for west Africans beyond Yoruba and Mandinka.

  5. Maybe when there are more west African references added like Tikaro or Igbo or Bubi,African Americans will get clearer results. African Americans,even if one had absolutely no interracial ancestry and was 100% completely pure black, are a blend of different ethnic groups from sengal/gambia down to angola and even from mozambique. Not all of that is going to fit with just Yoruba and mandinka so African Americans will get something like a "0.3%" "East Asian and Native American" result when in reality they have absolutely not one single trace of neither Native American ancestry nor East Asian ancestry.

  6. I took the Nat Geo genome and found I was in the T2b3e category.......2% showed Native American???????can you elaborate.....I am woundering if I took the Nat Geo Family Tree DNA...would it really go into a further analysis