Calculate outliers

Author: Charles Brown
Date Of Creation: 8 February 2021
Update Date: 1 July 2024
Anonim
Statistics - How to find outliers
Video: Statistics - How to find outliers

Content

A outlier or outlier in statistics is a data point that differs significantly from the other data points in a sample. Often, outliers point to discrepancies or errors in the measurements to statisticians, after which they can remove the outlier from the data set. If they actually choose to remove the outliers from the data set, it could produce significant changes in the conclusions drawn from the study. This is why it is important to calculate and determine outliers if you want to interpret statistical data properly.

To step

  1. Learn how to spot possible outliers. Before we can decide whether to remove anomalous values ​​from a particular data set, we must of course first identify the possible outliers in the data set. In general, outliers are the data points that deviate significantly from the trend that form the other values ​​in the set - in other words, they shoot out of the other values. It is usually easy to recognize this in tables and (especially) in graphs. If the data set is graphed visually, the outliers will be "far" from the other values. For example, if most points in a data set form a straight line, outliers will not conform to this line.
    • Let's take a look at a data set showing the temperatures of 12 different objects in a room. If the temperature of 11 of the objects fluctuates by a few degrees at most around 21 ° C, while one object, an oven, has a temperature of 150 ° C, you can see at a glance that the oven is probably an outlier.
  2. Sort all data points from lowest to highest. The first step in calculating outliers is to find the median value (or the middle value) of the data set. This task becomes much easier if the values ​​in the set are in order from lowest to highest. So before continuing, sort the values ​​in your dataset like this.
    • Let's continue with the example above. Here is our data set showing the temperatures in degrees Fahrenheit of various objects in a room: {71, 70, 73, 70, 70, 69, 70, 72, 71, 300, 71, 69}. If we sort the values ​​in the set from lowest to highest, this becomes our new set: {69, 69, 70, 70, 70, 70, 71, 71, 71, 72, 73, 300}.
  3. Calculate the median of the data set. The median of a dataset is the data point where half of the data is above it, and half of the data is below it - it is, in effect, the "center" of the dataset. If the data set contains an odd number of points, the median is easy to find - the median is the point with as many points above as below. If there is an even number of points, since there is not one center point, you have to take the average of the two center points to find the median. When calculating outliers, the median is usually referred to by the variable Q2 - because it lies between Q1 and Q3, the first and third quartiles. We will determine these variables later.
    • Don't be confused by datasets with an even number of points - the average of the two middle points is often a number that is not in the dataset itself - this is okay. However, if the two middle points are the same, the mean will of course also be this number - also this is Okay.
    • In our example we have 12 points. The middle two terms are points 6 and 7 - 70 and 71, respectively. So the median of our data set is the mean of these two points: ((70 + 71) / 2) =70,5.
  4. Calculate the first quartile. This point, which we denote with the variable Q1, is the data point below which 25 percent (or a quarter) of the observations lie. In other words, this is the center of all points in your data set below the median. If there is an even number of values ​​below the median, you must again take the average of the two middle values ​​to find Q1, as you might have done to determine the median yourself.
    • In our example, six points are above the median and six points below it. So to find the first quartile we have to take the average of the two middle points in the bottom six points. Points 3 and 4 of the bottom six are both 70, so their mean is ((70 + 70) / 2) =70. So our value for Q1 is 70.
  5. Calculate the third quartile. This point, which we denote with the variable Q3, is the data point above which 25 percent of the data lies. Finding Q3 is practically the same as finding Q1, except we're looking at the points in this case above the median.
    • Continuing with the example above, we see that the two middle points of the six points above the median are 71 and 72. The mean of these two points is ((71 + 72) / 2) =71,5. So our value for Q3 is 71.5.
  6. Find the interquartile range. Now that we have determined Q1 and Q3 we have to calculate the distance between these two variables. You can find the distance between Q1 and Q3 by subtracting Q1 from Q3. The value you get for the interquartile range is crucial for determining the boundaries for non-deviating points in your data set.
    • In our example, the values ​​for Q1 and Q3 are 70 and 71.5, respectively. To find the interquartile range, we calculate Q3 - Q1: 71.5 - 70 =1,5.
    • This works even if Q1, Q3, or both numbers are negative. For example, if our value for Q1 were -70, the interquartile range would be 71.5 - (-70) = 141.5, which is correct.
  7. Find the "Inner Limits" of the dataset. You can recognize outliers by determining whether they fall within a number of numerical limits; the so-called "inner limits" and "outer limits". A point that falls outside the inner limits of the dataset is classified as one mild outlier, and a point outside the outer limits is classified as one extreme outlier. To find the inner boundaries of your data set, first multiply the interquartile range by 1.5. Add the result to Q3 and subtract it from Q1. The two results are the inner limits of your data set.
    • In our example, the interquartile range is (71.5 - 70), or 1.5. Multiply this by 1.5 to get 2.25. We add this number to Q3 and subtract it from Q1 to find the inner boundaries as follows:
      • 71,5 + 2,25=73,75
      • 70 - 2,25=67,75
      • So the internal borders are 67.75 and 73.75.
    • In our data set, only the oven temperature - 300 degrees Fahrenheit - is outside this range. So this may be a mild outlier. However, we have yet to determine whether this temperature is an extreme outlier, so let's not jump to conclusions just yet.
  8. Find the "outer limits" of the dataset. You do this in the same way as with the inner limits, with the only difference that you multiply the interquartile distance by 3 instead of 1.5. You then add the result to Q3 and subtract from Q1 to find the outer limit values.
    • In our example, we multiply the interquartile distance by 3 to get (1.5 * 3) or 4.5. We can now find the outer limits in the same way as the inner limits:
      • 71,5 + 4,5=76
      • 70 - 4,5=65,5
      • So the outer limits are 65.5 and 76.
    • Data points that lie outside the external boundaries are considered extreme outliers. In our example, the oven temperature, 300 degrees Fahrenheit, is well beyond the outside limits. So the oven temperature is certainly an extreme outlier.
  9. Use a qualitative assessment to determine if you should "throw out" the outliers. With the above method you can determine whether certain points are mild outliers, extreme outliers or no outliers at all. But make no mistake - recognizing a point as an outlier makes it just one candidate to be removed from the dataset, and not immediately a point that is removed must turn into. The reason why an outlier differs from the rest of the points in the set is crucial in determining whether the outlier should be removed. Generally, outliers caused by some error - an error in the measurements, in the recordings or in the experimental design, for example - are removed. In contrast, outliers that are not caused by errors and that reveal new, unpredicted information or trends usually become not deleted.
    • Another criterion to consider is whether the outliers affect the mean of a data set in a way that is skewed or misleading. This is especially important if you plan to draw conclusions from the mean of your data set.
    • Let's judge our example. Since the highest It is unlikely that the furnace reached a temperature of 300 ° F due to some unforeseen force of nature, in our example we can conclude with almost 100% certainty that the furnace was accidentally turned on, causing an abnormally high temperature reading. Additionally, if we don't remove the outlier, the mean of our data set comes out to (69 + 69 + 70 + 70 + 70 + 70 + 71 + 71 + 71 + 72 + 73 + 300) / 12 = 89.67 ° F, while the mean without the outlier comes out to (69 + 69 + 70 + 70 + 70 + 70 + 71 + 71 + 71 + 72 + 73) / 11 = 70.55 ° F.
      • Since the outlier was caused by human error, and because it is incorrect to say that the average room temperature was close to 32 ° C, we must choose to use our outlier. remove.
  10. Understand the importance of (sometimes) retaining outliers. While some outliers should be removed from a dataset because they are the result of errors or because they skew the results in a misleading way, other outliers should be preserved. For example, if an outlier has been obtained correctly (and is therefore not the result of an error) and / or if the outlier offers a new insight into the phenomenon to be measured, it should not be removed immediately. Scientific experiments are particularly sensitive situations when it comes to dealing with outliers - mistakenly removing an outlier can mean throwing away important information about a new trend or discovery.
    • For example, imagine we are designing a new drug to make fish in a fish farm grow bigger. Let's use our old data set ({71, 70, 73, 70, 70, 69, 70, 72, 71, 300, 71, 69}), except that each point now represents the mass of a fish (in grams) after treatment with another experimental drug from birth. In other words, the first drug gave one fish a mass of 71 grams, the second gave another fish a mass of 70 grams, and so on. In this situation, 300 still a huge outlier, but we shouldn't remove it now. Because, if we assume that the outlier is not the result of an error, it represents a great success in our experiment. The drug that produced a 300 gram fish worked better than any other drug, so this is it most important data point in our set, instead of the least important data point.

Tips

  • If you find outliers, try to explain them before removing them from the dataset; they can indicate measurement errors or deviations in the distribution.

Necessities

  • Calculator