Descriptive statistics is a part of statistics that helps us understand and describe the basic features of a dataset. It employs measures of central tendency and measures of dispersion to provide a quantitative summary of data. These measures include mean, median, mode, variance, standard deviation and range, among others. In this article, we will provide a detailed explanation of each of these measures, along with examples to illustrate their practical applications.
Mean
The mean is perhaps the most used measure of central tendency in statistics. It is simply the arithmetic average of a set of values. To calculate the mean, you add up all the values in a dataset and divide by the total number of values.
μ = (Σ xi) / n
Example
Suppose we have a dataset consisting of the following test scores:
85 + 90 + 75 + 95 + 80 + 85 + 90 + 95 + 85 + 90
To calculate the mean, we add up all the values and divide by the total number of values:
(85 + 90 + 75 + 95 + 80 + 85 + 90 + 95 + 85 + 90) / 10 = 87
The mean test score in this dataset is 87.
Median
The median is another measure of central tendency. It is the middle value in a dataset when the values are arranged in order of magnitude. When the number of values is even, the median is determined by calculating the average of the two middle values.
Example
Suppose we have a dataset of salaries arranged randomly:
$60,000, $75,000, $65,000,$55,000,$50,000,$70,000
To find the median salary, we first arrange the salaries in order:
$50,000, $55,000, $60,000, $65,000, $70,000, $75,000
Median = ($60,000 + $65,000) / 2
Mode
In a dataset, the mode represents the value that appears most frequently. It can be used to describe the most common value in a dataset. In some cases, there may be more than one mode in a dataset, which means that there are multiple values that occur with the same frequency.
Example
Suppose we have a dataset consisting of the following test scores:
85, 90, 75, 95, 80, 85, 90, 95, 85, 90
In this dataset, the value 90 occurs three times, which makes it the mode of the dataset.
Quartiles
Quartiles divide a dataset into four equal parts, with each part representing 25% of the data. The first quartile (Q1) represents the 25th percentile of the data, the second quartile (Q2) represents the 50th percentile (which is the same as the median), and the third quartile (Q3) represents the 75th percentile of the data. The difference between the third and first quartiles (Q3-Q1) is known as the interquartile range (IQR), which is another measure of variability.
To calculate the quartiles of a dataset, we can first sort the data in ascending order. Then we can use the following formulas:
- Q1 = (n+1)/4th value
- Q2 = (n+1)/2th value (same as the median)
- Q3 = 3(n+1)/4th value
The total number of data points in the dataset is denoted by ‘n’.
In addition to providing information about the central tendency of the data, quartiles and the interquartile range can help to identify potential outliers and can provide additional insights into the distribution of the data.
Example
Considering the same dataset
$50,000, $55,000, $60,000, $65,000, $70,000, $75,000
Q1 = ($55,000 + $60,000) / 2 = $57,500
Q2: ($60,000 + $65,000) / 2 = $62,500
Q3 = ($70,000 + $75,000) / 2 = $72,500
The interquartile range (IQR) can be calculated as Q3 - Q1:
IQR = $72,500 - $57,500 = $15,000
Variance and Standard Deviation
The variance is a measure of the spread or variability of a dataset. It measures the distance of each value in the dataset from the mean. A high variance indicates that the values in the dataset are widely spread out, while a low variance indicates that the values are clustered closely around the mean.
The standard deviation is another measure of the spread or variability of a dataset. It is a square root of the variance. Like the variance, the standard deviation is a useful measure for describing the spread of values in a dataset.
Formula
Standard deviation = sqrt(variance)
Here’s the example table for calculating variance and std deviation with the dataset {4, 6, 8, 10, 12}:
Value (xi) |
Mean (μ) |
Deviation (xi – μ) |
(xi – μ)^2 |
4 |
8 |
-4 |
16 |
6 |
8 |
-2 |
4 |
8 |
8 |
0 |
0 |
10 |
8 |
2 |
4 |
12 |
8 |
4 |
16 |
Now, follow these steps:
- Sum the squared deviations: 16 + 4 + 0 + 4 + 16 = 40
- Divide the sum of squared deviations by the total number of data points (n) for population variance or (n-1) for sample variance.
For population variance: 40 / 5 = 8
For sample variance: 40 / (5-1) = 40 / 4 = 10
Population standard deviation = sqrt(8) = 2.83
Sample standard deviation = sqrt(10) = 3.16
Range
The range of a dataset is obtained by subtracting the lowest value from the highest value, thus representing the difference between the two. It is a simple measure of the spread of values, but it can be sensitive to extreme values. In some cases, the range may not be a good measure of the spread of values if there are outliers in the dataset.
Formula
range = highest value – lowest value
Example
Suppose we have a dataset consisting of the following test scores:
85, 90, 75, 95, 80, 85, 90, 95, 85, 90
The highest value in this dataset is 95 and the lowest value is 75, so the range is:
95 - 75 = 20
The range of this dataset is 20
Conclusion
Descriptive statistics provides a way to summarize and describe the basic features of a dataset. The measures discussed in this article are fundamental to understanding data and are used extensively in a variety of fields, including science, economics and finance. By understanding these measures, you can gain insights into your own data and make informed decisions based on your analysis.