Sorting and Organizing |
Descriptive Statistics
Descriptive statistics involves preparing graphs, calculating representative numbers, and calculating measures of consistency. The methods of descriptive statistics are designed to summarize collections of numbers so that any key patterns, or trends, become more obvious.
Even if you never create your own descriptive statistics, being able to understand averages and charts, and their limitations, is an essential skill in the modern world.
Charts
Charts and graphs are constructed to summarize and illustrate the essence of collected data. Bar graphs are useful for illustrating changes in data from one period to the next. Pie charts are useful for emphasizing how a total quantity is divided up - such as an annual budget, voter preferences, or the contents of the universe. Line graphs are used to illustrate trends that occur as variables are changed. Line graphs are typically used to show the price variations of a company’s stock over time.
The type of graph and the range of the graph’s scales are normally chosen to best illustrate the key features of the data. A note of caution – just as a digital photograph can be compressed or stretched to make the same person look skinny or fat, the scales on a graph can be selected to make trends in the same data look large or small.
A frequency distribution graph is particularly useful in descriptive statistics. The various possible values of a parameter are divided into categories and are shown on the horizontal axis. The frequency with which items are found in each category is shown on the vertical axis. For example, the manager of a fast food restaurant might make a frequency distribution graph for the average number of donuts sold on each weekday. He could then use that information to decide how many donuts to bake for Friday’s customers. If he bakes too many there will be leftovers, if he does not bake enough he may lose customers.
When there are many categories, a frequency distribution graph typically takes the form of a hill. The highest point on the hill represents the most frequent value. A tall narrow hill indicates a strong central tendency in the data. A low broad hill indicates that the data is more evenly distributed. A normal distribution is a symmetrical frequency distribution graph and has many applications in statistics.
Representative numbers
Sometimes you want to reduce a whole set of data to a single number – the average annual income, the average temperature in July, or the percent of the population that is over age sixty-five.
Finding the average value, or mean, is the simplest and most frequently used computation in descriptive statistics. You just add up all the individual values and then divide by the number of values. If three different workers earn $10, $12 and $17 per hour, then their average wage is $13 per hour.
Sometimes not every value has the same importance and a ‘weighted mean’ is more appropriate. Suppose you scored 15/20, 45/100, and 33/50 on three math tests. What is your average math score? If you assume the tests have equal value, then you could convert each score to a percent: 75%, 45%, and 66%, and then average those scores to obtain 62%. If you assume that the tests should be weighted according to the number of marks available, i.e. a ‘big test’ should be worth more than a ‘quiz’, then you could add up all the marks earned and divide by the total marks available to obtain a weighted average of 55%.
There are two other approaches that can be used to represent a whole set of data with a single number.
When all your data is arranged in order, from the smallest to the largest, the number in the middle of the list is the ‘median’. There are as many numbers smaller than the median as there are larger numbers. The median of the numbers: 23, 44, 47, 72, 81 is 47.
When a set of data is arranged in order, it is easy to see if some values are repeated. If there is a most frequent value, it is called the ‘mode’. The mode corresponds to the most probable result if one data item is selected at random from a whole set of values.
Average values are used in a variety of applications. Insurance companies use data on the average ages of death to construct mortality tables and set appropriate insurance rates. Clothing designers use data on average body dimensions to determine the range of clothing sizes to manufacture. Chemists use average values to describe the properties of chemical compounds.
Statistical consistency
The spread of values in a set of data is an important property of that data. If the values are fairly close together, then the data can be thought of as consistent. If the values are scattered over a wide range, then the data can be thought of as inconsistent. Consider the following two cases:
1. Fred earns $10 per hour and Martha earns $90 per hour. Their average wage is $50 per hour, with a wide range of $80.
2. Bob earns $49 per hour and Natasha earns $51 per hour. Their average wage is also $50 per hour, but the range is just $2.
A simple technique for indicating the spread of the original values involves listing the differences between the original values and the average, and then averaging those differences to find the ‘mean deviation’. The average wage in the first case is then $50 ± $40 ($50 plus-or-minus $40), and in the second case is $50 ± $1.
Descriptive statistics involves preparing graphs, calculating representative numbers, and calculating measures of consistency. The methods of descriptive statistics are designed to summarize collections of numbers so that any key patterns, or trends, become more obvious.
Even if you never create your own descriptive statistics, being able to understand averages and charts, and their limitations, is an essential skill in the modern world.
Charts
Charts and graphs are constructed to summarize and illustrate the essence of collected data. Bar graphs are useful for illustrating changes in data from one period to the next. Pie charts are useful for emphasizing how a total quantity is divided up - such as an annual budget, voter preferences, or the contents of the universe. Line graphs are used to illustrate trends that occur as variables are changed. Line graphs are typically used to show the price variations of a company’s stock over time.
The type of graph and the range of the graph’s scales are normally chosen to best illustrate the key features of the data. A note of caution – just as a digital photograph can be compressed or stretched to make the same person look skinny or fat, the scales on a graph can be selected to make trends in the same data look large or small.
A frequency distribution graph is particularly useful in descriptive statistics. The various possible values of a parameter are divided into categories and are shown on the horizontal axis. The frequency with which items are found in each category is shown on the vertical axis. For example, the manager of a fast food restaurant might make a frequency distribution graph for the average number of donuts sold on each weekday. He could then use that information to decide how many donuts to bake for Friday’s customers. If he bakes too many there will be leftovers, if he does not bake enough he may lose customers.
When there are many categories, a frequency distribution graph typically takes the form of a hill. The highest point on the hill represents the most frequent value. A tall narrow hill indicates a strong central tendency in the data. A low broad hill indicates that the data is more evenly distributed. A normal distribution is a symmetrical frequency distribution graph and has many applications in statistics.
Representative numbers
Sometimes you want to reduce a whole set of data to a single number – the average annual income, the average temperature in July, or the percent of the population that is over age sixty-five.
Finding the average value, or mean, is the simplest and most frequently used computation in descriptive statistics. You just add up all the individual values and then divide by the number of values. If three different workers earn $10, $12 and $17 per hour, then their average wage is $13 per hour.
Sometimes not every value has the same importance and a ‘weighted mean’ is more appropriate. Suppose you scored 15/20, 45/100, and 33/50 on three math tests. What is your average math score? If you assume the tests have equal value, then you could convert each score to a percent: 75%, 45%, and 66%, and then average those scores to obtain 62%. If you assume that the tests should be weighted according to the number of marks available, i.e. a ‘big test’ should be worth more than a ‘quiz’, then you could add up all the marks earned and divide by the total marks available to obtain a weighted average of 55%.
There are two other approaches that can be used to represent a whole set of data with a single number.
When all your data is arranged in order, from the smallest to the largest, the number in the middle of the list is the ‘median’. There are as many numbers smaller than the median as there are larger numbers. The median of the numbers: 23, 44, 47, 72, 81 is 47.
When a set of data is arranged in order, it is easy to see if some values are repeated. If there is a most frequent value, it is called the ‘mode’. The mode corresponds to the most probable result if one data item is selected at random from a whole set of values.
Average values are used in a variety of applications. Insurance companies use data on the average ages of death to construct mortality tables and set appropriate insurance rates. Clothing designers use data on average body dimensions to determine the range of clothing sizes to manufacture. Chemists use average values to describe the properties of chemical compounds.
Statistical consistency
The spread of values in a set of data is an important property of that data. If the values are fairly close together, then the data can be thought of as consistent. If the values are scattered over a wide range, then the data can be thought of as inconsistent. Consider the following two cases:
1. Fred earns $10 per hour and Martha earns $90 per hour. Their average wage is $50 per hour, with a wide range of $80.
2. Bob earns $49 per hour and Natasha earns $51 per hour. Their average wage is also $50 per hour, but the range is just $2.
A simple technique for indicating the spread of the original values involves listing the differences between the original values and the average, and then averaging those differences to find the ‘mean deviation’. The average wage in the first case is then $50 ± $40 ($50 plus-or-minus $40), and in the second case is $50 ± $1.