Stats Glossary
Section 1.1
- Data – numbers collected in a
particular context.
For
example, if you asked everyone in our class how many brothers and/or sisters they had, the numbers that your class
members gave as responses would represent data.
- Variable – any characteristic of a
person or thing that can be assigned a number of a category.
For
example, in the scenario described above, the variable would be the number of brothers and/or sisters.
Students
sometimes have trouble determining whether or not a statement represents a variable. Suppose that the statement was “name of your
math teacher.” If the observational units were students in
our class, would this vary from
student to student? Pretend to ask each
student, “What is the name of your math
teacher?” Aren’t they all going to say
the same name? Since the students’ answers would not vary, this cannot be a
variable.
- Observational unit – the person or
thing to which the number or category is assigned; also called the case.
For example,
each class member that you asked for the number of brothers and/or sisters is an observational unit or case.
________________________________________________________________________
Section 1.2
- Quantitative variable – a variable
that measures a numerical characteristic; also called a measurement
variable.
For
example, since the response to how many brothers and/or sisters a person has is a number, this variable is a
quantitative variable.
Count variable – a type of quantitative
variable; answers the question, “How many?”
- Categorical variable – a variable
that records a category destination; also called a qualitative variable.
For
example, if you were to record the gender of your class members, gender would be a categorical variable because the
class members are either in the female category
or in the male category.
Here is
another example. Suppose you asked your
class members for their favorite
flavor of ice cream and you allowed them to choose from the following: vanilla, chocolate, strawberry, or
other. Ice cream flavor would be a
categorical variable because the
class responses would fall into one of the four categories.
- Binary variable – a special
categorical variable for which only two possible categories exist.
For
example, the variable gender would be a binary variable. The variable ice ream flavor would not be a binary variable because it has more than
two categories for the responses.
- Displays for a Categorical Variable
1. Frequency
Table
2. Picture
Graph
3. Bar
Graph
4. Segmented
Bar Graph
5. Circle
Graph
________________________________________________________________________
Section 1.3
- Read Question – requires the
respondent to read information from the table to determine a solution;
important but low-level.
- Derive Question – requires some
type of computation involving information read from a table.
- Interpret Question – requires an
extension, prediction, or inference to read beyond the data; higher-level
thinking.
________________________________________________________________________
Section 1.4
- Displays for Quantitative Variables
- Dot
Plot
- Stem-Leaf
Plot
- Grouped
Frequency Table
- Histogram
- Symmetric – a distribution is
symmetric if one half is roughly a mirror image of the other.
- Skewed to the right – a
distribution is skewed to the right if it tails off toward larger values.
- Skewed to the left – a
distribution is skewed to the left if it tails off toward smaller values.
- Outliers – observations that
differ markedly from the pattern established by the vast majority.
- Granularity – a distribution has
this characteristic if it has values occurring only at fixed intervals.
- ]Six Features of Data Distribution that
are typically of interest –
- center
– to be discussed in section 2.1
- variability
or spread – to be discussed in sections 2.2 and 2.3
- shape
– while the shape may vary, many times the shape may be identified as
symmetric, skewed to the right, or skewed to the left
- cluster/peaks
– peaks or clusters indicate that the data fall into natural subgroups
- outliers
– if outliers are present, they warrant close examination
- granularity
- Side-by-side Stemplot
- A
common set of stems is used in the middle of the display with leaves for
each category branching out in either direction
- Order
the leaves from the middle out toward either side
- Statistical Tendency
o Pertains
to average or typical cases but not necessarily to individual cases
o Ex:
Men tend to be taller than women. This
does not mean that all men are taller than all women.
________________________________________________________________________
Section 1.5 Part 1
Response Variable vs.
Explanatory Variable
- Many
times we would like to offer an explanation as to why a person gives a
particular response.
- Example:
Do you believe that a person who is 50 years old is “old”? A person’s response to this question can
most likely be explained by that person’s age. That is, someone who is 20 might believe
50 is old. However, someone who is
49 or 60 might not consider 50 as old.
- In the
above example, there are two variables of interest, namely age and the “do
you believe 50 is old” variable.
Since we are thinking that a person’s age might predict the
response to the statement, the variable age is called the explanatory
variable. The “do you believe 50 is
old” variable is the response variable.
- The
response variable is affected or predicted by the explanatory variable.
Two-way Table
- This
is a table which classifies a person in 2 ways.
- Continuing
the above example, suppose the following data were collected:
|
Age
|
5
|
10
|
20
|
25
|
30
|
35
|
45
|
50
|
60
|
65
|
|
Agree
|
Y
|
Y
|
Y
|
Y
|
Y
|
N
|
N
|
N
|
N
|
N
|
Here is a two-way table for this
data. (Notice: The ages are placed in
categories so as to create a categorical variable.)
|
|
0-25
|
26-50
|
52-75
|
|
Agree
|
4
|
1
|
0
|
|
Disagree
|
0
|
3
|
2
|
·
The explanatory variable should be in columns and
the response variable in rows.
Marginal Distribution
·
Calculated by finding the proportion of
responses in each category
·
Example: Continuing the above example—The
marginal distributions for the age variable are 4/10 = .4 (there are 4 people
in the 0-25 age category out of 10 people total), 4/10 = .4 (there are 4 people
in the 26-50 age category out of 10 people total), 2/10 = .2 (there are 2
people in the 51-75 age category out of 10 people total).
Conditional
Distribution
·
Distribution of one variable for given
categories of the other variable.
·
From the above example, the proportion of
“middle-aged” respondents who agree is 1/4 = .25 (one agrees out of the total
of 4 people in that age group).
Segmented Bar Graphs
- Visual
display for conditional distributions.
- Each
rectangle has a height of 100%.
- Each
rectangle is divided into segments whose lengths correspond to the
conditional proportions.
________________________________________________________________________
Section 2.1
Three Measures of Center
- Mean – the arithmetic average—The
mean is found by adding up the values of the observations and dividing by
the number of observations.
Example:
Let 5, 10, 8, 7, 4 be the data set. To
find the mean add these numbers (5 +
10 + 8 + 7 + 4 = 34) and divide by how many numbers there were in the set (34/5) = 6.8).
The mean for this data set is 6.8.
The
mean can be thought of as the “balance point” of the distribution. Also, the mean
can be calculated only on quantitative variables.
- Median – the middle observation
when the observations are listed in order.
To
find the median:
- Arrange
the values in order
- If
there are an odd number of values, the median is in the (n + 1)/2
position.
- If
there are an even number of values, the median is the average of the
values in the n/2 and (n/2) + 1 positions.
Example: Let 5, 10, 8, 7, 4 be the
data set. To find the median, we must
first list these numbers in order—4, 5, 7, 8, 10. Since there are 5 numbers in the set (an odd
number) the median is the middle number, in this case 7. Sometimes a set is very large so it is easier
to figure out which numbered position the median is in. If so, use the formula (n + 1)/2 to find the
position number. For this example, n
would be 5. Using the formula, (5 + 1)/2
gives us 3. If you look in the third
position, the median is 7.
Example: Let 5, 10, 8, 7, 4, 12 be
the data set. To find the median we must
first list these numbers in order—4, 5, 7, 8, 10, 12. Since there are 6 numbers in the set (an even
number) the median is the average of the two middle numbers, in this case the
average of 7 and 8 is 7.5. If a data set
is very large, it may be beneficial to use the formulas n/2 and (n/2) + 1 to
find the two numbers that you must average to get the median. In this example, n is 6. Using the formulas we get 6/2 = 3 and (6/2) +
1 = 4. We need to average the numbers in
the third position (7) and in the fourth position (8). If you average 7 and 8 you get 7.5.
- Mode – the most common value; the
value that occurs most frequently.
Example:
Let 5, 7, 3, 4, 4, 1 bet the data set.
The mode is 4 simply because it occurs
twice and the other values occur only once.
Suppose the data set had been orange,
blue, orange, blue, orange, blue, red, black, green, red. The mode here would be both orange and blue since each of these occurred the
most (three times each).
The mode
applies to all categorical variables but is only useful with some quantitative variables.
Sample Size – the
number of observations in the data set; the variable n usually denotes the
sample size.
The relationship of
the mean and the median –
- Symmetric
distribution – the mean is close to the median
- Skewed
right distribution – the mean is greater than the median
- Skewed
left distribution – the mean is less than the median
Resistant – a
measure whose value is relatively unaffected by the presence of outliers.
Note: Measures of center are often important, but they do
not summarize all aspects of a distribution.
________________________________________________________________________
Section 2.2
Range
- A
measure of variability
- Simple
but not very useful
- Maximum
value minus the minimum value
Inter-quartile Range (IQR)
- A
measure of variability
- It
is the upper quartile minus the lower quartile
- The
range of the middle 50% of the data
Lower Quartile
- 25th
percentile
- The
value such that 25% of the observations fall below that value and 75% of
the observations fall above the value
- To
find the lower quartile
- Find
the median for the entire data set.
(This number divides the set into two halves.)
- Find
the median for the portion of the data set that falls below the actual
median (which was found in step 1).
This is your Lower Quartile.
(By dividing the bottom half of the data set in half, you have
found the quarters of the entire data set.)
- Note:
If there are an odd number of observations in the original data set, the
actual median is not included in the bottom half when finding the lower
quartile.
Upper Quartile
- 75th
percentile
- The
value such that 75% of the observations fall below that value and 25% of
the observations fall above the value.
- To
find the upper quartile
- Find
the median for the entire data set.
(This number divides the set into two halves.)
- Find
the median for the portion of the data set that falls above the actual
median (which was found in step 1).
This is your Upper Quartile.
(By dividing the upper half of the data set in half, you have
found the quarters of the entire data set.)
- Note:
If there are an odd number of observations in the original data set, the
actual median is not included in the upper half when finding the upper
quartile.
Five-number summary
- Provides
a quick, convenient description of where the four quarters of the data
fall
- Includes
the minimum value, the lower quartile, the median, the upper quartile, and
the maximum value.
Boxplot
- A
visual display which is based on the 5-number summary.
- Draw
a box between the quartiles. This
box demonstrates where the middle 50% of the data fall.
- Draw
horizontal lines (or whiskers) that extend from the left and right sides
of the box to the minimum and maximum, respectively.
- Mark
the median with a vertical line inside the box.
- One
weakness of box plots – the effect of an outlier
Modified Boxplots
- Outliers
are marked with symbols.
- “Whiskers”
extend to the most extreme, nonoutlying value.
- Rule
for identifying outliers: outliers are observations lying more than 1.5
times the IQR away from the nearer quartile.
________________________________________________________________________
Section 2.3
Standard Deviation
- A
widely used measure of variability.
- To
compute:
- Calculate
the difference between each observation and the mean.
- Square
each of these differences.
- Add
these squares.
- Divide
this sum by n-1.
- Take
the square root.
Empirical Rule
- With
mound-shaped data
- About
68% of the observations fall within 1 standard deviation of the mean.
- About
95% of the observations fall within 2 standard deviations of the mean.
- Virtually
all observations fall within 3 standard deviations of the mean
- This
is not necessarily true for distributions of other shapes.
z-score or
standardized score
- Useful
for comparing individual scores from different distributions.
- To
calculate a z-score
- Subtract
the mean from the value of interest.
- Divide
by the standard deviation.
- The
z-score indicates how many standard deviations above or below the mean a
particular value falls.
- It
should only be used when working with mound-shaped distributions.
Note: A common
misconception about variability is to believe that a “bumpier” histogram
indicates a more variable distribution, but this is not the case. Similarly, the number of distinct values
represented in a histogram does not necessarily indicate greater variability.
________________________________________________________________________
Section 1.5 Part 2
Scatterplot
- A
scatterplot is similar to a dot plot except that it displays two
quantitative variables simultaneously.
- The
vertical axis represents one variable and the horizontal axis represents
the other.
- A dot
represents an observational pair.
- Generally,
the response variable is on the vertical axis and the explanatory variable
is on the horizontal axis.
- For
example, I believe that if I know your foot length, then I can tell you
your height. The variable foot length
is predicting the variable height.
Foot length is the explanatory variable and should be on the
horizontal axis. Height is the
response variable and should be on the vertical axis.
Positive Association
- Two
variables are positively associated
if larger values of one variable tend to occur with larger values of the
other variable.
- For
example, consider the variables “number of hours worked” and “money
earned.” One would assume that if a
large number of hours are worked, then a large amount of money is
earned. Therefore, these two
variables are positively associated.
Negative Association
- Two
variables are negatively associated
if larger values of one variable tend to occur with smaller values of the
other.
- For
example, consider the variables “the number of days absent from class” and
“grade in class.” Generally,
someone with a high number of absences will have a lower grade in the
class. These two variables are
negatively associated.
Correlation
Coefficient
- The
letter r is used to denote the
correlation coefficient.
- The correlation coefficient is a
measure of the degree to which two variables are associated.
- The
value of the correlation coefficient ranges from -1 to +1.
- If
the correlation coefficient equals +1 or -1, then the observations form a
perfectly straight line.
- The
sign of the correlation coefficient reflects the direction of the
association. That is, if r is
positive then the two variables are positively associated. If r is negative, then the two variables
are negatively associated.
- Values
of r that are closer to +1 or -1 indicate stronger associations. Therefore, the correlation coefficient
indicates the magnitude or strength of the correlation.
- The
correlation coefficient only measures linear relationships between two
variables. Therefore, it is always
important to look at the scatterplot when interpreting r.
Association vs.
Causation
- Two
variables may be strongly associated without a cause-and-effect
relationship.
- Often,
it two variables are associated but a cause-and-effect relationship is not
apparent, then it is likely that the two variables are related to a third
variable that is not being measured.
This third variable is called a lurking variable or a confounding
variable.