statistics

Correlation coefficients cannot be mathed. A +.80 coefficient isn’t twice as strong as +.40.

They can be thought of through:

  1. Consistency: the relative degree of consistetency with which Ys are paired with Xs. At +1, everyone who had a given X got the same Y for all Xs.
  2. Variability: There is no variability within any given X at .
  3. The scatterplot: How closely the correlation coefficient matches the linear regression line.
  4. Predictions: Communicates the relative accuracy of our predictions

Definitions

variability : the opposite of correlation

correlation coefficient : the numerical representation of a relationship between two things. It’s always rounded to 2 decimals. In real experiments, is considered weak and is considered extremely strong.

regression line : a line through a scatter plot indicating the average.

scatterplot : a graph of individual data points from a set of x-y pairs. The X value in a scatterplot is the “given”. e.g. “Given cups of coffee consumed.. how nervous are people?”

strength of a relationship : (aka degree of association) how much/consistently one value of Y is associated with one and only one value of X. Consistency ranges from either 0-1 (max = 1) or -1 to +1 (if signed).

perfect correlation : a correlation coefficient of -1 or +1.variability

pearson correlation coefficient : represented as . Describes the relationship between two interval or ratio variables.

restricted range : Seems like it means a minimum breadth/range of scores required

sampling distribution of r : if you took an infinite number of samples from the population, computing . It produces a normal distribution.

Correlational Analysis

The four main differences as compared to an experiment.

  1. In an experiment you look at ‘x’ and ‘y’ and then ‘x and y’. For correlational analysis, you look at the ‘x-y pairs’ only.
  2. b/c we look at all the pairs, correlational analysis is single-sample. N=number of pairs.
  3. X is determined by the question. “Given an amount of coffee, how nervous are people?” would mean that X = amount of coffee.
  4. Data is graphed on a scatterplot.

Types of relationships

“As x changes, the y’s…”

  • linear: it follows a single straight line. It can go up (positive linear relationship) or down (negative linear relationship)
  • nonlinear: aka curvilinear, like “given age, how fast are you?” would be a U-shaped relationship.

Pearson correlation coefficient

This determines the “average” amount that the X and Y scores correspond. We translate each X and Y into their z-score ( and ) and compare them.

Requires:

  1. X and Y scores each form an approximately normal distribution
  2. Avoid the restricted range of X or Y.

At a high level, we compare the

Janky “defining formula” that yields too many rounding errors.

\begin{math} r = \frac{\Sigma(z_x z_y)}{N} \end{math}

To compute r, we instead use:

\begin{math} r = \frac{N(\Sigma XY) - (\Sigma X)(\Sigma Y)}{\sqrt{[N(\Sigma X^2)-(\Sigma X)^2] [N(\Sigma Y^2)-(\Sigma Y)^2]}} \end{math}

It’s worth noting that is the “sum of the cross products”, e.g. sum all of (x*y)

Example

X = Glasses of Juice per Day Y = Doctor Visits per Year

ParticipantXX2YY2XY
1008640
2007490
3117497
4116366
5115255
6244168
7244168
83941612
939246
10416000
n=10Σ X = 17Σ X^2 = 45\SigmaY = 47Σ Y^2 = 275Σ XY = 52
(Σ X)^2 = 289(Σ Y)^2 = 2209

\begin{align} r &= \frac{N(\Sigma XY) - (\Sigma X)(\Sigma Y)}{\sqrt{[N(\Sigma X^2)-(\Sigma X)^2] [N(\Sigma Y^2)-(\Sigma Y)^2]}} \\ &= \frac{10(52) - (17)(47)}{\sqrt{[10(45)-289] [10(275)-2209]}} \\ &= \frac{520 - 799}{\sqrt{[450-289] [2750-2209]}} \\ &= \frac{-279}{\sqrt{[161] [541]}} \\ &= \frac{-279}{\sqrt{87101}} \\ &= \frac{-279}{295.129} \\ &= -.95 \end{align}

Significance testing

df = .

Steps:

  1. Compute
  2. Define and as one or two tailed
  3. find using df.
  4. If is beyond , then it’s significant.