In this section we comment on correlation analysis which is a method used come quantify the associations between two consistent variables. For example, we can want to quantify the association between body massive index and also systolic blood pressure, or between hours of exercise per week and also percent human body fat. Regression evaluation is a related technique to evaluate the connection between result variable and also one or much more risk components or confounding variables (confounding is discussed later). The result variable is also called the **response** or **dependent variable,** and also the risk factors and also confounders are called the **predictors**, or **explanatory** or **independent variables**. In regression analysis, the dependent change is denoted "Y" and also the elevation variables room denoted by "X".

You are watching: What is the variable used to predict another variable called?

< **NOTE:** The term "predictor" deserve to be misleading if it is interpreted as the capability to predict even past the boundaries of the data. Also, the ax "explanatory variable" might give one impression the a causal impact in a situation in i m sorry inferences need to be limited to identifying associations. The terms "independent" and "dependent" variable are much less subject to these interpretations as they execute not strongly imply cause and also effect.

*After completing this module, the student will certainly be may be to:*

In correlation analysis, we estimate a **sample correlation coefficient**, an ext specifically the **Pearson Product minute correlation coefficient**. The sample correlation coefficient, denoted r,

ranges between -1 and also +1 and also quantifies the direction and strength the the direct association between the 2 variables. The correlation in between two variables have the right to be positive (i.e., higher levels of one variable are linked with higher levels that the other) or negative (i.e., higher levels that one variable are linked with reduced levels that the other).

The sign of the correlation coefficient suggests the direction that the association. The magnitude of the correlation coefficient shows the strength of the association.

For example, a correlation that r = 0.9 says a strong, positive association between two variables, conversely, a correlation that r = -0.2 suggest a weak, an adverse association. A correlation close come zero argues no straight association in between two consistent variables.

It is vital to note that there might be a non-linear association in between two constant variables, yet computation the a correlation coefficient does not detect this. Therefore, that is always important to advice the data carefully before computing a correlation coefficient. Graphical screens are particularly useful to check out associations in between variables.

The figure listed below shows 4 hypothetical scenarios in i beg your pardon one constant variable is plotted along the X-axis and also the various other along the Y-axis.

Scenario 1 depicts a strong positive combination (r=0.9), comparable to what we could see for the correlation between infant bear weight and birth length.Scenario 2 depicts a weaker combination (r=0,2) that we could expect to see in between age and also body mass index (which has tendency to increase with age).Scenario 3 can depict the absence of combination (r about = 0) between the degree of media exposure in adolescence and also age at which adolescents initiate sex-related activity.Scenario 4 could depict the strong an adverse association (r= -0.9) usually observed in between the variety of hours of aerobic practice per week and also percent body fat.## Example - Correlation the Gestational Age and also Birth Weight

A tiny study is performed involving 17 babies to investigate the association between gestational period at birth, measure up in weeks, and also birth weight, measured in grams.

Infant i would #

Gestational period (weeks)

Birth weight (grams)

1 | 34.7 | 1895 |

2 | 36.0 | 2030 |

3 | 29.3 | 1440 |

4 | 40.1 | 2835 |

5 | 35.7 | 3090 |

6 | 42.4 | 3827 |

7 | 40.3 | 3260 |

8 | 37.3 | 2690 |

9 | 40.9 | 3285 |

10 | 38.3 | 2920 |

11 | 38.5 | 3430 |

12 | 41.4 | 3657 |

13 | 39.7 | 3685 |

14 | 39.7 | 3345 |

15 | 41.1 | 3260 |

16 | 38.0 | 2680 |

17 | 38.7 | 2005 |

We great to estimate the association in between gestational age and also infant birth weight. In this example, birth load is the dependence variable and gestational age is the live independence variable. Thus y=birth weight and x=gestational age. The data are shown in a scatter diagram in the figure below.

Each point represents an (x,y) pair (in this situation the gestational age, measure up in weeks, and the bear weight, measure in grams). Keep in mind that the live independence variable, gestational age) is top top the horizontal axis (or X-axis), and also the dependent change (birth weight) is on the vertical axis (or Y-axis). The scatter plot shows a confident or straight association in between gestational age and also birth weight. Babies with much shorter gestational periods are much more likely to it is in born with reduced weights and also infants with much longer gestational periods are more likely to it is in born with higher weights.

Computing the Correlation CoefficientThe formula for the sample correlation coefficient is:

where Cov(x,y) is the covariance the x and y defined as

and also room the sample variances the x and also y, defined as follows: and alsoThe variances the x and y measure the variability that the x scores and also y scores about their corresponding sample way of X and Y taken into consideration separately. The covariance steps the variability the the (x,y) pairs around the mean of x and also mean of y, taken into consideration simultaneously.

To compute the sample correlation coefficient, we should compute the variance of gestational age, the variance of bear weight, and likewise the covariance the gestational age and birth weight.

We first summarize the gestational age data. The median gestational period is:

To compute the variance of gestational age, we have to sum the squared deviations (or differences) between each observed gestational age and also the median gestational age. The computations room summarized below.

Infant identifier #

Gestational age (weeks)

1 | 34.7 | -3.7 | 13.69 |

2 | 36.0 | -2.4 | 5.76 |

3 | 29.3 | -9.1 | 82,81 |

4 | 40.1 | 1.7 | 2.89 |

5 | 35.7 | -2.7 | 7.29 |

6 | 42.4 | 4.0 | 16.0 |

7 | 40.3 | 1.9 | 3.61 |

8 | 37.3 | -1.1 | 1.21 |

9 | 40.9 | 2.5 | 6.25 |

10 | 38.3 | -0.1 | 0.01 |

11 | 38.5 | 0.1 | 0.01 |

12 | 41.4 | 3.0 | 9.0 |

13 | 39.7 | 1.3 | 1.69 |

14 | 39.7 | 1.3 | 1.69 |

15 | 41.1 | 2.7 | 7.29 |

16 | 38.0 | -0.4 | 0.16 |

17 | 38.7 | 0.3 | 0.09 |

The variance of gestational period is:

Next, we summarize the birth load data. The mean birth weight is:

The variance of birth load is computed simply as we did for gestational period as shown in the table below.

Infant ID#

Birth Weight

1 | 1895 | -1007 | 1,014,049 |

2 | 2030 | -872 | 760,384 |

3 | 1440 | -1462 | 2,137,444 |

4 | 2835 | -67 | 4,489 |

5 | 3090 | 188 | 35,344 |

6 | 3827 | 925 | 855,625 |

7 | 3260 | 358 | 128,164 |

8 | 2690 | -212 | 44,944 |

9 | 3285 | 383 | 146,689 |

10 | 2920 | 18 | 324 |

11 | 3430 | 528 | 278,764 |

12 | 3657 | 755 | 570,025 |

13 | 3685 | 783 | 613,089 |

14 | 3345 | 443 | 196,249 |

15 | 3260 | 358 | 128,164 |

16 | 2680 | -222 | 49,284 |

17 | 2005 | -897 | 804,609 |

The variance the birth weight is:

Next we compute the covariance:

To compute the covariance that gestational age and birth weight, we have to multiply the deviation from the mean gestational period by the deviation native the average birth weight for each participant, the is:

The computations space summarized below. Notification that we merely copy the deviations from the average gestational age and birth weight from the 2 tables over into the table below and also multiply.

Infant ID#

1 | -3.7 | -1007 | 3725.9 |

2 | -2.4 | -872 | 2092.8 |

3 | -9,1 | -1462 | 13,304.2 |

4 | 1.7 | -67 | -113.9 |

5 | -2.7 | 188 | -507.6 |

6 | 4.0 | 925 | 3700.0 |

7 | 1.9 | 358 | 680.2 |

8 | -1.1 | -212 | 233.2 |

9 | 2.5 | 383 | 957.5 |

10 | -0.1 | 18 | -1.8 |

11 | 0.1 | 528 | 52.8 |

12 | 3.0 | 755 | 2265.0 |

13 | 1.3 | 783 | 1017.9 |

14 | 1.3 | 443 | 575.9 |

15 | 2.7 | 358 | 966.6 |

16 | -0.4 | -222 | 88.8 |

17 | 0.3 | -897 | -269.1 |

Total = 28,768.4 |

The covariance that gestational age and also birth weight is:

Finally, we have the right to ow compute the sample correlation coefficient:

Not surprisingly, the sample correlation coefficient suggests a strong positive correlation.

As we noted, sample correlation coefficients range from -1 to +1. In practice, meaningful correlations (i.e., correlations that room clinically or virtually important) deserve to be as little as 0.4 (or -0.4) for positive (or negative) associations. There are also statistical test to determine whether an it was observed correlation is statistically far-ranging or no (i.e., statistically considerably different native zero). Steps to check whether an it was observed sample correlation is suggestive of a statistically far-reaching correlation are described in information in Kleinbaum, Kupper and also Muller.1

Regression AnalysisRegression analysis is a extensively used method which is advantageous for many applications. We introduce the method here and expand ~ above its supplies in subsequent modules.

## Simple linear Regression

Simple linear regression is a method that is appropriate to recognize the association in between one live independence (or predictor) variable and one continuous dependent (or outcome) variable. Because that example, expect we desire to assess the association in between total cholesterol (in milligrams per deciliter, mg/dL) and also body mass table of contents (BMI, measured as the ratio of weight in kilograms to elevation in meters2) where complete cholesterol is the dependence variable, and also BMI is the elevation variable. In regression analysis, the dependent variable is denoted Y and also the independent change is denoted X. So, in this case, Y=total cholesterol and X=BMI.

When there is a single consistent dependent variable and also a solitary independent variable, the evaluation is called a basic linear regression analysis . This evaluation assumes that there is a direct association between the 2 variables. (If a different relationship is hypothesized, such as a curvilinear or exponential relationship, alternative regression analyses room performed.)

The figure below is a scatter diagram illustrating the relationship between BMI and total cholesterol. Each point represents the observed (x, y) pair, in this case, BMI and also the corresponding total cholesterol measured in each participant. Note that the independent variable (BMI) is top top the horizontal axis and also the dependent change (Total Serum Cholesterol) top top the upright axis.

**BMI and Total Cholesterol**

The graph reflects that there is a hopeful or direct association in between BMI and total cholesterol; attendees with lower BMI are much more likely to have lower total cholesterol levels and also participants with greater BMI are much more likely to have higher total cholesterol levels. In contrast, intend we study the association between BMI and HDL cholesterol.

In contrast, the graph listed below depicts the relationship in between BMI and **HDL cholesterol** in the same sample the n=20 participants.

**BMI and HDL Cholesterol**

This graph shows a negative or train station association between BMI and also HDL cholesterol, i.e., those with reduced BMI are more likely to have higher HDL cholesterol levels and also those with higher BMI are an ext likely to have actually lower HDL cholesterol levels.

For either of these relationship we could use straightforward linear regression analysis to estimate the equation that the heat that best describes the association in between the independent variable and also the dependence variable. The basic linear regression equation is as follows:

where **Y** is the suspect or intended value of the outcome, **X** is the predictor, **b0** is the estimated Y-intercept, and **b1** is the estimated slope. The Y-intercept and slope are estimated from the sample data, and also they space the values that minimization the amount of the squared differences between the observed and also the predicted values of the outcome, i.e., the estimates minimize:

These differences in between observed and predicted values of the result are dubbed **residuals**. The approximates of the Y-intercept and also slope minimize the amount of the squared residuals, and also are dubbed the **least squares estimates**.1

Residuals Conceptually, if the worths of X provided a perfect prediction of Y then the sum of the squared differences in between observed and also predicted values of Y would certainly be 0. The would average that variability in Y could be fully explained by distinctions in X. However, if the differences between observed and predicted values are not 0, then we room unable to completely account for distinctions in Y based upon X, climate there room residual errors in the prediction. The residual error could result from inaccurate measurements of X or Y, or there might be various other variables as well as X that influence the value of Y. |

Based ~ above the it was observed data, the best estimate the a straight relationship will certainly be obtained from an equation because that the line the minimizes the differences between observed and predicted values of the outcome. The **Y-intercept** the this heat is the value of the dependent variable (Y) when the independent variable (X) is zero. The **slope** the the heat is the change in the dependent variable (Y) relative to a one unit adjust in the independent change (X). The least squares approximates of the y-intercept and also slope room computed as follows:

and

where

r is the sample correlation coefficient,the sample means are and and Sx and Sy room the standard deviations of the independent change x and the dependent change y, respectively.### BMI and Total Cholesterol

The the very least squares estimates of the regression coefficients, b 0 and also b1, explicate the relationship between BMI and total cholesterol are b0 = 28.07 and also b1=6.49. These are computed as follows:

and

The estimate of the Y-intercept (b0 = 28.07) to represent the estimated total cholesterol level when BMI is zero. Since a BMI that zero is meaningless, the Y-intercept is no informative. The calculation of the steep (b1 = 6.49) represents the readjust in complete cholesterol loved one to a one unit readjust in BMI. Because that example, if we compare two participants who BMIs different by 1 unit, us would suppose their total cholesterols to differ by roughly 6.49 units (with the person with the higher BMI having actually the greater total cholesterol).

The equation of the regression heat is as follows:

The graph below shows the approximated regression heat superimposed top top the scatter diagram.

The regression equation can be offered to calculation a participant"s total cholesterol as a role of his/her BMI. For example, suppose a participant has actually a BMI that 25. We would certainly estimate their total cholesterol to be 28.07 + 6.49(25) = 190.32. The equation can likewise be provided to estimate total cholesterol for various other values the BMI. However, the equation must only be offered to estimate cholesterol levels for persons who BMIs are in the selection of the data provided to generate the regression equation. In our sample, BMI arrays from 20 to 32, for this reason the equation must only be offered to generate estimates of full cholesterol because that persons v BMI in that range.

There are statistical exam that deserve to be perform to assess whether the estimated regression coefficients (b0 and b1) are statistically significantly different from zero. The check of many interest is generally H0: b1=0 matches H1: b1≠0, whereby b1 is the populace slope. If the population slope is considerably different from zero, us conclude that there is a statistically significant association between the independent and dependent variables.

### BMI and also HDL Cholesterol

The least squares estimates of the regression coefficients, b0 and also b1, explicate the relationship between BMI and HDL cholesterol are as follows: b0 = 111.77 and b1 = -2.35. These are computed as follows:

and

Again, the Y-intercept in uninformative due to the fact that a BMI the zero is meaningless. The calculation of the steep (b1 = -2.35) to represent the change in HDL cholesterol loved one to a one unit readjust in BMI. If we compare 2 participants who BMIs different by 1 unit, we would intend their HDL cholesterols to different by around 2.35 systems (with the person with the higher BMI having the reduced HDL cholesterol. The figure below shows the regression heat superimposed top top the scatter diagram because that BMI and HDL cholesterol.

Linear regression evaluation rests ~ above the assumption that the dependent change is continuous and the the circulation of the dependent change (Y) at each worth of the independent change (X) is around normally distributed. Note, however, the the independent variable deserve to be constant (e.g., BMI) or can be dichotomous (see below).

Comparing median HDL Levels v Regression AnalysisConsider a clinical psychological to evaluate the efficacy of a new drug to increase HDL cholesterol. We could compare the mean HDL levels in between treatment groups statistically making use of a 2 independent samples t test. Here we think about an alternate approach. An introduction data because that the trial are displayed below:

Sample Size

Mean HDL

Standard Deviation of HDL

New Drug

Placebo

50 | 40.16 | 4.46 |

50 | 39.21 | 3.91 |

HDL cholesterol is the consistent dependent variable and treatment assignment (new medicine versus placebo) is the live independence variable. Intend the data on n=100 entrants are gotten in into a statistical computing package. The result (Y) is HDL cholesterol in mg/dL and the independent variable (X) is therapy assignment. For this analysis, X is coded as 1 because that participants who received the new drug and as 0 because that participants who received the placebo. A an easy linear regression equation is estimated as follows:

where Y is the approximated HDL level and X is a dichotomous change (also referred to as an indicator variable, in this case indicating even if it is the participant was assigned to the brand-new drug or come placebo). The estimate of the Y-intercept is b0=39.21. The Y-intercept is the worth of Y (HDL cholesterol) when X is zero. In this example, X=0 suggests assignment come the placebo group. Thus, the Y-intercept is precisely equal come the average HDL level in the placebo group. The steep is approximated as b1=0.95. The slope represents the estimated change in Y (HDL cholesterol) loved one to a one unit adjust in X. A one unit adjust in X to represent a difference in treatment assignment (placebo versus new drug). The slope to represent the difference in mean HDL levels in between the treatment groups. Thus, the mean HDL for participants receiving the new drug is:

-----A research was carried out to evaluate the association between a person"s intelligence and the dimension of their brain. Participants perfect a standardized IQ test and also researchers used Magnetic Resonance Imaging (MRI) to determine mind size. Demography information, including the patient"s gender, was likewise recorded.

The debate Over environmental Tobacco exhilaration ExposureThere is convincing proof that energetic smoking is a *cause* the lung cancer and heart disease. Countless studies excellent in a wide range of circumstances have actually consistently prove a strong association and also indicate that the threat of lung cancer and cardiovascular condition (i.e.., heart attacks) rises in a dose-related way. This studies have actually led to the conclusion that active smoking is causally pertained to lung cancer and cardiovascular disease. Research studies in energetic smokers have had actually the advantage that the lifetime exposure come tobacco smoke can be quantified with reasonable accuracy, because the unit sheep is continual (one cigarette) and the habitual nature that tobacco smoking provides it possible for many smokers to carry out a reasonable calculation of their total lifetime exposure quantified in terms of cigarettes per day or packs per day. Frequently, average daily exposure (cigarettes or packs) is merged with expression of usage in years in order come quantify exposure as "pack-years".

It has actually been much more difficult to create whether environmental tobacco smoke (ETS) exposure is causally pertained to chronic illness like heart condition and lung cancer, due to the fact that the full lifetime exposure dosage is lower, and it is much more challenging to that s right estimate complete lifetime exposure. In addition, quantifying these risks is also complicated because the confounding factors. Because that example, ETS exposure is usually classified based on parental or spousal smoking, however these studies room unable come quantify other ecological exposures to tobacco smoke, and inability to quantify and change for other ecological exposures such together air contamination makes it complicated to demonstrate an association even if one existed. As a result, there proceeds to be debate over the risk imposed by ecological tobacco acting (ETS). Some have gone so far regarding claim the even an extremely brief exposure come ETS can reason a myocardial infarction (heart attack), yet a very big prospective cohort study by Enstrom and also Kabat was unable to demonstrate far-reaching associations between exposure come spousal ETS and coronary love disease, chronic obstructive pulmonary disease, or lung cancer. (It need to be noted, however, the the report by Enstrom and also Kabat has been widely criticized because that methodological problems, and these authors also had jae won ties to the tobacco industry.)

Correlation evaluation provides a helpful tool for thinking about this controversy. Consider data native the British medical professionals Cohort. They reported the yearly mortality for a selection of condition at 4 levels the cigarette smoking per day: never ever smoked, 1-14/day, 15-24/day, and 25+/day. In bespeak to execute a correlation analysis, i rounded the exposure level to 0, 10, 20, and 30 respectively.

Cigarettes Smoked

Per Day

CVD Mortality

Per 100,000 men Per Year

Lung Cancer Mortality

Per 100,000 guys Per Year

0

10 (actually 1-14)

20 (actually 15-24)

30 (actually >24)

572 | 14 |

802 | 105 |

892 | 208 |

1025 | 355 |

The figures listed below show the two approximated regression lines superimposed ~ above the scatter diagram. The correlation v amount of smoking cigarettes was strong for both CVD mortality (r= 0.98) and also for lung cancer (r = 0.99). Note likewise that the Y-intercept is a coherent number here; it represents the predicted yearly death price from these condition in individuals who never ever smoked. The Y-intercept because that prediction of CVD is slightly greater than the observed price in never smokers, while the Y-intercept because that lung cancer is lower than the observed rate in never ever smokers.

The linearity of this relationships suggests that over there is an incremental threat with each extr cigarette smoked every day, and also the extr risk is approximated by the slopes. This probably helps us think around the after-effects of ETS exposure. For example, the hazard of lung cancer in never ever smokers is quite low, however there is a finite risk; various reports imply a risk of 10-15 lung cancers/100,000 every year. If an separation, personal, instance who never smoked actively was exposed come the identical of one cigarette"s exhilaration in the kind of ETS, then the regression argues that their danger would rise by 11.26 lung cancer deaths every 100,000 every year. However, the risk is clearly dose-related. Therefore, if a non-smoker to be employed by a tavern with heavy levels the ETS, the risk can be significantly greater.

Finally, it have to be detailed that part findings imply that the association between smoking and heart disease is non-linear in ~ the really lowest exposure levels, definition that non-smokers have a disproportionate increase in risk when exposed to ETS early to boost in platelet aggregation.

SummaryCorrelation and linear regression evaluation are statistical techniques to quantify associations between an independent, sometimes referred to as a predictor, variable (X) and a consistent dependent result variable (Y). For correlation analysis, the independent change (X) deserve to be continuous (e.g., gestational age) or ordinal (e.g., raising categories the cigarettes per day). Regression evaluation can also accommodate dichotomous live independence variables.

See more: G I Ll See You In My Dreams Giant Song), I'Ll See You In My Dreams Lyrics

The procedures defined here assume the the association in between the independent and dependent variables is **linear**. V some adjustments, regression evaluation can also be used to estimate associations the follow an additional functional type (e.g., curvilinear, quadratic). Below we think about associations in between one independent variable and also one constant dependent variable. The regression analysis is called basic linear regression - basic in this case refers come the truth that there is a **single independent variable**. In the next module, we consider regression analysis with several independent variables, or predictors, taken into consideration simultaneously.