Research XII) The correlation coefficent and the Chi-squared

Research XII) The correlation coefficent and the Chi-squared

In this research we’ll discuss about the concept of dependence that will lead us to the correlation coefficent. Although, before talking about it, an explanation about dependence is needed.

In statistics, we talk about dependence when we want to describe any statistical relationship between two random variables or bivariate data.
So, it’s called correlation any kind of statistical relationships involving dependence, more precisely it often refers to how close two variables are to having a linear relationship with each other.

Intuitively, random variables are dependent if they do not satisfy a mathematical property of probabilistic independence. Basically, correlation is synonymous with dependence.

In order to measure the degree of correlation, there’s the need of what’s called a correlation coefficent. There are a few of famous correlation coefficents. However, the most used is the so called Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables.
It is so famous that it’s also known simply as “the correlation coefficent” or bivariate correlation.

It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. Pearson’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. In other words, the correlation coefficient ρX,Y between two random variables X and Y with expected values μX and μY and standard deviations σX and σY is defined as

\rho _{X,Y}=\mathrm {corr} (X,Y)={\mathrm {cov} (X,Y) \over \sigma _{X}\sigma _{Y}}={E[(X-\mu _{X})(Y-\mu _{Y})] \over \sigma _{X}\sigma _{Y}},

where E is the expected valuecov is the covariance, and corr is another alternative notation for the correlation coefficient. The correlation coefficient is also symmetric: corr(X,Y) = corr(Y,X).

Usually it’s possible to specify the correlation coefficent with an r. In this case, the coefficient is used along with samples and may be referred to as the sample correlation coefficient. Also, the correlation coefficient r measures the strength and direction of a linear relationship between two variables on a scatterplot.

scatterplot
(examples of a scatterplot)

Even in this case, if the correlation coefficient is close to 1, it would indicate that the variables are positively linearly related and the scatter plot falls almost along a straight line with positive slope. For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along a straight line with negative slope. And for zero, it would indicate a weak linear relationship between the variables.

In order to obtain a formula for r it’s only needed to substitute estimates of the covariances and variances based on a sample into the original formula. For example, if we have one dataset {x1,…,xn} containing n values and another dataset {y1,…,yn} containing values, then that formula for r is:

{\displaystyle r={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}}}
where n is the sample size and  x_{i},y_{i}are the single samples indexed with i, while the mean is such that:

{\bar {x}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}(and the same can be done for {\bar {y}})

Thisis the uncorrected form of r, or not standard. Once applied the changes to the formula in order to be standard, we obtain:

{\displaystyle r_{xy}={\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{(n-1)s_{x}s_{y}}}={\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sqrt {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}\sum \limits _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}},}

where {\bar {y}} is still the sample means of Y (and the same goes with X), while sx and sy are the corrected sample standard deviations of X and Y.

The coefficient of variation and correlation coefficient are commonly used to assess the reliability or reproducibility of interval-scale measurements. So there are some tests done in order to compare dependent reliability or reproducibility parameters.

One of the most used is, for which it’s interesting to notice how the correlation coefficent is used. We’ll show an experiment written in this paper down below:

likelihood

This is one of the possible methods for testing the null hypothesis, a general statement or default position in inferential statistics such that it states there is no relationship between two measured phenomena, or no association among groups. Concluding that there are grounds for believing that there is a relationship between two phenomena is a central task in the modern practice of science.

A statistical hypothesis, also known as confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables. A statistical hypothesis test is a method of statistical inference.
The comparison is deemed statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis according to the significance level, basically a threshold probability.

Hypothesis tests are used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance.
The significance tests for chi-square and correlation will not be exactly identical, but will very often give the same statistical conclusion. Chi-square tests are based on the normal distribution, but the significance test for correlation uses the t-distribution.

So, with large sizes of N-samples, the t and the normal distributions will be the same, or at least extremely close.

Finally, here’s another intereting article to read about the link between the correlation coefficient and the T-distribution.

Lascia un commento