When we look at a scatter plot of two variables, if we could
draw a straight line with a ruler close to most of the data, we would be able to
predict the value of one variable based on a value of the other. What does line
of "best fit" mean? Most statistical packages use the method of least squares.
This method uses the data to find the line of best fit by minimizing the sum of
the squares of the distances of the data points to the line. The squares of the
distances are used since the distance might be a negative number, but the square
of the distance of the point to the line will be positive. Examples of the
distance of the point to the line is indicated by the red lines below. The
equation of this line of best fit can be found from the data.
Through this procedure an estimate of the
equation of the line is sought. An estimate is not about being exactly correct.
It provides what some people refer to as a "ballpark figure"  something that is
near to and could be, but is not expected to be, exact.
Compare the above graph to the one below. Although the line
below goes through two of the data points, all the rest of the data is above the
line, where in the line above the number of points above and below the line are
nearly equal.
If most of the data points are close to the line
of best fit, the two variables are said to be highly correlated, otherwise they
are weakly correlated. If the slope of the line of best fit is positive (going
up from left to right), the two variables plotted are said to be positively
correlated. If the slope of the line of best fit is negative (going down from
left to right), the two variables being plotted are said to be negatively
correlated. If a line of best fit cannot be estimated, then the two variables
are said to have no correlation. In the example above, the quality and price are
positively correlated.
