sample versus population variance

The TimeLord

2005-06-30 04:43:44 UTC

Post by Nevets Steprock
I'm writing a program that will, among other things, calculate the
correlation coefficient between two stocks.

Well, the easiest is the most straight-forward:

In data sets {X[i]|1<=i<=N} and {Y[i]|1<=i<=N}, first calculate the average
Xbar and Ybar. Then the correlation coefficient is

rho = (Sum from i=1 to N of ((X[i]-Xbar)*(Y[i]-Ybar)))/
Sqrt[(Sum from i=1 to N of (X[i]-Xbar)^2)*
(Sum from i=1 to N of (Y[i]-Ybar)^2)]

Post by Nevets Steprock
I'm trying to wrap head my head around whether to use sample variance or
population variance in my equation.

If the data set is a sample from a wider data set, then it would be the
sample variance, since you are trying to estimate the population. However,
it sounds like you only what some kind of relation between two data sets
without trying to do anything more advanced. So as long as you are
consistent, it doesn't matter. Use "rho" from above.

Post by Nevets Steprock
I will only be working within certain dates (ie. between 1998 to 1999) of
the total data, which suggests using sample variance.
I'm concerned about using sample variance when the data is too short. For

[...]

Well, there isn't anything inherently wrong with small data sets, but you do
need to be aware of the affect of "degrees of freedom". Basically, if you
have N samples in your data set, then you have N degrees of freedom for
estimating the mean with the average and N-1 degrees of freedom for
estimating the population variance with the sample variance. The effect of
reduced degrees of freedom is that the act of estimating becomes less
reliable. To find out more, you'll need to look in the statistics books
under "hypothesis testing".

[...]

Post by Nevets Steprock
I've searched and searched the internet, but all the pages say the same
thing about using sample variance when the population mean is not known -
no real explanation. Can someone please give me guidance as to the

All this has to with the theoretical underpinnings of hypothesis testing and
the Central Limit Theorem. Unless you are using the data to test a
hypothesis about the population, don't worry about it.

A situation where this comes into importance is like that done by Richard
Hoagland concerning the face on Mars. Basically the population data is the
object itself. Then Viking takes a picture, which is a sample of that
object. It turns out that the image with the face was part of a much larger
image. In fact, the face picture was less than 10% likely to be a true
image of an actual face on Mars because of the small number of data points
(pixels) involved. Hoagland further processed the image thus reducing his
degrees of freedom on the data set (the pixels), resulting in the fact that
his processed image was less than 1% likely to be the image of a true
object. Further processing to reveal teeth in the face further reduced the
degrees of freedom so far, that the likelihood of the resulting image
reflecting anything real was so close to 0%, that it is a small wonder that
anyone believed Richard Hoagland. No wonder that later probes revealed no
face on Mars. - This is the situation where hypothesis testing should have
been considered.

Post by Nevets Steprock
preferred variance to use. I would hugely appreciate some justification
(if it's not to much to ask) for the decision. If I use sample variance. I
will eventually be asked by my boss why the example above isn't
'perfectly' correlated

In that case, your boss is just asking trying to "act smart". "rho" above
gives the correlation between the data sets. However, if your boss wants to
do more, then he should specify and we can proceed from there.

--
// The TimeLord says:
// Pogo 2.0 = We have met the aliens and they are us!