'One forms provisional theories and waits for time or fuller knowledge to explode them.' Sussex Vampire, Sir Arthur Conan Doyle 3. RECURSIVE BAYESIAN ESTIMATION; ONE DEPENDENT AND ONE EXPLANATORY VARIABLE 3.1 A SIMPLE MODEL Below , an estimation method using Bayes theorem will be developed for the case of one dependent and one explanatory variable. We shall assume a very simple model to start with: Yt = B . Xt + vt 3.2 THE RECURSIVE BAYESIAN ESTIMATION SYSTEM The results described below are given in different notation in Harrison and Stevens, 1976. First, we define some notation. For any variable Z, ZEt will denote any estimate of Zt ZOt will denote an observation of Zt ZFt will denote a forecast of Zt Prec(ZEt) will denote the precision of ZEt (where precision is defined as the reciprocal of the variance) Prec(ZFt) and Prec(ZOt) will be the precisions of the forecast and the observation. If the variance of X is s2, the variance of AX (if A is a constant) is A2s2. Var(AX) = A2.Var(X) So Prec(AX) = A-2.Prec(X) Suppose at some time t - 1 we have an estimate of B, BEt-1 and its precision Prec(BEt-1). First we forecast Bt ; this is easy, as our assumption is that B is constant. BFt = BEt equation 3.2.1. Prec(BFt) = Prec(BEt-1) equation 3.2.2. Note that these two equations give us two opportunities. The first opportunity is to relax the assumption of constant B (for example, an assumption of declining B could be made). The second opportunity is to relax the assumption that the precision with which we know BFt is the same as the precision of BEt-1. This assumption is equivalent to assuming that information about last year's elasticity is just as good when used for this year's elasticity. If the elasticity is actually a variable,such that Bt = Bt-1 + random then this assumption would be unwarranted, and Prec(BFt) is less than Prec(BEt-1) These two assumptions will indeed be relaxed when practical applications are studied. Given our forecast for B, we can forecast Y YF t = BF t . Xt eq. 3.2.3. Prec(YF t) = Xt -2 . Prec(BF t) eq. 3.2.4. But we also have an observation of Y, YOt and we have an estimate of the precision with which we know YOt. This estimate can come from a consideration of how the observation was made, or by estimating YOt using several different methods. So we can use the relationship Y = BX to calculate what could be called an observation of B, defined by: BO t = YO t / Xt eq. 3.2.5. and Prec(BO t) = Xt . Xt . Prec(YO t) eq. 3.2.6. Now we can combine the forecast of B with the observation of B to give a new estimate of B, using the usual result for the combination of two pieces of uncertain information. The estimate is a weighted average of the forecast and the observation of B, using the precisions as weights. Prec(BFt) .BFt + Prec(BOt) . BOt BEt = ------------------------------------- Prec(BFt) + Prec(BOt) equation 3.2.7 Prec(BE t) = Prec(BF t) + Prec(BO t) equation 3.2.8. This completes the recursive estimation procedure for B. The BEt and Prec(BEt) are used in the next recursion of the procedure in equations 3.2.1 and 3.2.2. We can also refine our estimate for Y, by combining the forecast with the observation: Prec(YFt) .YFt + Prec(YOt) . YOt YEt = ------------------------------------- Prec(YFt) + Prec(YOt) equation 3.2.9. Prec(YEt) = Prec(YFt) + Prec(YOt) equation 3.2.10. The first estimate of B, BE0 may come from prior information about the elasticity (from other research, or by looking at elasticities for other countries, or from a subjective estimate). If there is no prior information on B, then Prec(BE0) is zero, and anything can be used for BE0, as the weight given to it in equation 3.2.7 is zero. Then the precision of the first forecast of Y, Prec(YF0) is also zero (by equation 3.2.3). Strictly speaking, a recursive procedure is one that invokes itself in the course of execution, such as N! = N.(N-1)! The procedure described above is not recursive, in that it does not invoke itself, and should perhaps be called iterative. However, if the procedure for calculating BEt from BEt-1 is denoted by BEt = f(BEt-1), then BEt-1 = f(BEt-2), and so on, and the procedure can truly be called recursive. The world "recursive" has become generally adopted for procedures such as this (e.g. Young, 1974), and so will be used in this paper. 3.3 RECURSIVE LEAST SQUARES ESTIMATION; ONE DEPENDENT AND ONE EXPLANATORY VARIABLE Consider the model Yt = B . Xt + vt We want to find an estimate of B, BE such that the sum of the squared errors is minimized. This estimate is called the least squares estimate. The sum of squared errors is: S = SUMt(Yt - BE . Xt )2 dS This is minimized by -- = 0 dBE SUMt((Yt - BE . Xt ) . Xt ) = 0 SUMt(Xt . Yt ) = BE . SUMt(Xt2) BE = SUMt(Xt . Yt)/SUMt(Xt2) and the precision of BE, Prec(BE) = SUMt(Xt2/s2) where s2 is the variance of the residuals. Suppose these calculations were done using n-1 observations of X and Y. Using the obvious notation, SUMtn-1(XtYt) SUMtn-1(Xt2) BEn-1 = -------------- Prec(BEn-1) = ------------ SUMtn-1(Xt2) s2 equation 3.3.1 Now, suppose another observation Xn, Yn becomes available. Then SUMt n(XtYt) SUMt n(Xt 2) BE n = ----------- Prec(BE n) = ---------- SUMtn(Xt2) s2 So Prec(BEn) = Prec(BEn-1) + Xn2/s2 equation 3.3.2 SUMtn-1(XtYt) + XnYn BEn = -------------------- SUMtn-1Xt2 + XnXn But SUMtn-1(XtYt) = BEn-1 SUMtn-1(Xt2) = BEn-1 . s2 . Prec(BEn-1) (from equation 3.3.1) so BEn-1 . s2 . Prec(BEn-1) + XnYn BEn = -------------------------------- s2 . Prec(BEn-1) + XnXn BEn-1 . Prec(BEn-1) + XnYn/s2 BEn = -------------------------------- Prec(BEn-1) + XnXn/s2 BEn-1 . Prec(BEn-1) + (XnXn/s2) . Yn/Xn BEn = ---------------------------------------- Prec(BEn-1) + XnXn/s2 equation 3.3.3 This completes the recursive formula for updating BE. 3.4 THE RELATIONSHIP BETWEEN RECURSIVE BAYESIAN ESTIMATION AND RECURSIVE LEAST SQUARES It will be seen easily that equation 3.3.3 is the same as equation 3.2.7, and equation 3.3.2 is the same as equation 3.2.8. BOt is replaced by Yn/Xn Prec(BOt) by XnXn/s2 BFt (= BEt-1) by BEn-1 Prec(BF t)(=Prec(BE t-1)) by Prec(BE n-1) The equivalence between recursive Bayesian estimation and least squares estimation is now clear. This is not an obvious result. Least squares is only one of many different possible ways of estimating parameters. Other methods include minimizing the sum of absolute errors, regression of Y on X or X on Y and minimizing some function of the errors. Least squares has the advantage of computational convenience, and also the least squares estimator is the best linear unbiased estimator BLUE (i.e. has the least variance of all the unbiased linear estimators; Maddala 1977 , p 75). Maddala (1977, p 83) shows that the least squares estimator of B is the same as the maximum likelihood estimator. Another consequence is that the estimators of Z that have been denoted by ZE can now be seen to be maximum likelihood (ML) estimators (Maddala 1977, p 83). 3.5 FORECASTING The model Yt = BXt can be used for forecasting if there are forecasts for B and for Xt. The variable Xt is the explanatory variable for Yt; if no forecasts for Xt are available then Yt cannot be forecast. It follows that if the purpose of model-building is to prepare forecasts, (not the only possible purpose), then the researcher would be unwise to estimate a model Y = BX if there were no forecasts of X available, or expected to become available. B is forecast very simply. In this model it is assumed constant, so the forecast for B at all times subsequent to the last year of data is the last estimate of B. The precision of the forecasts for Yt is Xt-2 . Prec(B) (by equation 3.2.4). 3.6 HOW THE FIVE PROBLEMS CAN BE TREATED The five problems described in 1.1.1 to 1.1.5 will now be examined in 3.6.1 to 3.6.5 to check whether recursive Bayesian estimation offers any solution. 3.6.1 PRELIMINARY DATA Five problems relating to preliminary data were listed in 1.1.1. These were: (a) Should preliminary data be used when estimating the model? (b) How should the lower accuracy of preliminary data be taken account of by the estimation method? (c) Should the preliminary data be used when preparing the forecasts? (d) If the forecast is expected to be better than the preliminary data, should the preliminary data be ignored? (e) How should the preliminary data be used in forecasting? An important difference between preliminary data and final data is their precision. It is assumed here that the preliminary data are an unbiased estimate of the final data; if it is not then the bias should be estimated and corrected for, for example by a simple model describing the relationship between final and preliminary data such as : Yfinal = A + B . Ypreliminary + e The precision of data is taken account of in equations 3.2.6 (used in the estimation of the model) and 3.2.9 (used in calculating the maximum likelihood estimate of the dependent variable, and for forecasting). So the answers to the five preliminary data problems are : (a) Yes. (b) By feeding the lower estimate of the precision into equations 3.2.6, 3.2.9 and 3.2.10. (c) Yes. (d) No. (e) In the same way as final data, but taking the lower precision into account in equations 3.2.9 and 3.2.10. Thus, no special arrangements need to be made for the treatment of preliminary data apart from correcting for bias, if it exists and telling the recursive Bayesian estimation procedure the lower precision of the preliminary data. 3.6.2 IMPRECISE DATA There is an obvious way to treat imprecise data, or even data of variable precision, if the recursive Bayesian estimation procedure is used. This is simply to introduce into the recursive Bayesian estimation procedure the various precisions of the data, in equations 3.2.6, 3.2.9 and 3.2.10. 3.6.3 INCORPORATION OF OTHER INFORMATION Information about the parameter B from other sources (whether cross-sectional analysis, other objective analysis or subjective data) can be incorporated into the estimation very easily by specifying the first estimate of B, BE0 as being the estimate from the other source, and by specifying its precision, Prec(BE0). 3.6.4 THE IRRELEVANCE OF OLD DATA Equation 3.2.2 says that the parameter B, estimated with data up to and including time t - 1, suffers no loss of relevance when applied to time t ; Prec(BFt) = Prec(BEt-1). The real world is not always like that. Sometimes we may believe that the parameter B is changing by Bt+1 = Bt + wt where wt is a normal random variable, N(0,Wt). Thus, if we know B with absolute precision at time t, we could not forecast it with complete accuracy at time t+1. In this case the precision of B at time t is infinite, but the precision at time t + 1 can be calculated by Var(Bt+1) = Var(Bt + wt) = Var(Bt) + Var(wt) = 0 + Wt = Wt and so the precision of Bt+1 is 1/Wt Similarly, the precision of Bt+2 is 1/(Wt + Wt+1) and the precision continues to fall as time passes unless the Wt are zero. This is similar, (but not identical), to the Hildreth- Houck model described in Hildreth and Houck, 1968. The difference is that the Hildreth-Houck model is Bt+1 = Bmean + wt whereas in the model above, B describes a random walk. Young (1974, p.213) suggests that random-walking coefficients could be useful, and (for example) Borooah and Chakravarty (1978) use such a model, and report a much improved fit compared to OLS estimation with constant coefficients. Harrison and Stevens (1976) also suggest this model. Using the notation above, Wt would govern the rate at which old information became irrelevant. Note that there is no need for W to be constant (as in discounted least squares). When the researcher is aware of a shock to the system that may disturb the parameter B, (such as the 1973 oil crisis), he may wish to increase Wt to reflect the lower relevance of pre-shock information to the present day. 3.6.5 RESIDUALS The forecasting procedure given in 3.5 above will cope with the residuals problem described in 1.1.5. The forecasts will be somewhere between the two extreme possibilities, and will take account of the most recent residuals to the extent that their precisions and relevance warrant. This subject will be covered in more detail in part 11.3 of this paper. 3.7 NON-NORMAL DISTRIBUTIONS The prior distribution and likelihoods have, up till now, been assumed to be normal. Other likelihood functions and conjugate prior distributions exist. One example of this is the Binomial likelihood function and its conjugate prior distribution, the Beta. The Beta distribution: Beta (a,b) is defined by the probability density function p(T=t) = ta.(1 - t)b.GAMMA(a+b+2)/(GAMMA(a+1).GAMMA(b+1)) for t between 0 and 1 Thus, t cannot be greater than 1 or less than 0. Such a constraint (t between 0 and 1) would be appropriate for observations of market share, or for model paramaters such as those of the Cobb-Douglas function, or for any other parameter which, by definition, lies between 0 and 1. The procedure will be developed along the lines of that in 3.2 above, and the equation numbers will correspond. Consider a model of form Y = B.X and suppose that the structure of the model is such that B necessarily lies between 0 and 1. Suppose that at time t - 1 an estimate for B exists, BEt-1 which is distributed as the beta distribution, Beta(mt-1,nt-1). The mean of this distribution (and so the maximum likelihood estimate for B at time t - 1) is mt-1/(mt-1+nt-1). The variance is mt-1.nt-1/((mt-1+nt-1)2(mt-1+nt-1+1)). First we forecast Bt ; this will be the beta-binomial analogy of equations 3.2.1 and 3.2.2. BEt-1 is distributed as Beta(mt-1,nt-1) and so the forecast for BFt will be: BFt is Beta(mt-1,nt-1) eq. 3.7.1 and 3.7.2. The same comments as were made at this point in section 3.2 can be made here; the assumption of constant B could be relaxed, and/or the assumption that the precision of BFt is the same as the precision of BEt-1 could be relaxed. If BF t is Beta(kmt-1,knt-1) where k is between 0 and 1 then the mean of BFt = the mean of BEt-1 , but the precision is less. Given the forecast for Bt , we can forecast Y YFt = Xt.mt-1/(mt-1+nt-1) eq. 3.7.3. YFt is Xt . Beta(mt-1,nt-1) eq. 3.7.4. The observation of Yt gives us an observation of Bt If the observation of Yt has mean YOt and variance 1/Prec(YOt) then as Bt = Yt / Xt then the observation of Bt has mean YOt/Xt, variance 1/Xt2.Prec(YOt) If BOt is distributed as the binomial distribution, binomial (a,b) then, by equating the mean and variance, ab = YOt / Xt ab(1-b) = 1/(Xt2.Prec(YOt) and, solving for a and b, YOt 1 a = -- .---------------------- eq. 3.7.5. Xt 1 - 1/(XtYOtPrec(YOt)) b = 1 - 1/YOtPrec(YOt).Xt eq. 3.7.6. Using the result of 2.3.1 BEt is Beta(mt,nt) eq. 3.7.7 and 3.7.8. where mt = mt-1 + BOt nt = nt-1 + a - BOt with a defined as above. The estimate of Yt, YEt can be refined as in equations 3.2.9 and 3.2.10.