Evaluating heterogeneous forecasts for vintages of macroeconomic variables

There are various reasons why professional forecasters may disagree in their quotes for macroeconomic variables. One reason is that they target at different vintages of the data. We propose a novel method to test forecast bias in case of such unobserved heterogeneity. The method is based on so-called Symbolic Regression, where the variables of interest become interval variables. We associate the interval containing the vintages of data with the intervals of the forecasts. An illustration to 18 years of forecasts for annual USA real GDP growth, given by the Consensus Economics forecasters, shows the relevance of the method


Introduction and motivation
This paper is about the well-known Mincer Zarnowitz (1969) (MZ) auxiliary regression, which is often used to examine (the absence of) bias in forecasts 1 . This regression, in general terms, reads as = 0 + 1 + Usually, the statistical test of interest concerns, 0 = 0 and 1 = 1, jointly.
The setting in this paper concerns macroeconomic variables. For many such variables it holds that these experience revisions. For variables like real Gross Domestic Growth (GDP), after the first release, there can be at least five revisions for various OECD countries 2 .
The second feature of our setting is that forecasts are often created by a range of professional forecasters. In the present paper for example we will consider the forecasters collected in Consensus Economics 3 . To evaluate the quality of the forecasts from these forecasters, one often takes the average quote (the consensus) or the median quote, and sometimes also measures of dispersion like the standard deviation or the variance are considered. The latter measures give an indication to what extent the forecasters disagree. Recent relevant studies are Capistran andTimmermann (2009), Dovern, Fritsche, andSlacalek (2012), Lahiri andSheng (2010), Laster, Bennett, andGeoum (1999), and Legerstee and Franses (2015).
Reasons for disagreement could be heterogeneity across forecasters caused by their differing 1 Bias in forecasts can come from including inappropriate information in the creating of the forecasts. Professional forecasters may rely on econometric models with a range of potentially relevant variables, but the forecasters may also decide not to incorporate econometric models at all and base their forecasts on intuition, or they may decide to manually adjust econometric model forecasts. The results summarized in Franses (2014) shows that such manual adjustment or fully ignoring an econometric model can lead to substantial bias in forecasts.
Recently, Clements (2019) suggested that there might be another reason why forecasters disagree, and that is, that they may target at different vintages of the macroeconomic data.
Some may be concerned with the first (flash) quote, while others may have the final (say, after 5 years) value in mind. The problem however is that the analyst does not know who is doing what.
It is not easy to learn from the actual forecasts how they were created, nor is it easy to learn how forecast revisions are created. Clements (2019) proposes a few assumptions, and with these, he documents for a few variables that data revisions can be predictable. Aruoba (2008) also documents that sometimes data revisions can be viewed as noise, meaning that they can be predicted.
The question then becomes how one should deal with the MZ regression. Of course, one can run the regression for each vintage on the mean of the forecasts. But then still, without knowing who is targeting what, it shall be difficult to interpret the estimated parameters in the MZ regression. At the same time, why should one want to reduce or remove heterogeneity by only looking at the mean? It could be that the range from the vintages widens, but it could also be otherwise. We do not assume that the target of the forecasters interacts with the range from the vintages.
To alleviate these issues, in this paper we propose to keep intact the heterogeneity of the realized values of the macroeconomic variables as well as the unknown heterogeneity across the quotes of the professional forecasters. Our proposal relies on the notion to move away from scalar measurements to interval measurements. Such data are typically called symbolic data, see for example Bertrand and Goupil (1999) and Billard and Diday (2007). The MZ regression for such symbolic data thus becomes a so-called Symbolic Regression.
The outline of our paper is as follows. In the next section we provide more details about the setting of interest. For ease of reading, we will regularly refer to our illustration for annual USA real growth rates, but the material in this section can be translated to a much wider range of applications. The following section deals with the estimation methodology for the Symbolic Regression. We will also run various simulation experiments to examine the reliability of the methods. Next, we will apply the novel MZ Symbolic Regression to the USA growth rates data and compare the outcomes with what one would have obtained if specific vintages were considered. It appears that the Symbolic MZ Regression is informative. The final section deals with a conclusion, limitations, and further research issues.

Setting
Consider the I vintages of data for a macroeconomic variable , where = 1,2, . . , and = 1,2, … , . In our illustration below we will have = 7 and = 1996, 1997, … . , 2013, so  The number of forecasters can change per month and per forecast target, hence we write , .
In Table 1 this number is 29. For 2013, and in our notation, Table 1  A key issue to bear in mind for later, and as indicated in the previous section, is that we do not observe ̂, | with = 1,2, … , , , that is, we do not know who of the forecasters is targeting which vintages of the data.
To run a Mincer Zarnowitz (MZ) regression, the forecasts per month are usually summarized by taking the median, by using a variance measure, or by the mean ("the consensus"), that is, by considering The MZ regression then considered in practice is = 0 + 1̂, + for = 1,2, … , , and this regression can be run for each = 1,2, … , . Under the usual assumptions, parameter estimation can be done by Ordinary Least Squares. Next, one computes the Wald test for the joint null hypothesis 0 = 0, 1 = 1.
Now, one can run this MZ test for each vintage of the data, but then still it is unknown what the estimated parameters in the MZ regression actually reflect. Therefore, we propose an alternative approach. We propose to consider, for = 1,2, … , , the interval (min ; max ) as the dependent variable, instead of , and to consider (min̂, | ; max̂, | ) as the explanatory variable, instead of ̂, . These two new variables are intervals, and often they are called symbolic variables. The MZ regression thus also becomes a so-called Symbolic Regression, see Bertrand and Goupil (1999), Billard and Diday (2000, 2003, 2007. Table 2 presents an exemplary dataset for May in year t, so m = 17. Figure 1 visualizes the same data in a scatter diagram. Clearly, instead of points in the simple regression case, the data can now be represented as rectangles.

How does Symbolic Regression work?
When we denote the dependent variable for short as y and the dependent variable as x, we can compute for the Symbolic MZ Regression Under the assumption that the data are uniformly distributed in the intervals 4 , Billard and Diday (2000) derive the following results. At first, the averages are ̅ = 1 2 ∑(max + min ) and 4 Even when there are clusters of forecasters who target at specific vintages, the data can be uniformly distributed. Or at least, it shall be hard to reject such a uniform distribution in practice. An interesting area for further research is the potentially plausible occurrence of outlying observations. That is, all forecasters behave similarly, and just one forecaster takes a position at far other end of the spectrum. For the data that we consider in this paper we do not observe such behavior, but for other variables this may occur.
The covariance is computed as Finally, the variance is computed as This expression completes the relevant components to estimate the parameters.

Standard errors
To compute standard errors around the thus obtained parameter estimates ̂0 and ̂1 , we resort to the bootstrap. By collecting T random draws of pairs of intervals, with replacement, and by repeating this B times, we compute the bootstrapped standard errors. Together, they are used to compute the joint Wald test for the null hypothesis that 0 = 0, 1 = 1.

Simulations
To learn how Symbolic Regression and the bootstrapping of standard errors works, we run some simulation experiments. To save notation, we take as the Data Generating Process (DGP) = + + for = 1,2, … , . We set ~ (0,1) and ~ (0, 2 ). Next, we translate the thus generated and to intervals by creating where , ~ (0, 2 ), = 1,2 , ~ (0, 2 ), = 1,2 We set the number of simulation runs at 1000, and the number of bootstrap runs at B = 2000 (as suggested to be a reasonable number in Efron and Tibshirani, 1993). Experimentation with larger values of B did not show markedly different outcomes. The code is written in Matlab and R. We set N at 20 and 100, while = 0 or 5, and = −2, or 0, or 2. The results are in Tables 3 to 6. Table 3 shows that when we compare the cases where 2 = 0.5 versus 2 = 2.0 that a larger interval of the explanatory variable creates more bias than a larger interval for the dependent variable (compare 2 = 0.5 versus 2 = 2.0). Also, the bootstrapped standard errors get larger when the intervals of the data get wider, as expected. Table 4 is the same as Table 3, but now 2 = 0.5 is replaced by 2 = 2.0. Overall this means that ̂ deviates more from when the variance 2 increases. The differences across the deviations of ̂ versus are relatively small. Table 5 is the same as Table 3, but now = 20 is replaced by = 100. Clearly, a larger sample size entails less bias in the estimates, and also much smaller bootstrapped standard errors. But still, we see that ̂ is closer to then is ̂ to . Table 6 is similar to Table 4, but now for = 100. A larger sample can offset the effects of increased variance 2 , as the standard errors are reasonably small.
In Table 7 we report on the simulations when we assume that there is autocorrelation in the forecast revisions. We now consider with the convention 0 = 0. We set the number of simulations runs again at 1000, and the number of bootstraps runs at B = 2000 (as suggested to be a reasonable number in Efron and Tibshirani, 1993). We set N at 100, while = 0, and = −2, or 0, or 2, and we choose = 0.2 0.5. The results in Table 7 show that the method performs well, also when there is autocorrelation in the revisions.

Analysis of forecasts
We now turn to an illustration of the Symbolic MZ regression. We choose to consider the forecasts for annual growth rates of real GDP in the USA, for the years 1996 to and including 2013. This makes = 18. Our data source 5 gives annualized growth rates per quarter 6 . As there are no vintages of true annual growth data available, we decide to further consider the averages of each time these four quarterly growth rates. The data intervals are presented in Table 2. The right-hand side columns of Table 2 concern the forecasts created in May of year t, which means the case where = 17. This implies that we can consider 24 Symbolic MZ regressions, each for each of the 24 months. Table 8 presents the estimation results, the bootstrapped standard errors and the p value of the Wald test for the null hypothesis that 0 = 0, 1 = 1. We see from the last column that a p value > 0.05 appears for the forecasts quoted in May in year t-1, and that after that the p value stays in excess of 0.05. However, if we look at the individual parameter estimates, we see that Let us now turn to the MZ regression in its standard format, that is, the explanatory variable is the mean of the forecasts and the variable to be explained in one of the vintages of the data. Table 9 presents the results for the first (flash) release real GDP annual growth rates, whereas Table 10 presents the results for the currently available vintage. We also have the results of all vintages in between, but these do not add much to the conclusions that can be drawn from Tables 9 and 10.
First, we see that the standard errors in Tables 9 and 10 are much smaller than the bootstrapped standard errors for the Symbolic MZ Regression. This of course does not come as a surprise as we have point data instead of intervals. For the first vintage of data in Table   9, we see from the p values for the Wald test in the last column that only since March, year t, the null hypothesis of no bias cannot be rejected (p value is 0.485). One month earlier, the p value is 0.071, but for that month we see that 1 = 1 is not in 95% confidence interval (0.787 with a SE of 0.098). Note by the way that the forecasts created in the very last month of the current year (December, year t) are biased (p value of 0.012), at least for the first release data. year t, but note that 1 = 1 is not in 95% confidence interval for 23 of the 24 months. Only for the forecasts in December, year t, the forecasts do not seem biased (p value of 0.115, and 1 = 1 is in the 95% confidence interval (0.820 with SE of 0.088).
In sum, it seems that individual MZ regressions for vintages of data deliver confusing outcomes, which seem hard to interpret. Let alone that we effectively do not know who of the forecasters is targeting at which vintage. Moreover, it seems that outcomes of the Symbolic MZ Regression are much more coherent and straightforward to interpret. Of course, due to the very nature of the data, that is, intervals versus points, statistical precision in the Symbolic Regression is smaller, but the results seem to have much more face value and interpretability than the standard MZ regressions.
The power of our approach of course suffers from the notion that we look at annual data. We do not think that power loss is due to bootstrapping. In fact, for the first four months in Table   8, we do reject the null hypothesis. Also, as time proceeds the standard error get smaller quite rapidly. The Symbolic Regression method incorporates the heterogeneity, that is fully ignored buy Ordinary Least Squares. So, we are tempted to argue that the bootstrapped standard errors reflect uncertainty more realistically than the OLS based standard errors. At the same time, the parameters in the MZ regression are approaching 0 and 1, respectively, as time proceeds, which is also something you would expect. This does not happen in Table 10.

Conclusion and discussion
Forecasts created by professional forecasters can show substantial dispersion. Such dispersion can change over time but can also concern the forecast horizon. The relevant literature has suggested various sources for dispersion. A recent contribution to this literature by Clements (2017) adds another potential source of heterogeneity, and this is that forecasters may target different vintages of the macroeconomic data. Naturally, the link between targets and forecasts is unknown to the analyst.
To alleviate this problem, we proposed an alternative version of the Mincer Zarnowitz (MZ) regression to examine forecast bias. This version adopts the notion that the vintages of the macroeconomic data can perhaps best be interpreted as interval data, where at the same time, the forecasts also have upper and lower bounds. Taking the data as intervals makes the standard MZ regression a so-called Symbolic MZ Regression. Simulations showed that reliable inference can be drawn from this auxiliary regression. An illustration for annual USA GDP growth rates showed its merits.
A limitation to the interval-based data analysis is the potential size of the intervals. In our case, the sample size is equal to 18 years. When more data become available, the method will become more reliable. A second limitation is that it is assumed that the data are uniformly distributed within the intervals. In our empirical exercise, we have a small number of observations in the intervals, so basically this assumption is an axiom. It shall not be reliable to formally test for the appropriateness of this assumption. Further research with alternative distributional assumptions shall be relevant. At present, our application considers only two variables, and it would be of interest to study the symbolic regression for more variables, as is also done in some of the relevant literature.
Further applications of the new regression should shine light on its practical usefulness. The method does have conceptual and face validity, but more experience with data and forecasts for more variables related to more countries should provide more credibility. Figure 1: The intervals of Table 2.
This article is protected by copyright. All rights reserved.