# SOLVING THE PROBLEM OF MISSING DATA After deciding which indicators to include in an index, we…

SOLVING THE PROBLEM OF MISSING DATA After deciding which indicators to include in an index, we collect the data. At this point we become aware of the amount of the missing data, and whether we will be able to encompass the selected group of indicators or not. If a lot of data is missing that cannot be imputed meaningfully, we have to exclude some indicators from the analysis, i.e. to drop out the variables or instances for which the data is missing. So, the case deletion is the first and the most radical way of dealing with this problem. This method is used in the case when, for example, among the observed countries we miss the values for the majority of countries for the larger part of the observed period of time, so that we cannot estimate and impute the missing values. For example, the famous Human Development Index deals with the missing data problem in this way, selecting the variables for which complete data across the domain of countries is relatively easy to obtain, restricting to a small set of indicators for which complete data exists. On one hand this is good since it provides objective and more reliable results, but on the other hand this way we cannot include all important aspects of one phenomenon, but have to restrict to the basic ones. The case deletion approach is, of course, appealing because of its simplicity. However, this is not applicable in cases when missing values cover a lot of instances, or their presence in essential attributes is large (Little & Rubin, 1987) Generally, if this radical exclusion of indicators is not needed or wanted, composite indices can deal with the problem of missing data in three ways (Foa &Tanner, 2012). The first and the simplest solution is to drop out any country for which complete data does not exist; the second solution is to impute missing values using different methods; and the third is to use only existing data in the estimation of the index, but supplementing with an estimated margin of error. The first solution of reducing the sample is not always acceptable, especially if we want to make the global cross-country comparisons. However, there are indices, as the Doing Business Indicators whose authors want to avoid the methodological problems and obtain an objective result (Foa & Turner, 2012). This is why the country domain is smaller than in the case of e.g. ArCo index of technological capability of countries (Archibugi & Coco, 2004), which is measured for 162 countries. So, in order to enlarge the sample, we need to find the right way of imputing the missing values, always having in mind that the selection of methods manipulates the results. We will discuss this in detail in the Section 2.1. The third proposed solution refers to using only the existing data in the estimation of the index, but supplementing the results with an estimated margin of error, based on the number of missing items, among other criteria observed. This approach is used in a number of recent indices such as the Corruptions Perceptions Index (CPI) where the confidence range indicates the reliability of the country scores. It tells that allowing for a margin of error we can be 90 percent sure that the true score for countries lies within the given range. 2.1. Missing data imputation techniques – an overview The literature on the analysis of missing data is extensive and in rapid development. OECD & EC-JRC (2008) published the Handbook on Constructing Composite Indicators – Methodology and User Guide, which provides help in creating indices. The Handbook aims to contribute to a better understanding of the complexity of composite indicators and to an improvement in the techniques currently used to build them. Among others, the Handbook deals with the problem of missing data imputation, and suggests single and multiple imputations as possible solutions. As defined in the Handbook, “imputations are means or draws from a predictive distribution of missing values.” The predictive distribution must be generated by using the observed data Single imputation refers to both implicit and explicit modelling. The implicit techniques are simple. Hot desk imputation refers to filling in the blanks cells with individual data which are drawn from the unit that has similar characteristics (for example if we observe units according to four indicators and miss a value for one unit for indicator x, we will fill in that missing value with the value of indicator x for the unit which is the most similar to the one observed according to the other three indicators). Substitution means the replacement of non-responding units with the unselected units in the sample, while Cold desk imputation is the replacement of missing values with the values from an external source (for example from the previous realization of the same survey or the value of an indicator from the previous year in the case of assessing the countries performance). Additionally, we propose the simple mean imputation which refers to imputing the missing values considering only one instance and its dataset (separate from the sample), imputing the missing value by finding average of the previous and the next value to the one missing (it is important to have in mind that assessing countries performance we deal with time-series datasets) Explicit modeling is more complex and demands more detailed explanations. Using unconditional mean imputation means that we impute the missing values with the sample mean (median, mode) for the observed indicator. The limitation of mean value based imputation and its variations is its focus on a specific variable without taking into account the overall similarities between instances (Ayuyev et al., 2009). This is the easiest way explicit modelling, but not always precise enough. Therefore, we could use other two more sophisticated techniques. Firstly, the regression imputation where missing values are imputed with the predicted values obtained by regression. Here we observe dependent variable and independent variable(s). The dependent variable is the indicator for which we miss some values, and the independent variable(s) is (are) the individual indicator(s) which show strong relationship (usually high correlation) with the dependent variable. Expectation maximization imputation focuses on the interdependence between parameters of the model and the missing values. It is an iterative process. First, the missing values are predicted based on initial estimates of the model parameters values. These predictions are then used to update parameters values, and the process repeated. The sequence of parameters converges to maximum-likelihood estimates, and the time to convergence depends on the proportion of missing data and the flatness of the likelihood function. For more detailed mathematical explanation on explicit modellir es see OECD & EC-JRC (2008, pp. 55-58). Multiple imputation is considered to be one of the most powerful approaches to missing values estimation (Ayeyev et al., 2009). It is a general approach that does not require a specification of parameterised likelihood for all data. The missing data is imputed with a random process that reflects uncertainty Imputation is done N times, to create N “complete” datasets. The parameters of interest are estimated on each data set, together with their standard errors. Average (mean or median) estimates are combined using the N sets and between-and within-imputation variance is calculated. Although any imputation method can be used in multiple imputation (used repeatedly to obtain N values), one of the most general models is the Markov Chain Monte Carlo (MCMC) method. It is a sequence of random variables where the distribution of the observed element depends on the value of the previous one. It is assumed that data are drawn from a multivariate normal distribution and requires the following assumptions: missing at random (MAR) and missing completely at random (MCAR). For more detailed explanation see OECD & EC-JRC (2008, pp. 58- 61). For example, the Environmental Sustainability Index uses the MCMC technique to substitute the missing values (Srebotnjak, 2001). Based on the amount of operations performed, Zhang (2011) presents the following categorisation of imputation techniques: single, multiple, fractional and iterative. Fractional imputation represents a compromise between the single and multiple imputation methods, while iterative imputation techniques primarily use a generate-and-test mechanism, taking into account useful information (including incomplete cases) Fujikawa and Ho (2002) consider a clustering based approach for missing data imputation, where the premise is that units could be grouped such that all the imputations in identified groups are independent from other groups. Distance-based clustering is focused mainly on development of supervised clustering methods and mean/mode based imputations in these clusters (De Mántaras, 1991). They are based on a strict separation for objects within clusters, so it is assumed that there is no influence of instances in one cluster to an imputation process in other clusters. Ayuyev et al. (2009) suggest the improved dynamic clustering-based imputation (DCI) of missing values in mixed type data. They consider the appropriate choice of a method for imputation especially important when the fraction of missing values is large and the data are of mixed type. The proposed DCI algorithm relies on similarity information from shared neighbours, where mixed type variables considered together. Around each instance with a missing value they deterministically construct an independent cluster of similar instances with no missing values for a particular attribute. In contrast to a typical clustering method, they allow cluster intersections meaning that the same unit may be included in many clusters. It relies on a distance measure that considers both categorical and continuous variables and is applicable for estimation of missing values in high dimensional mixed type data. Different authors propose and analyze other complex algorithms for missing data imputation. For example, Abdella & Marwala (2005) introduced a new method for imputing the missing values which uses a combination of genetic algorithms and neural networks for approximation of the missing data. Nelwamondo et al. (2007) compare neural network and expectation maximization techniques, while Lobato