The HCR is a poverty indicator which measures the frequency of households under poverty line. Two types of variables are required for SAE analysis, the variable of interest and the auxiliary variables. In this study, the variable of interest for which small area estimates are required is drawn from the Household Consumer Expenditure Survey —12 of NSSO for rural areas of the State of Bihar in India.
The sampling design used in the NSSO data is stratified multi-stage random sampling with districts as strata, villages as first stage units and households as the second stage units. A total of households were surveyed from the 38 districts of the Bihar. The district-wise sample size varied from minimum 64 to maximum with average of 87 Table 1. From Table 1 , it is evident that district level sample sizes are very small with very low values of average sampling fraction of 0.
Therefore, it is difficult to produce reliable estimates of the poverty incidence and their standard errors at district level. Hence, the application of SAE technique is an obvious choice for obtaining the district level estimates of poverty incidence. The SAE technique is expected to provide reliable estimates for the districts having small sample data [ 3 — 5 ].
The target variable used for the study is poor households. The poverty line has been used to identify whether given household is poor or not. The auxiliary covariates variables used in this analysis are drawn from the Population Census These auxiliary variables are only available as counts at district level, and there are approximately 50 such covariates that are available for use in SAE analysis. We therefore carried out a preliminary data analysis in order to define appropriate covariates for SAE modelling, using Principal Component Analysis PCA to derive composite scores for selected groups of variables.
The reader is referred to [ 10 — 11 ] for a more detailed discussion on PCA. The PCA variables i. We carried out PCA separately on three groups of variables, all measured at district level and identified as X 1 , X 2 and X 3 respectively.
- International Perspective - Jan. 8, 2010.
- About the Small Area Income and Poverty Estimates.
- Socio-Economic Democracy and the World Government: Collective Capitalism, Depovertization, Human Rights, Template for Sustainable Peace.
The first group X 1 consisted of literacy rates by gender and proportions of worker population by gender. The second group X 2 consisted of the proportions of main worker by gender, proportions of main cultivator by gender and proportions of main agricultural labourer by gender.
Finally, the third group X 3 consisted of proportions of marginal cultivator by gender and proportions of marginal agriculture labourers by gender.
USDA ERS - Documentation
We then fitted a generalised linear model using direct survey estimates of proportions of poor households as the response variable and the six principal component scores X 11 , X 12 , X 21 , X 22 , X 31 , and X 32 as potential covariates. The final selected model included the three covariates X 11 , X 21 and X This final model was then used to produce district wise estimates of poverty incidence, i. In this Section we illustrate the theoretical framework used to produce small area estimates of the poverty incidence and their measure of precision.
The details presented here are followed from [ 12 — 13 ]. Let us assume a finite population U of size N and a sample s of size n is drawn from this population with a given survey design. The subscript s and r are used for denoting the quantities related to the sample and non-sample parts of the population.
So that n d and N d represent the sample and population i.
Let s d denotes the part of sample from area d such that and Let y di denotes the value of target variable of interest y for unit i in small area d. Let assume that the variable of interest y is binary and the target is the estimation of population counts or population proportions in area d. The direct estimator of proportion of poor household is defined as , where w di is the survey weight associated with household i in area d. Let us denote by y sd and y rd the sample and non-sample counts of poor households in area or district d.
Further, y sd and y rd are assumed to be independent Binomial variables with p d being a common success probability. Here we assume that only aggregated level data is available for the small area modelling. For example, from survey data y sd and from secondary data sources i. Census and administrative records etc x d , the p -vector of the covariates, are available for area d. Following [ 12 — 13 ], the model linking the probabilities of success p d with the covariates x d is the logistic linear mixed model given by 1 with. Here, we observe that equation number 1 relates the area or district level proportions direct estimates from the survey data to the area or district level covariates.
Area level model was originally proposed by Fay and Herriot [ 8 ] for the prediction of mean per-capita income PCI in small geographical areas less than persons within counties in the United States. Fay-Herriot model [ 8 ] is widely used area level model for the estimation of small area quantities. In many small area applications, when data are non-linear on original scale, Fay-Herriot model is fitted on transformed scale. For example, some function of small area direct survey estimates is linearly related to the area aggregates of auxiliary variables.
Similarly, in Chilean poverty estimation methodology, Fay-Herriot model is fitted with transformed poverty rate estimates using the arcsine transformation [ 16 ]. In such cases, model parameters are estimated under Fay-Herriot model fitted on transformed scale. This is followed by back transformation to obtain the estimate for small area quantities on original scale. However, back transformation leads to biased estimates of small area quantities on original scale [ 15 , 17 ]. This approach of poverty estimation based on Fay-Herriot method using the transformed direct estimates is often criticised.
The Fay-Herriot method for SAE is based on area level linear mixed model and their approach is applicable to a continuous variable. This model is not applicable for non-normal data. Equation number 1 on the other hand, a special case of a generalized linear mixed model GLMM with logit link function, is suitable for modelling discrete data, particularly the binary variables. Here, This leads to and. Detailed description of the approach can be followed from [ 18 — 20 ].
Let us write the total counts, i. An estimate of proportion in area d is then obtained as. For area with zero sample sizes i. From equation number 1 , for non-sampled areas, the synthetic type predictor of total count for area d is , where x d , out denote the vector of covariates associated with non-sampled area d. An alternative to predictor 3 has been proposed by [ 21 ].
Income / Employment
Unfortunately, this predictor does not have a closed form and can only be computed via numerical approximation. This is generally not straightforward, and so many users tend to favour computation of a plug-in empirical predictors like 3. There are several alternative approaches for estimating the small area counts. For example, Bayesian approaches for modelling the counts, using a negative binomial distribution or via a hierarchical Poisson-gamma model, are popular in the disease mapping and ecological regression literature, see for example, [ 22 — 26 ] and references therein.
The mean squared error MSE estimates are computed to assess the reliability of estimates and also to construct the confidence interval for the estimates. Following [ 12 , 13 , 19 , 20 ], the MSE estimate of small area predictor 3 is given by 4 In equation number 4 , the first two components m 1 and m 2 constitute the largest part of the overall MSE estimate.
For simplicity and ease of implementation, we define few notations to express different components of the MSE estimate given in equation number 4. We denote by and , the diagonal matrices defined by the corresponding variances of the sample and non-sample part respectively. We then define , and , where I D is an identity matrix of order D. We further write and. Using these notations, the various components of MSE estimate are: Here is the asymptotic covariance matrix of the estimate of variance component , which can be evaluated as the inverse of the appropriate Fisher information matrix for.
We use REML estimate for , then with and. Finally, and , where A i is the i th row of the matrix A. The empirical results reported in the next Sections are obtained using R software. We now discuss the results i. In this analysis, we use survey data from the Household Consumer Expenditure Survey —12 of NSSO and the Population Census , and assume a binomial specification for the observed district level sample counts.
Looking for other ways to read this?
Model specification for this application was discussed in previous Section, and resulted in the identification of three PCA-based covariates, labelled X 11 , X 21 and X 31 , there. In SAE applications, generally two types of diagnostics measures are suggested and employed, the model diagnostics and the diagnostics for the small area estimates, see [ 12 , 27 ].
The model diagnostics are applied to verify the assumptions of the underlying model.
Fig 1 shows normal q-q plot of the district level residual, which provides evidence in support of the normality assumption for the district-level residuals. Besides, this visual method for checking normality, we also perform Shapiro-Wilk SW test of normality i. The p-value from SW test indicates the chance that the sample comes from a normal distribution.
Typically, a value of 0. We use shapiro. These results clearly indicate that the normality assumption is satisfied reasonably well for the data. Other diagnostics are used to examine reliability and validity of the model-based small area estimates. Such diagnostics are suggested in [ 27 ].
The model-based small area estimates should be consistent with the unbiased direct survey estimates, be more precise than the direct survey estimates, and provide reasonable results to users. The values for the model-based small area estimates derived from the fitted model should be consistent with the unbiased direct survey estimates, wherever these are available, i. The model-based small area estimates should have mean squared errors significantly lower than the variances of corresponding direct survey estimates.
For this purpose, we consider three commonly used measures, a bias diagnostic, a percent coefficient of variation CV diagnostic, and a 95 percent confidence interval-diagnostic. See for example, [ 12 , 13 , 27 ], for more information. The bias diagnostic is used to examine if the model-based small area estimates are less extreme when compared to the direct survey estimates, when it is available. If direct survey estimates are unbiased, their regression on the true values should be linear and correspond to the identity line.
Further, if model-based small area estimates are close to the true values the regression of the direct survey estimates on these model-based estimates should be similar. In particular, the aim of the diagnostic is a simple test that the straight line found by regressing the direct estimate against the model-based estimate provides an adequate fit of the small area estimates.
The bias scatter plot of the district level direct survey estimates against the corresponding model-based small area estimates is given in Fig 2 , with fitted least squares regression line dotted line and line of equality solid line superimposed. The bias diagnostic plot in Fig 2 indicates that the district level model-based estimates generated by EBP are less extreme when compared to the direct survey estimates, demonstrating the typical SAE outcome of shrinking more extreme values towards the average.
This is expected, since the EBP estimates are realisation of random variables and so the regression of the direct estimates on the EBP estimates is unbiased for a test of common expected values. Such a test is provided by the Goodness of Fit GoF diagnostic. This diagnostic tests whether the direct estimates and the model-based estimates generated by EBP are statistically different. The null hypothesis is that the direct estimates and the model-based estimates are statistically equivalent. The alternative is that the direct estimates and the model-based estimates are statistically different.
The GoF diagnostic is computed using the following Wald statistic for EBP estimate: The value from the test statistic is compared against the value from a chi square distribution with D degrees of freedom. A smaller value less than The diagnostic results clearly show that the model based small area estimates are consistent with the direct survey estimates. In small area applications, aggregation or benchmarking of small area estimates at higher level is always desirable by the users.
At higher level of aggregation, the direct estimates are considered to be reliable and therefore the model-based small area estimates are expected to be near to the direct estimates when they are aggregated. We checked the aggregation of model-based small area estimates at state level. We computed state level incidence of poverty by aggregating the direct estimates and the model-based small area estimates i. The state level estimate of incidence of poverty computed by aggregation of direct and EBP methods are 0. As one expects, the model-based estimates aggregate well to state level direct estimate.
We use the percent CV to assess the improved precision of the model-based small area estimates EBP and the direct survey estimates. The CVs show the sampling variability as a percentage of the estimate. Estimates with large CVs are considered unreliable i. In general, there are no internationally accepted tables available that allow us to judge what is "too large".
Sort / Rank
Different organization uses different cut off for CV to release their estimate for the public use. The results in Table 1 and district-wise values in Fig 3 clearly show that the direct estimates of the proportion of poor households within each district are unstable, with CVs varying from The estimate of poverty incidence of Madhepura district is zero because the sample count is zero.
As a result, the standard error of direct estimate is zero and hence CV cannot be computed for this district. This is one of the drawback of direct estimation.
Alabama Business Confidence Index
Except in two districts i. Samastipur and Jamui , the model-based estimates are less variable i. In these two districts Samastipur and Jamui the value of CVs are at par for both direct and model-based estimates. Overall, using SAE improves the precision of the small area estimates. This ignores the effects of differential weighting and clustering within districts that would further inflate the true standard errors of the direct estimates.
This map provides the spatial inequality in distribution of poverty incidence, i. This map is very useful in identifying the districts and regions with low and high level of poverty incidence in the state. This map clearly shows that districts bordering with eastern Uttar Pradesh have higher poverty incidence.
The district level estimates as well as the spatial map of poverty rates are expected to provide invaluable information to policy-analysts and decision-makers for identifying the regions and districts requiring more attention in the state.
This is an example of a "poverty map" showing reliable estimates of poverty incidence across a region of interest. Theory of SAE method for estimation of proportions is well developed, however, its application in the field of agricultural or social sciences are not so popular. Although need of small area statistics has been felt in different agencies and organization in India, but, not much initiative has been taken place.
In India, the Census is usually limited in its scope in collection of data; it focuses mainly on basic social and demographic information and that too at decennial interval. On the other hand, NSSO conducts regular surveys on a number of socioeconomic indicators, but their utility is restricted to generate national and state level estimates, but not administrative units below state because of small sample sizes for such units. This paper demonstrates that the SAE can be used as cost effective and efficient approach for generating fairly accurate disaggregate level estimates of the poverty incidence from existing survey data and using auxiliary information from different published data sources.
Man pages 5. API 9. Source code Usage 1. BayesSAE index. Package overview. R Package Documentation rdrr. We want your feedback! Note that we can't provide technical support on individual packages. You should contact the package authors for that. Tweet to rdrrHQ. GitHub issue tracker. Personal blog. What can we improve?