Confidence bands for a distribution function with merged data from multiple sources


Share / Export Citation / Email / Print / Text size:

Statistics in Transition New Series

Polish Statistical Association

Central Statistical Office of Poland

Subject: Economics, Statistics & Probability


ISSN: 1234-7655
eISSN: 2450-0291





Volume / Issue / page

Related articles

VOLUME 21 , ISSUE 4 (August 2020) > List of articles

Special Issue

Confidence bands for a distribution function with merged data from multiple sources

Takumi Saegusa

Keywords : confidence band, data integration, Gaussian process

Citation Information : Statistics in Transition New Series. Volume 21, Issue 4, Pages 144-158, DOI:

License : (CC BY-NC-ND 4.0)

Received Date : 31-January-2020 / Accepted: 30-June-2020 / Published Online: 15-September-2020



We consider nonparametric estimation of a distribution function when data are collected from multiple overlapping data sources. Main statistical challenges include (1) heterogeneity of data sets, (2) unidentified duplicated records across data sets, and (3) dependence due to sampling without replacement from a data source. The proposed estimator is computable without identifying duplication but corrects bias from duplicated records. We show the uniform consistency of the proposed estimator over the real line and its weak convergence to a Gaussian process. Based on these asymptotic properties, we propose a simulation-based confidence band that enjoys asymptotically correct coverage probability. The finite sample performance is evaluated through a simulation study. A Wilms tumor example is provided.

Content not available PDF Share



BERK, R. H. JONES, D. H., (1978). Relatively optimal combinations of test statistics. Scand. J. Statist., 5(3), pp. 158–162.

BICKEL, P. J. FREEDMAN, D. A., (1981). Some asymptotic theory for the bootstrap. Ann. Statist., 9(6), pp,1196–1217.

BICKEL, P. J. KRIEGER, A. M., (1989). Confidence bands for a distribution function using the bootstrap. J. Amer. Statist. Assoc., 84(405), pp. 95–100.

BRESLOW, N. E. CHATTERJEE, N., (1999). Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 48(4), pp. 457–468.

BRESLOW, N. E., LUMLEY, T., BALLANTYNE, C., CHAMBLESS, L., KULICH, M., (2009). Using the whole cohort in the analysis of case-cohort data. American J. Epidemiol., 169, pp. 1398–1405.

BRETH, M., (1978). Bayesian confidence bands for a distribution function. Ann. Statist., 6(3), pp. 649–657.

BRICK, J. M., DIPKO, S., PRESSER, S., TUCKER, C., YUAN, Y., (2006). Nonresponse bias in a dual frame sample of cell and landline numbers. The Public Opinion Quarterly, 70(5), pp. 780–793.

CERVANTES, I., JONES, M., ROJAS, L., BRICK, J., KURATA, J., GRANT, D., (2006). A review of the sample design for the california health interview survey. In Proceedings of the Social Statistics Section, American Statistical Association, pp. 3023–3030.

CHATTERJEE, N., CHEN, Y.-H., MAAS, P., CARROLL, R. J., (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J. Amer. Statist. Assoc., 111(513), pp. 107–117.

CHENG, R. C. H. ILES, T. C., (1983). Confidence bands for cumulative distribution functions of continuous random variables. Technometrics, 25(1), pp.77–86.

COX, D. R., (1972). Regression models and life-tables. J. Roy. Statist. Soc. Ser. B, 34, pp. 187–220.

D’ANGIO, G. J., BRESLOW, N., BECKWITH, J. B., EVANS, A., BAUM, H., DELORIMIER, A., FERNBACH, D., HRABOVSKY, E., JONES, B., KELALIS, P., (1989). Treatment of Wilms’ tumor. Results of the Third National Wilms’ Tumor Study. Cancer, 64(2), pp. 349–360.

DVORETZKY, A., KIEFER, J., WOLFOWITZ, J., (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist., 27, pp. 642–669.

FREY, J., (2008). Optimal distribution-free confidence bands for a distribution function. J. Statist. Plann. Inference, 138(10), pp. 3086–3098.

GINÉ, E. NICKL, R., (2016). Mathematical foundations of infinite-dimensional statistical models. Cambridge Series in Statistical and Probabilistic Mathematics, [40]. Cambridge University Press, New York.

HARTLEY, H. O., (1962). Multiple frame surveys. In Proceedings of the Social Statistics Section, American Statistical Association, pp. 203–206.

HARTLEY, H. O., (1974). Multiple frame methodology and selected applications. Sankhy¯a Ser. C, 36, pp. 99–118.

HU, S. S., BALLUZ, L., BATTAGLIA, M. P., FRANKEL, M. R., (2011). Improving public health surveillance using a dual-frame survey of landline and cell phone numbers. American Journal of Epidemiology, 173(6), pp. 703–711.

KANOFSKY, P. SRINIVASAN, R., (1972). An approach to the construction of parametric confidence bands on cumulative distribution functions. Biometrika, 59, pp. 623–631.

KEIDING, N. LOUIS, T. A., (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), pp. 319–376.

KOLMOGOROV, A. N., (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell’Istituto Italiano degli Attuari, 4, pp. 83–91.

MASSART, P., (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab., 18(3), pp. 1269–1283.

METCALF, P. SCOTT, A., (2009). Using multiple frames in health surveys. Statistics in Medicine, 28(10), pp. 1512–1523.

OWEN, A. B., (1995). Nonparametric likelihood confidence bands for a distribution function. J. Amer. Statist. Assoc., 90(430), pp. 516–521.

SAEGUSA, T., (2019). Large sample theory for merged data from multiple sources. Ann. Statist., 47(3), pp. 1585–1615.

SAEGUSA, T. WELLNER, J. A., (2013). Weighted likelihood estimation under twophase sampling. Ann. Statist., 41(1), pp. 269–295.

SCHAFER, R. E. ANGUS, J. E., (1979). Estimation of weibull quantiles with minimum error in the distribution function. Technometrics, 21(3), pp. 367–370.

SMIRNOV, N. V., (1944). Approximate laws of distribution of random variables from empirical data. Uspehi Matem. Nauk, 10, pp. 179–206.

TSIRELSON, V. S., (1975). The density of the distribution of the maximum of a Gaussian process. Theory of Probability and its Applications, 20, pp. 847–865.

WANG, J., CHENG, F., YANG, L., (2013). Smooth simultaneous confidence bands for cumulative distribution functions. J. Nonparametr. Stat., 25(2), pp. 395–407.