Applied Statistics


For possible and future Master's thesis in Mathematics or in Biostatistics, see here. We often have shorter projects as well, see here.

See also our Declaration of Reproducibility Policy.

varrank: a variable selection appoach

Gilles Kratzer

A common challenge encountered when working with high dimensional datasets is that of variable selection. All relevant confounders must be taken into account to allow for unbiased estimation of model parameters, while balancing with the need for parsimony and producing interpretable models. This task is known to be one of the most controversial and difficult tasks in epidemiological analysis. 

Variable selection approaches can be categorized into three broad classes: filter-based methods, wrapper-based methods, and embedded methods. They differ in how the methods combine the selection step and the model inference. An appealing filter approach is the minimum redundancy maximum relevance (mRMRe) algorithm. The purpose of this heuristic approach is to select the most relevant variables from a set by penalising according to the amount of redundancy variables share with previously selected variables. In epidemiology, the most frequently used approaches to tackle variable selection based on modeling use goodness-of fit metrics. The paradigm is that important variables for modeling are variables that are causally connected and predictive power is a proxy for causal links. On the other hand, the mRMRe algorithm aims to measure the importance of variables based on a relevance penalized by redundancy measure which makes it appealing for epidemiological modeling.

varrank has a flexible implementation of the mRMRe algorithm which perform variable ranking based on mutual information. The package is particularly suitable for exploring multivariate datasets requiring a holistic analysis. The two main problems that can be addressed by this package are the selection of the most representative variables for modeling a collection of variables of interest, i.e., dimension reduction, and variable ranking with respect to a set of variables of interest.


Spatial fusion modeling

Craig Wang, in collaboration with Milo Puhan, Epidemiology, Biostatistics and Prevention Institute, UZH.

The availability of data has increased dramatically in past years. Multivariate remote sensing data, highly detailed social-economic data are readily to be analyzed to address different research interests. Moreover, the linkage between diverse datasets can be more easily established with the openness trend of database hosts and organizations.

We constructs spatial fusion models within the Bayesian framework to jointly analyze both individual point data and area data. A single source of data may be incomplete or not suitable for parameter inference in statistical models. The cost of data collection especially in large population studies may result useful variables to be omitted, hence limit the scope of research interests. In addition, appropriate statistical models can be complex hence requiring a large amount of data to make precise inference on the weakly identified parameters. Therefore, it becomes crucial in those situations to utilize multiple data sources, in order to reduce bias, widen research possibilities and apply appropriate statistical models.

Area data (left), point data (middle) and output of fusion model (right) from a simulation.

Time series extension to Additive Bayesian Network

Gilles Kratzer.

In recent years, Additive Bayesian Networks (ABN) analysis has been successfully used in many fields from sociology to veterinary epidemiology. This approach has shown to be very efficient in embracing the correlated multivariate nature of high dimensional datasets and in producing data driven model instead of expert based models. ABN is a multidimensional regression model analogous to generalised linear modelling but with all variables as potential predictors. The final goal is to construct the Bayesian network that best support the data. When applying ABN to time series dataset it is of high importance to cope with the autocorrelation structure of the variance-covariance matrix as the structural learning process relies on the estimation of the relative quality of the model for the given dataset. This is done by using common model selection score such as AIC or BIC. We adapt the ABN framework such that it can handle time series and longitudinal datasets and then generalize the time series regression for a given set of data. We implement an iterative Cochrane-Orcutt procedure in the fitting algorithm to deal with serially correlated errors and cope with the between- and within- cluster effect in regressing centred responses over centred covariate. tsabn is distributed as an R package.


64-bit sparse matrices in R

Florian Gerber

Software packages for spatial data often implement a hybrid approach of interpreted and compiled programming languages. The compiled parts are usually written in C, C++, or Fortran, and are efficient in terms of computational speed and memory usage. Conversely, the interpreted part serves as a convenient user interface and calls the compiled code for computationally demanding operations. The price paid for the user friendliness of the interpreted component is—besides performance—the limited access to low level and optimized code. An example of such a restriction is the 64-bit vector support of the widely used statistical language R. On the R side, users do not need to change existing code and may not even notice the extension. On the other hand, interfacing 64-bit compiled code efficiently is challenging. Since many R packages for spatial data could benefit from 64-bit vectors, we investigated how to simply extend existing R packages using the foreign function interface to seamlessly support 64-bit vectors. This extension is shown with the sparse matrix algebra R package spam. The new capabilities are illustrated with an example of GIMMS NDVI3g data featuring a parametric modeling approach for a non-stationary covariance matrix.
A key part of the 64-bit extension is the R package dotCall64, which provides an enhanced foreign function interface to call compiled code from R. The interface provides functionality to do the required double to 64-bit integer type conversions. In addition, options to control copying of R objects are available.

Developing Bayesian Networks as a tool for Epidemiological Systems Analysis

Gilles Kratzer, in collaboration with the Section of Epidemiology, VetSuisse Faculty, UZH.

The study of the causes and effect of health and disease condition is a cornerstone of the epidemiology. Classical approaches, such as regression techniques have been successfully used to model the impact of health determinants over the whole population. However, recently there is a growing recognition of biological, behavioural factors, at multiple levels that can impact the health condition. These epidemiological data are, by nature, highly complex and correlated. Classical regression framework have shown limited abilities to embrace the correlated multivariate nature of high dimensional epidemiological variables. On the other hand, models driven by expert knowledge often fail to efficiently manage the complexity and correlation of the epidemiological data. Additive Bayesian Networks (ABN) addresses these challenges in producing a data selected set of multivariate models presented using Directed Acyclic Graphs (DAGs). ABN is a machine learning approach to empirically identifying associations in complex and high dimensional datasets. It is actually distributed as an R package available on CRAN.
The very natural extension to abn R package is to implement a frequentist approach using the classical GLM, then to implement classical scores as AIC, BIC etc. This extension could have many side benefits, one can imagine to boost different scores to find the best supported BN, it is easier to deal with data separation in a GLM setting, multilevel of clustering can be tackled with a mixed model setting, there exists highly efficient estimation methods for fitting GLM. More generally, if the main interest relies on the score and not on the shape of the posterior density, then a frequentist approach can be a good alternative. Surprisingly, there exists few available resources to display and analyse epidemiological data in an ABN framework. There is a need for comprehensive approach to display abn outputs. Indeed as the ABN framework is aimed for non-statistician to analyse complex data, one major challenge is to provide simple graphical tools to analyse epidemiological data. Besides that, there is a lack of resource addressing which class of problem can be tackle using ABN method, in terms of sample size, number of variables, expected density of the learned network.


see also publications of this project

Bayesian hierarchical modeling of anthelmintic resistance

Craig Wang, in collaboration with Paul Torgerson, Section of Epidemiology, VetSuisse Faculty, UZH.

The prevalence of anthelmintic resistance has increased in recent years, as a result of the extensive use of anthelmintic drugs to reduce the infection of parasitic worms in livestock. In order to detect the resistance, the number of parasite eggs in animal faeces is counted. The widely used faecal egg count reduction test (FECRT) was established in the early 1990s. We develop Bayesian hierarchical models to estimate the reduction in faecal egg counts. Our models provide lower estimation bias and provide accurate posterior credible intervals. An R package is available on CRAN, and we have also implemented an user-friendly interface in R Shiny.


Predicting missing values in spatio-temporal satellite data

Florian Gerber, in collaboration with Rogier de Jong, Michael E. Schaepman, Gabriela Schaepman-Strub

Remotely sensed data are sparse, which means that data have missing values, for instance due to cloud cover. This is problematic for applications and signal processing algorithms that require complete data sets. To address the sparse data issue, we worked on a new gap-fill algorithm. The proposed method predicts each missing value separately based on data points in a spatio-temporal neighborhood around the missing data point. The computational workload can be distributed among several computers, making the method suitable for large datasets. The prediction of the missing values and the estimation of the corresponding prediction uncertainties are based on sorting procedures and quantile regression. The algorithm was applied to MODIS NDVI data from Alaska and tested with realistic cloud cover scenarios featuring up to 50% missing data. Validation against established software showed that the proposed method has a good performance in terms of the root mean squared prediction error. We demonstrate the software performance with a real data example and show how it can be tailored to specific data. Due to the flexible software design, users can control and redesign major parts of the procedure with little effort. This makes it an interesting tool for gap-filling satellite data and for the future development of gap-fill procedures.