# Possible Master's Thesis

Are you interested in writing a Master's thesis within the broad topic of Applied Statistics? On this website we describe what writing a Master's thesis in our group involves and list some projects opportunities.

## Course of action

Contact us a few month before you plan to start you Master's thesis. We will then discuss the topic and technicalities with you. Before the work begins, we will provide you with a starting repository in Git containing additional information about available resources, template for the report, guidelines etc. You will fill out a thesis plan agreement where you record the goals of the thesis. Besides the supervision of Reinhard Furrer, you will also be assigned a PhD student who will provide additional help. You will work quite independently, but will have meetings with one of the supervisors approximately every two or three weeks to discuss your questions. There will be an intermediate and a final presentation. The intermediate presentation should be approximately halfway through your project. It serves to check if you are on the right track, and understood the problem, data, and hypotheses; and to train for the final presentation. The final presentation will take place after you have handed in the thesis.

## Duration

The work of the thesis is tailored to 6 month (as awarded 30 ECTS). Under unforeseen circumstances (TA activities, sickness, ...) it might be possible to get an extension.

## Report

The report has to be written in LaTeX with knitr. We will provide you with a template and support you acquiring the necessary skills if you have never used with this report generating system.

## Reproducibility

All work has to be done in a reproducible framework. We adhere to these guidelines.

## Prerequisites

We strongly recommend the modules "STA402 Likelihood Inference" and "STA121 Statistical Modeling" or similar as prerequisites. In addition, we also recommend that you take the module "STA472 Good Satistical Practice", which teaches very useful technical and communication skills for the Master's thesis.

# List of topics

The list below give an idea of prototypic MSc thesis. Depending on the background and the degree they may be further tailored.

## Topics in Applied Statistics

Visit this page often as we try to keep the list up to date. Moreover, there are often "now or never" opportunities; we might have additional projects ready when you visit us.

### Sample size estimation for mixed models

In animal experimentation often measurements are taken at several time points in the same animal during the study. Typical examples for such longitudinal data include the antibody titer in a vaccine trial, hemodynamic parameter during a clinical trial with different anesthetics, tumor growth at distinct time

points in different treatment groups, and cross-over studies. Further complications for sample size estimation might occur due to the presence of additional covariates, the necessity for a baseline adjustment and heteroscedasticity.

This type of data requires a statistical analysis which considers the potential clustering within animal, i.e. mixed effects models. A number of R packages have been developed for sample size estimation in the case of mixed models, but information on how to choose the most appropriate approach in specific

situations is lacking.

The aim of this Master's thesis is to provide guidance on the most appropriate methods for sample size estimation for different types of longitudinal data. Data sets from published studies as well as simulated data sets will be used for the assessment.

### Bayesian modeling: model selection in the eggCounts model family

The prevalence of anthelmintic resistance has increased in recent years due to the extensive use of anthelmintic drugs to reduce the infection of parasitic worms in livestock. In order to detect the resistance, the number of parasite eggs in animal feces is counted. Typically a subsample of the diluted feces is examined, and the mean egg counts from both untreated and treated animals are compared. In the past, conventional methods have been extended by rather complex Bayesian hierarchical models that take into account the variabilities introduced by the counting process, different infection levels across animals, or extra zero counts, which arise as a result of the unexposed animals in an infected population or animals. Current practice in Germany for horses relies only very few animals only and the recently introduce methods may not have sufficient statistical power or are not suitable for sequential test procedures.

The goal of this Master thesis is to study and implement model selection guidelines for the eggCounts family.

### Stan implementation of a parametric bootstrapping procedure for additive bayesian network analysis

Studying the causes and effects of health and disease conditions is the cornerstone of epidemiology. Classical approaches, such as regression techniques, have been successfully used to model the impact of health determinants on the population. However, recently there is a growing recognition of biological, behavioral factors, at multiple levels that can impact health condition. Those epidemiological data are, by nature, highly complex and correlated. Classical regression techniques have shown a limited ability to embrace high-dimensional epidemiological variables' correlated multivariate nature. Models driven by expert knowledge often fail to efficiently manage epidemiological data's complexity and correlation. Additive Bayesian Networks (ABNs) address those challenges in producing a data selected' set of multivariate models. It is known that overfitting is a limitation of an ABN analysis. Actually there is a well developed bunch of code which script a bootstrapping procedure dedicated to ABN in JAGS. Although, the computation time is typically very long. JAGS and Stan have slightly different strengths and limitations, but Stan is known to be slightly faster.

The goal of this Master thesis is to implement a bootstrapping procedure in Stan, to compare computing performance over simulated examples between a JAGS and a Stan implementation and implement the code as an R function that can be added to existing ABN R package.

### Functional data ANOVA

The concept of an Analysis of Variance (ANOVA) for single observations is straightforward and quite intuitive from a statistical or geometrical point of view. However, if we observe an entire function instead, the concept needs to be extended.

The thesis studies and discusses existing approaches and illustrates these according to different datasets we have used in the past.

## Topics in Spatial Statistics

We have a pretty long list of topics in spatial (and spatio-temporal) statistics. Most of these can be further tailored to fit the student's background. We recommend the module "STA330 Modeling Dependent Data" for these projects.

A non-exhaustive list is as follows:

### Cholesky factorization of sparse matrices

In the setting of multivariate Gaussian random vectors, generating samples or evaluating the likelihood is computationally demanding, especially in large dimensions. An algorithm of choice is the Cholesky factorization of the underlying covariance matrix such that the determinant and the quadratic form of the density can be evaluated relatively fast.

The main objective is a (numerical) complexity analysis of the Cholesky factorization implemented in the R package spam and compare the performance to alternative factorizations. Ideally, an existing approximate minimal degree algorithm will embedded in the *spam* environment.

### Gaussian equivalent measures

In the context of a bounded and fixed domain, it is impossible to differentiate between specific covariance models based on data only. More specifically, two different Gaussian measures are equivalent. This fact is used to work with computationally simpler models. The theoretical assumptions for equivalence are explicit, but often practical recommendations still need to be elucidated. The thesis elaborates on the assumptions for specific parameter settings and incorporates a simulation study illustrating the practical aspect of the approximation.

### Computational and statistical efficiency of a 32-bit Cholesky factorization

When working with Gaussian likelihoods, one is confronted with calculating the log-determinant of the covariance matrix and solving a linear system based on the covariance matrix. In case of sparse matrices, both steps can be efficiently accomplished after a Cholesky factorization (based on dedicated algorithms).