All Past Events

2024

Fast Bayesian Functional Principal Components Analysis

Wednesday, December 11, 2024

Speaker: Joe Sartini (Johns Hopkins University)

Functional Principal Components Analysis (FPCA) is one of the most successful and widely used analytic tools for functional data exploration and dimension reduction. Standard implementations of FPCA estimate the principal components from the data but ignore their sampling variability in subsequent inferences. To address this problem, we propose the Fast Bayesian Functional Principal Components Analysis (Fast BayesFPCA), that treats principal components as parameters on the Stiefel manifold. To ensure efficiency, stability, and scalability we introduce three innovations: (1) project all eigenfunctions onto an orthonormal spline basis, reducing modeling considerations to a smaller-dimensional Stiefel manifold; (2) induce a uniform prior on the Stiefel manifold of the principal component spline coefficients via the polar representation of a matrix with entries following independent standard Normal priors; and (3) constrain sampling leveraging the FPCA structure to improve stability. We demonstrate the improved credible interval coverage and computational efficiency of Fast BayesFPCA in comparison to existing software solutions. We then apply Fast BayesFPCA to actigraphy data from NHANES 2011-2014, a modelling task which could not be accomplished with existing MCMC-based Bayesian approaches.

Air Pollution Monitoring

Thursday, December 5, 2024

Speaker: Dr. Chris Heaney, Matthew Aubourg, Bonita Salmerón (Johns Hopkins Univeristy)

Data-related challenges in air pollution monitoring and health impacts, focused on South Baltimore and in partnership with South Baltimore Community Land Trust (represented by Greg Galen). Joint seminar with the JHU Causal Inference Working Group.

Backwards sequential Monte Carlo for efficient Bayesian optimal experimental design

Wednesday, November 13, 2024

Speaker: Andrew Chin (Johns Hopkins University)

The expected information gain (EIG) is a crucial quantity in Bayesian optimal experimental design (OED), quantifying how useful an experiment is by the amount we expect the posterior to differ from the prior. However, evaluating the EIG can be computationally expensive since it requires the posterior normalizing constant. A rich literature exists for estimation of this normalizing constant, with sequential Monte Carlo (SMC) approaches being one of the gold standards. In this work, we leverage two idiosyncrasies of OED to improve efficiency of EIG estimation via SMC. The first is that, in OED, we simulate the data and thus know the true underlying parameters. The second is that we ultimately care about the EIG, not the individual normalizing constants. This lets us create an EIG-specific SMC method that starts with a sample from the posterior and tempers backwards towards the prior. The key lies in the observation that, in certain cases, the Monte Carlo variance of SMC for the normalizing constant of a single dataset is significantly lower than the variance of the normalizing constants themselves across datasets. This suggests the potential to slightly increase variance while drastically decreasing computation time by reducing the SMC population, and taking this idea to the extreme gives rise to our method. We demonstrate our method on a simulated coupled spring-mass system where we observe order of magnitude performance improvements.

Neural Networks for Geospatial Data

Wednesday, October 16, 2024

Speaker: Wentao Zhan (Johns Hopkins University)

Geospatial data analysis has traditionally been model-based, with a mean model, customarily specified as a linear regression on the covariates, and a Gaussian process covariance model, encoding the spatial dependence. While non-linear machine learning algorithms like neural networks are increasingly used for spatial analysis, current approaches depart from the model-based setup and cannot explicitly incorporate spatial covariance.

In this talk, we will first go through a brief introduction to geospatial modeling, and several machine-learning-style extensions, followed by a focused discussion on NN-GLS — a novel neural network architecture recently proposed by us.

NN-GLS falls within the traditional Gaussian process (GP) geostatistical model. It accommodates non-linear mean functions while retaining all other advantages of GP, like explicit modeling of the spatial covariance and predicting at new locations via kriging. NN-GLS admits a representation as a special type of graph neural network (GNN). This connection facilitates the use of standard neural network computational techniques for irregular geospatial data, enabling novel and scalable mini-batching, backpropagation, and kriging schemes.

Besides, we provide a methodology to obtain uncertainty bounds for estimation and predictions from NN-GLS. Theoretically, we show that NN-GLS will be consistent for irregularly observed spatially correlated data processes. We also provide finite sample concentration rates which quantifies the need to accurately model the spatial covariance in neural networks for dependent data. Simulations and an application to air pollution modeling will be presented to demonstrate the methodology.

Challenges and opportunities in the analysis of joint models of longitudinal and survival data

Wednesday, May 15, 2024

Speaker: Dr. Eleni-Rosalina Andrinopoulou (Erasmus University Medical Center)

The increasing availability of clinical measures, such as electronic medical records, has enabled the collection of diverse information including multiple longitudinal measurements and survival outcomes. Consequently, there is a need to utilize methods that can examine the associations between exposures and longitudinal measurement outcomes simultaneously. This statistical approach is known as joint modeling of longitudinal and survival data, which typically involves integrating linear mixed effects models for longitudinal measurements with Cox models for censored survival outcomes.

This method is motivated by various clinical scenarios. For instance, patients with Cystic Fibrosis, a genetic lung disorder, face risks like exacerbation, lung transplantation, or mortality, and are regularly monitored with multiple biomarkers. Similarly, patients recovering from stroke undergo longitudinal assessments to track their progress over time. Although these outcomes are biologically interconnected, they are often analyzed separately in practice.

Analyzing such complex data presents several challenges. One key challenge is accurately characterizing patients’ longitudinal profiles that influence survival outcomes. It’s commonly assumed that the underlying longitudinal values are associated with survival outcomes, but sometimes specific aspects of these profiles, like the rate of biomarker progression, affect the hazard differently. Choosing the right functional form for this relationship is crucial and requires careful investigation due to its potential impact on results.

Another challenge arises from the high dimensionality of some datasets, such as registry data. Analyzing such comprehensive datasets using complex methodologies can be computationally expensive. Therefore, there’s a demand for algorithms capable of distributed analyses that can concurrently and impartially explore multiple correlated outcomes.

In this presentation, we will explore strategies to tackle these challenges effectively.

Bayesian extension of Multilevel Functional Principal Components Analysis with application to Continuous Glucose Monitoring Data

Wednesday, May 1, 2024

Speaker: Joe Sartini (Johns Hopkins University)

Multilevel functional principal components analysis (MFPCA) facilitates estimation of hierarchical covariance structures for functional data produced by wearable sensors, including continuous glucose monitors (CGM), all while accounting for covariate effects. There are several existing methods to efficiently fit these types of models, including the eminent fast covariance estimation and the recently proposed procedure of fitting appropriate localized mixed effects models and smoothing (Xiao et al. 2016, Leroux et al. 2023). However, these methods do not inherently account for uncertainty in the eigenfunctions during the estimation procedure. Most rely on bootstrap or asymptotic analytic results to perform inference after estimation. Towards this end, we fit MFPCA within a fully-Bayesian framework using MCMC, treating the orthogonal eigenfunctions as random. A model constructed in this way automatically accounts for variability in eigenfunction estimation and its interplay with both features of the data and the assumed hierarchical structure. The flexibility of this method also makes it well-suited to exploring the imposition of additional constraints on the eigenfunctions, such as mutual orthogonality across levels. We assess the convergence of our model using Grassmannian distances between the spaces spanned by sampled eigenfunctions at each level. After performing validation using a variety of simulated functional data, we compare the results of our model to the prominent existing approaches using 4-hour windows of CGM data for persons with diabetes centered around known mealtimes.

2023

Estimation and false discovery control for the analysis of environmental mixtures

Wednesday, November 15, 2023

Speaker: Dr. Srijata Samanta (Bristol Myers Squibb)

The analysis of environmental mixtures is of growing importance in environmental epidemiology, and one of the key goals in such analyses is to identify exposures and their interactions that are associated with adverse health outcomes. Typical approaches utilize flexible regression models combined with variable selection to identify important exposures and estimate a potentially nonlinear relationship with the outcome of interest. Despite this surge in interest, no approaches to date can identify exposures and interactions while controlling any form of error rates with respect to exposure selection. We propose two novel approaches to estimating the health effects of environmental mixtures that simultaneously 1) estimate and provide valid inference for the overall mixture effect, and 2) identify important exposures and interactions while controlling the false discovery rate. We show that this can lead to substantial power gains to detect weak effects of environmental exposures. We apply our approaches to a study of persistent organic pollutants and find that controlling the false discovery rate leads to substantially different conclusions.

The Modified Ziggurat Algorithm for Skewed Shrinkage Prior

Wednesday, October 18, 2023

Speaker: Yihao Gu (Fudan University)

Consortiums of health databases utilize standardized vocabularies to facilitate multi-institutional studies based upon their constituent data. However, synthesizing this heterogeneous clinical data is hampered by variation between ostensibly unified terminologies, with each constituent dataset providing a different set of clinical covariates. Notably, we observe ontological relationships among these covariates, and those related covariates likely contribute similarly to treatment decisions and health outcomes. Here, we extend the Bayesian hierarchical model framework by encoding ontological relations among covariates in the form of correlations in corresponding parameters. Additionally, to deal with the large number of covariates in the observational health databases, we introduce the skew-shrinkage technique. Such technique directs parameter estimates either toward the null value or informed based on the evidence supported by the data. We developed a modified ziggurat algorithm to address the computational challenges in updating the local-scale parameters under the skewed horseshoe priors. We demonstrate our approach in a transfer learning task, using a causal model trained on a larger database to improve the treatment effect estimate in a smaller database.

BLAST trainee presentations

Tuesday, October 3, 2023

Speaker: Andrew Chin, Yuzheng Dun, Claire Heffernan, Dr. Sandipan Pramanik (Johns Hopkins University)

A series of four talks given by current PhD students and postdocs in the BLAST working group.                 

Spatial predictions on physically constrained domains: Applications to Arctic sea salinity data

Wednesday, September 27, 2023

Speaker: Dr. Bora Jin (Johns Hopkins University)

In this paper, we predict sea surface salinity (SSS) in the Arctic Ocean based on satellite measurements. SSS is a crucial indicator for ongoing changes in the Arctic Ocean and can offer important insights about climate change. We particularly focus on areas of water mistakenly flagged as ice by satellite algorithms. To remove bias in the retrieval of salinity near sea ice, the algorithms use conservative ice masks, which result in considerable loss of data. We aim to produce realistic SSS values for such regions to obtain more complete understanding about the SSS surface over the Arctic Ocean and benefit future applications that may require SSS measurements near edges of sea ice or coasts. We propose a class of scalable nonstationary processes that can handle large data from satellite products and complex geometries of the Arctic Ocean. Barrier Overlap-Removal Acyclic directed graph GP (BORA-GP) constructs sparse directed acyclic graphs (DAGs) with neighbors conforming to barriers and boundaries, enabling characterization of dependence in constrained domains. The BORA-GP models produce more sensible SSS values in regions without satellite measurements and show improved performance in various constrained domains in simulation studies compared to state-of-the-art alternatives. An R package is available on GitHub.

Proximal MCMC for Bayesian Inference of Constrained and Regularized Estimation

Wednesday, May 17, 2023

Speaker: Dr. Xinkai Zhou (Johns Hopkins University)

This paper advocates proximal Markov Chain Monte Carlo (ProxMCMC) as a flexible and general Bayesian inference framework for constrained or regularized estimation. Originally introduced in the Bayesian imaging literature, ProxMCMC employs the Moreau-Yosida envelope for a smooth approximation of the total-variation regularization term, fixes nuisance and regularization parameters as constants, and relies on the Langevin algorithm for the posterior sampling. We extend ProxMCMC to the full Bayesian framework with modeling and data-adaptive estimation of all parameters including the regularization strength parameter. More efficient sampling algorithms such as the Hamiltonian Monte Carlo are employed to scale ProxMCMC to high-dimensional problems. Analogous to the proximal algorithms in optimization, ProxMCMC offers a versatile and modularized procedure for the inference of constrained and non-smooth problems. The power of ProxMCMC is illustrated on various statistical estimation and machine learning tasks. The inference in these problems is traditionally considered difficult from both frequentist and Bayesian perspectives.

Polygenic Risk Predication via Bayesian Bridge Prior

Wednesday, April 5, 2023

Speaker: Yuzheng Dun (Johns Hopkins University)

Polygenic Risk Scores (PRS) have shown great promise in predicting the genetic variation in complex human traits and diseases. Although a series of methods have been proposed to construct PRS, their relative performance varies a lot by not only the genetic architectures of the traits/diseases, but also the training GWAS sample size and LD reference panel. There is still a strong need for novel methods that have more flexible assumptions for effect size distribution and are less sensitive to the sample size of LD reference panel. We introduce PRSBridge, a Bayesian method for developing PRS based on GWAS summary-level association statistics and external reference panel for estimating linkage disequilibrium (LD). PRSBridge places a continuous shrinkage prior, Bridge prior, on SNP effect size distribution to accommodate varying genetic architectures. A prior-preconditioning conjugate gradient method is implemented to provide an MCMC algorithm with substantial computational advantages. Low rank approximation of LD matrix makes our method robust to the LD reference panel especially when the reference sample size is small. Our analyses on six continuous traits in UK Biobank further demonstrate the improvement of prediction power of PRSBridge over two most commonly implemented methods, LDpred2 and PRS-CS, on average.

A dynamic spatial filtering approach to mitigate underestimation bias in field calibrated low-cost sensor air pollution data

Wednesday, April 5, 2023

Speaker: Claire Heffernan (Johns Hopkins University)

Low-cost air pollution sensors, offering hyper-local characterization of pollutant concentrations, are becoming increasingly prevalent in environmental and public health research. However, low-cost air pollution data can be noisy, biased by environmental conditions, and usually need to be field-calibrated by collocating low-cost sensors with reference-grade instruments. We show that the common procedure of regression-based calibration using collocated data systematically underestimates high air pollution concentrations, which are critical to diagnose from a health perspective. Current calibration practices also often fail to utilize the spatial correlation in pollutant concentrations. We propose a novel spatial filtering approach to collocation-based calibration of low-cost networks that mitigates the underestimation issue by using an inverse regression. The inverse-regression also allows for incorporating spatial correlations by a second-stage model for the true pollutant concentrations using a conditional Gaussian Process. Our approach works with one or more collocated sites in the network and is dynamic, leveraging spatial correlation with the latest available reference data. The uncertainty in estimating the spatial correlations is propagated into the uncertainty of our concentration estimates through a Bayesian implementation of the method. Through extensive simulations, we demonstrate how the spatial filtering substantially improves estimation of pollutant concentrations, and measures peak concentrations with greater accuracy. We apply the methodology for calibration of a low-cost PM2.5 network in Baltimore, Maryland, and diagnose air pollution peaks that are missed by the regression-calibration.

Scalable Bayesian inference using non-reversible parallel tempering

Wednesday, March 8, 2023

Speaker: Dr. Saifuddin Syed (University of Oxford)

Markov chain Monte Carlo (MCMC) methods are the most widely used tools in Bayesian statistics for making inferences from complex posterior distributions. MCMC works by constructing a Markov chain stationary with respect to the posterior and averaging the statistics over its trajectory. In practice, for challenging problems where the posterior is high-dimensional with well-separated modes, MCMC algorithms can get trapped exploring local regions of high probability and fail to converge reliably in a finite time.

Physicists and statisticians independently introduced parallel tempering (PT) algorithms to tackle this issue. PT delegates the task of exploration to additional annealed chains running in parallel with better-mixing properties. They then communicate with the target chain of interest and help it discover new unexplored regions of the sample space. Since their introduction in the ’90s, PT algorithms are still extensively used to improve mixing in challenging sampling problems in statistics, physics, computational chemistry, phylogenetics, and machine learning.

The classical approach to designing PT algorithms was developed using a reversible paradigm that is difficult to tune and deteriorates performance when too many parallel chains are introduced. This talk will introduce a new non-reversible paradigm for PT that dominates its reversible counterpart while avoiding the performance collapse endemic to reversible methods. We will then establish near-optimal tuning guidelines and efficient black-box methodology scalable to GPUs. Our work out-performs state-of-the-art PT methods and has been used at scale by researchers to study the evolutionary structure of cancer and by the Event Horizon Telescope collaboration to discover magnetic polarization in the photograph of the supermassive black hole M87 and, most recently, to image Sagittarius A*, the supermassive black hole at the center of the Milky Way.

JHU Infectious Disease Dynamics Modeling Group

Wednesday, February 22, 2023

Speaker: Dr. Amy Wesolowski (Johns Hopkins University)

An overview of the Infectious Disease Dynamics (IDD) modeling group’s research portfolio, with short talks highlighting ongoing work and discussion of possible collaborations. The IDD is made up of faculty, post-docs, graduate and undergraduate students who are interested in the dynamics of a wide span of infectious diseases, from dengue to influenza to chikungunya, based at the Johns Hopkins Bloomberg School of Public Health. The group uses a combination of theoretical and empirical approaches to study transmission dynamics.

Hamiltonianizing a piecewise deterministic Markov process: a bouncy particle sampler with "inertia"

Wednesday, February 8, 2023

Speaker: Andrew Chin (Johns Hopkins University)

The Bouncy Particle Sampler is among the most prominent examples of piecewise deterministic Markov process samplers, a state-of-the-art paradigm in Bayesian computation. Inspired by recent connections to the Hamiltonian Monte Carlo paradigm, we present a Monte Carlo algorithm intimately related to the Bouncy Particle Sampler but relying on Hamiltonian-like dynamics which generate a piecewise linear trajectory similar to the Bouncy Particle Sampler’s. However, changes in its velocity occur deterministically in the manner of Hamiltonian dynamics, dictated by the auxiliary “inertia” parameter we introduce. We show that the proposed dynamics, while technically non-Hamiltonian, are reversible and volume-preserving and thus constitute a valid Metropolis proposal mechanism. They can be simulated exactly on log-concave target distributions, easily accommodate parameter constraints, and require minimal tuning. We further establish that the dynamics converge to the Bouncy Particle Sampler in the limit of increasingly frequent inertia refreshment. In this talk we first introduce Hamiltonian Monte Carlo and the Bouncy Particle Sampler. Then we describe our algorithm, which we call the Hamiltonian Bouncy Particle Sampler, and demonstrate its performance on real-data applications in observational health data analytics and phylogenetics.

2022

Spectral approaches to speed up Bayesian inference for large stationary time series data

Wednesday, December 7, 2022

Speaker: Dr. Matias Quiroz (Stockholm University)

This talk will discuss some recent approaches to speed up MCMC for large stationary time series data via data subsampling. We discuss the Whittle log-likelihood for univariate time series and some properties that allow estimating the log-likelihood via data subsampling. We also consider an extension to multivariate time series via the multivariate Whittle log-likelihood and propose a novel model that parsimoniously models semi-long memory properties of multivariate time series.

Bayesian penalized monotoneregression for quantifying Alzheimer disease progression with biomarkers

Wednesday, November 9, 2022

Speaker: Mingyuan Li (Johns Hopkins University)

Several biomarkers are hypothesized to indicate early stagesof Alzheimer’s disease, well before the cognitive symptoms manifest. Theirprecise relations to the disease progression, however, is poorly understood.This limits our ability to diagnose the disease and intervene at early stages.To better quantify how the disease and biomarkers progress, we propose a jointmodel in which biomarkers are modeled as increasing functions of a latentdisease progression parameter. In estimating these functions, we deploymonotone regression splines with smoothness penalty to flexibly model increasingfunctions. Besides their monotonic property, the biomarkers are expected to“flatten out” before the onset of and at the end stage of the disease. Weincorporate this scientifically-motivated shape-constraint through a “windowfunction” that controls the prior variances on splines near the two ends of thedisease progression. We fit this joint, monotone regression model under theBayesian framework to avoid having to tune the large number of hyper-parametersand to allow for a future hierarchical extension to multi-database settings.The model fit to the BIOCARD data recovers the biomarkers progressionsgenerally consistent with the existing scientific hypotheses.

Learning and Predicting from Dynamic Models for COVID-19 Patient Monitoring

Wednesday, November 9, 2022

Speaker: Zitong Wang (Johns Hopkins University)

COVID-19 has challenged health systems to learn how to learn. This talk describes the context and methods for learning from EHR data at one academic health center. We use Bayesian hierarchical regression to jointly model 1) major survival outcomes including discharge, ventilation, and death, and 2) multivariate biomarker processes that describe a patient’s disease trajectory. We focus on dynamic models using Bayesian machinery in which both the predictors and survival outcomes vary over time. We contrast prospective longitudinal models in common use with retrospective analogues that are complementary in the COVID-19 context. We apply the method to a cohort of 1,678 patients who were hospitalized with COVID-19 during the early months of the pandemic. The Bayesian dynamics model facilitates physician learning and clinical decision making through graphical tools.

A Second Look at Spatial Confounding in Spatial Linear Mixed Models

Tuesday, October 25, 2022

Speaker: Dr. Kori Khan (Iowa State University)

In the last two decades, considerable research has been devoted to a phenomenon known as spatial confounding. Spatial confounding is thought to occur when there is collinearity between a covariate and the random effect in a spatial regression model. This collinearity is often considered highly problematic when the inferential goal is estimating regression coefficients, and various methodologies have been proposed to “alleviate” it. Recently, it has become apparent that many of these methodologies are flawed, yet the field continues to expand. In this talk, we synthesize work in the field of spatial confounding. We propose that there are at least two distinct phenomena currently conflated with the term spatial confounding. We refer to these as the analysis model and the data generation types of spatial confounding. In the context of spatial linear mixed models, we show that these two issues can lead to contradicting conclusions about whether spatial confounding exists and whether methods to alleviate it will improve inference. Our results also illustrate that in many cases, traditional spatial models do help to improve inference of regression coefficients. Drawing on the insights gained, we offer a path forward for research in spatial confounding.

Efficient Alternatives for Bayesian Hypothesis Tests in Psychology

Wednesday, October 12, 2022

Speaker: Dr. Sandipan Pramanik (Johns Hopkins University)

Bayesian hypothesis testing procedures have gained increased acceptance in recent years. A key advantage of Bayesian tests over classical testing procedures is their potential to quantify information supporting true null hypotheses. Ironically, default implementations of Bayesian tests prevent the accumulation of strong evidence in favor of true null hypotheses because associated default alternative hypotheses assign a high probability to data that are most consistent with a null effect. We propose the use of “non-local” alternative hypotheses to resolve this paradox. The resulting class of Bayesian hypothesis tests permits a more rapid accumulation of evidence in favor of both true null hypotheses and alternative hypotheses that are compatible with standardized effect sizes of most interest in psychology.

Bypassing Markov Chains for Bayesian Generalized Linear Mixed Effects Models

Wednesday, May 4, 2022

Speaker: Dr. Jonathan R. Bradley (Florida State University)

Markov chain Monte Carlo (MCMC) is an all-purpose tool that allows one to generate dependent replicates from a posterior distribution for effectively any Bayesian hierarchical model. As such, MCMC has become a standard in Bayesian statistics. However, convergence issues, tuning, and the effective sample size of the MCMC are nontrivial considerations that are often overlooked or can be difficult to assess. Moreover, these practical issues can produce a significant computational burden. This motivates us to consider finding closed-form expressions of the posterior distribution that are computationally straightforward to sample from directly. We focus on a broad class of Bayesian generalized linear mixed-effects models (GLMM) that allows one to jointly model data of different types (e.g., Gaussian, Poisson, and binomial distributed observations). Exact sampling from the posterior distribution for Bayesian GLMMs is such a difficult problem that it is now arguably overlooked as a possible problem to solve. To solve this problem, we derive a new class of distributions that gives one the flexibility to specify the prior on fixed and random effects to be any conjugate multivariate distribution. We refer to this new distribution as the generalized conjugate multivariate (GCM) distribution. The expression of the exact posterior distribution is given along with the steps to obtain direct independent simulations from the posterior distribution. These direct simulations have an efficient projection/regression form, and hence, we refer to our method as Exact Posterior Regression (EPR). Several illustrations are provided.

nnSVG: scalable identification of spatially variable genes using nearest-neighbor Gaussian processes

Wednesday, April 20, 2022

Speaker: Dr. Lukas Weber (Johns Hopkins University)

Spatially-resolved transcriptomics enables the measurement of transcriptome-wide gene expression along with spatial coordinates of the measurements within tissue samples. Depending on the technological platform, this is achieved either by tagging messenger RNA (mRNA) molecules with spatial barcodes followed by sequencing, or through fluorescence imaging-based in-situ transcriptomics techniques where mRNA molecules are detected along with their spatial coordinates using sequential rounds of fluorescent barcoding. During computational analyses of spatially-resolved transcriptomics data, an important initial analysis step is to apply feature selection methods to identify a set of genes that vary in expression across the tissue sample of interest. These genes are referred to as ‘spatially variable genes’ (SVGs). These SVGs can then be further investigated individually as potential markers of biological processes, or used as the input for further downstream analyses such as spatially-aware unsupervised clustering of cell populations. Here, we propose ’nnSVG’, a new scalable approach to identify SVGs based on nearest-neighbor Gaussian processes (NNGPs), which applies NNGPs in the context of spatially-resolved transcriptomics data. Our method identifies SVGs with flexible length scales per gene, optionally within spatial domains (subregions of the tissue slide), and scales linearly with the number of spatial locations. The linear computational scalability ensures that the method can be applied to the latest technological platforms with thousands or more spatial locations per tissue slide. We demonstrate the performance of our method using experimental data from several technological platforms and simulations, and show that it outperforms existing approaches. A software implementation is available from Bioconductor and GitHub.

Better with Bayes: surface-based spatial Bayesian modeling in functional MRI

Wednesday, April 6, 2022

Speaker: Dr. Mandy Mejia (Indiana University)

Functional magnetic resonance imaging (fMRI) is a non-invasive indirect measure of neural activity, which is commonly used to study the function, organization and connectivity of the brain. Given its high dimensionality and complex spatiotemporal structure, fMRI data is often analyzed in a “massive univariate” framework wherein a separate model is fit at every location (e.g. voxel or vertex) of the brain. This approach ignores spatial dependencies, leading to inefficient estimates and a lack of power to detect effects, particularly in individual subjects. A statistically principled alternative is spatial Bayesian models, which impose spatial priors on the latent signal. For computational feasibility, stationary and isotropic gaussian Markov random field (GMRF) spatial priors are a common choice. The underlying signal in fMRI data is primarily localized to the gray matter of the cortical surface and subcortical/cerebellar structures. In its original volumetric form, the spatial fields of this signal exhibit clear deviations from the assumptions of stationarity and isotropy due to cortical folding and the presence of nuisance tissue classes (white matter and cerebral spinal fluid). It is therefore preferable to build spatial models directly on the cortical surface, a 2-dimensional manifold, and subcortical/cerebellar gray matter regions (collectively referred to as “grayordinates”). In this talk, I will discuss my group’s work developing spatial Bayesian models for common types of fMRI analysis. I will also discuss the software we have developed to facilitate the adoption of grayordinates neuroimaging data and spatial Bayesian modeling for fMRI. Finally, I will present an application using a task fMRI study of amyotrophic lateral sclerosis (ALS), in which we uncovered new features of disease progression.

Spatial meshing for general Bayesian multivariate models

Wednesday, March 9, 2022

Speaker: Dr. Michele Peruzzi (Duke University)

Quantifying spatial associations in multivariate geolocated data of different types is achievable via random effects in a Bayesian hierarchical model, but severe computational bottlenecks arise when spatial dependence is encoded as a latent Gaussian process (GP) in the increasingly common large scale data settings on which we focus. The scenario worsens in non-Gaussian models because the reduced analytical tractability leads to additional hurdles to computational efficiency. We introduce methodologies for efficiently computing multivariate Bayesian models of spatially referenced non-Gaussian data. First, we outline spatial meshing as a tool for building scalable processes using patterned directed acyclic graphs. Then, we introduce a novel Langevin method which achieves superior sampling performance with non-Gaussian multivariate data that are common in studying species’ communities. We proceed with outlining strategies for improving Markov-chain Monte Carlo performance in the settings on which we focus. We conclude with extensions and applications showcasing the flexibility of the proposed methodologies and the publicly-available software packages.

Recent Experiences Conducting Trials using Bayesian Response Adaptive Randomization

Wednesday, February 23, 2022

Speaker: Dr. Thomas Murray (University of Minnesota)

I will discuss my experiences coordinating two recent clinical trials in out-of-hospital cardiac arrest that used Bayesian response adaptive randomization designs, and present some methodological innovations to improve implementation and understand the potential benefit these designs offer. I will discuss the ACCESS trial (Clinicaltrials.gov ID: NCT03119571), which sought to compare the efficacy of two standards or care: direct admission to the cardiac catheter laboratory versus the ICU; and the ARREST trial (NCT03880565), which sought to evaluate the efficacy of in-transit ECMO-facilitated resuscitation versus standard Advanced Cardiac Life Support (ACLS) resuscitation. My experiences with these trials motivated methodological research into alternative prior choices and randomization techniques that improve type I error control and reduce the risk of enrolling a substantial proportion of participants to an inferior treatment arm. Building upon this research, we proposed and investigated comparing a set of potential designs in terms of the expected number of failures among the cohort most acutely affected by the choice of design; namely, persons who would participate in the trial if it were open to enrollment at the time they become eligible. I plan to discuss the details and take-aways from this line of research.

A Bayesian predictive platform design for proof of concept and dose finding using early and late endpoints

Wednesday, February 9, 2022

Speaker: Dr. Ruitao Lin (University of Texas)

Evaluating long-term benefits of potential new treatments for chronic diseases can be very time-consuming and costly. We propose a Bayesian predictive platform design that provides a unified framework for evaluating multiple investigational agents in a multistage, randomized controlled trial. The design expedites the drug evaluation process and reduces development costs by including dose finding, futility and superiority monitoring, and enrichment, while avoiding over-allocating patients to a shared placebo or active control arm. To facilitate making real-time interim group sequential decisions, unobserved long-term responses are treated as missing values and imputed from longitudinal biomarker measurements. Design parameters as well as the maximum sample size are calibrated to obtain good frequentist properties. The proposed design is illustrated by a trial of three targeted agents for systemic lupus erythematosus, evaluated by their 24-week response rates. Extensive simulations show that the proposed design compares favorably to several conventional platform designs.

2021

Variational Methods for Latent Variable Problems: Part III

Wednesday, November 17, 2021

Speaker: Dr. Ryan Giordano (Massachusetts Institute of Technology)

Many practical problems in statistical inference involve “latent” variables, by which I will mean high-dimensional, unobserved nuisance parameters or missing data which must be accounted for when performing inference on some lower-dimensional quantity of primary interest. Common examples include random effects models (the random effects are the latent variables) and mixture models (where the component indicators are the latent variables). I will introduce and discuss variational inference (VI) methods for latent variable problems, drawing connections both with Bayesian approaches (Markov Chain Monte Carlo and the maximum a-posteriori estimator) and frequentist approaches (maximum likelihood estimators and the EM algorithm). I will focus particularly on providing intuition on when VI is or is not helpful, how it can go wrong, and briefly survey some modern approaches to alleviating its shortcomings.

Variational Methods for Latent Variable Problems: Part II

Wednesday, November 10, 2021

Speaker: Dr. Ryan Giordano (Massachusetts Institute of Technology)

Many practical problems in statistical inference involve “latent” variables, by which I will mean high-dimensional, unobserved nuisance parameters or missing data which must be accounted for when performing inference on some lower-dimensional quantity of primary interest. Common examples include random effects models (the random effects are the latent variables) and mixture models (where the component indicators are the latent variables). I will introduce and discuss variational inference (VI) methods for latent variable problems, drawing connections both with Bayesian approaches (Markov Chain Monte Carlo and the maximum a-posteriori estimator) and frequentist approaches (maximum likelihood estimators and the EM algorithm). I will focus particularly on providing intuition on when VI is or is not helpful, how it can go wrong, and briefly survey some modern approaches to alleviating its shortcomings.

Variational Methods for Latent Variable Problems: Part I

Wednesday, October 27, 2021

Speaker: Dr. Ryan Giordano (Massachusetts Institute of Technology)

Many practical problems in statistical inference involve “latent” variables, by which I will mean high-dimensional, unobserved nuisance parameters or missing data which must be accounted for when performing inference on some lower-dimensional quantity of primary interest. Common examples include random effects models (the random effects are the latent variables) and mixture models (where the component indicators are the latent variables). I will introduce and discuss variational inference (VI) methods for latent variable problems, drawing connections both with Bayesian approaches (Markov Chain Monte Carlo and the maximum a-posteriori estimator) and frequentist approaches (maximum likelihood estimators and the EM algorithm). I will focus particularly on providing intuition on when VI is or is not helpful, how it can go wrong, and briefly survey some modern approaches to alleviating its shortcomings.

Latent Gaussian Model Boosting

Wednesday, October 13, 2021

Speaker: Dr. Fabio Sigrist (Lucerne University)

Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent predictive accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models that allow for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. We introduces a novel approach that combines boosting and latent Gaussian models in order to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased predictive accuracy compared to existing approaches in both simulated and real-world data experiments.

Generalized Additive Neutral to the Right Regression for Survival Analysis

Wednesday, May 12, 2021

Speaker: Dr. Alan Riva-Palacio (Universidad Nacional Autónoma de México)

We present a novel Bayesian nonparametric model for regression in survival analysis. The model builds on the neutral to the right model of Doksum (1974) and on the Cox proportional hazards model of Kim and Lee (2003). The use of a vector of dependent Bayesian nonparametric priors allows us to efficiently model the hazard as a function of covariates whilst allowing non-proportionality. Properties of the model and inference schemes will be discussed. The method will be illustrated using simulated and real data. (Joint work with Jim Griffin, University College Londom, U.K., and Fabrizio Leisen, University of Nottingham, U.K.)

Reversible Hamiltonian zigzag sampler outperforms its non-reversible competitors to learn correlation among mixed-type biological traits

Wednesday, April 28, 2021

Speaker: Zhenyu Zhang (University of California, Los Angeles)

Inferring correlation among multiple continuous and discrete biological traits along an evolutionary history remains an important yet challenging problem. We jointly model these mixed-type traits through data augmentation and a phylogenetic multivariate probit model. With large sample sizes, posterior computation under this model is problematic, as it requires repeated sampling from a high-dimensional truncated Gaussian distribution with strong correlation. For this task, we propose the Hamiltonian zigzag sampler based on Laplace momentum, one state-of-the-art Markov chain Monte Carlo method. The reversible Hamiltonian zigzag sampler achieves better efficiency than its non-reversible competitors including the Markovian zigzag sampler and the bouncy particle sampler that is the best current approach for sampling latent parameters in the phylogenetic probit model. In an application with 535HIV viruses and 24 traits that necessitates sampling from a 12,840-dimensional truncated normal, our method makes it possible to estimate the across-trait correlation and detect association between immune escape mutations and the pathogen’s capacity to cause disease.

Disease Risk Modeling and Visualization using R-INLA

Wednesday, March 31, 2021

Speaker: Dr. Paula Moraga (King Abdullah University of Science and Technology)

Disease risk models are essential to inform public health and policy. These models can be used to quantify disease burden, understand geographic and temporal patterns, identify risk factors, and measure inequalities. In this tutorial we will learn how to estimate disease risk and quantify risk factors using spatial data. We will also create interactive maps of disease risk and risk factors, and introduce presentation options such as interactive dashboards. The tutorial examples will focus on health applications, but the approaches covered are also applicable to other fields that use georeferenced data including ecology, demography or the environment. The workshop materials are drawn from the book ‘Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny’ by Paula Moraga (2019, Chapman & Hall/CRC).

Modeling cell-free DNA fragmentation in human cancers

Wednesday, March 17, 2021

Speaker: Dr. Rob Scharpf (Johns Hopkins University)

Cell-free DNA in the blood provides a non-invasive diagnostic avenue for patients with cancer. However, characteristics of the origins and molecular features of cell-free DNA are poorly understood. Here we developed an approach to evaluate fragmentation patterns of cell-free DNA across the genome, and found that profiles of healthy individuals reflected nucleosomal patterns of white blood cells, whereas patients with cancer had altered fragmentation profiles. We used this method to analyse the fragmentation profiles of 236 patients with breast, colorectal, lung, ovarian, pancreatic, gastric or bile duct cancer and 245 healthy individuals. A machine learning model that incorporated genome-wide fragmentation features had sensitivities of detection ranging from 57% to more than 99% among the seven cancer types at 98% specificity, with an overall area under the curve value of 0.94. Fragmentation profiles could be used to identify the tissue of origin of the cancers to a limited number of sites in 75% of cases. Combining our approach with mutation-based cell-free DNA analyses detected 91% of patients with cancer. The results of these analyses highlight important properties of cell-free DNA and provide a proof-of-principle approach for the screening, early detection and monitoring of human cancer.

Two approaches to unmeasured spatial confounding

Wednesday, March 3, 2021

Speaker: Dr. Georgia Papadogeorgou (University of Florida)

Spatial confounding has different interpretations in the spatial and causal inference literatures. I will begin this talk by clarifying these two interpretations. Then, seeing spatial confounding through the causal inference lense, I discuss two approaches to account for unmeasured variables that are spatially structured when we are interested in estimating causal effects. The first approach is based on the propensity score. We introduce the distance adjusted propensity scores (DAPS) that combine spatial distance and propensity score difference of treated and control units in a single quantity. Treated units are then matched to control units if their corresponding DAPS is low. We can show that this approach is consistent, and we propose a way to choose how much matching weight should be given to unmeasured spatial variables. In the second approach, we aim to bridge the spatial and causal inference literatures by estimating causal effects in the presence of unmeasured spatial variables using outcome modeling tools that are popular in spatial statistics. Motivated by the bias term of commonly-used estimators in spatial statistics, we propose an affine estimator that addresses this deficiency. I will discuss that estimation of causal parameters in the presence of unmeasured spatial confounding can only be achieved under an untestable set of assumptions. We provide one such set of assumptions which describe how the exposure and outcome of interest relate to the unmeasured variables.

Introduction to Hamiltonian Monte Carlo

Wednesday, February 17, 2021

Speaker: Dr. Aki Nishimura (Johns Hopkins University)

Hamiltonian Monte Carlo (HMC) is a state-of-the-art Markov chain Monte Carlo algorithm for Bayesian computation, forming a backbone of popular Bayesian inference software packages such as Stan and PyMC. Even though these packages automate posterior computation for users, basic understanding of HMC remains essential to obtain best computational performances out of these packages. For example, HMC — and hence software based on it — is highly sensitive to model parametrization. In this tutorial, I will explain inner workings of HMC and its implementations that are most relevant to its practical performances.

Personalized Dynamic Treatment Regimes in Continuous Time: A Bayesian Joint Model for Optimizing Clinical Decisions with Timing

Wednesday, February 3, 2021

Speaker: Dr. Yanxun Xu (Johns Hopkins University)

Accurate models of clinical actions and their impacts on disease progression are critical for estimating personalized optimal dynamic treatment regimes (DTRs) in medical/health research, especially in managing chronic conditions. Traditional statistical methods for DTRs usually focus on estimating the optimal treatment or dosage at each given medical intervention, but overlook the important question of “when this intervention should happen.” We fill this gap by building a generative model for a sequence of medical interventions–which are discrete events in continuous time–with a marked temporal point process (MTPP) where the mark is the assigned treatment or dosage. This clinical action model is then embedded into a Bayesian joint framework where the other components model clinical observations including longitudinal medical measurements and time-to-event data. Moreover, we propose a policy gradient method to learn the personalized optimal clinical decision that maximizes patient survival by interacting the MTPP with the model on clinical observations while accounting for uncertainties in clinical observations. A signature application of the proposed approach is to schedule follow-up visitations and assign a dosage at each visitation for patients after kidney transplantation. We evaluate our approach with comparison to alternative methods on both simulated and real-world datasets. In our experiments, the personalized decisions made by our method turn out to be clinically useful: they are interpretable and successfully help improve patient survival. The R package doct (short for “Decisions Optimized in Continuous Time”) implementing the proposed model and algorithm is publicly available.

2020

Spatial Factor Modeling: A Bayesian Matrix-Normal Approach for Misaligned Data

Wednesday, December 2, 2020

Speaker: Dr. Lu Zhang (Columbia University)

Multivariate spatially-oriented data sets are prevalent in the environmental and physical sciences. Investigators aim to model multiple variables, each indexed by a spatial location, jointly to capture spatial association for each variable as well as associations among the variables. We prefer multivariate latent spatial processes to drive the inference and allow better predictive inference at arbitrary locations. High-dimensional multivariate spatial data, which is the theme of this work, refer to situations where the number of spatial locations or the number of spatially dependent variables is very large. We propose frameworks to extend scalable modeling strategies for a single process into multivariate process cases. We pursue Bayesian inference which is attractive for full uncertainty quantification of the latent spatial process. Our approach exploits distribution theory for the Matrix-Normal distribution, which we use to build scalable versions of a hierarchical linear model of coregionalization (LMC) and spatial factor models that deliver inference over a high-dimensional parameter space including the latent spatial process. We illustrate the computational and inferential benefits of our algorithms over competing methods using simulation studies and real data analyses for a vegetation index dataset with observed locations in millions.

Getting started with Bayesian modeling in Stan

Wednesday, November 18, 2020

Speaker: Charles Margossian (Columbia University)

Stan is a probabilistic programming language, designed primarily for Bayesian inference. Its main algorithm is an adaptive Hamiltonian Monte Carlo sampler, supported by a state-of-the-art automatic differentiation library. Beyond model fitting, the Stan framework is designed to support a comprehensive modeling workflow. Using a simple example, I’ll demonstrate how to code and run a model in Stan; how to analyze samples from our posterior distribution using various diagnostics such as posterior predictive checks; and, based on these diagnoses, how to improve our model. This talk includes live-coding in R and Stan, and participants are welcomed to code along.

A Case Study Competition among Methods for Analyzing Large Spatial Data

Wednesday, November 4, 2020

Speaker: Dr. Matt Heaton (Brigham Young University)

The Gaussian process is an indispensable tool for spatial data analysts. The onset of the “big data” era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics.

Bayes in the time of Big Data

Wednesday, October 28, 2020

Speaker: Dr. Andrew Holbrook (University of California, Los Angeles)

Big Bayes is the computationally intensive co-application of big data and large, expressive Bayesian models for the analysis of complex phenomena in scientific inference and statistical learning. Standing as an example, Bayesian multidimensional scaling (MDS) can help scientists learn viral trajectories through space and time, but its computational burden prevents its wider use. Crucial MDS model calculations scale quadratically in the number of observations. We mitigate this limitation through massive parallelization using multi-core central processing units, instruction-level vectorization and graphics processing units (GPUs). Fitting the MDS model using Hamiltonian Monte Carlo, GPUs can deliver more than 100-fold speedups over serial calculations and thus extend Bayesian MDS to a big data setting. To illustrate, we employ Bayesian MDS to infer the rate at which different seasonal influenza virus subtypes use worldwide air traffic to spread around the globe. We examine 5392 viral sequences and their associated 14 million pairwise distances arising from the number of commercial airline seats per year between viral sampling locations. To adjust for shared evolutionary history of the viruses, we implement a phylogenetic extension to the MDS model and learn that subtype H3N2 spreads most effectively, consistent with its epidemic success relative to other seasonal influenza subtypes.