Keynote and invited speakers
Keynote Speakers
Sharon Machlis
Sharon Machlis is Director of Editorial Data & Analytics at IDG Communications, a global technology media company that publishes Computerworld, CIO, PCWorld, and Macworld, among other brands. In her current role, she hosts InfoWorld’s “Do More With R” video series, writes R how-to articles, helps colleagues with data journalism projects, codes tools for colleagues around the world, and analyzes editorial web metrics. Sharon is the author of “Practical R for Mass Communication & Journalism” and the Computerworld Beginner’s Guide to R She has also written a number of R packages, among which SunsetTSA.
Kelly O'Briant
Kelly O’Briant is a Solutions Engineer at RStudio interested in configuration and workflow management with a passion for R administration. She has a background in data science, software engineering and cloud computing, with degrees in bioinformatics and computational sciences. In 2016, Kelly founded the Washington DC chapter of R-Ladies, a global organization dedicated to increasing gender diversity in the R community.
Tomas Kalibera
Tomas Kalibera is researcher at the Czech Technical University. As an active R Core developer Tomas has been contributing improvements and bug fixes to multiple parts of the R runtime including the byte-code compiler and interpreter, source reference handling, Windows specific code, package installation, memory management, object dispatch and function invocation. He implemented and maintains rchk, a tool for finding PROTECT bugs, which he uses regularly to check CRAN packages. He has been helping the CRAN team with checking and debugging packages, and is a member of the R Foundation.
Francesco Bartolucci
Francesco Bartolucci is a Professor of Statistics in the Department of Economics of the University of Perugia, where he also coordinated the Doctorate program in “Mathematical and Statistical Methods for the Economic and Social Sciences”. His main research interests of Francesco Bartolucci are focused on: Longitudinal and panel data, Latent variable and mixture models, Marginal models for categorical data, Optimization and Markov chain Monte Carlo algorithms. He co-authored the R packages LMest, MultiCIRT, MLCIRTwithin, cquad and extRC.
Jared Lander
Jared Lander is Chief Data Scientist at Lander Analytics, and Adjunct Professor at Columbia Business School, where he teaches the Introduction to Programming in R course. He is organizer of the New York Open Statistical Programming Meetup and Series Editor for Pearson. He is author of the R pacakges coefplot, useful, resumer and RepoGenerator, as well as the book “R for Everyone: Advanced Analytics and Graphics”. His work with the Minnesota Vikings in the 2015 NFL Draft has been featured in in the Wall Street Journal.
Stephanie Hicks
Stephanie Hicks is an Assistant Professor in the Department of Biostatistics at Johns Hopkins Bloomberg School of Public Health. She is also a faculty member of the Johns Hopkins Data Science Lab, co-host of The Corresponding Author podcast discussing data science in academia, and co-founder of R-Ladies Baltimore. Her research interests are at the intersection of statistics, genomics, and data science. She actively contributes software packages to Bioconductor and is involved in teaching courses for data science and the analysis of genomics data.
Invited Speakers
brquasi: Improved quasi-likelihood estimation
Reduction of bias in the estimation of generalized linear models has seen extensive applied usage. The main reasons for this are the availability of comprehensive R packages for its application (e.g. brglm2; https://cran.r-project.org/package=brglm2) and the desirable side-effects that bias reduction has in models for categorical responses (see, e.g. http://arxiv.org/abs/1812.01938). Nevertheless, a significant limitation of the methods behind software like brglm2 is that they rely on the full specification of the model, and do not apply, for example, for the modelling of overdispersed data using quasi-likelihoods.
In this talk, we introduce the brquasi R package that allows users, for the first time, to reduce bias in the quasi-likelihood estimation of regression models, without the need to resort to resampling (as jackknife and bootstrap typically do). The only requirement is one extra smoothness assumption, which holds for the well-used quasi-likelihood models implemented in R.
We illustrate the properties of the new estimators, and the inference and prediction capabilities in brquasi, using examples from the modelling of overdispersed data and relative risk regression.
And you thought CRAN was harsh
Many R workflows revolve around packages and git. Typically, they use some form of continuous integration, such as Travis, or Gitlab CI. The general idea is that R developers are notified if a commit causes the package to fail some checks. This talk will describe the additional rigorous steps that we apply to our checks via the inteRgrate package. Using this package, allows us to standardise code style, catch errors quicker, and produce more readable commits.
How can you write a good R package?
Read a starter guide, the “R packages” book, “Writing R Extensions”, and then?
This talk will feature ways to take your R package, and your R development skills, to the next level!
We shall present tools that equip you with helpful flags and metrics, from R CMD check locally and on the cloud, to lintr::lint_package() and covr::package_coverage().
We shall also mention tools that automagically improve your package or its docs: the styler and pkgdown packages.
Furthermore, we shall explain how rOpenSci Software Peer Review helps package authors receive human and humane feedback.
Finally, we shall advocate for learning from others’ experience, be it by directly reading code; or by reading blogs such as the R-hub blog and developers’ forums such as R-package-devel mailing list.
We don’t aim to go through a catalogue: we hope you’ll get away from this talk with one or a few lifechanging habits for your daily R development work.”
Building Agile data products leveraging the R package structure
Building a complex data science product can span many months or even years. This poses at least two risks to the successful completion of the project. The data scientist might work too long in isolation, losing touch with stakeholders and the intended users of the product. And as the product grows in complexity, reproducibility might get low.
During the first half of the talk we will discuss the benefits of the Agile approach to data science. What are the pros of early deployment and continuous delivery? How can one build a Minimal Viable Product? How to deal with the uncertainty in data science? It will be data science generic and not specific for R users. In the second part, however, it will be apparent that the R package structure is ideal for an Agile approach. We can clearly separate the product code (R folder) from the exploration code (Vignettes folder). All the best practices for high-quality software development will be enforced upon the product by using R packages instead of individual scripts. Meanwhile hypothesis-testing can be done quick-and-dirty without polluting the product. Finally, I propose to create a data pipeline from the product code, using drake.
Better than Deep Learning: Gradient Boosting Machines (GBM) – with 2020 updates
With all the hype about deep learning and “AI”, it is not well publicized that for structured/tabular data widely encountered in business applications it is actually another machine learning algorithm, the gradient boosting machine (GBM) that most often achieves the highest accuracy in supervised learning/prediction tasks. In this talk we’ll review some of the main open source GBM implementations such as xgboost, h2o, lightgbm, catboost, Spark MLlib (all of them available from R) and we’ll discuss some of their main features and characteristics (such as training speed, memory footprint, scalability to multiple CPU cores and in a distributed setting, GPU implementations, prediction speed etc). If you have seen an earlier version of my talk with the same title (for example at eRum 2018, or the video recording from several other conferences/meetups), this talk will have plenty of updates that will make it worth hearing it (for example more details on the GPU implementations, new results on catboost, or exciting updates on Spark MLlib).
Multi-Omics Factor Analysis Plus: A probabilistic framework for comprehensive and scalable integration of multi-modal data
Technological advances have led to a growing number of studies that profile multiple molecular layers simultaneously on large cohorts of samples (e.g., patient tissues) and, more recently, at single cell resolution. While this promises a more comprehensive understanding, computational strategies for an integrative analysis are essential to obtain valuable insights into the underlying biological processes. Furthermore, such methods need to handle complex experimental designs that include multiple data modalities and multiple groups of samples as well as scale to the growing dimensions of the data. We present Multi-Omics Factor Analysis (MOFA) and its extension MOFA+ that provide a statistical framework for the comprehensive and scalable integration of such multi-modal data. MOFA+ reconstructs a low-dimensional representation of the data using a fast stochastic variational inference scheme and employs flexible sparsity constraints that enable joint modelling of variation across multiple sample groups and data modalities. It enables decomposing variation into factors present in all, some, or single data modalities or sample groups and promotes interpretable factors that can directly be linked to molecular drivers. Once learnt, the factors enable a variety of downstream analyses, including the inference of differentiation trajectories, denoising, feature selection or the identification of sample subgroups.
Testing Shiny: What, why, and how.
It’s now a common saying that in software engineering, everything that can be tested should be tested. And R code is no exception: the more you test, the more you’re likely to catch errors at an early stage of your project. And there are a lot of tools available out there to do exactly that: test your R code to protect your project against bugs.
But Shiny is a little bit different, as some parts of the app need a Shiny runtime, some part are pure back-end functions, and some other are pure web elements. So, how can we test Shiny efficiently? What do we test, and why?
In this talk, Colin will cover some of the lesser known tools that Shiny developers can integrate in their workflow.
From writing code to infoRming policy: a case study of reproducible research in transport planning
R provides unparalleled support for reproducible research. Its command-line interface and scriptable nature is revolutionary for people who previously relied on explaining a long series of steps in a graphical user interface to enable others to reproduce their work. Furthermore, R has many tools to enable the efficient replication of results in everything ranging from minimal examples (e.g. via the function dput() and the package reprex) to large projects (e.g. via Makefiles and workflow management packages such as drake).
It has been well-known that reproducibility (and its corollary falsifiability) are cornerstones of science since the time of Karl Popper, but few have considered the implications for policy. This presentation will outline ways in which research design decisions can maximise the chances of informing evidence-based policies. This includes choice of software and the way in which code underlying research is written, maintained and disseminated. Case studies from my work on the Propensity to Cycle Tool (the results of which are freely available at www.pct.bike), which has informed government transport policies, and the package for accessing road traffic casualty data stats19 will illustrate these points. The talk will conclude with concrete steps that everyone can take to maximise the reproducibility of not only their code but also the key results of research to encourage scientific debate and evidence-based decisions.
dplyr 1.0.0
dplyr, the data manipulation package of the tidyverse, recently reached the 1.0.0 milestone. This version brings new functions, enhances error messages throughout the package, consolidates the foundations of the package thanks to vctrs, embraces selection capabilities thanks to tidyselect, makes the interface nicer with across() and is blessed with a new amazing logo. Several blog posts before and after the release have covered the changes. In this talk, we’ll focus on summarise() one of the main dplyr functions that has received new interesting super powers.
CRANalerts: A Shinyapp-as-a-Service for Impatient R Users
You’re super excited for your favorite package to get updated soon. It has to be soon, you’ve been waiting so long for those cool new features! You keep checking CRAN and GitHub and Twitter every few days to see news about the new release, but you’re always afraid you may miss it. There’s got to be a better way… CRANalerts to the rescue! In this talk I’ll discuss my experience building a Shiny app to solve this problem, along with the issues that come with it.
Invited CovidR Winners
COVID-19 Data Hub
Built with R, available in any language, the Data Hub provides a worldwide, fine-grained, unified dataset helpful for a better understanding of COVID-19. The user can instantly download up-to-date, structured, historical daily data across several official sources. The data are hourly crunched by the R package COVID19 and made available in csv format on a cloud storage, so to be easily accessible from Excel, R, Python… and any other software. We welcome external contributors to join and extend the number of supporting data sources. All sources are properly documented, along with their citation. COVID-19 Data Hub can spot misalignments between data-sources and automatically inform authorities of possible errors. All logs are available at https://covid19datahub.io Available on CRAN by Emanuele Guidotti and David Ardia.
COVOID: Modelling COVID-19 Transmission and Interventions
COVOID is an evolving but fully-functional R package and accompanying Shiny app for simulation modelling of both COVID-19 transmission and the interventions intended to reduce that transmission, using deterministic compartmental models (DCMs). The package contains an expanding API for simulating and estimating homogeneous and age-structured SIR, SEIR and extended models. In particular, COVOID allows the simultaneous simulation of setting-specific (e.g. school closures) and general interventions over varying time intervals. This is informed through incorporation of publicly available data on population demographics from the United Nations, age and setting-specific contact rates from previously published research and COVID-19 incidence counts from the European CDC. The built-in Shiny app enables ease of use and demonstration of key concepts to those without R programming backgrounds. Coauthors: Oisin Fitzgerald and Tim Churches,