News
Entertainment
Science & Technology
Life
Culture & Art
Hobbies
News
Entertainment
Science & Technology
Culture & Art
Hobbies
8 | Follower
Regular readers know that I maintain gssr and gssrdoc, two packages for R. The former makes the General Social Survey’s annual, cumulative and panel datasets available in a way that’s easy to use in R. The latter makes the survey’s codebook available in R’s integrated help system in a way that documents every GSS variable as if it were a function or object in R, so you can query them in exactly the same way as any function from the R console or in the IDE of your choice. As a bonus, because I use pkgdown to document the packages, I get a website as a side-effect. In the case of gssrdoc this means a browsable index of all the GSS variables. The GSS is the Hubble Space Telescope of American social science; our longest-running representative view of many aspects of the character and opinions of American households. The data is freely available from NORC, but they distribute it in SPSS, SAS, and STATA formats. I wrote these packages in an effort to make it more easily available in R. If you want to know the relationship between these various platforms, I have you covered. But the important thing is that R is a free and open-source project, and the others are not. This week I spent a little time updating gssrdoc a bit to clean up how the help pages looked and make some other improvements. Inside R, you can say, e.g., ?govdook at the console and have this pop up in the help: Yeah govdook is short for ‘Gov Do OK’, not ‘Dook’. The package also includes gss_doc, a data frame containing all of the information that the help pages are built from. I included it because it can be useful to work with directly, as when you might want to extract summary information about a subset of variables. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 library(tibble) library(gssrdoc) gss_doc #> # A tibble: 6,694 × 10 #> variable description question value_labels var_yrtab yrballot_df module_df subject_df norc_id norc_url #> #> 1 year GSS year for this respondent "GSS year" "[NA(d)] do… 1 https:/… #> 2 id Respondent id number "Respondent id … "" 2 https:/… #> 3 wrkstat labor force status "Last week were… "[1] workin… 3 https:/… #> 4 hrs1 number of hours worked last week "Last week were… "[89] 89+ h… 4 https:/… #> 5 hrs2 number of hours usually work a week "Last week were… "[89] 89+ h… 5 https:/… #> 6 evwork ever work as long as one year "Last week were… "[1] yes / … 6 https:/… #> 7 occ R's census occupation code (1970) "A. What kind o… "[NA(d)] do… 7 https:/… #> 8 prestige r's occupational prestige score(1970) "A. What kind o… "[NA(d)] do… 8 https:/… #> 9 wrkslf r self-emp or works for somebody "A. What kind o… "[1] self-e… 9 https:/… #> 10 wrkgovt govt or private employee "A. What kind o… "[1] govern… 10 https:/… #> # ℹ 6,684 more rows The gss_doc object has regular columns but also a series of list-columns to (insert meme here, you know the one) put data frames inside your data frames. (They’re labeled as “tibbles” here; basically the same thing). Why a list-column? Why a list? Well, a list is one of the fundamental ways to store data of any sort. Lists are useful because they can contain heterogeneous elements: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 items $todo_home #> [1] "Laundry" "Clean bathroom" "Feed cat" "Bring out rubbish bins" #> #> $important_dates #> [1] "1776-07-04" "1788-06-21" "2025-01-18" #> #> $keycode #> [1] 8675309 #> #> $storage_tiers #> [1] 128 256 512 1024 One thing to notice about a list like this is that it doesn’t really make sense to represent it as a table. This is partly because the elements of the list are of different lengths, but really it’s because if we did represent it as a table, it would not mean anything to read across the rows: 1 2 3 4 5 6 7 8 items_df #> # A tibble: 4 × 4 #> todo_home important_dates keycode storage_tiers #> #> 1 Laundry 1776-07-04 8675309 128 #> 2 Clean bathroom 1788-06-21 - 256 #> 3 Feed cat 2025-01-18 - 512 #> 4 Bring out rubbish bins - - 1024 The rows don’t form “cases” of anything. We just have four unrelated categories with various pieces of information in them. Lists are also useful because they lend themselves easily to being nested: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 items $todo_home #> $todo_home$tasks #> [1] "Laundry" "Clean bathroom" "Feed cat" "Bring out rubbish bins" #> #> $todo_home$tobuy #> [1] "Cat Food" "Burritos" #> #> $todo_home$wifi_password #> [1] "p@ssw0rd!" #> #> #> $important_dates #> [1] "1776-07-04" "1788-06-21" "2025-01-18" #> #> $keycode #> [1] 8675309 #> #> $storage_tiers #> $storage_tiers$ssd #> [1] 128 256 512 1024 #> #> $storage_tiers$ram #> [1] 1 4 8 In its bones, R is a LISP/Scheme-like list-processing language fused with features of classic array languages like APL. This is because, in the world of data analysis, what we deal with all the time are rectangular tables, or arrays, where rows are cases and columns are different sorts of variables. The wrinkle is that, unlike a beautiful array of pure numbers, each column might measure something (a date, a True/False answer, a location, a score, a nationality) that we’d prefer not to represent directly as a number. Sure, underneath in the computer everything is all just ones and zeros. (Or rather, electromagnetic patterns in some physical substrate that we can interpret as meaning ones and zeros.) And if we want to do any sort of data analysis that involves treating our table as a matrix then we’ll want numeric representations of all the columns. But for many uses we’d like to see “France” or “Strongly Agree” instead of “33” or “5”. Just a table of rows and columns, where different things can be represented across columns, but any particular column is all the same kind of thing. A rectangular table like that is called a data frame. One way to think of a data frame is just as a special case of a list. A data frame is a list where you can put all the list elements side by side and treat them as columns, and where all these elements are made of vectors of the same length. Beyond that, it’s a list where the nth element of each vector refers to some property of the same underlying entity, i.e. the thing that’s in the row, or case; the thing the columns are showing you measurements or properties of. You can have empty entries if needed, as when some bit of data is missing. The important thing is that each column has as many slots as there are cases, and you fill in the values for each case in the same slot in each column. Whenever you look at any table of data, one of your first questions should always be “What is a row in this table?” In this case, each row is a variable in the full GSS dataset, and each column describes some property of that variable. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 library(tibble) library(gssrdoc) gss_doc #> # A tibble: 6,694 × 10 #> variable description question value_labels var_yrtab yrballot_df module_df subject_df norc_id norc_url #> #> 1 year GSS year for this respondent "GSS year" "[NA(d)] do… 1 https:/… #> 2 id Respondent id number "Respondent id … "" 2 https:/… #> 3 wrkstat labor force status "Last week were… "[1] workin… 3 https:/… #> 4 hrs1 number of hours worked last week "Last week were… "[89] 89+ h… 4 https:/… #> 5 hrs2 number of hours usually work a week "Last week were… "[89] 89+ h… 5 https:/… #> 6 evwork ever work as long as one year "Last week were… "[1] yes / … 6 https:/… #> 7 occ R's census occupation code (1970) "A. What kind o… "[NA(d)] do… 7 https:/… #> 8 prestige r's occupational prestige score(1970) "A. What kind o… "[NA(d)] do… 8 https:/… #> 9 wrkslf r self-emp or works for somebody "A. What kind o… "[1] self-e… 9 https:/… #> 10 wrkgovt govt or private employee "A. What kind o… "[1] govern… 10 https:/… #> # ℹ 6,684 more rows Because R was designed by statisticians—R is a descendant of S, which like everything else in computing traces its origins to Bell Labs—it has this concept of a data frame built-in to its core instead of being bolted-on afterwards, which is extremely handy. Normally data frames are just ordinary rectangles, but there’s no reason why any particular column can’t itself be thought of as a list of something else. That’s what we have here. The yr_vartab column contains data frames of crosstabs of the answers to each question by year. Except where it doesn’t (e.g. for id), and this is fine because lists don’t have to be internally homogeneous. Similarly yrballot_df has a little table of which ballots, or internal portions of the survey, a question was asked on for each year it was asked. The upshot is that having assembled the gss_doc object we can use it to emit, like, seven thousand pages of documentation on the GSS’s many, many questions over the years. We can build them as standardized R help pages, as above. On the website that pgkdown builds for us, we get this: Website view. The cross-referencing to other relevant variables in the “See Also” section is new in this version. It comes courtesy of the GSS’s own information about survey modules and an ad hoc topic index they keep for the variables. I just use a subset of possible cross-references as we don’t want, e.g., every single question in the GSS core to be cross-referenced to every other core question on any particular help page. On the website, I gather these into a single page: Topic index page. The GSS has its own handy data explorer which is very useful for quickly checking on particular trends and getting a quick graph of what the data look like, or a summary view of the content of particular variables. Each help page in gssrdoc now links to the GSS Data Explorer page for that variable, in case you want to hop over and take a look there. Of course, the gssrdoc package doesn’t and isn’t meant to replace the Data Explorer; it’s just a different view of the same information, with a different use-case in mind.
Dear fellow developers and data scientists: If everyone reading this gave just the price of a coffee, I could focus fully on open source work for our community. But not everyone can or will contribute, and that’s okay. For years, I’ve built and m...
So this is a blog post about making and finding an error in modelling strategy, in an apparently simple situation. I saw this post (toot?) on Mastodon from Daniel Lakens mentioning in passing that “On the other hand, the homicide rate across the world...
This tutorial breaks down the development of an R Shiny application titled S&P 500 Monitoring Dashboard for the 2025 Posit Table Dashboard. This app effectively combines interactive financial data visualization (plotly), beautiful data tables (gt, gtExtras), web scraping (rvest), and external API integration (riingo, ellmer/Gemini AI) within a custom, sleek dark theme. You can access the app through this link
The future package celebrates ten years on CRAN as of June 19, 2025. I got a bit stalled over the holidays and while attending the fantastic useR! 2025 conference, but, as promised, here is the fourth in a series of blog posts highlighting recent...
This post contains a riddle/puzzle related to T. Moudiki's situation, including files and code snippets in R, Python, and bash. The ZIP file with a lot of details is available for download. An update with hints is provided.
The paper The Ten-Year Cycle in Numbers of the Lynx in Canada by Charles Elton and Mary Nicholson is a foundational work in population ecology and wildlife dynamics. They documented the regular, roughly ten-year population cycle of the Canadian Lynx (Lynx canadensis), observed in fur trade records from the Hudson’s Bay Company (HBC).
Hidden Markov models (HMMs) are powerful models to capture the complex behaviours of psychological processes that switch between different latent states. Examples include manic and depressive states in bipolar disorder, recovery and relapse states as s...
Join our workshop on Building and Customising Statistical Models with Stan and R: An Introduction to Bayesian Inference, which is a part of our workshops for Ukraine series! Here’s some more info: Title: Building and Customising Statistical Models with Stan and R: An Introduction to Bayesian Inference Date: Thursday, November 13th, 18:00 – 20:00 CEST … Continue reading Building and Customising Statistical Models with Stan and R: An Introduction to Bayesian Inference workshopBuilding and Customising Statistical Models with Stan and R: An Introduction to Bayesian Inference workshop was first posted on October 16, 2025 at 10:57 am.
Movement tracking is a novel process-tracing method that promises unique access to the temporal dynamics of psychological processes. The method involves high-resolution tracking of a hand or handheld device (e.g., a computer mouse) while it is used to ...
Python 3.14 was released on 7th October 2025. Here we summarise some of the more interesting changes and some trends in Python development and data-science over the past year. We will highlight the following: the colourful Python command-line interface; project-management tool uv; free-threading; and a brief summary of other developments. The Python 3.14 release notes also describe the changes to base Python. Colourful REPL At Jumping Rivers we have taught a lot of people to program in Python. Throughout a programming career you get used to making, and learning from, mistakes. The most common mistakes made in introductory programming lessons may still trip you up in 10 years time: unmatched parentheses, typos, missing quote symbols, unimported dependencies. Our Python training courses are presented using Jupyter. Jupyter notebooks have syntax highlighting that makes it easy to identify an unfinished string, or a mis-spelled keyword. But, most Python learners don’t use Jupyter (or other high-level programming tools) on day one - they experiment with Python at the command line. You can type “python” into your shell/terminal window and start programming into the “REPL” (read-evaluate-print loop). Any effort to make the REPL easier to work with will be beneficial to beginning programmers. So the introduction of syntax highlighting in the Python 3.14 REPL is really beneficial. Whether you want to start from scratch, or improve your skills, Jumping Rivers has a training course for you. uv and package development One of the big trends in Python development within 2025, is the rise of the project management tool uv. This is a Rust-based command-line tool and can be used to initialise a package / project structure, to specify the development and runtime environment of a project, and to publish a package to PyPI. At Jumping Rivers, we have used poetry for many of the jobs that uv excels at. Python is used for the data preparation tasks for diffify.com, and we use poetry to ensure that our developers each use precisely the same package versions when working on that project (See our current blog series on Poetry). But, poetry doesn’t prevent developers using different versions of Python. For that, we need a second tool like pyenv (which allows switching between different Python versions) or for each developer to have the same Python version installed on their machine. uv goes a step further than poetry and allows us to pin Python versions for a project. Let’s use uv to install Python 3.14, so that we can test out features in the new release. First follow the instructions for installing uv. Then at the command line, we will use uv to create a new project where we’ll use Python 3.14. # [bash] cd ~/temp mkdir blog-py3.14 cd blog-py3.14 # Which versions of Python 3.14 are available via uv? uv python list | grep 3.14 # cpython-3.14.0rc2-linux-x86_64-gnu # cpython-3.14.0rc2+freethreaded-linux-x86_64-gnu You’ll see something similar regardless of the operating system that you use. That lists two versions of Python 3.14 - one with an optional system called “Free Threading” (see later). We’ll install both versions of Python: uv python install cpython-3.14.0rc2-linux-x86_64-gnu uv python install cpython-3.14.0rc2+freethreaded-linux-x86_64-gnu Users of pyenv will be able to install Python 3.14 in a similar manner. We can select between the two different Python versions at the command line. First using the version that does not have free threading: uv run --python=3.14 python # Python 3.14.0rc2 (main, Aug 18 2025, 19:19:22) [Clang 20.1.4 ] on linux # ... >>> import sys >>> sys._is_gil_enabled() # True Then using the version with free threading (note the t suffix) uv run --python=3.14t python # ... # Python 3.14.0rc2 free-threading build (main, Aug 18 2025, 19:19:12) [Clang 20.1.4 ] on linux # ... >>> import sys >>> sys._is_gil_enabled() # False Project creation and management with uv uv is capable of much more than allowing us to switch between different versions of Python. The following commands initialise a Python project with uv: # From ~/temp/blog-py3.14 # Indicate the default python version for the project uv python pin 3.14 # Initialise a project in the current directory uv init . # Check the Python version uv run python --version # Python 3.14.0rc2 This adds some files for project metadata (pyproject.toml, README.md) and version control: tree -a -L 1 # . # ├── .git # ├── .gitignore # ├── main.py # ├── pyproject.toml # ├── .python-version # ├── README.md # ├── uv.lock # └── .venv # # 2 directories, 6 files Now we can add package dependencies using uv add and other standard project-management tasks. But one thing I wanted to highlight is that uv allows us to start a Jupyter notebook, using the project’s Python interpreter, without either adding jupyter as a dependency or explicitly defining a kernel for jupyter: uv run --with jupyter jupyter lab Creating a new notebook using the default Python 3 kernel in the JupyterLab session that starts, should ensure you are using the currently active Python 3.14 environment. Threading Python 3.13 introduced an experimental feature, ‘Free-threading’, that is now officially supported as of 3.14. First though, what is a ’thread’? When a program runs on your computer, there are lots of different tasks going on. Some of those tasks could run independently of each other. You, as the programmer, may need to explain to the computer which tasks can run independently. A thread is a way of cordoning-off one of those tasks; it’s a way of telling the computer that your software is running on, that this task here can run separately from those tasks there, and the logic for running this task too. (Basically). Python has allowed developers to define threads for a while. If you have a few tasks that are largely independent of each other, each of these tasks can run in a separate thread. Threads can access the same memory space, meaning that they can access and modify shared variables in a Python session. In general, this also means that a computation in one thread could update a value that is used by another thread, or that two different threads could make conflicting updates to the same variable. This freedom can lead to bugs. The CPython interpreter was originally written with a locking mechanism (the Global Interpreter Lock, GIL) that prevented different threads from running at the same time (even when multiple processors were available) and limited the reach of these bugs. Traditionally, you would have used threads for “non-CPU-bound tasks” in Python. These are the kinds of tasks that would be unaffected by having more, or faster, processors available to the Python instance: network traffic, file access, waiting for user input. For CPU-bound tasks, like calculations and data-processing, you could use Python’s ‘multiprocessing’ library (although some libraries like ‘numpy’ have their own low-level mechanisms for splitting work across cores). This starts multiple Python instances, each doing a portion of the processing, and allows a workload to be partitioned across multiple processors. The main other differences between threading and multiprocessing in Python are in memory and data management. With threading, you have one Python instance, with each thread having access to the same memory space. With multiprocessing, you have multiple Python instances that work independently: the instances do not share memory, so to partition a workload using multiprocessing, Python has to send copies of (subsets of) your data to the new instances. This could mean that you need to store two or more copies of a large dataset in memory when using multiprocessing upon it. Simultaneous processing across threads that share memory-space is now possible using the free-threaded build of Python. Many third-party packages have been rewritten to accommodate this new build and you can learn more about free-threading and the progress of the changes in the “Python Free-Threading Guide”. As a simple-ish example, lets consider natural language processing. There is a wonderful blog post about parallel processing with the nltk package on the “WZB Data Science Blog”. We will extend that example to use free-threading. ntlk provides access to some of the Project Gutenberg books, and we can access this data as follows: # main.py import nltk def setup(): nltk.download("gutenberg") nltk.download("punkt_tab") nltk.download('averaged_perceptron_tagger_eng') corpus = { f_id: nltk.corpus.gutenberg.raw(f_id) for f_id in nltk.corpus.gutenberg.fileids() } return corpus corpus = setup() The key-value pairs in corpus are the abbreviated book-title and contents for 18 books. For example: corpus["austen-emma.txt"] # [Emma by Jane Austen 1816] # # VOLUME I # # CHAPTER I # # # Emma Woodhouse, handsome, clever, and rich, with a comfortable home ... A standard part of a text-processing workflow is to tokenise and tag the “parts-of-speech” (POS) in a document. We can do this using two nltk functions: # main.py ... continued def tokenise_and_pos_tag(doc): return nltk.pos_tag(nltk.word_tokenize(doc)) A function to sequentially tokenise and POS-tag the contents of a corpus of books can be written: # main.py ... continued def tokenise_seq(corpus): tokens = { f_id: tokenise_and_pos_tag(doc) for f_id, doc in corpus.items() } return tokens You need to install or build Python in a particular way to make use of “Free-threaded” Python. In the above, we installed Python “3.14t” using uv, so we can compare the speed of free-threaded and sequential, single-core, processing. We will use the timeit package to analyse processing speed, from the command line. # Activate the threaded version of Python 3.14 uv python pin 3.14t # Install the dependencies for our main.py script uv add timeit nltk # Time the `tokenise_seq()` function # -- but do not time any setup code... PYTHON_GIL=0 \ uv run python -m timeit \ --setup "import main; corpus = main.setup()" \ "main.tokenise_seq(corpus)" # [lots of output messages] # 1 loop, best of 5: 53.1 sec per loop After some initial steps where the nltk datasets were downloaded and the corpus object was created (neither of which were timed, because these steps were part of the timeit --setup block), tokenise_seq(corpus) was run multiple times and the fastest speed was around 53 seconds. A small note: we have used the environment variable PYTHON_GIL=0 here. This makes it explicit that we are using free-threading (turning off the GIL). This wouldn’t normally be necessary to take advantage of free-threading (in Python “3.14t”), but was needed because one of the dependencies of nltk hasn’t been validated for the free-threaded build yet. To write a threaded-version of the same, we introduce two functions. The first is a helper that takes (filename, document-content) pairs and returns (filename, processed-document) pairs: def tupled_tokeniser(pair): file_id, doc = pair return file_id, tokenise_and_pos_tag(doc) The second function creates a Thread-pool, taking advantage of as many CPUs as there are available on my machine (16, counted by multiprocessing.cpu_count()). Each document is processed as a separate thread and we wait for all of the documents to be processed before returning results to the caller: import multiprocessing as mp from concurrent.futures import ThreadPoolExecutor, wait # ... def tokenise_threaded(corpus): with ThreadPoolExecutor(max_workers=mp.cpu_count()) as tpe: try: futures = [ tpe.submit(tupled_tokeniser, pair) for pair in corpus.items() ] wait(futures) finally: # output is a list of (file-id, data) pairs tokens = [f.result() for f in futures] return tokens # Time the `tokenise_threaded()` function # -- but do not time any setup code... PYTHON_GIL=0 \ uv run python -m timeit \ --setup "import main; corpus = main.setup()" \ "main.tokenise_threaded(corpus)" # [lots of output messages] # 1 loop, best of 5: 32.5 sec per loop I could see that every core was used when processing the documents, using the htop tool on Ubuntu. At points during the run, each of the 16 CPUs was at near to 100% use (whereas only one or two CPUs were busy at any time during the sequential run): But, despite using 16x as many CPUs, the multithreaded version of the processing script was only about 40% faster. There was only 18 books in the dataset and some disparity between the book lengths (the bible, containing millions of words was processed much slower than the others). Maybe the speed up would be greater with a larger or more balanced dataset. In the post on the WZB Data Science blog, there is a multiprocessing implementation of the above. Running their multiprocessing code with 16 CPUs gave a similar speed up to multithreading (minimum time 31.2 seconds). Indeed, if I was writing this code for a real project, multiprocessing would remain my choice, because the analysis for one book can proceed independently of that for any other book and data volumes aren’t that big. Other News Python 3.14 has also introduced some improvements to exception-handling, a new approach to string templating and improvements to the use of concurrent interpreters. See the Python 3.14 release notes for further details. In the wider Python Data Science ecosystem, a few other developments have occurred or are due before the end of 2025: The first stable release of the Positron IDE was made in August; Pandas 3.0 is due before the end of the year, and will introduce strings as a data-type, copy-on-write behaviour, and implicit access to columns in DataFrame-modification code; Tools that ingest DataFrames are becoming agnostic to DataFrame library through the Narwahls project. See the Plotly write-up on this subject. Python data science progresses at such a speed that we can only really scratch the surface here. Have we missed anything in the wider Python ecosystem (2025 edition) that will make a huge difference to your data work? Let us know on LinkedIn or Bluesky. For updates and revisions to this article, see the original post
I am pleased to announce that clitable, my new R package for printing tables in the terminal, just made its way to the CRAN!! It lets you print pretty colorful tables directly in the R console or terminal. I am also quite proud that it was accep...
Join our workshop on Linear algebra using Armadillo via armadillo4r , which is a part of our workshops for Ukraine series! Here’s some more info: Title: Linear algebra using Armadillo via armadillo4r Date: Thursday, November 6th, 18:00 – 20:00 CEST (Rome, Berlin, Paris timezone) Speaker: Mauricio ‘Pachá’ Vargas Sepúlveda has a Master of Arts in … Continue reading Linear algebra using Armadillo via armadillo4r workshopLinear algebra using Armadillo via armadillo4r workshop was first posted on October 15, 2025 at 12:53 pm.
New online book AI Assistants for Scientific Coding I’ve released a new online book, AI Assistants for Scientific Coding. It’s a practical guide to using language models to support scientific computing and analysis. The book focuses on helping people...
Bitcoin hit an all-time high of $125,664 on October 5. This increase was fueled by a historic net inflow of $3.24 billion into spot Bitcoin ETFs and rising public demand. In this article, we will predict the trend of two blockchain ETFs using nested forecasting with the Spark backend. I couldn’t use lagged and smoothing […]
Resource for using Large Language Model tools in R If you’re interested in using Large Language Models (LLMs) with R, you should check out Luis D. Verde Arregoitia’s new online resource: Large Language Model tools for R. Its available in English and S...
Read it in: Español.Read it in: Français. Our own dev guide states Be generous with attributions Recognizing the diverse forms of contributions to our mission is very important to us: we like thanking package reviewers and more generally all package ...
I typically work in quantitative ecology and molecular epidemiology, where we use statistical models to predict species distributions or disease transmission patterns. Recently though, I had an interesting conversation with a data science PhD student who mentioned they were applying GAMs to predict Customer Lifetime Value at a SaaS startup. This caught my attention because CLV prediction, as it turns out, faces remarkably similar statistical challenges to ecological forecasting: nonlinear relationships that saturate at biological or business limits, hierarchical structures where groups behave differently, and the need to balance model flexibility with interpretability for stakeholders who need to understand why the model makes certain predictions. After diving into the business analytics literature, I discovered that companies take wildly different approaches to this problem. Many rely on simple heuristics like average order value × frequency × lifespan that ignore individual customer trajectories and temporal dynamics. Others implement sophisticated probabilistic models (the Pareto/NBD and BG/NBD families are particularly elegant), though these often require specialized knowledge to tune and interpret. Black-box machine learning models can capture complex patterns but sacrifice the interpretability that executives need for strategic decisions. Meanwhile, standard regression approaches often predict impossible values like negative revenue or infinite growth. Generalized Additive Models provide a compelling middle ground that I think deserves more attention in this space. GAMs capture complex nonlinear relationships while maintaining full interpretability through visualizable smooth functions. They handle the heteroscedastic variance inherent in revenue data, automatically adapt to saturation effects, and provide uncertainty quantification that’s crucial for risk assessment. This post demonstrates how to build CLV models that respect business constraints, capturing channel decay patterns, tier-specific behaviors, and feature adoption dynamics, while remaining straightforward enough to deploy without extensive infrastructure or specialized optimization expertise. Understanding SaaS business dynamics Before we dive into the modeling (I promise we’ll get to the GAMs soon), it’s worth understanding why Software as a Service (SaaS) businesses present such interesting statistical challenges. Unlike traditional software vendors who sell one-time licenses, or e-commerce platforms with discrete transactions, SaaS companies operate on subscription-based recurring revenue. Customers pay monthly or annually for continued access to cloud-hosted software, which fundamentally changes the economics and, more importantly for us, the statistical patterns we need to model. Here’s the key challenge: customer acquisition costs are typically massive relative to monthly revenue. Think about it, a customer paying $99/month with a $500 acquisition cost needs at least six months just to break even. This is where accurate CLV prediction becomes critical. You need to know which customer segments will stick around long enough to justify that upfront investment, and traditional models often miss the mark completely. What makes this particularly interesting from a modeling perspective is how customer value evolves. Feature adoption doesn’t just correlate with retention, it actively drives expansion revenue through tier upgrades and add-ons. I’ve seen datasets where highly engaged Basic tier customers generate more lifetime value than barely-engaged Enterprise customers, despite the 6x price difference. Meanwhile, customer success teams intervene based on usage patterns, creating feedback loops where our predictions influence outcomes. Add in the fact that churned customers can be reactivated (unlike, say, a dead coral reef in my usual work), and you’ve got a complex dynamic system. The statistical requirements here are non-trivial: we need models that handle nonlinear feature effects (adoption plateaus around 80-90% for most products), tier-specific behaviors (Enterprise customers behave fundamentally differently), temporal decay patterns (paid acquisition channels show diminishing returns), and heteroscedastic variance (high-value customers are inherently more variable). Linear models with their constant effects and variance assumptions? They’ll predict negative revenue or infinite growth. Simple multipliers? They miss all the interesting dynamics. This is exactly where GAMs shine. In this post, I’ll walk through building a production-ready CLV prediction model using GAMs, starting from simulated data that captures realistic SaaS dynamics through to extracting specific business insights like upgrade thresholds and feature investment ROI. By the end, you’ll have a complete workflow that you can adapt to your own customer data, whether you’re in SaaS, e-commerce, or any business with recurring customer relationships. Environment setup Before we dive into the modeling, let’s load our packages. Nothing exotic here, just the usual suspects for GAM fitting and visualization: # Load libraries library(mgcv) # Fit flexible GAMs with smooth terms library(tidyverse) # Data manipulation and visualization library(marginaleffects) # Extract interpretable predictions from GAMs library(gratia) # Beautiful plots and posterior sampling for GAMs library(scales) # Format axes for currency and percentages # Set a clean plotting theme theme_set(theme_bw(base_size = 15, base_family = 'serif') + theme(panel.grid = element_blank())) Simulating realistic SaaS customer dynamics Let’s build some synthetic data that captures real B2B SaaS patterns. I’m a big believer in simulation for understanding model behavior, it’s something I use constantly in my ecological work, and it translates perfectly here. By controlling the data generating process, we know exactly what patterns our GAM should recover. I’ll simulate 100 customers tracked over 5 months, using realistic SaaS pricing tiers: Basic ($99/month), Pro ($299/month), and Enterprise ($599/month). The target variable is 6-month CLV, a common planning horizon in SaaS businesses. Now here’s where it gets interesting, and where we build in patterns that would break traditional models: First, feature adoption naturally plateaus. Nobody uses 100% of features (trust me, I still discover new RStudio shortcuts after a decade). Second, paid acquisition channels show decay, customers attracted by promotions gradually revert to their natural engagement levels. Third, and this is crucial, Enterprise customers don’t just have higher baseline value, they exhibit completely different response curves to adoption. These nonlinearities are exactly what GAMs handle naturally, no feature engineering required. # Set seed for reproducibility set.seed(55) # Generate customer-month observations train_data % dplyr::mutate( # Assign customers to pricing tiers with typical SaaS distribution contract_tier = sample(c("Basic", "Pro", "Enterprise"), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2))[customer_id], # Customer acquisition channels affect initial value and retention acquisition_channel = sample(c("Organic", "Paid", "Partner"), 100, replace = TRUE)[customer_id], # Feature adoption grows over time but naturally caps around 95% feature_adoption_pct = pmin(95, pmax(5, 20 + month * 12 + rnorm(dplyr::n(), 0, 10))), # Monthly recurring revenue depends on tier base_mrr = dplyr::case_when( contract_tier == "Basic" ~ 99, contract_tier == "Pro" ~ 299, contract_tier == "Enterprise" ~ 599, TRUE ~ 99 ), # Paid acquisition provides early boost that decays over time marketing_boost = ifelse(acquisition_channel == "Paid", 0.3 * exp(-0.2 * month), 0), # CLV with stronger saturation effects for Enterprise adoption_multiplier = dplyr::case_when( contract_tier == "Enterprise" ~ 1 + 2 * (1 - exp(-0.05 * feature_adoption_pct)), contract_tier == "Pro" ~ 1 + 0.5 * (feature_adoption_pct/100), TRUE ~ 1 + 0.2 * (feature_adoption_pct/100) ), # Calculate CLV with clear asymptotic behavior clv_6m = base_mrr * 6 * adoption_multiplier * (1 + marketing_boost) * exp(rnorm(n(), 0, 0.15)) ) %>% # Convert to factors after numeric calculations dplyr::mutate( contract_tier = factor(contract_tier, levels = c("Basic", "Pro", "Enterprise")), acquisition_channel = factor(acquisition_channel) ) # Let's visualize the data we just created to understand the patterns # we're asking our GAM to capture ggplot(train_data, aes(x = feature_adoption_pct, y = clv_6m, color = contract_tier)) + # Add individual customer data points geom_point(alpha = 0.6, size = 1.5) + # Use a quadratic term to show nonlinear relationships geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE, alpha = 0.2) + # Format y-axis as currency scale_y_continuous(labels = scales::dollar) + # Use consistent color scheme across tiers scale_color_manual(values = c("darkblue", "darkred", "black")) + # Separate panels by acquisition channel facet_wrap(~ acquisition_channel) + # Add informative axis labels labs(x = "Feature Adoption Rate (%)", y = "6-Month Customer Lifetime Value ($)", color = "Contract Tier") Notice how the relationship between adoption and CLV isn’t just nonlinear, it’s differently nonlinear for each tier. Enterprise customers hit a ceiling hard (classic diminishing returns), while Basic tier customers show nearly linear growth. The Paid acquisition channel generates higher initial CLV, but look closely at high adoption rates, that premium fades. This suggests organic customers eventually match paid ones if they stick around and engage with the product. But we will see a clearer plot of these relationships when we estimate them using our GAM below. # Now examine temporal patterns - how does CLV evolve over time? train_data %>% ggplot(aes(x = month, y = clv_6m, color = acquisition_channel)) + # Add LOESS smoothing with confidence bands for each channel geom_smooth(aes(fill = acquisition_channel), method = "loess", se = TRUE, alpha = 0.3, linewidth = 1.2) + # Format y-axis as currency scale_y_continuous(labels = scales::dollar) + # Apply consistent color scheme for channels scale_color_manual(values = c("darkblue", "darkred", "black")) + scale_fill_manual(values = c("darkblue", "darkred", "black")) + # Create separate panels for each tier with independent y-scales facet_wrap(~ contract_tier, scales = "free_y", ncol = 2) + # Add informative axis labels labs(x = "Months Since Customer Acquisition", y = "6-Month Customer Lifetime Value ($)", color = "Channel", fill = "Channel") The temporal patterns here are interesting. Paid customers maintain higher CLV consistently, but the effect varies by tier. In Enterprise, we’re talking nearly $2,000 premium for Paid acquisition, while in Basic it’s much smaller. Partner channels are the wild card, they match Organic in most tiers but start to outperform in Pro. If I were building acquisition strategy from these patterns, we might want to focus paid spending on Enterprise and partner relationships on Pro customers. Initial GAM: Gaussian with log link But rather than trying to make interpretations directly from the observed data, we would do better by fitting some models. So let’s fit our first GAM. I’m starting with a Gaussian distribution and log link, which at least ensures positive predictions (unlike linear models that cheerfully predict negative revenue): # Fit our first GAM with hierarchical smooths clv_model |t|) ## (Intercept) 6.55559 0.08326 78.74
Developing a Python package when coming from R: experience feedback I’ve learned a lot by developing and contributing to various R packages over the years. Without having (yet!) a personal package on CRAN, I use package development on a daily b...
Introduction In the previous posts in this series (Using R in Excel) I have demonstrated some basic use-cases where using R in Excel is useful. Specifically we have looked at descriptive statistics, linear regression, forecasting, and calling Python. I...
If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. You can send me questions for the blog using this form and subscribe to receive an email when there is a new po...
Zhenguo Zhang's Blog /2025/10/08/r-setting-up-an-interactive-r-shiny-plotting-app/ -I have an ieda of creating a shiny app to create all kinds of plots interactively, so that users can choose their color, symbols, etc, which is very useful for data exploration and make a publication-quality figure. Today, I make it and published it at https://fortune9.shinyapps.io/interactive_plot/. The source code is at github https://github.com/fortune9/interactive_plot. This guide focuses on the infrastructure setup for creating and deploying an interactive plotting application using R Shiny. For detailed features and functionality, please refer to the README.md in the GitHub repository. Overview The application allows users to create customizable plots (scatter, bar, box) from both uploaded data and built-in R datasets, with extensive customization options and export capabilities. All source code is available at https://github.com/fortune9/interactive_plot. Project Setup Create the project structure: 1 2 3 4 5 6 interactive_plot/ ├── app.R ├── ui.R ├── server.R ├── www/ └── data/ Install required packages: 1 install.packages(c("shiny", "ggplot2", "tidyverse")) Create the main app file (app.R): 1 2 3 4 5 6 7 8 library(shiny) library(ggplot2) library(tidyverse) source("ui.R") source("server.R") shinyApp(ui = ui, server = server) Deployment 1. Prepare for Deployment Install the rsconnect package: 1 install.packages("rsconnect") Create a shinyapps.io account at https://www.shinyapps.io Get your account tokens from: Dashboard → Account → Tokens Set up authentication in R: 1 2 3 4 5 rsconnect::setAccountInfo( name='your-account-name', token='your-token', secret='your-secret' ) 2. Deploy Using GitHub Actions Add GitHub Secrets: SHINYAPPS_USER SHINYAPPS_TOKEN SHINYAPPS_SECRET Create GitHub Actions workflow (.github/workflows/deploy.yml): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 name: Deploy to shinyapps.io on: push: branches: - main jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: r-lib/actions/setup-r@v2 - name: Deploy to shinyapps.io run: | Rscript -e 'install.packages(c("shiny", "rsconnect"))' Rscript -e 'rsconnect::deployApp()' env: SHINYAPPS_USER: ${{ secrets.SHINYAPPS_USER }} SHINYAPPS_TOKEN: ${{ secrets.SHINYAPPS_TOKEN }} SHINYAPPS_SECRET: ${{ secrets.SHINYAPPS_SECRET }} Development Tips Local Testing Run the app locally using shiny::runApp() Test with different browsers and screen sizes Verify all features before deployment Deployment Checks Ensure all required packages are listed in the deployment workflow Test deployment locally using rsconnect before pushing to GitHub Monitor GitHub Actions logs for deployment issues Maintenance Keep your shinyapps.io tokens secure Update package versions regularly Monitor app usage on shinyapps.io dashboard Conclusion This guide covers the essential setup and deployment steps for your Shiny app. For implementation details and features, refer to the source code and documentation in the GitHub repository at https://github.com/fortune9/interactive_plot. For questions or improvements, please open an issue in the GitHub repository. Do you know? Actually most of the code was written using Github Copilot, to which I feed the file development plan.md to start. Very cool 😄! - /2025/10/08/r-setting-up-an-interactive-r-shiny-plotting-app/ -
In this article, we executed a successful integration of a non-standard Python forecasting model into the R Tidyverse/Tidymodels framework, primarily leveraging the reticulate package. 1. The Objective (The Challenge) The goal was to utilize a powerful, custom Python model (nnetsauce‘s MTS wrapper around a cyb$BoosterRegressor) and integrate its outputs—predictions and prediction intervals—into the R ecosystem, […]
Recently, a post on LinkedIn highlighted a Google Scholar profile of an apparently just starting PhD researcher who suddenly started accumulating unbelievable counts of citations to his questionably numerous papers. Usually, such profiles are a ...
Four years ago, I described a simple framework for organizing simulations to conduct power analyses or explore the operating characteristics of modeling approaches. In that framework, I introduced a small function scenario_list that generated a list...
This post contains a riddle/puzzle related to T. Moudiki's situation, including files and code snippets in R, Python, and bash. The ZIP file with a lot of details is available for download.
We want to introduce a new, work-in-progress book on spatial data visualization in R using the tmap package. The current version of the book, titled Spatial Data Visualization with tmap: A Practical Guide to Thematic Mapping in R, is available on...
Ever wondered what M184V, K65R actually mean? I learnt it from rebuilding Stanford’s HIV resistance algorithm from scratch to find out. Spoiler: it took tons of code to match their 3-line tool. But the lesson was worth it Motivations: We’ve all learnt and memorize what those letters and numbers mean when it comes to antiretroviral resistance. Since we’ve been exploring genomics lately, let’s take another look at HIV genotype. Stanford University HIVdb is an amazing resource! I’ve always been confused with all these letters and have difficulty understanding how to even check for genotype resistance because all these numbers and letters are quite confusing and intimidating. Let’s put on our bioconductor hat and revisit this topic and see if we can at least get a better understanding on what these letters and numbers mean. Better yet, use this opportunity to try to reproduce the algorithm that tells us the susceptibility of the ART! Hang tight on this one, it’s going to be a bumpy road! Objectives Find HIV reference gene Find Resistance Workflow Identify Translate Align Stanford’s Sierrapy Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Final Thoughts Opportunities for improvement Lessons Learnt Find HIV Reference Gene From most of the searching, the reference gene appears to be HXB2. We know that pol gene encodes protease (PR), reverse transcriptase (RT) and integrase (IN). Below is directly from NCBI. source. You can click on it and hover over those regions to see what they are. Take note that all these genes reside in specific position. So the trick is to use a reference, in our case HXB2, locate these 3 genes (RT, PR, IN), then extract them and make them into a database. Some reference is pretty good at letting you know where these positions are, some don’t! As for our reference, it didn’t really say either! We’ll have to look around reference and see if we can get those position. library(Biostrings) library(DECIPHER) library(tidyverse) ## We downloaded a bunch of HIV genomes, including the reference known as K03455 (aka HXB2) (all_hiv_genome
Date: Thursday, 23rd October 2025 Time: 13:05 (UK Time) Duration: 55 minutes Cost: Absolutely free! Reserve your spot now: https://jumpingrivers.typeform.com/to/UmdyNbAs Ready to get more out of your Posit tools and understand how they can drive ...
The future package celebrates ten years on CRAN as of June 19, 2025. I got a bit stalled over the holidays and going to the fantastic useR! 2025 conference, but, as promised, here is the third in a series of blog posts highlighting recent improveme...