News
Entertainment
Science & Technology
Life
Culture & Art
Hobbies
News
Entertainment
Science & Technology
Culture & Art
Hobbies
8 | Follower
If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. You can send me questions for...
At rOpenSci, our Code of Conduct (CoC) committee works to support a healthy, welcoming, and inclusive community. A big part of this work is making sure that the processes we follow are transparent, consistent, and fair. Over the years, we’ve developed...
Deutsche Bank Research Institute stated in its published report that Bitcoin has undergone a process similar to what gold experienced over the past 100 years. According to the report, Bitcoin’s increasing adoption and reduced volatility may transform it into a reserve asset that central banks could hold by 2030. The uncertainty graph below confirms the […]
Unformatunately Piwik Pro has discontinued their free plan as of the end of 2025. If I wanted to user Piwik Pro I would have to pay at least 420,00 € per year. As the author of the CRAN-hosted R-package piwikproR I have to decide whether I should conti...
If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. You can send me questions for...
Learned RNA-seq workflow using C. difficile data from a published study 🧬. Processed raw reads through fastp → kallisto → DESeq2 pipeline. Results matched the original paper’s findings, with clear differential expression between mucus and contro...
China aims to increase its influence in the global bullion market by directing friendly countries to store their gold reserves within its borders. This move is part of Beijing’s efforts to reduce its reliance on the dollar and promote the global use of the yuan. Goldman Sachs predicts that if just 1% of corporate bonds […]
If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. You can send me questions for...
If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. You can send me questions for...
If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. You can send me questions for...
You can send me questions for the blog using this form and subscribe to receive an email when there is a new post. Dear fellow developers and data scientists, If everyone reading this gave just the price of a coffee, I could focus fully on open s...
Ten years of blog posts A few months ago—26 July 2025 to be precise—was the tenth anniversary of my first blog post. Over that time it turns out I’ve written about 225 blog posts, and an astonishing (to me) 350,000 words. That’s after you take out the...
You can send me questions for the blog using this form and subscribe to receive an email when there is a new post. Dear fellow developers and data scientists, If everyone reading this gave just the price of a coffee, I could focus fully on open s...
Join our workshop titled Bayesian Optimization for Sequential Decisions with Multi-Arm Bandits, which is a part of our workshops for Ukraine series! Here’s some more info: Title: Bayesian Optimization for Sequential Decisions with Multi-Arm Bandits Date: Thursday, October 23rd, 18:00 – 20:00 CEST (Rome, Berlin, Paris timezone) Speaker: Jordan Nafa is a Data Scientist and … Continue reading Bayesian Optimization for Sequential Decisions with Multi-Arm BanditsBayesian Optimization for Sequential Decisions with Multi-Arm Bandits was first posted on September 23, 2025 at 6:10 pm.
Read it in: Français. Sunny and Steffi showing off their R hex stickers! This summer I had a wonderful time attending the Society for Canadian Ornithologists meeting in Saskatoon, Canada. It was super exciting to run into Sunny Tseng, rOpenSci Cham...
If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. You can send me questions for...
In this post, you will learn what a T Test is and how to perform it in R. First, you’ll see a simple function that lets you perform the test with just one line of code. Then, we will explore the intuition behind the test, building it step by step with data about the Titanic passengers. Enjoy the reading! 1. What is a T-Test? A t-test is a statistical procedure used to check whether the difference between two groups is significant or just due to chance. In this post, we’ll look at data from Titanic passengers, dividing them into males and females. Suppose we want to test the hypothesis that men and women had the same average age. If our data shows that women were, on average, 2 years younger than men, we need to ask: is this a real difference, or could it have happened randomly? The t-test helps us answer this question. 2. Why is a T-Test important? A t-test is important when we want to draw conclusions about a population based on a sample. For example, imagine we are studying the demographics of ship passengers at the beginning of the twentieth century and want to use the Titanic sample to generalize findings to a broader population of passengers. Of course, such inferences may be biased, since Titanic passengers might not perfectly represent all ship passengers of that era. Nevertheless, the sample can still provide valuable insights, as long as the context of both the sample and the population is carefully considered and clearly explained. 3. The Titanic passengers We are going to use the titanic R library to access data about Titanic passengers. Specifically, we will work with a subset of passengers contained in the titanic_train dataset. Below, you will find the code to load the data, calculate the mean and standard deviation of age for males and females, and show how many passengers are men and women. content_copy Copy library(titanic) data('titanic_train') df % select(Sex, Age) %>% na.omit() df %>% group_by(Sex) %>% summarize(mean(Age), sd(Age), n()) Sex mean(Age) sd(Age) n female 27.9 14.1 261 male 30.7 14.7 453 We can see that there is a difference of 2.8 years between the average age of men and women on the Titanic. Below, you can also check the distribution of ages. content_copy Copy ggplot()+ geom_density(aes(x=df$Age, color = df$Sex), size = 0.7)+ scale_color_discrete("")+ xlab("Age")+ ylab("Density") It seems indeed that the distributions are very similar. In this case, our best option is to carry a T Test out to see if they are really so similar. 4. T test in R A T test can be performed in R in a very easy way. There is a function called t.test, whose first argument is a formula, in our case, we would like to know how age varies across different genders. Thomas Leeper wrote a very clear explanation about formulas in this page. Important for us is that the formula is composed by a dependent variable on the left (Age), followed by “~” and one or more independent variables on the right (Sex). The second argument is simply the dataframe with the data we want to test. This test assumes the two samples are independent and that age is approximately normally distributed, which we confirmed by the density plot above. content_copy Copy t.test(Age ~ Sex, data = df) How to interpret these results? The p-value of 0.0118 means that if there were truly no difference in the average age between male and female passengers (i.e., if the null hypothesis were true), there would be only a 1.18% chance of observing a difference as large as the one we found or larger. Since this p-value is less than 0.05, we reject the null hypothesis at the 95% confidence level, suggesting that a real difference exists. However, if we had chosen a 99% confidence level, we would not reject the null hypothesis, because the p-value is greater than 0.01. Our confidence interval tells us that if we took many samples like the one we have, in 95% percent of the times, we would obtain a difference between averages between -0.62 and -5. This confidence interval does not include 0 and therefore we reject the null hypothesis and accept the hypothesis that there is a difference between the average age of men and women. 5. T test with Bootstrap A T test with bootstrap is a good way of understanding the concepts needed to interpret the results of the T test above. Everything relies on the Central Limit Theorem according to which if I draw many samples of a population and calculate the mean of each sample, then the distribution of all these means will: (i) follow a normal distribution; (ii) the mean of the sample means will approximate the population mean; (iii) the standard deviation of this distribution will be called standard error. In our example, we have one sample of passengers. Imagine we could collect many of those samples. If we could do that, then the means of all samples would approximate the population parameter. Bootstrap is a technique to virtually create as many samples as we want from our unique sample. In our example, we have 712 ages after eliminating NAs. We could resample 712 observations from these values allowing them to repeat. That is the basic idea behind bootstrapping. In order to do that procedure, we will create a function that will resample our data frame. The first line of code uses slice_sample to randomly select n rows of our dataframe allowing for the same row to be chosen more than one time. Note that n is the number of rows of the dataframe. After that, we use dplyr to calculate the mean by gender. Note that we are actually interested in the difference between the male mean and the female mean. That’s what the two last lines of code do. content_copy Copy diff_means % summarize(mean_age = mean(Age, na.rm = TRUE)) male_mean % filter(Sex == "male") %>% pull(mean_age) female_mean % filter(Sex == "female") %>% pull(mean_age) return(male_mean - female_mean) } Now we can use the replicate function to execute our function for n times. For our purpose 1000 times is enough. Note that replicate works like a for loop. Before we do that, however, let us make a small adjustment so that we can also calculate our p-value. The p-value assumes the null hypothesis is true. Therefore, before resampling our data, let us make the difference between means be 0. For that, let us subtract the difference observed, 2.81, from the ages of all males. content_copy Copy df_null % mutate(Age = ifelse(Sex=="male", Age-2.81, Age)) set.seed(1308) diffs =2.81)/1000 sum(diffs
If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor. You can send me questions for...
I have modeled the BIST 100 index to build predictive intervals. Because the data has daily seasonality, I preferred the modeltime::adam_reg function. I did not use the timetk::step_timeseries_signature function because the model cannot process too many exterior regressors, and the algorithm captures the trend and seasonality well by nature. So I did not preprocess the […]
Centre for Marine Socioecology R and AI workshops In person Hobart Australia 11th and 12th November 2025 Course page and further details Registration The R program is a powerful tool for data analysis, but has a steep learning curve. Join us for a ...
<div style = "width:60%; display: inline-block; float:left; "> </div><div style = "width: 40%; display: inline-block; float:right;"><img src=' https://xianblog.wordpress.com/wp-content/uploads/2025/09/bayzondab.png?w=450' width = "200" style = "padding: 10px;" /></div><div style="clear: both;"></div>
Read it in: Español.Read it in: Français. If life gives you a bunch of Markdown files to analyse or edit, do you warm up your regex muscles and get going? How about using more specific tools instead? In this post, we shall give an overview of programm...
Figura de Astarté – Museo Arqueológico de Sevilla, Public domain, via Wikimedia Commons I was reading Phoenician colonization from its origin to the 7th century BC (Manzano-Agugliaro et al. 2025) and thought it was an interesting dataset, but alas: it is split in four tables, behind a javascript redirect (wtf Taylor & Francis?) and with DMS coordinates (including typos and special characters)… So not easily reusable. Let’s go build an accessible dataset. Config library(readr) library(purrr) library(dplyr) library(stringr) library(ggplot2) library(forcats) library(janitor) library(sf) library(rnaturalearth) library(glue) library(parzer) library(leaflet) sf_use_s2(FALSE) knitr::knit_hooks$set(crop = knitr::hook_pdfcrop) Data We need to manually download the CSVs (parts 1, 2, 3 and 4) because there is an antiscraping mechanism… Then a little cleaning and coordinates parsing with the very nice {parzer} package let us build a spatial object with {sf}. sources = list( c_10_bce = "data_raw/T0001-10.1080_17445647.2025.2528876.csv", c_09_bce = "data_raw/T0002-10.1080_17445647.2025.2528876.csv", c_08_bce = "data_raw/T0003-10.1080_17445647.2025.2528876.csv", c_07_bce = "data_raw/T0004-10.1080_17445647.2025.2528876.csv" ) phoenician imap(\(f, c) { read_csv(f) |> mutate(century_start_bce = parse_number(c))}) |> list_rbind() |> clean_names() |> mutate(lon = parse_lon(str_replace(longitude_e, "−", "-")), lat = parse_lat(str_replace(latitude_n, ",", "."))) |> st_as_sf(coords = c("lon", "lat"), crs = "EPSG:4326") Maps The resulting layer, mapped on a Natural Earth background, seems good. world st_intersection(phoenician |> st_bbox() |> st_as_sfc() |> st_buffer(4, joinStyle = "MITRE", mitreLimit = 10)) phoenician |> ggplot() + geom_sf(data = world) + geom_sf(aes(color = fct_rev(as_factor(century_start_bce)))) + theme_void() + labs(title = "Phoenician colonies", subtitle = "10th c. BCE - 7th c. BCE", color = "from\n(century BCE)", caption = glue("data doi:10.1080/17445647.2025.2528876 https://r.iresmi.net/ {Sys.Date()}")) + theme_minimal() + theme(plot.caption = element_text(size = 6), plot.background = element_rect(fill = "white")) Figure 1: Phoenician colonies You want more interactivity? Using {leaflet}… phoenician |> leaflet() |> addTiles(attribution = r"( r.iresmi.net. data: Manzano-Agugliaro et al. 2025. doi:10.1080/17445647.2025.2528876; map: OpenStreetMap)") |> addCircleMarkers(popup = ~ glue("{settlement} from {century_start_bce}th c. BCE \ {if_else(!is.na(centuries_of_subsequent_permanence), paste0('to ', centuries_of_subsequent_permanence), '')}"), clusterOptions = markerClusterOptions()) Figure 2: Phoenician colonies (interactive) Export We can build a clean Geopackage (and a CSV just in case): phoenician |> st_write( "data/phoenician_settlements.gpkg", layer = "phoenician_settlements", layer_options = c( "IDENTIFIER=Phoenician colonization from its origin to the 7th century BC", glue("DESCRIPTION=Data from: Manzano-Agugliaro, F., Marín-Buzón, C., Carpintero-Lozano, S., & López-Castro, J. L. (2025). \ Phoenician colonization from its origin to the 7th century BC. Journal of Maps, 21(1). \ https://doi.org/10.1080/17445647.2025.2528876 Available on https://doi.org/10.5281/zenodo.17141060 Extracted on {Sys.Date()} – https://r.iresmi.net/posts/2025/phoenician")), delete_layer = TRUE, quiet = TRUE) phoenician |> select(-c(latitude_n, longitude_e)) |> bind_cols(st_coordinates(phoenician)) |> rename(lon_wgs84 = X, lat_wgs84 = Y) |> st_drop_geometry() |> write_csv("data/phoenician_settlements.csv") And lastly we store them in a public repository; they are now available on Zenodo and therefore even have a doi:10.5281/zenodo.17141060 References Manzano-Agugliaro, Francisco, Carmen Marín-Buzón, Susana Carpintero-Lozano, and José Luis López-Castro. 2025. “Phoenician Colonization from Its Origin to the 7th Century BC.” Journal of Maps 21 (1): 2528876. https://doi.org/10.1080/17445647.2025.2528876.
Revisiting my 2007-2009 Master's Thesis work on SCR Equity approximation using probabilistic machine learning techniques in R and Python.
Code library(tidyverse) library(patchwork) library(gt) # sysfonts::font_add_google(name = "fira code") # showtext::showtext_auto() knitr::opts_chunk$set(collapse = TRUE, comment = "#>", dev = "ragg_png") In celebration of the release of {ggplot2} 4.0.0 🥳, I wanted to explore the relationships between the geometric objects (“geoms”) and statistical transformations (“stats”) that are offered by the core {ggplot2} functions. Geoms and What? Within The Layered Grammar of Graphics framework, plots in {ggplot2} are built by adding layer()s. Each layer() consists of a geom (the actual thing being drawn), with different geoms having different aes()thetics that can (and should) be mapped to variables in the data. However, all layer()s also have a stat - this can be thought of as a function applied to the data, transforming it in some way before it is passed to the geom and its aesthetics. For example, geom_histogram() uses the "bin" stat by default: geom_histogram()$stat #> #> aesthetics: function #> compute_group: function #> compute_layer: function #> compute_panel: function #> default_aes: ggplot2::mapping, uneval, gg, S7_object #> dropped_aes: weight #> extra_params: na.rm orientation #> finish_layer: function #> non_missing_aes: #> optional_aes: #> parameters: function #> required_aes: x|y #> retransform: TRUE #> setup_data: function #> setup_params: function #> super: This stat takes the raw data, counts the number of occurrences of each value of the x aesthetic within each x-bin. p1 # see computed variables: select(x, count, density, ncount, ndensity, width) #> x count density ncount ndensity width #> 1 12.00000 5 0.006009615 0.07352941 0.07352941 0.9 #> 2 15.55556 50 0.060096154 0.73529412 0.73529412 0.9 #> 3 19.11111 34 0.040865385 0.50000000 0.50000000 0.9 #> 4 22.66667 29 0.034855769 0.42647059 0.42647059 0.9 #> 5 26.22222 68 0.081730769 1.00000000 1.00000000 0.9 #> 6 29.77778 33 0.039663462 0.48529412 0.48529412 0.9 #> 7 33.33333 9 0.010817308 0.13235294 0.13235294 0.9 #> 8 36.88889 3 0.003605769 0.04411765 0.04411765 0.9 #> 9 40.44444 1 0.001201923 0.01470588 0.01470588 0.9 #> 10 44.00000 2 0.002403846 0.02941176 0.02941176 0.9 We can see that this data has been transformed, and reflects the underlying data represented in the final plot in Figure 1: 10 columns, their x location, and their heights (count).1 Let’s take a deeper look at the geoms and stats in {ggplot2} and how they relate to each other. Data collection # list all functions in ggplot2 that start with geom_ or stat_: get_statgeom bind_rows(), map(ggfunction, has_statgeom.args) |> bind_rows() ) |> select(-`NA.`) |> filter(!(is.na(stat) & is.na(geom)), !help_page %in% c("ggsf")) Plot geom-stat combinations ggdata_tidy mutate( stat = str_remove(stat, "Stat") |> str_to_lower(), geom = str_remove(geom, "Geom") |> str_to_lower(), stat = fct_reorder(factor(stat), stat, .fun = length), geom = fct_reorder(factor(geom), stat, .fun = length) ) ggplot(ggdata_tidy, aes(geom, stat)) + geom_point() + scale_x_discrete(limits = rev, guide = guide_axis(angle = -40)) + theme(text = element_text(family = "fira code")) We can see that the most common stat by far is the identity stat. What’s that about? The Identity Stat and the Basic Geom Builing Blocks The identity stat is a stat that does nothing.2 It takes the data as-is and passes it to the geom’s aesthetics. Geoms that use the identity stat can be thought of a the most basic building blocks of {ggplot2} - many types of lines, bars/tiles, points, areas, etc. These are the geoms you’d probably use with annotate().3 We can see such geoms are often used with other stats as well: Code ggdata |> filter("StatIdentity" %in% stat, .by = geom) |> mutate( stat = str_remove(stat, "Stat") |> str_to_lower(), geom = str_remove(geom, "Geom") |> str_to_lower(), is_identity = if_else(stat == "identity", "identity", "other"), geom = fct_reorder(factor(geom), geom, .fun = length) ) |> ggplot(aes(stat, geom, fill = geom)) + facet_grid(cols = vars(is_identity), scales = "free_x", space = "free_x") + geom_point( shape = 21, color = "black", size = 3 ) + ggrepel::geom_label_repel( aes(label = ggfunction), family = "fira code", layout = 2, force = 10, max.overlaps = Inf ) + theme( title = element_text(family = "fira code"), axis.text = element_text(family = "fira code") ) + scale_fill_viridis_d(option = "A", end = 1, begin = 0.2) + scale_x_discrete(guide = guide_axis(angle = -40)) + guides(fill = "none") For example, the point geom is used by {ggplot2} together with 7 other non-identity stats. Geom-Stat Pairs {ggplot2} provides many stats that basically can only be used with a specific geom (and vice versa). These geom-stat pairs are almost exclusive to one another - think of a boxplot for example, which is a unique geom that has little meaning without the boxplot stat. p2 # see computed variables: select(x, ymin:ymax, width, outliers) #> x ymin lower middle upper ymax width outliers #> 1 1 23 24.0 25.0 26.0 26 0.9 #> 2 2 23 26.0 27.0 29.0 33 0.9 35, 37, 35, 44 #> 3 3 23 26.0 27.0 29.0 32 0.9 #> 4 4 21 22.0 23.0 24.0 24 0.9 17 #> 5 5 15 16.0 17.0 18.0 20 0.9 12, 12, 12, 22 #> 6 6 20 24.5 26.0 30.5 36 0.9 44, 41 #> 7 7 14 17.0 17.5 19.0 22 0.9 12, 12, 25, 24, 27, 25, 26, 23 The table below lists all the geom-stat pairs in {ggplot2}: Code ggdata_pairs filter( stat != "StatIdentity", !help_page %in% c( "geom_abline", "geom_linerange", "ggsf", "stat_summary", "stat_summary_2d" ), !ggfunction %in% c("geom_col", "geom_freqpoly"), !str_detect(ggfunction, "(bin2d|density2d|binhex)") ) |> filter(n() > 1, .by = c(help_page)) |> arrange(help_page) |> mutate( stat = str_remove(stat, "Stat") |> str_to_lower(), geom = str_remove(geom, "Geom") |> str_to_lower() ) ggdata_pairs |> pivot_wider( names_from = type, values_from = ggfunction, id_cols = c(help_page, geom, stat), names_prefix = "type_" ) |> group_by(help_page) |> fill(starts_with("type_"), .direction = "downup") |> ungroup() |> distinct(help_page, type_geom, type_stat, .keep_all = TRUE) |> mutate( help_page = case_when(n() > 1 ~ help_page, .default = "Other topics"), .by = help_page ) |> gt(groupname_col = "help_page") |> cols_merge( columns = c(type_geom, stat), pattern = '{1}(stat = "{2}")' ) |> cols_merge( columns = c(type_stat, geom), pattern = '{1}(geom = "{2}")' ) |> opt_table_font(font = google_font("Fira Code")) |> tab_style( style = list( cell_text(weight = "bold") ), locations = cells_column_labels() ) |> tab_style( style = list( cell_text(weight = "bold") ), locations = cells_row_groups() ) |> cols_label( "type_geom" = "geom_", "type_stat" = "stat_", "help_page" = "Topic" ) geom_ stat_ Other topics geom_bar(stat = "count") stat_count(geom = "bar") geom_bin_2d(stat = "bin2d") stat_bin_2d(geom = "bin2d") geom_boxplot(stat = "boxplot") stat_boxplot(geom = "boxplot") geom_count(stat = "sum") stat_sum(geom = "point") geom_density(stat = "density") stat_density(geom = "density") geom_hex(stat = "binhex") stat_bin_hex(geom = "hex") geom_histogram(stat = "bin") stat_bin(geom = "bar") geom_quantile(stat = "quantile") stat_quantile(geom = "quantile") geom_area(stat = "align") stat_align(geom = "area") geom_smooth(stat = "smooth") stat_smooth(geom = "smooth") geom_violin(stat = "ydensity") stat_ydensity(geom = "violin") geom_contour geom_contour(stat = "contour") stat_contour(geom = "contour") geom_contour_filled(stat = "contourfilled") stat_contour_filled(geom = "contourfilled") geom_density_2d geom_density_2d(stat = "density2d") stat_density_2d(geom = "density2d") geom_density_2d_filled(stat = "density2dfilled") stat_density_2d_filled(geom = "density2dfilled") geom_qq geom_qq(stat = "qq") stat_qq(geom = "point") geom_qq_line(stat = "qqline") stat_qq_line(geom = "abline") When using the default geom= and stat= arguments, these pairs are interchangeable - meaning you can swap the stat_ function for the geom_ function from the table above. For example: Code # g1
We learnt how to assemble DNA fragments from a recently published N. meningitidis outbreak study using SKESA 🧬, practiced using BLAST for species identification and MLST for strain typing, and explored how serogroups are determined using specialized P...
Open science has transformed how research is conducted, shared, and reused. Yet the organisations at the heart of this transformation are often left vulnerable, underfunded, and disconnected from one another. To move from simply surviving to truly thr...
Our free monthly webinar series is back, and the first session on 21 August – “Reports that Write Themselves: Automated Reporting with Quarto” was a fantastic success! It was wonderful to see the Jumping Rivers community grow, with so many data p...