News
Entertainment
Science & Technology
Life
Culture & Art
Hobbies
News
Entertainment
Science & Technology
Culture & Art
Hobbies
Maintaining a package can be a lonesome activity, which sometimes poses a problem if you prefer team work or if you encounter a very thorny-for-you problem. Beside belonging to a supportive community of maintainers (like rOpenSci 😉), for collaborative...
Join our workshop on Introduction to Qualitative Comparative Analysis (QCA) using R, which is a part of our workshops for Ukraine series! Here’s some more info: Title: Introduction to Qualitative Comparative Analysis (QCA) using R Date: Thursday, October 5th, 18:00 – 20:00 CEST (Rome, Berlin, Paris timezone) Speaker: Ingo Rohlfing, Ingo Rohlfing is Professor of … Continue reading was first posted on September 9, 2023 at 10:02 am.
Authors: Ivan Svetunkov, John E. Boylan Journal: International Journal of Production Economics Abstract: Inventory decisions relating to items that are demanded intermittently are particularly challenging. Decisions relating to termination of sales of product often rely on point estimates of the mean demand, whereas replenishment decisions depend on quantiles from interval estimates. It is in this […] Сообщение iETS: State space model for intermittent demand forecasting появились сначала на Open Forecasting.
Coming from the simple sine function (remember of Fourier series), German mathematician Karl Weierstrass became the first to publish an example of a continuous, nowheredifferentiable function. Weierstrass function (originally defined as a Fourier series) was the first instance in which…Read more ›
Intro/Overview to TidyTuesday 36: Union Membership in the United States This week’s TidyTuesday presents data taken from the Union Membership and Coverage Database from the CPS (Unionstats.com) created by Barry T. Hirsch, David A. Macpherson, and William E. Even. This database contains data about the wages of union and non-union workers from 1973 until today. There is a companion paper: Macpherson, David A. and Barry T. Hirsch. 2023. “Five decades of CPS wages, methods, and union-nonunion wage gaps at Unionstats.com.” Industrial Relations: A Journal of Economy and Society 62: 439–452. https://doi.org/10.1111/ irel.12330 I highly recommend reading the paper as it clearly illustrates the challenges of working with real-world data collected by 3rd parties. The source data is government survey data. For some key questions, a third of respondents didn’t answer. Imputation was performed (but not always noted) by the government agency, but in such a way that it obscured the trend about wage gaps between union and non-union labor. (Likely, the survey was designed to study something else, and the imputation method was fine for that question. Some years did not even ask about participation in labor unions.) The survey also didn’t always collect detailed information about the salary of the highest earners, simply marking them as being “above the cap.” The paper details numerous such issues and explains how the data was handled to standardize the results over the five decades the database covers. It is a beautiful example of how to handle messy data. Today, I will make two types of treemaps using some of the demographic data from this dataset. A treemap is a way of visualizing how various parts relate to the whole. The demographic data seems all related and could be viewed in a treemap form. So, I’m going to pull that out. The UnionStats website also warns, “Note: CPS sample sizes are very small for some cells. Use such estimates cautiously.” So, we can use the treemap to visualize that caution for the demographic subsets. Setting Up Loading Libraries library(tidyverse) # who doesn't want to be tidy? library(gt) # for nice tables library(treemap) # for treemap library(d3treeR) # for interactive treemaps Loading Data This week’s data is not loaded in the usual way! While the TidyTuesday page was up on Monday, the TidyTuesday package insisted the data was unavailable by either week number or date. So, instead, I loaded it directly from Git Hub. demographics % gt() year sample_size wage at_cap union_wage nonunion_wage union_wage_premium_raw union_wage_premium_adjusted facet 2019 94818 28.09587 0.05071279 31.70222 27.64805 0.14663510 0.14290078 all wage and salary workers 2019 48157 31.31306 0.07093566 34.02119 30.96146 0.09882383 0.17838623 demographics: male 2019 46661 24.64259 0.02903222 28.99576 24.12817 0.20173885 0.10255232 demographics: female 2019 40352 31.72107 0.07324656 34.46735 31.36370 0.09895673 0.17714998 demographics: white male 2019 37864 24.57163 0.02741190 28.91303 24.06818 0.20129693 0.09296709 demographics: white female 2019 3607 23.72498 0.02696504 30.11012 22.74304 0.32392681 0.17447826 demographics: black male 2019 4577 21.96607 0.02158116 26.77524 21.29592 0.25729427 0.14210089 demographics: black female 2019 6897 22.11324 0.02291996 30.18838 21.20103 0.42391109 0.24975224 demographics: hispanic male 2019 5947 18.81279 0.01247475 25.42001 18.14904 0.40062550 0.15435517 demographics: hispanic female Now, I don’t need male and female total rows except to check the numbers (the first two rows). I see the subcategories don’t add up to these upper-level demographic categories. Male + female does equal the total number of workers (“all wage and salary workers”), but the individual subgroups of male and female don’t add up to the totals (hispanic male + white male + black male > male). One explanation is that some survey participants chose to identify in multiple categories. This data isn’t the greatest choice for a treemap since some participants will appear in multiple boxes, but I will proceed with the graph. Remove the first two rows. Treemap will generate the totals using the sample size from each subcategory. wages_2019_demo3a
Hey guys, welcome back to my R-tips newsletter. In today’s R-tip, I’m sharing a super common data science task (one that saved me 20 hours per week)… You’re getting the cheat code to automating Google Sheets. Plus, I’m sharing exactly how I made this a...
Moving away from proprietary languages, Roche has made a notable decision to freeze their legacy macros library. With great enthusiasm, they now embrace R as the primary framework for evidence generation in late-stage clinical trials, and they remain open to exploring additional open-source languages in this evolving landscape. James Black, a senior member of Roche’s […] The post appeared first on appsilon.com/blog/.
The Berlin R User Group fosters a diverse and vibrant R community in Berlin. Rafael Camargo shared some insights from his experience regarding the potential of R and some anecdotes... The post Spatial Data Science Using R in Berlin, Germany appeared first on R Consortium.
Weather and climate are words often used interchangeably in casual conversations. But when a chilly summer breeze sweeps through in July, or when unexpected rains dampen our winter holidays, the age-old debate resurfaces: “Is climate change even real?” If you want to dive deep into the fascinating realm of chaotic systems to unravel this enigma … Continue reading "Is this still Weather or is it already Climate? Decoding Chaos!"
Introduction Data With ties Without ties Correlation coefficients by hand Pearson With and without ties Spearman With ties Without ties Kendall Without ties With ties Verification in R Conclusion Introduction In statistics, a correlation is used to evaluate the relationship between two variables. In a previous post, we showed how to compute a correlation and perform a correlation test in R. In this post, we illustrate how to compute the Pearson, Spearman and Kendall correlation coefficients by hand and under two different scenarios (i.e., with and without ties). Data To illustrate the methods with and without ties, we consider two different datasets, one with ties and another without ties. With ties For the illustrations of the scenarios with ties, suppose we have the following sample of size 5: As we can see, there are some ties since there are two identical observations in the variable \(x\). Without ties For the scenarios which require no ties, we will consider the following sample of size 3: Correlation coefficients by hand The three most common correlation methods are:1 Pearson, used for two quantitative continuous variables which have a linear relationship Spearman, used for two quantitative variables if the link is partially linear, or for one qualitative ordinal variable and one quantitative variable Kendall, often used for two qualitative ordinal variables Each method is presented in the next sections. Note that the aim of this post is to illustrate how to compute the three correlation coefficients by hand and under two different scenarios; we are not interested in verifying the underlying assumptions. Pearson Luckily, the procedure for computing the Pearson correlation coefficient is the same whether there are ties or not so we do not distinguish the two scenarios. With and without ties The Pearson correlation coefficient, denoted \(r\) in the case of a sample, can be computed as follows2 \[ r = \frac{\sum^n_{i = 1} x_i y_i - n \bar{x} \bar{y}}{(n - 1) s_x s_y} \] where: \(n\) is the sample size \(x_i\) and \(y_i\) are the observations \(\bar{x}\) and \(\bar{y}\) are the sample means of \(x\) and \(y\) \(s_x\) and \(s_y\) are the sample standard deviations of \(x\) and \(y\) We show how to compute it step by step. Step 1. As you can see, before computing the correlation coefficient we first need to compute the mean and the standard deviation for each of the two variables. The mean of \(x\) is computed as follows: \[\bar{x} = \frac{1}{n}\sum^n_{i = 1} x_i\] The standard deviation of \(x\) is computed as follows: \[s_x = \sqrt{\frac{1}{n - 1} \sum^n_{i = 1}(x_i - \bar{x})^2}\] The formulas can be used analogously for the variable \(y\). We start by computing the means of the two variables: \(\bar{x} = \frac{-1 + 3 + 5 + 5 + 2}{5} = 2.8\) \(\bar{y} = \frac{-3 + 1 + 0 + 2 - 1}{5} = -0.2\) Step 2. We now need to compute the standard deviation of each variable. To ease the computations, it is best to use a table, starting with the observations: To compute the standard deviations, we need the sum of the squared differences between each observation and its mean, that is, \(\sum^n_{i = 1}(x_i - \bar{x})^2\) and \(\sum^n_{i = 1}(y_i - \bar{y})^2\). We start by creating two new columns in the table, denoted x-xbar and y-ybar, corresponding to \((x_i - \bar{x})\) and \((y_i - \bar{y})\): We take the square of these two new columns to have \((x_i - \bar{x})^2\) and \((y_i - \bar{y})^2\), denoted (x-xbar)^2 and (y-ybar)^2: We then sum these two columns, which gives: for \(x\): \(\sum^n_{i = 1}(x_i - \bar{x})^2 =\) 24.8 for \(y\): \(\sum^n_{i = 1}(y_i - \bar{y})^2 =\) 14.8 The standard deviations are thus: \(s_x = \sqrt{\frac{24.8}{5-1}} = 2.49\) \(s_y = \sqrt{\frac{14.8}{5-1}} = 1.92\) Step 3. We now need \(\sum^n_{i = 1} x_i y_i\), so we add a new column in the table, corresponding to \(x_iy_i\) and denoted x*y: Step 4. We take the sum of that last column, which gives \(\sum^n_{i = 1} x_i y_i =\) 14. Step 5. Finally, the Pearson correlation coefficient can be computed by plugging values found above in the initial formula: \[ \begin{split} r &= \frac{\sum^n_{i = 1} x_i y_i - n \bar{x} \bar{y}}{(n - 1) s_x s_y} \ &= \frac{14 - (5\times2.8\times-0.2)}{(5-1) \times 2.49\times1.92} \ &= 0.88 \end{split} \] Note that there are other formulas to compute the Pearson correlation coefficient. For instance, \[ \begin{split} r &= \frac{1}{n - 1} \sum^n_{i = 1} \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{x}}{s_y}\right) \ &= \frac{\sum^n_{i = 1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum^n_{i = 1}(x_i - \bar{x})^2} \sqrt{\sum^n_{i = 1}(y_i - \bar{y})^2}} \end{split} \] All formulas will of course give the exact same results. Spearman We now present the Spearman correlation coefficient, also referred as Spearman’s rank correlation coefficient. This coefficient is actually the same than Pearson coefficient, except that the computations are based on the ranked values rather than on the raw observations. Again, we present how to compute it by hand step by step, but this time we distinguish two scenarios: if there are ties if there are no ties With ties The Spearman correlation coefficient (with ties), denoted \(r_s\), is defined as follows \[ r_s = \frac{\sum^n_{i = 1} Rx_i Ry_i - n \overline{Rx} \overline{Ry}}{(n - 1) s_{Rx} s_{Ry}} \] where: \(n\) is the sample size \(Rx_i\) and \(Ry_i\) are the ranks for the two variables \(\overline{Rx}\) and \(\overline{Ry}\) are the sample means of the ranks for the two variables \(s_{Rx}\) and \(s_{Ry}\) are the sample standard deviations of the ranks for the two variables Here is how to compute it by hand. Step 1. As mentioned before, the Spearman coefficient is based on the ranks. So we first need to add the ranks of the observations (from lowest to highest), separately for each of the two variables. For \(y\), we see that: -3 is the smallest value, so we assign it the rank 1 -1 is the second smallest value, so we assign it the rank 2 then comes 0, so we assign it the rank 3 then comes 1, so we assign it the rank 4 finally, 2 is the largest value, so we assign it the rank 5 The same goes for \(x\), except that here we have two observations with the value 5 (so there will be ties in the ranks). In this case, we take the mean rank: -1 is the smallest value, so we assign it the rank 1 2 is the second smallest value, so we assign it the rank 2 then comes 3, so we assign it the rank 3 finally, the two largest values belongs to rank 4 and 5, so we assign both of them the rank 4.5 We include the ranks in the table, denoted Rx and Ry: Step 2. From there, it is similar than the Pearson coefficient except that we work on the ranks and not on the initial observations anymore. To avoid any confusion in the remaining steps, we remove the initial observations from the table and we keep only the ranks: We start with the means of the ranks: \(\overline{Rx} = \frac{1+3+4.5+4.5+2}{5} = 3\) \(\overline{Ry} = \frac{1+4+3+5+2}{5} = 3\) For the standard deviations, we use the table as we did for the Pearson coefficient: We sum the last two columns, which gives: for \(x\): \(\sum^n_{i = 1}(Rx_i - \overline{Rx})^2 =\) 9.5 for \(y\): \(\sum^n_{i = 1}(Ry_i - \overline{Ry})^2 =\) 10 The standard deviations are thus: \(s_{Rx} = \sqrt{\frac{9.5}{5-1}} = 1.54\) \(s_{Ry} = \sqrt{\frac{10}{5-1}} = 1.58\) Step 3. We now need \(\sum^n_{i = 1} Rx_i Ry_i\), so we add a new column in the table, corresponding to \(Rx_iRy_i\) and denoted Rx*Ry: Step 4. We take the sum of that last column, which gives \(\sum^n_{i = 1} Rx_i Ry_i =\) 53. Step 5. Finally, the Spearman correlation coefficient can be computed by plugging all values in the initial formula: \[ \begin{split} r_s &= \frac{\sum^n_{i = 1} Rx_i Ry_i - n \overline{Rx} \overline{Ry}}{(n - 1) s_{Rx} s_{Ry}} \ &= \frac{53 - (5\times3\times3)}{(5-1) \times 1.54\times1.58} \ &= 0.82 \end{split} \] Without ties When all initial values are different within each variable, meaning that all ranks are distinct integers, there are no ties. In that specific case, the Spearman coefficient can be computed with the following shortened formula: \[ r_s = 1 - \frac{6 \sum^n_{i = 1}(Rx_i - Ry_i)^2}{n (n^2 - 1)} \] For example, suppose the sample without ties introduced at the beginning of the post: Step 1. All observations within each variable are different, so all ranks are distinct integers and there are no ties: Step 2. We only need to compute the difference between the two ranks of each row, denoted Rx-Ry: Take the square of these differences, denoted (Rx-Ry)^2: And then take the sum of that last column, which gives \(\sum^n_{i = 1}(Rx_i - Ry_i)^2 =\) 2. Step 3. Finally, we can fill in the initial formula to find the Spearman coefficient: \[ \begin{split} r_s &= 1 - \frac{6 \sum^n_{i = 1}(Rx_i - Ry_i)^2}{n (n^2 - 1)} \ &= 1 - \frac{6 \times2}{3 (3^2 - 1)} \ &= 1 - \frac{12}{24} \ &= 0.5 \end{split} \] Kendall Kendall coefficient correlation, also known as Kendall’s \(\tau\) coefficient, is similar than Spearman coefficient, except that it is often preferred for small samples and when many rank ties. Here also we distinguish between two scenarios: if there are no ties if there are ties Unlike Spearman coefficient, we first illustrate the scenario when there are no ties, and then when there are ties. Without ties When there are no ties, the Kendall coefficient, denoted \(\tau_a\), is defined as follows \[ \tau_a = \frac{C - D}{C + D} \] where: \(C\) is the number of concordant pairs \(D\) is the number of discordant pairs This coefficient, also referred as Kendall tau-a, does not make any adjustment for ties. Let’s see what are concordant and discordant pairs using the data without ties (the same data than for Spearman correlation without ties): Step 1. We start by computing the ranks for each variable (Rx and Ry), as we did for the Spearman coefficient: Step 2. We then arbitrarily choose a reference variable among the two, Rx or Ry. Suppose we take Rx as the reference variable here. We sort the dataset by this reference variable: From now, we only look at the ranks of the second variable (the one which is not the reference level, here Ry), so to avoid any confusion in the remaining steps we keep only the Ry column: Step 3. We now take each row of Ry one by one and check whether the rows below it in the table are smaller or larger. In our table, the first row of Ry is 1. We see that the value just below it is 3, which is larger than 1. Since 3 is larger than 1, this is called a concordant pair. We write it in the table: The next row, 2, is also larger than 1, so it is also a concordant pair. We also write it in the table: We start again with the second row of Ry, which is 3. Again, we look at the row below it and check whether it is larger or smaller. Here, the row below it is 2, which is smaller than 3, so we have a discordant pair. We add this information into the table: Step 4. We now compute the total number of concordant and discordant pairs. There are: 2 concordant pairs so \(C = 2\), and 1 discordant pair so \(D = 1\). Step 5. Finally, we plug the values we have just found in the initial formula: \[ \begin{split} \tau_a &= \frac{C - D}{C + D}\ &= \frac{2 - 1}{2 + 1}\ &= \frac{1}{3}\ &= 0.33 \end{split} \] Alternatively, we can also use the following formulas \[ \begin{split} \tau_a &= 1 - \frac{4D}{n(n - 1)}\ &= \frac{4C}{n(n - 1)} - 1 \end{split} \] where \(n\) is still the sample size, and which both give the exact same results. With ties I must admit that the process with ties is slightly more complex than without ties. The Kendall tau-b coefficient, which makes adjustments for ties, is defined as follows \[ \tau_b = \frac{C - D}{\sqrt{(C^2_n - n_x)(C^2_n - n_y)}} \] where: \(C\) and \(D\) are still the number of concordant and discordant pairs, respectively \(C^2_n\) is the total number of possible pairs \(n_x\) is the number of possible pairs with a tie among the \(x\) values \(n_y\) is the number of possible pairs with a tie among the \(y\) values Note that here, the letter \(C\) in \(C^2_n\) denotes “combination” and not “concordant”. Let’s illustrate that scenario and the formula with the dataset used for the Pearson correlation and the Spearman correlation with ties, that is: Step 1. Similarly to with no ties, the process with ties is based on the ranks so we start by adding the ranks for each variable, denoted as usual as Rx and Ry: Step 2. One of the two variables (\(x\) or \(y\)) must be selected as the reference variable. This time, we do not choose it arbitrarily, but we choose the one which does not have any ties. In our case, variable \(x\) has ties, while variable \(y\) does not have any ties. So \(y\) will be our reference variable. We then: order the dataset by the reference variable, here Ry, and we keep only the necessary columns to avoid any confusion in the remaining steps: Step 3. We check whether it is a concordant or discordant pair the same way we did without ties, but this time we do not count ties. The first row in column Rx is 1. We compare all rows below it in the table with that value. All rows below 1 in the table are larger, so we write that they are concordant pairs: We repeat the same process for each row. For instance, for the second row of the column Rx, we have the value 2. Again, all rows below in the table are larger so we write that they are concordant pairs: For the third row of the column Rx, we have the value 4.5. The row just below in the table (= 3) is smaller than 4, so we write D for discordant pair: Now, as you can see, the last row is also 4.5. Therefore, since it is equal to the value we are comparing to, it is neither a concordant nor a discordant pair, so we write “T” in the table for “ties”: Last, the fourth row in the column Rx is 3, which we compare to 4.5 to see that 4.5 is larger so it is a concordant pair: Step 4. We then sum the number of concordant and discordant pairs: \(C = 8\) \(D = 1\) Step 5. We now have all the information required to compute the numerator of \(\tau_b\), but we still need to find \(C^2_n\), \(n_x\) and \(n_y\) to compute the denominator. As mentioned above, \(C^2_n\) is the total number of possible pairs, thus corresponding to the number of combinations of two values. This number of pairs can be found with the formula of the combination, that is, \(C^2_n = \frac{n!}{2!(n - 2)!}\) where \(n\) is the sample size. In our case, we have a sample size of 5, so we have:3 \[ \begin{split} C^2_n &= \frac{n!}{2!(n - 2)!}\ &= \frac{5!}{2!(5-2)!}\ &= 10 \end{split} \] Furthermore, \(n_x\) and \(n_y\) are the number of possible pairs with a tie among the \(x\) and \(y\) variables, respectively. When looking at Ry and Rx: We see that there are: 2 identical values for the variable \(x\) 0 identical value for the variable \(y\) This means that the number of possible pairs with a tie is equal to:4 1 for the variable \(x\) (the only possible pair with a tie is the pair {4.5, 4.5}) 0 for the variable \(y\) (since all ranks are distinct, there exists no pair with a tie) Therefore, \[ \begin{split} n_x &= 1\ n_y &= 0 \end{split} \] And we now have all the necessary information to compute the denominator of \(\tau_b\)! Step 6. By plugging the values found above in the initial formula, we have \[ \begin{split} \tau_b &= \frac{C - D}{\sqrt{(C^2_n - n_x)(C^2_n - n_y)}}\ &= \frac{8-1}{\sqrt{(10 - 1)(10 - 0)}}\ &= \frac{7}{\sqrt{90}}\ &= 0.74 \end{split} \] Verification in R For the sake of completeness, we verify our results with the help of R for each coefficient and scenario. Pearson: x
Round two is here! The R Consortium Infrastructure Steering Committee (ISC) orchestrates two rounds of proposal calls and grant awards per year to fortify the R ecosystem’s technical infrastructure. We... The post Grants For R Language Infrastructure Projects Available Now! appeared first on R Consortium.
I’m part of a local Functional Programming Meetup group which hosts talks, but also coordinates social meetings where we discuss all sorts of FP-related topics including Haskell and other languages. We’ve started running challenges where we all solve a...
Introduction Data visualization is a powerful tool for understanding the relationships between variables in a dataset. One of the most common and insightful ways to visualize correlations is through heatmaps. In this blog post, we’ll dive into t...
I’m still behind on TidyTuesday, so here is TidyTuesday #34, which deals with refugee data. This data is collected by the United Nations High Commissioner for Refugees and the R package is updated twice a year. Loading Data and Libraries Librarie...
Get ready for a thrill ride in causal discovery! We’re diving into gCastle, a Python package, right in R to amp up our skills. Let’s orchestrate our prior knowledge and nail that true DAG. 🔥 As I delve into Aleksander Molak’s Causa...
Welcome to another roundup of new additions to the Big Book of R collection of almost 400 books! If you want to bypass social media algorithms and be 100% sure you’ll be notified of new additions, sign up to my newsletter. You can select to only receive Big Book of … The post 6 New books added to Big Book of R appeared first on Oscar Baruffa.
Introduction Graphs are powerful visual tools for analyzing and presenting data. In this blog post, we will explore how to plot multiple lines on a graph using base R. We will cover two methods: matplot() and lines(). These functions provide fle...
Introduction to structural causal modelling A primary goal of science is to understand causes. Structural causal modelling is a framework for developing causal hypotheses to test with data. I taught a workshop at the Australian Marine Sciences Associ...
Introduction Are you tired of dealing with irregularly spaced data points that just don’t seem to fit together? Do you find yourself struggling to interpolate or smooth your data for better analysis? Look no further! In this blog post, we’ll div...
Introduction In the vast world of R programming, there are numerous functions that provide powerful capabilities for data visualization and analysis. One such function that often goes under appreciated is the curve() function. This neat little f...
This is fun 🙂 It is also O(MAX) complexity. But first some background. Since the problem is super old, we are not intending to solve it, merely to play with it. In the number theory of mathematics, the Goldbach’s conjecture…Read more ›
Introduction As data-driven decision-making becomes more critical in various fields, the ability to extract valuable insights from datasets has never been more important. One common task is to calculate counts by group, which can shed light on t...
Last year I benchmarked a few ways of shuffling columns in a data.table, but what about pandas? I didn’t know, so let’s revisit those tests and add a few more operations! pandas winds up being much more competitive than I expected. First, dplyr is by far the slowest. Second, pandas is (more than) competitive with the R options. In the small-size regime (vector sizes up to about 1,000), the pandas option is similar to, but faster than, most of the slower R options, and the numpy-backed solution is nearly as fast as base R assignment and data.table’s in-place option. I expected pandas to be a lot slower. More surprising, in the large-vector regime both Python solutions outperform all R options, and the in-place Python option is much faster than everything else, starting with vector sizes of about 10,000. I’m not sure how representative this benchmark is, but it’s an interesting data point. More than most frameworks, pandas feels sensitive to the way we do something: calling .apply() isn’t just a little slower than .transform()—it’s miles slower. So the simple transformations we’re doing here are pretty easy to optimize; and numpy-backed operations should be fast anyway. There also might be systematic differences between R and Python tests: R functions are tested using microbenchmark and Python tests were run with timeit. New code is below. New Python testing code from timeit import Timer import pandas as pd def scramble_naive(df: pd.DataFrame, colname: str) -> pd.DataFrame: df[colname] = df[colname].sample(frac=1, ignore_index=True) return df test_naive = "scramble_naive(df, colname='x')" results_naive = { n: Timer(test_naive, setup % n).repeat(repeat=100, number=1) for n in range(21) } The timeit module reports its times in seconds (cf. nanoseconds for microbenchmark), so we need a conversion factor to make the times comparable. import numpy as np import pandas as pd def scramble_inplace(df: pd.DataFrame, colname: str) -> pd.DataFrame: np.random.shuffle(df[colname].to_numpy()) return df test_inplace = "scramble_naive(df, colname='x')" results_inplace = { n: Timer(test_inplace, setup % n).repeat(repeat=100, number=1) for n in range(21) } New R testing code scramble_base
Introduction Data analysis often requires preprocessing and transforming data to make it more suitable for analysis. In R, the scale() function is a powerful tool that allows you to standardize or normalize your data, helping you unlock deeper i...
Join our workshop on Introduction to Qualitative Comparative Analysis (QCA) using R, which is a part of our workshops for Ukraine series! Here’s some more info: Title: Introduction to Qualitative Comparative Analysis (QCA) using R Date: Thursday, August 31st, 18:00 – 20:00 CEST (Rome, Berlin, Paris timezone) Speaker: Ingo Rohlfing, Ingo Rohlfing is Professor of … Continue reading Introduction to Qualitative Comparative Analysis (QCA) using R workshopIntroduction to Qualitative Comparative Analysis (QCA) using R workshop was first posted on August 6, 2023 at 9:11 am.
What makes a ML model a black-box? It is the interactions. Without any interactions, the ML model is additive and can be exactly described. Studying interaction effects of ML models is challenging. The main XAI approaches are: This post is mainly about the third approach. Its beauty is that we get information about all interactions. […]
One of the joys (?) of having been online for…quite some time now…is watching topics reappear every few years or so. So what’s new? In terms of English word matches: not much. Some new proteins but no new 9-letter words. The Twitter thread, above, contains an interesting reply about an approach using generative AI: Other … Continue reading Price’s Protein Puzzle: 2023 update
Introduction Are you tired of manually calculating summary statistics for your data in R? Look no further! In this blog post, we will explore two powerful ways to summarize data: using the tapply() function and the group_by() and summarize() fun...
Introducing cran.dev shortlinks! On r-universe you can find package repositories from many different organizations and maintainers. But sometimes you just want to lookup a particular CRAN package, without knowing the developer. The new cran.dev shortlink service lets you navigate or link directly to the r-universe homepage and docs of any established CRAN package. The root domain https://cran.dev/{package} redirects to the primary homepage for a package: https://cran.dev/magick https://cran.dev/Matrix https://cran.dev/ggplot2 The subdomain https://docs.cran.dev/{package} redirects to package manual page: https://docs.cran.dev/targets https://docs.cran.dev/MASS https://docs.cran.dev/Rcpp Finally the subdomain https://api.cran.dev/{package} does not redirect, but returns a JSON blob with links to the package versions and resources. For example: https://api.cran.dev/gert This shows the package page and maintainer for the gert package, and information on both the current CRAN release and development versions. Right now it returns: { "package": "gert", "maintainer": "Jeroen Ooms ", "home": "https://ropensci.r-universe.dev/gert", "release": { "version": "1.9.2", "date": "2023-06-30T08:43:26.000Z", "source": "https://github.com/cran/gert", "repository": "https://cloud.r-project.org", "docs": "https://cran.r-universe.dev/gert/doc/manual.html", "api": "https://cran.r-universe.dev/api/packages/gert" }, "devel": { "version": "1.9000", "date": "2023-07-20T11:08:09.000Z", "source": "https://github.com/r-lib/gert", "repository": "https://ropensci.r-universe.dev", "docs": "https://ropensci.r-universe.dev/gert/doc/manual.html", "api": "https://ropensci.r-universe.dev/api/packages/gert" } } We can see the release and devel version of this package, with links to the respective sources, cranlike repository, manual, and package metadata. From the above it can be seen that the gert r-universe page is https://ropensci.r-universe.dev/gert and the r-universe "repository" from where to install the devel version of gert: # install 'devel' gert install.packages("gert", repos = "https://ropensci.r-universe.dev") The "api" field shows the JSON url with all the information and metadata about this package: # Everything there is to know about devel gert gert