1. home
  2. R-bloggers | R news and tutorials contributed by (750) R bloggers
  3. Software

R-bloggers | R news and tutorials contributed by (750) R bloggers - Software

8 | Follower

R and Python Together: Refactoring and Prompt Engineering A Previous Case Study, Using the Perplexity API | R-bloggers

I wrote a post last year looking at how to employ tools in LangChain to have GPT-3.5 Turbo access information on the web, outside of its training data. The purpose of the present post is to revisit this post, improving the poor performance I saw there through refactoring and prompt engineering. Background The motivating example is again using large language models (LLMs) to help me calculate features for my Oscar model. Specifically: How many films did the director of a Best Picture nominee direct before the nominated film? Mistakes I Made Last Time In the previous post, I found pretty poor performance. Some of this is due to the studio system in old Hollywood: Directors of films before about 1970ish could have ludicrously large bodies of work due to the studio system and due to shorts, documentaries, or films made for WWII. This underscores the importance of domain expertise in data science. Even looking at performance in the more contemporary era, the performance was still poor (I will get into this more below). So, I made a few changes before trying it again this year. Change 1: This time around, I decided to only look at films nominated after 1969. This reflects a qualitative shift in the underlying data generation mechanism—there was no more studio system, and the directors that were prolific under that system are no longer having films nominated for Best Picture by 1970. Change 2: I was using GPT-3.5 Turbo inside of a LangChain agent that was equipped with the Wikipedia API and a calculator. The issue here is that I used defaults for the Wikipedia tool, which defaults to only allowing the first 4,000 characters from the page. Given that the “filmography” section of directors’ pages occurs near the bottom of the page, this was never getting read into the LLM. Even though it appears that they have increased the max character length, I could not get it to read in more this time around. So, I ditched the LangChain approach… Change 3: …and replaced it with Perplexity AI. They have built a “conversational search engine”. The basic idea is that they use existing foundational models and give them the tools to search the web for real-time information to answer your query. This is more or less the agent I was trying to make myself with LangChain… except it performs much better, and it can access sources beyond Wikipedia. (The payload actually has a citations section where you can see the hyperlinks it used to find the information.) With there being no free lunch, however, it is more expensive. Even if you use an open source foundational model, Perplexity charges you a premium for using their agent on top of it that is searching the web. Even with all the prompting I gave it, this entire blog post still only cost me a few bucks, with the final dataset being $1.32. Change 4: Refactoring my prompt through prompt engineering. Let’s look at the prompt I made in the original blog post: ("system", """You are a helpful research assistant, and your job is to retrieve information about movies and movie directors. Think through the questions you are asked step-by-step."""), ("human", """How many feature films did the director of the {year} movie {name} direct before they made {name}? First, find who directed {name}. Then, look at their filmography. Find all the feature-length movies they directed before {name}. Only consider movies that were released. Lastly, count up how many movies this is. Think step-by-step and write them down. When you have an answer, say, 'The number is: ' and give the answer."""), In the system message, we see I tried to use chain-of-thought prompting. The idea here was I told it to think “step-by-step,” and then I explicitly gave it steps to take in the human part of the prompt. As I’ve read more about prompt engineering (such as this book and this book), I realize that what I was doing here was confusing two concepts. One of the basic principles is to break down tasks into modular parts and divide labor. I thought I was giving the model a way to reason step-by-step… but what I should have realized is I was giving it way too many tasks and intermediate steps to do in one prompt. I’ll get into how I improved this dogshit prompt below. The Current Approach As mentioned above, I’m now using the Perplexity API with a rewritten prompt. First, I use the OpenAI module in Python to access Perplexity, which will be using the llama-3.1-sonar-large-128k-online model under the hood. I define a Python function, which I will be calling from R using {reticulate}. The function is: from openai import OpenAI API_KEY="" model="llama-3.1-sonar-large-128k-online" client = OpenAI(api_key=API_KEY, base_url="https://api.perplexity.ai") def get_films(director): messages = [ { "role": "system", "content": """ You are a detailed research assistant helping the user find the names of movies. The user will give you a name of a director, and it is your job to find the name and year of every feature-length movie this person has ever directed. Finding an accurate list of these movies is important because the information will be employed by the user in a project where having an accurate list of films is vital. Do not list TV movies, as these are not feature-length movies. Do not list TV episodes, as these are not feature-length movies. Do not list mini-series, as these are not feature-length movies. Do not list music videos, as these are not feature-length movies. Do not list short films, as these are not feature-length movies. Many of these people may have also acted in or wrote or produced films—do not list these films unless they also directed the film. You should respond in a very specific format: A list where two greater-than signs (>>) begin each line. You should say the name of the film, then include two tildes (~~), then say the year it was released. Include no other text before or after the list; do not give any commentary. Only list the name and year of every feature-length movie this person has ever directed. """, }, { "role": "user", "content": "Rian Johnson" }, { "role": "assistant", "content": """ >> Brick ~~ 2005 >> The Brothers Bloom ~~ 2008 >> Looper ~~ 2012 >> Star Wars: The Last Jedi ~~ 2017 >> Knives Out ~~ 2019 >> Glass Onion ~~ 2022 """ }, { "role": "user", "content": "John Krasinski" }, { "role": "assistant", "content": """ >> Brief Interviews with Hideous Men ~~ 2009 >> The Hollars ~~ 2016 >> A Quiet Place ~~ 2018 >> A Quiet Place Part II ~~ 2020 >> IF ~~ 2024 """ }, { "role": "user", "content": director }, ] response = client.chat.completions.create( model=model, messages=messages, temperature=0 ) return(response.choices[0].message.content) Let’s look at the prompt, one piece at a time. First, I take all of the static information that will be the same from one call to another and put that in the system content. The first paragraph defines what the assistant is and why, as giving reasons behind actions can help improve performance (pg. 91). The second paragraph tells the model what not to list. Although one of the basic principles of prompting is to tell it what to do instead of what telling it NOT to do, I couldn’t find a better way of telling the model what it should exclude from the list. My hope in being repetitive here was that the tokens representing music videos, shorts, mini-series, etc., would be close enough to the word “not,” and I gave it a reason for excluding each of them. I give it a very specific format to respond with. Why? I noticed that so many special characters are used in film names. Markdown likes asterisks, and LLMs like Markdown. Problem! Movie titles can be italicized in Markdown using asterisks, and film names can have asterisks in them (e.g., M*A*S*H). Hence the double >> and the double ~~ to break up the list. I tell it to give no other commentary, and then I “sandwich” the prompt by reminding the model of its primary objective (pg. 125). I then employ few-shot learning to show examples of proper user and assistant calls to drive home the format and what to (and not to) include. I chose these two directors for very specific reasons. Rian Johnson has directed short films, TV episodes (such as a few notable ones from Breaking Bad), music videos (including “oh baby” by LCD Soundsystem), and a commercial. My example only lists feature films. I also included John Krasinski, as he is primarily known as an actor (that is, Jim from The Office). In the actual Academy Award data, there are folks like Clint Eastwood and George Clooney who are primarily known as actors but show up as directors. Lastly, I chose these two because they do not appear in the dataset of directors of films nominated for Best Picture, so there’s no ability for the model to simply copy one of the examples. Lastly, you’ll notice I don’t give the model a movie, ask it to find the director, then ask it to find their filmography, then ask it to return me the number of films before the film I gave it. That’s too much—let’s divide labor. What I am doing is now simply asking them for their full filmography. I can rely on data I already pulled down using the OMDb API to get the director for each film. I can then get the number of previously-directed films by working with the data in R. Dividing the Labor in R Retrieving the Data I set up {reticulate} using a virtual environment that is a path relative to my own Google Drive, and then I source the script with the Python function in it. I already have the director(s) for each film in a Google Sheet. I pull this down—and separate longer when there are multiple directors—to get a list of individuals who directed a film nominated for Best Picture after 1969. I then initialize an empty string variable and fill it with the result from the call to Perplexity in a loop. # prep ------------------------------------------------------------------------- library(googlesheets4) library(tidyverse) library(reticulate) use_virtualenv("../../../../") source_python("ppxai.py") gs4_auth() 2 # load director list ----------------------------------------------------------- dat % filter(year > 1969) %>% select("director") %>% separate_longer_delim(director, ", ") %>% unique() # run -------------------------------------------------------------------------- dat$films_string % str_split(fixed("~~")) %>% map(trimws) if (!all(map(tmp, length) == 2)) next films % filter(year > 1969) %>% mutate(film = tolower(film)) # hand-coded from last year check % mean(na.rm = TRUE) } For each film in the hand-checked dataset, I map() through the directors (since there could have been multiple) and count the number of entries before the year that the film of interest was released. You’ll see I do a mean() call at the end, such that if there were multiple directors, I use the mean of them. I also return NA upon any error or warning. The only time this popped up was for Adam McKay. The model was a bit too detailed and returned two lists: ">> Anchorman: The Legend of Ron Burgundy ~~ 2004\n>> Anchorman 2: The Legend Continues ~~ 2013\n>> Talladega Nights: The Ballad of Ricky Bobby ~~ 2006\n>> Step Brothers ~~ 2008\n>> The Other Guys ~~ 2010\n>> The Big Short ~~ 2015\n>> Vice ~~ 2018\n>> Don't Look Up ~~ 2021\n>> We the Economy: 20 Short Films You Can't Afford to Miss (segment \"The Unbelievably Sweet Alpacas\") ~~ 2014\n\nNote: While \"We the Economy: 20 Short Films You Can't Afford to Miss\" is a collection of short films, the specific segment directed by Adam McKay is included here as it is part of his feature-length filmography in the context of directing. However, if strictly adhering to feature-length films only, this entry could be excluded.\n\nUpdated list without the short film segment:\n\n>> Anchorman: The Legend of Ron Burgundy ~~ 2004\n>> Anchorman 2: The Legend Continues ~~ 2013\n>> Talladega Nights: The Ballad of Ricky Bobby ~~ 2006\n>> Step Brothers ~~ 2008\n>> The Other Guys ~~ 2010\n>> The Big Short ~~ 2015\n>> Vice ~~ 2018\n>> Don't Look Up ~~ 2021" It tells us some reasoning, even though we didn’t ask for it: There is a feature-length film where Adam McKay directed a short film within it. The model gave us two possible lists. We’re going to clean it by taking the second list, which excludes the collection of short films: # failures --------------------------------------------------------------------- check %>% filter(is.na(from_llm)) directors % filter(film == "vice") %>% pull(director) %>% str_split_1(fixed(", ")) read_csv("directors_films.csv") %>% filter(director == directors) %>% pull(films_string) check$from_llm[check$film == "vice"] % filter(director == directors) %>% arrange(year) %>% unique() %>% # dedupe filter(!grepl("economy", film)) %>% # get rid of short film collection filter(year < year[film == "vice"]) %>% nrow() # make sure empty check %>% filter(is.na(from_llm)) Now we’ve got a clean verification set. It is 110 movies, with columns for: Hand-checked number of films directed before the film of interest Perplexity-based estimate Estimate from previous post, using GPT-3.5 Turbo and LangChain Fifteen random entries from these data look like: ## # A tibble: 15 × 5 ## year film check last_post from_llm ## ## 1 1978 midnight express 1 3 1 ## 2 1971 nicholas and alexandra 7 2 6 ## 3 2016 manchester by the sea 2 1 2 ## 4 2018 blackkklansman 22 21 22 ## 5 1983 terms of endearment 0 2 0 ## 6 1993 in the name of the father 2 2 2 ## 7 2013 12 years a slave 2 2 2 ## 8 2015 the revenant 5 5 5 ## 9 2009 the blind side 3 2 3 ## 10 1970 mash 3 3 5 ## 11 2007 atonement 1 1 1 ## 12 1985 prizzi's honor 37 35 36 ## 13 1977 the turning point 8 5 8 ## 14 2001 the lord of the rings: the fellowship of the … 5 5 5 ## 15 2003 lost in translation 1 1 1 Let’s see how we did. Performance First, let’s compare: the correlation; mean absolute error (MAE); and how often we were correct, overcounted, or undercounted. Let’s do this first for Perplexity: # performance ------------------------------------------------------------------ cor.test( ~ check + from_llm, check) ## ## Pearson's product-moment correlation ## ## data: check and from_llm ## t = 60.394, df = 108, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.9789129 0.9900619 ## sample estimates: ## cor ## 0.9855161 check %>% summarise(mae = mean(abs(check - from_llm))) ## # A tibble: 1 × 1 ## mae ## ## 1 0.5 check %>% mutate(err_disc = case_when( (check - from_llm) > 0 ~ "Undercount", (check - from_llm) < 0 ~ "Overcount", (check - from_llm) == 0 ~ "Correct" )) %>% count(err_disc) %>% mutate(pct = n / sum(n) * 100) ## # A tibble: 3 × 3 ## err_disc n pct ## ## 1 Correct 77 70 ## 2 Overcount 18 16.4 ## 3 Undercount 15 13.6 Versus last time (GPT-3.5 Turbo with Wikipedia Tool): ## compare to last time -------------------------------------------------------- cor.test( ~ check + last_post, check) ## ## Pearson's product-moment correlation ## ## data: check and last_post ## t = 15.007, df = 108, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.7503941 0.8747150 ## sample estimates: ## cor ## 0.8221232 check %>% summarise(mae = mean(abs(check - last_post))) ## # A tibble: 1 × 1 ## mae ## ## 1 2.01 check %>% mutate(err_disc = case_when( (check - last_post) > 0 ~ "Undercount", (check - last_post) < 0 ~ "Overcount", (check - last_post) == 0 ~ "Correct" )) %>% count(err_disc) %>% mutate(pct = n / sum(n) * 100) ## # A tibble: 3 × 3 ## err_disc n pct ## ## 1 Correct 46 41.8 ## 2 Overcount 17 15.5 ## 3 Undercount 47 42.7 We’ve made tremendous improvements here: Correlation between estimate vs. truth increases from r = .82 last time to r = .99 this time. MAE decreases from 2.01 films to 0.50 films. We go from a correct count 42% of the time to a correct count 70% of the time. Undercounting (14%) and overcounting (16%) are now happening at about the same rate, while beforehand we were much more likely to undercount (43%) then to overcount (16%). Lastly, let’s plot the true, hand-checked number on the x-axis versus the estimated number on the y-axis. The purple line is a loess, the green line is OLS, and the dotted black line is a perfect relationship. I label any data point where the absolute error was five or more films off. ggplot(check, aes(x = check, y = from_llm)) + geom_abline(aes(intercept = 0, slope = 1), linetype = "dotted") + geom_point(alpha = .5) + geom_smooth(method = "loess", se = FALSE, span = .95, color = "purple") + geom_smooth(method = "lm", se = FALSE, color = "forestgreen") + theme_light() + labs( x = "Hand-Checked", y = "Perplexity AI (Llama 3.1 Sonar 70B Online)\nNew Prompt" ) + ggrepel::geom_text_repel( aes(label = ifelse(abs(check - from_llm) >= 5, film, "")) ) ggplot(check, aes(x = check, y = last_post)) + geom_abline(aes(intercept = 0, slope = 1), linetype = "dotted") + geom_point(alpha = .5) + geom_smooth(method = "loess", se = FALSE, span = .95, color = "purple") + geom_smooth(method = "lm", se = FALSE, color = "forestgreen") + theme_light() + labs( x = "Hand-Checked", y = "GPT-3.5 Turbo with Wikipedia Tool\nOld Prompt" ) + ggrepel::geom_text_repel( aes(label = ifelse(abs(check - last_post) >= 5, film, "")) ) This time, the only two notable misses were for Secrets & Lies by Mike Leigh and Unforgiven by Clint Eastwood. Looking at Mike Leigh, Perplexity picked up on stage plays he directed, which I did not mention in the prompt. As for Clint Eastwood, it picked up a few acting credits—but not all of them. Conclusion I revisited an earlier post and greatly improved my ability to extract information using an LLM by: Employing a service—Perplexity AI—that incorporates web search for me Breaking up the task into smaller parts and dividing the labor, only relying on the LLM when necessary Re-writing my prompt in line with best practices and key principles—as well as relying on few-shot learning All of this underscores the importance of prompt engineering and how understanding the architecture of an LLM can lead to querying one better. Remember that, even with more complex models that can call tools be somewhat agentic, these are all just machines that do next-token prediction. Berryman & Zeigler provide a good way of thinking while you’re writing prompts: “Don’t ask yourself how a reasonable person would ‘reply’ …but rather how a document that happens to start with the prompt might continue” (pg. 18).

Tidyverse 🪐to Polars 🐻‍❄️: My Notes | R-bloggers

I found Polars syntax is quite similar to dplyr. And the way that we can chain the functions makes it even more familiar! It was fun learning the nuances, now it’s time to put them into practice! Wish me luck! 🍀 Motivation In preparation for using more Python in 2025 and also to speak more of the same language with our datathon team, I’ve decided to practice Polars in Python thinking in R first. Below is my notes to myself, hopefully I’ll be able to refer back and improve this more as I use more of this for the next month. Wish me luck! Objectives Create A Dataframe Filter, Select, Summarize, Across Mutate, Paste Extract Case_when Join / Merge To Dummies/Pivot/Unpivot Helpful Resources Lessons Learnt Create A Dataframe Tidyverse library(tidyverse) library(reticulate) use_virtualenv('path/to/your/environment') df filter(age >= 30) |> select(1:3, starts_with("performance"), starts_with("salary")) |> summarize(across(.cols = where(is.numeric), .fns = mean, .names = "mean_{.col}")) ## # A tibble: 1 × 5 ## mean_age mean_performance_rating mean_performance_reviews_count mean_salary ## ## 1 46.6 4.4 3.4 40000 ## # ℹ 1 more variable: mean_salary_increase_percentage Polars df \ .filter(pl.col('age') >= 30) \ .select(df.columns[0:4]+['^performance.*$','^salary.*$']) \ .select(pl.col(pl.Int64).mean().name.prefix('mean_')) shape: (1, 5)mean_agemean_performance_ratingmean_performance_reviews_countmean_salarymean_salary_increase_percentagef64f64f64f64f6446.64.43.440000.010.0 For some reason, for the regex above, I have to use ^ and $ sandwiched to return those column nams that I want to include. bizzare. Mutate, Paste Test 1 make a new column called combination_of_character paste all columns with character datatype separated by ``, a space select the created column Tidyverse df |> rowwise() |> transmute(combination_of_character = paste( across(where(is.character)), collapse = " " )) |> select(combination_of_character) ## # A tibble: 6 × 1 ## # Rowwise: ## combination_of_character ## ## 1 Alice New York 123 Main St, Ontario, OH 123-456-7890 alice@example.com Engine… ## 2 Bob San Francisco 123 Main St, Calgary, AB 987-654-3210 bob@example.com Marke… ## 3 Charlie Tokyo 456-7890, Tokyo, NY 098-765-4332 charlie@example.com Finance 20… ## 4 Ken Toronto 49494 Exchange St, Toronto, ON 111-232-4141 ken@stats.org Marketi… ## 5 Steven Lima 1010 Gb st, Lima, OH 505-402-6060 stephencurry@nba.com Marketing … ## 6 Carlos Cleveland 666 Heaven dr, Cleveland, OH 909-435-1000 carlos@genius.edu … Polars df \ .with_columns( pl.concat_str( pl.col(pl.String), separator=" " ).alias('combination_of_character') ) \ .select(pl.col('combination_of_character')) shape: (6, 1)combination_of_characterstr"Alice New York 123 Main St, On…"Bob San Francisco 123 Main St,…"Charlie Tokyo 456-7890, Tokyo,…"Ken Toronto 49494 Exchange St,…"Steven Lima 1010 Gb st, Lima, …"Carlos Cleveland 666 Heaven dr… Tidyverse Test 2 make a new column called age_salary glue column age and salary together with - between select columns name and age_salary df |> mutate(age_salary = paste0(age, "-", salary)) |> select(name, age_salary) ## # A tibble: 6 × 2 ## name age_salary ## ## 1 Alice 30-50000 ## 2 Bob 25-45000 ## 3 Charlie 35-60000 ## 4 Ken 50-20000 ## 5 Steven 60-40000 ## 6 Carlos 58-30000 Polars df \ .with_columns( age_salary=pl.format('{}-{}',pl.col('age'),pl.col('salary')) ) \ .select(pl.col('name','age_salary')) shape: (6, 2)nameage_salarystrstr"Alice""30-50000""Bob""25-45000""Charlie""35-60000""Ken""50-20000""Steven""60-40000""Carlos""58-30000" If it’s just 1 column, you can use this format age_salary= to name the column, otherwise you’d have to use alias to name it if there are multple columns Extract create a new column area_code_and_salary paste street number (extract it from address) with a space and the the column salary select area_code_and_salary Tidyverse df |> mutate(area_code_and_salary = paste0(str_extract(address, "\d{0,5}"), " ", salary)) |> select(area_code_and_salary) ## # A tibble: 6 × 1 ## area_code_and_salary ## ## 1 123 50000 ## 2 123 45000 ## 3 456 60000 ## 4 49494 20000 ## 5 1010 40000 ## 6 666 30000 Polars df \ .select( pl.concat_str( pl.col('address').str.extract(r'^(\d{0,5})'), pl.lit(" "), pl.col('salary') ).alias('area_code_and_salary') ) shape: (6, 1)area_code_and_salarystr"123 50000""123 45000""456 60000""49494 20000""1010 40000""666 30000" Have to use pl.lit(' ') for any constant string Case_when Test 1 create a new column called familiarity if address contains OH, then return local if address contains NY, then return foodie otherwise return elsewhere Tidyverse df |> mutate(familiarity = case_when( str_detect(address, "OH") ~ "local", str_detect(address, "NY") ~ "foodie", TRUE ~ "elsewhere" )) ## # A tibble: 6 × 17 ## name age city address phone_number email salary department hire_date ## ## 1 Alice 30 New York 123 Ma… 123-456-7890 alic… 50000 Engineeri… 2010-01-… ## 2 Bob 25 San Fran… 123 Ma… 987-654-3210 bob@… 45000 Marketing 2012-05-… ## 3 Charlie 35 Tokyo 456-78… 098-765-4332 char… 60000 Finance 2015-10-… ## 4 Ken 50 Toronto 49494 … 111-232-4141 ken@… 20000 Marketing 2010-04-… ## 5 Steven 60 Lima 1010 G… 505-402-6060 step… 40000 Marketing 2009-10-… ## 6 Carlos 58 Cleveland 666 He… 909-435-1000 carl… 30000 Finance 2005-11-… ## # ℹ 8 more variables: status , salary_increase_percentage , ## # years_of_service , bonus_amount , performance_rating , ## # performance_reviews_count , performance_reviews_last_updated , ## # familiarity Polars df \ .with_columns([ pl.when(pl.col('address').str.contains('OH')) .then(pl.lit('local')) .when(pl.col('address').str.contains('NY')) .then(pl.lit('foodie')) .otherwise(pl.lit('elsewhere')) .alias('familiarity') ]) shape: (6, 17)nameagecityaddressphone_numberemailsalarydepartmenthire_datestatussalary_increase_percentageyears_of_servicebonus_amountperformance_ratingperformance_reviews_countperformance_reviews_last_updatedfamiliaritystri64strstrstrstri64strstrstri64i64i64i64i64strstr"Alice"30"New York""123 Main St, Ontario, OH""123-456-7890""alice@example.com"50000"Engineering""2010-01-01""Active"105200042"2022-05-01""local""Bob"25"San Francisco""123 Main St, Calgary, AB""987-654-3210""bob@example.com"45000"Marketing""2012-05-15""Inactive"53150031"2021-07-15""elsewhere""Charlie"35"Tokyo""456-7890, Tokyo, NY""098-765-4332""charlie@example.com"60000"Finance""2015-10-01""Active"157300053"2022-08-31""foodie""Ken"50"Toronto""49494 Exchange St, Toronto, ON""111-232-4141""ken@stats.org"20000"Marketing""2010-04-01""Inactive"1010500053"2024-10-30""elsewhere""Steven"60"Lima""1010 Gb st, Lima, OH""505-402-6060""stephencurry@nba.com"40000"Marketing""2009-10-30""Active"1010300044"2023-01-02""local""Carlos"58"Cleveland""666 Heaven dr, Cleveland, OH""909-435-1000""carlos@genius.edu"30000"Finance""2005-11-12""Active"512200045"2024-12-12""local" Test 2 convert name data to lowercase create new column called email_name and extract email before the @ select columns that starts with name or end with name create a new column called same? if name and email_name is the same, then return yes otherwise return no Tidyverse df |> mutate( name = tolower(name), email_name = str_extract(email, "^([\d\w]+)@", group = 1) ) |> select(starts_with("name") | ends_with("name")) |> mutate(`same?` = case_when( name == email_name ~ "yes", TRUE ~ "no")) ## # A tibble: 6 × 3 ## name email_name `same?` ## ## 1 alice alice yes ## 2 bob bob yes ## 3 charlie charlie yes ## 4 ken ken yes ## 5 steven stephencurry no ## 6 carlos carlos yes Polars df \ .with_columns( [ pl.col('name').str.to_lowercase(), pl.col('email').str.extract(r'^([\d\w]+)@', group_index = 1) .alias('email_name') ] ) \ .select([ pl.col('^name|.*name$'), pl.when( pl.col('name') == pl.col('email_name')).then(pl.lit('yes')) .otherwise(pl.lit('no')) .alias('same?') ] ) shape: (6, 3)nameemail_namesame?strstrstr"alice""alice""yes""bob""bob""yes""charlie""charlie""yes""ken""ken""yes""steven""stephencurry""no""carlos""carlos""yes" Learnt that apparently we cannot use look forward or backward in polars. Such as .*(?=@) to capture the email_name Group_by, Shift, Forward_Fill group by department column summarize by selecting name, new column salary_shift with conditions: if the department only has 1 row of salary data, do not shift salary if the department has more than 1 row of salary data, shift by -1 of salary column reason: there was a mistake in entering data for those with more than 1 row of data, apparently the actualy salary data is 1 row more then forward fill the salary_shift with the number prior in the same group Tidyverse df |> group_by(department) |> summarize( name = name, salary_shift = case_when( n() == 1 ~ salary, TRUE ~ lead(salary) ) ) |> fill(salary_shift, .direction = "down") ## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in ## dplyr 1.1.0. ## ℹ Please use `reframe()` instead. ## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` ## always returns an ungrouped data frame and adjust accordingly. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated. ## `summarise()` has grouped output by 'department'. You can override using the ## `.groups` argument. ## # A tibble: 6 × 3 ## # Groups: department [3] ## department name salary_shift ## ## 1 Engineering Alice 50000 ## 2 Finance Charlie 30000 ## 3 Finance Carlos 30000 ## 4 Marketing Bob 20000 ## 5 Marketing Ken 40000 ## 6 Marketing Steven 40000 Polars df \ .group_by('department') \ .agg( pl.col('name'), pl.when(pl.col('salary').len()==1).then(pl.col('salary')) .otherwise(pl.col('salary').shift(-1)) .alias('salary_shift')) \ .explode('name','salary_shift') \ .with_columns( pl.col('salary_shift').forward_fill()) shape: (6, 3)departmentnamesalary_shiftstrstri64"Engineering""Alice"50000"Finance""Charlie"30000"Finance""Carlos"30000"Marketing""Bob"20000"Marketing""Ken"40000"Marketing""Steven"40000 Apparently polars would turn the column into a nested dataframe (list) when grouped and can’t do fill when it’s in list? will have to unnest by explode before fill can be used. Unless of coure if you merge the fill in the same line when shifting, such as df \ .group_by('department') \ .agg( pl.col('name'), pl.when(pl.col('salary').len()==1).then(pl.col('salary')) .otherwise(pl.col('salary').shift(-1)) .forward_fill() .alias('salary_shift')) Is There An Easier Way to Unnest without Typing ALL of the columns in Polars? Yes! I believe pl.col, pl.select, pl.filter all take a list of conditions. First create a list of columns you want to unnest, then use pl.col to select them. dt = [pl.List(pl.Int64),pl.List(pl.String)] df \ .group_by('department', maintain_order=True) \ .agg( pl.col('name'), pl.col('salary_increase_percentage'), salary_shift = pl.when(pl.col('salary').count() == 1) .then(pl.col('salary')) .otherwise(pl.col('salary').shift(-1)) ) \ .explode(pl.col(dt)) \ .with_columns( pl.col('salary_shift').forward_fill() ) \ .with_columns( new_raise = pl.col('salary_shift') * (1+pl.col('salary_increase_percentage')/100) ) shape: (6, 5)departmentnamesalary_increase_percentagesalary_shiftnew_raisestrstri64i64f64"Engineering""Alice"105000055000.0"Marketing""Bob"52000021000.0"Marketing""Ken"104000044000.0"Marketing""Steven"104000044000.0"Finance""Charlie"153000034500.0"Finance""Carlos"53000031500.0 Merge / Join Create another mock data with dataframe called df_dept with columns department and dept_id Tidyverse df_dept left_join(df_dept, by = "department") |> select(name, dept_id) |> mutate(employee_id = map_chr(dept_id, ~paste0(.x, "-", sample(1000000:9999999, 1)))) ## # A tibble: 6 × 3 ## name dept_id employee_id ## ## 1 Alice 30 30-1694470 ## 2 Bob 25 25-1696036 ## 3 Charlie 20 20-4463080 ## 4 Ken 25 25-6942432 ## 5 Steven 25 25-3012223 ## 6 Carlos 20 20-8705991 Polars import random df \ .join(df_dept, on="department") \ .select(['name','dept_id']) \ .with_columns( employee_id = pl.format( '{}-{}', 'dept_id', pl.Series([ random.randint(100000, 999999) for _ in range(len(df)) ]) ) ) shape: (6, 3)namedept_idemployee_idstri64str"Alice"30"30-832410""Bob"25"25-883365""Charlie"20"20-484404""Ken"25"25-421175""Steven"25"25-670538""Carlos"20"20-638378" there is a function called map_elements in polars but the documentation stated that it’s inefficient, essentially using a for loop. I’m not entirely certain if list comprehension above is any more efficient. Another probably more efficent way of doing this is 2 separate process. The random number generation on another dataframe, then merge it. To Dummies, Pivot_longer / Pivot_wider create dummy variables for department using name as index or id Tidyverse df |> select(name, department) |> pivot_wider(id_cols = "name", names_from = "department", values_from = "department", values_fill = 0, values_fn = length, names_prefix = "department_") ## # A tibble: 6 × 4 ## name department_Engineering department_Marketing department_Finance ## ## 1 Alice 1 0 0 ## 2 Bob 0 1 0 ## 3 Charlie 0 0 1 ## 4 Ken 0 1 0 ## 5 Steven 0 1 0 ## 6 Carlos 0 0 1 Polars - to_dummies df \ .select(['name','department']) \ .to_dummies(columns = 'department') shape: (6, 4)namedepartment_Engineeringdepartment_Financedepartment_Marketingstru8u8u8"Alice"100"Bob"001"Charlie"010"Ken"001"Steven"001"Carlos"010 Polars - pivot df \ .select(['name','address']) \ .with_columns( state = pl.col('address').str.extract(r'([A-Z]{2})$') ) \ .select('name','state') \ .pivot(on = 'state', index = 'name', values='state', aggregate_function='len') \ .with_columns( pl.col(pl.UInt32).fill_null(0) ) shape: (6, 5)nameOHABNYONstru32u32u32u32"Alice"1000"Bob"0100"Charlie"0010"Ken"0001"Steven"1000"Carlos"1000 Essentially, pivot_wider is Polars’ pivot. Whereas pivot_longer is Polars’ unpivot Helpful Resources Polars Python API documentation Cheat sheet cookbook Polars for R Lessons Learnt: stringr::str_extract has the parameter group We cannot use look forward or backward in polars polars also has selector for looking at column names more efficiently lots of trial and error going to try doing pure Polars for a month! Wish me luck! If you like this article: please feel free to send me a comment or visit my other blogs please feel free to follow me on BlueSky, twitter, GitHub or Mastodon if you would like collaborate please feel free to contact me

Creating an animated Christmas tree in R | R-bloggers

With Christmas tomorrow we have decided to create an animated Christmas Tree using {ggplot2}, {sf} and {gganimate}. First we need a tree. To do this we have used an {sf} polygon where we pass in the coordinates of the Christmas tree as a list matrix to st_polygon. We can then use geom_sf to add this layer onto a ggplot object. library(ggplot2) library(gganimate) library(sf) tree_coords = list( matrix( c(-4, 0, -2.22, 2, -3.5, 2, -1.5, 4, -2.5, 4, -0.8, 6, -1.5, 6, 0, 8, 1.5, 6, 0.8, 6, 2.5, 4, 1.5, 4, 3.5, 2, 2.22, 2, 4, 0, -4, 0), ncol=2, byrow=T ) ) tree = st_polygon(tree_coords) gg_tree = ggplot() + geom_sf(aes(), data=tree) gg_tree Okay, so now we have a tree shape. Now we need to make it a little more Christmassy by changing: The color using: fill = "forestgreen", color = "darkgreen" Adding the trunk: geom_rect(aes(xmin = -0.75, xmax = 0.75, ymin = -2, ymax = 0), fill = "saddlebrown", color = "sienna4") Add a star on the top: geom_point(aes(x = 0, y = 8), color = "gold", shape = 8, size = 7, stroke = 3) Remove the axis with: theme_void() Set the border: coord_sf(xlim = c(-6, 6), ylim = c(-4, 10)) Add a Christmas message: annotate("text", x = 0, y = 9.5, label = "Merry Christmas \n From Jumping Rivers!", size = 6) Now our tree looks like this: gg_tree = ggplot() + geom_sf(aes(), data=tree, fill = "forestgreen", color = "darkgreen") + geom_rect(aes(xmin = -0.75, xmax = 0.75, ymin = -2, ymax = 0), fill = "saddlebrown", color = "sienna4") + geom_point(aes(x = 0, y = 8), color = "gold", shape = 8, size = 7, stroke = 3) + theme_void() + coord_sf(xlim = c(-6, 6), ylim = c(-4, 10)) + annotate("text", x = 0, y = 9.5, label = "Merry Christmas \n From Jumping Rivers!", size = 6) gg_tree Next we need to use {sf} again to make some lights for the tree then {gganimate} to make the lights flash. Placing the points within the boundaries of the tree was a trickier task than we expected until we fell upon st_sample which we can pass a polygon to and it’ll create some sample points within the boundaries. We also create a vector to colour the points. points = st_sample(tree, 75) colours = sample(c("red", "yellow", "blue"), 75, replace = TRUE) gg_tree = ggplot() + geom_sf(aes(), data=tree, fill = "forestgreen", color = "darkgreen") + geom_sf(aes(), data=points, color = colours) + geom_rect(aes(xmin = -0.75, xmax = 0.75, ymin = -2, ymax = 0), fill = "saddlebrown", color = "sienna4") + geom_point(aes(x = 0, y = 8), color = "gold", shape = 8, size = 7, stroke = 3) + theme_void() + coord_sf(xlim = c(-6, 6), ylim = c(-4, 10)) + annotate("text", x = 0, y = 9.5, label = "Merry Christmas \n From Jumping Rivers!", size = 6) gg_tree We can now animate it to make the lights sparkle using transition_time and ease_aes: gg_tree + transition_time(1:75) + ease_aes('linear') Lastly, have a great Christmas and New Year from the Jumping Rivers team! For updates and revisions to this article, see the original post