1. home
  2. R-bloggers | R news and tutorials contributed by (750) R bloggers
  3. Software

R-bloggers | R news and tutorials contributed by (750) R bloggers - Software

8 | Follower

The Vibe of Flanders: Part 1 | R-bloggers

What’s it like to live in Flanders these days? Flanders, the Northern, Dutch-speaking part of Belgium, conducts a regular survey of the people who live here. The survey is called De Gemeente-Stadsmonitor (The Municipality and City Monitor), and covers a great many topics, from large societal issues to housing to mobility and climate. This post describes an analysis of the summary data (aggregated per municipality) from the most-recent survey wave, giving a data-driven view of the main underlying themes of the survey. We will analyze these themes in some detail, and make a map of Flemish municipalities (gemeenten in Dutch) and provinces (West Flanders, East Flanders, Flemish Brabant, Antwerp, and Limburg) according to their relationship to the themes and to one another. In which municipalities and regions do residents feel good about where they live, and in contrast, where do residents say they feel unsafe? In which municipalities and regions do residents use environmentally sustainable transport and where do they seem dependent on the car? Read on to find out! The Gemeente-Stadsmonitor Survey The Gemeente-Stadsmonitor (Municipality and City Monitor) is conducted every three years by the Agency of the Interior and Statistics Flanders, and the most recent survey wave was conducted in 2023. For the 2023 survey, a representative sample of residents between 17 and 85 years old was sent the survey, and in in total 389,714 people filled it out. According to the website, all 300 Flemish municipalities should be in the data, but in fact only 299 are.1 The survey contains questions on 11 broad topics, as designated by the creators of the survey: Armoede (Poverty) Cultuur en vrije tijd (Culture and leisure) Demografie (Demography) Klimaat, milieu en natuur (Climate, environment and nature) Lokaal bestuur (Local government) Mobiliteit (Mobility) Onderwijs en vorming (Education and training) Samenleven (Living together) Werk (Work) Wonen en woonomgeving (Living and living environment) Zorg en gezondheid (Care and health) For more information about the survey, you can check out this website (in Dutch). The 2023 survey form containing all of the question text and answer options can be found here Language Use in This Post The Gemeente-Stadsmonitor survey is conducted in Dutch; all survey questions are written in this language. I’m writing this post in English in order for it to be more widely-accessible. The question descriptions in the charts below will be displayed with the Dutch language descriptions. I’ll provide English language translations (created with machine-translation via Co-Pilot) throughout the text and tables. It should be possible to follow everything described in this post, even if you don’t know any Dutch! Finally, though the full name of the survey is the “Gemeente-Stadsmonitor”, in the text below I will refer to the survey as the “Stadsmonitor” for simplicity. This is also the term that is used in the Flemish media when talking about the survey and its results. The Data The data, at the gemeente / municipality level, are freely available to the public. You can download subsets of the data, or have all of it in a 100+ tab Excel file. I downloaded the Excel file with all of the data, and spent a significant amount of effort preparing it for analysis. All of the data preparation and analysis code is available on Github here. The present analysis considers only data from the most recent survey wave, conducted in 2023. The data for the analyses below are taken from the answer options that indicate agreement with the question topic. For example, the underlying data for the question “Zich thuis voelen bij mensen in de buurt” (“Feeling at home with people in the neighborhood”) are the percentage of respondents per municipality who agree with this statement. The dataset contains 299 rows (1 per gemeente/municipality), with the answers to each question contained in 200 columns. Analysis Part 1: Using PCA to Uncover Latent Themes in the Stadsmonitor Data The goal of the first set of analyses is to understand the latent themes of the survey items. Each survey question is written to assess residents’ thoughts or feelings about a specific topic (the 11 subjects outlined above, e.g. Mobiliteit / Mobility). However, it is often the case that survey items have a higher-level grouping that is evidenced by the iter-relationship of responses to the questions. One common data analytic technique that is often used to bring clarity to this underlying structure is called PCA (Principal Components Analysis). Principal Components Analysis is a technique that tries to reduce a set of variables into a smaller dimensional space. In the current case, we have variables describing the gemeente / municipality-level responses to the 200 questions in the Stadsmonitor. PCA allows us to find a smaller number of independent components that describe the variation in the responses to these questions. Within this reduced-dimensional space, we are better able understand the relationships among the questions, municipalities and regions.2 In this blog post, I will use the term topic to describe the designation given by the survey authors, and theme to describe the data-driven groupings of survey questions from our statistical analysis (e.g. the principal components). Revealed Themes (Principal Components) In this post, we will focus on the top two themes uncovered by the PCA analysis. Our analysis gives us a list of underlying themes (called Principal Components or PCs in statistical terms), ranked in terms of their importance in explaining the variation in the responses to the survey questions. Each question gets a score (called a loading in statistical terms) for each Principal Component. The loadings range in between -1 and +1, and the larger a question’s loading on a principal component (either in a positive or negative direction), the more the question is reflective of the theme represented by that component.3 Theme 1 (PC 1) The table below shows the questions with the highest scores (loadings) - both positive and negative - on the first principal component. Examining these questions will give us an idea of the subject matter of the first theme. As is clear in the table, the first theme is heavily concerned with feelings about one’s place of residence, in the neighborhood or gemeente / municipality. On the positive end of this principal component, we find items that focus on feeling good about where one lives. The questions deal with feeling at home, comfortable, safe, having good contacts with neighbors, etc. On the negative end of this principal component, we find questions related to feeling insecure or unsafe where one lives. The questions with the largest negative scores concern nuisances in the neighborhood, conflict, attitudes towards diversity, feeling unsafe, and interestingly, trust in the federal government. A Word on Interpretation PCA analysis is based upon the underlying correlations among the responses to the survey items at the gemeente / municipality level. We can therefore use the analysis to understand how responses to the survey questions are related to one another. Firstly, items that have higher scores on PC 1 are positively correlated with one another, and items that have lower scores on PC 1 are also positively correlated with one another. For example, gemeenten / municipalities that have higher average scores on the item “Feeling at home with people in the neighborhood” (PC 1 loading of .88) also have higher average scores on the item “Satisfaction with contact in the neighborhood” (PC 1 loading of .87). And at the other end of PC 1, gemeenten / municipalities with higher average scores on the item “Feeling of insecurity in the neighborhood” (PC 1 loading of -.73) also have higher average scores on the item “Chat with people of non-Belgian origin” (PC 1 loading of -.72). Furthermore, items with positive loadings on a given principal component are negatively correlated with items with negative loadings on that component. For example, gemeenten / municipalities with higher average scores on the item “Feeling at home with people in the neighborhood” (PC1 loading of .85) have lower average scores on the item “Feeling of insecurity in the municipality” (PC1 loading of -.74). Note that these relationships are correlations, and not causal relationships. The table of loadings below shows that gemeenten / municipalities where residents speak more with people of non-Belgian origin (loading of -.72) also feel less safe in their neighborhood (loading of -.73). However, it is not the case that people feel less safe because they speak more frequently with individuals of non-Belgian origin. Item - Dutch Item - English PC 1 Sociaal weefsel in de buurt - Zich thuis voelen bij mensen in de buurt Social fabric in the neighborhood - Feeling at home with people in the neighborhood 0.88 Tevredenheid over contact in de buurt Satisfaction with contact in the neighborhood 0.87 Tevredenheid over de buurt Satisfaction with the neighborhood 0.85 Zich thuis voelen in de buurt Feeling at home in the neighborhood 0.85 Sociaal weefsel in de buurt Social fabric in the neighborhood 0.85 Sociaal weefsel in de buurt - Mensen in de buurt zijn te vertrouwen Social fabric in the neighborhood - People in the neighborhood are trustworthy 0.84 Sociaal weefsel in de buurt - Mensen in de buurt willen hun buren helpen Social fabric in the neighborhood - People in the neighborhood want to help their neighbors 0.84 Graag wonen in de gemeente Like living in the municipality 0.80 Plaats om fietsen te stallen in of bij de woning Place to park bicycles in or at the house 0.79 Duurzaamheid van de woning - Zonnepanelen Sustainability of the house - Solar panels 0.79 Private buitenruimte of garage - Garage Private outdoor space or garage - Garage 0.76 Netheid van het centrum Cleanliness of the center 0.76 Buurthinder: lastiggevallen worden op straat Neighborhood nuisance: being harassed on the street -0.62 Diversiteit vriendenkring - Vrienden niet-Belgische herkomst Diversity of friends - Friends non-Belgian origin -0.63 Vertrouwen in federale overheid Trust in federal government -0.63 Verplaatsingen vrije tijd - Bus, tram of metro Leisure travel - Bus, tram or metro -0.64 Houding tegenover diversiteit - Teveel verschillende herkomst Attitude towards diversity - Too much different origin -0.64 Buurthinder: vandalisme en drugsdealing - Drugsdealing Neighborhood nuisance: vandalism and drug dealing - Drug dealing -0.69 Intensiteit van contacten - Praatje met mensen van niet-Belgische herkomst Intensity of contacts - Chat with people of non-Belgian origin -0.72 Onveiligheidsgevoel in de buurt Feeling of insecurity in the neighborhood -0.73 Buurthinder: milieuhinder - Hondenpoep Neighborhood nuisance: environmental nuisance - Dog poop -0.73 Onveiligheidsgevoel in de gemeente Feeling of insecurity in the municipality -0.74 Buurthinder: vandalisme en drugsdealing - Vandalisme Neighborhood nuisance: vandalism and drug dealing - Vandalism -0.76 Buurthinder Neighborhood nuisance -0.82 Theme 2 / PC 2 The questions with the highest scores (both positive and negative) on the second underlying theme are shown in the table below. This theme is heavily focused on transport and mobility, important topics in a country where the near-constant traffic jams have enormous negative impacts on the economy and commuter well-being. On the positive dimension of the principal component, the questions focus on environmentally sustainable transportation. Many of the top items concern bicycle use - whether respondents frequently use their bikes, the state of the bike infrastructure, and whether they feel safe while cycling. Mixed in with these items are questions related to walking and being physically active. At the negative end of the second theme, we find mostly items about frequent car usage. Item - Dutch Item - English PC 2 Duurzaam verplaatsingsgedrag voor korte afstanden Sustainable travel behavior for short distances 0.68 Voldoende fietsenstallingen Enough bicycle parking 0.64 Milieubewust handelen Environmentally conscious behavior 0.64 Veilig fietsen Safe cycling 0.63 Voldoende autoluwe en autovrije zones Enough car-free and car-free zones 0.62 Verplaatsingen vrije tijd - Fiets / elektrische fiets Leisure travel - Bike / electric bike 0.62 Verplaatsingen vrije tijd - Fiets algemeen Leisure travel - Bike in general 0.62 Milieubewust handelen - Korte afstanden te voet Environmentally conscious behavior - Short distances on foot 0.61 Duurzaam verplaatsingsgedrag voor korte afstanden - Korte afstanden te voet Sustainable travel behavior for short distances - Short distances on foot 0.61 Tevredenheid over staat van fietsinfrastructuur Satisfaction with the condition of cycling infrastructure 0.60 Actief bewegen Active movement 0.58 Voldoende fietsinfrastructuur Enough cycling infrastructure 0.58 Verplaatsingen vrije tijd - Te voet Leisure travel - On foot 0.57 Voldoende openbaar vervoer Enough public transport 0.56 Buurthinder: milieuhinder - Zwerfvuil Neighborhood nuisance: environmental nuisance - Litter -0.54 Verplaatsingen vrije tijd - Autopassagier Leisure travel - Car passenger -0.57 Buurthinder: verkeershinder - Snel rijden Neighborhood nuisance: traffic nuisance - Fast driving -0.57 Verplaatsingen woon-werk/woon-school: dominant vervoermiddel - Auto Commuting: dominant mode of transport - Car -0.60 Verplaatsingen vrije tijd - Autobestuurder Leisure travel - Car driver -0.60 Plotting the Stadsmonitor Questions According to Their Scores on Themes 1 & 2 We can plot the individual survey items in the tables above in the two-dimensional space defined by the scores of each question on the first two principal components, mapping out where the questions fall with regards to one another within this space. The information shown in the plot below is the same as that contained in the above tables, but the visualization provides another way of understanding the relationships among the questions and the themes of the first two principal components. The items are colored by the question topic, as determined by the agency that administered the survey. Note that we are showing just a subset of the questions in this plot; there are 200 questions, and plotting them all yields in an enormous jumble that is difficult to read and to interpret. The main concerns of the first theme (displayed on the horizontal or x-axis) - of feelings about one’s place of residence - are very clear. On the positive (right-hand) side, we see the questions describing positive feelings about where one lives (e.g. “Zich thuis voelen in de buurt” / “Feeling at home in the neighborhood”). On the negative (left-hand) side, we see the items about feeling insecure or unsafe in the place where one lives (e.g. “Onveiligheidsgevoel in de buurt” / “Feeling of insecurity in the neighborhood”). The main concerns of the second theme (displayed on the vertical or y-axis) - all about transport and mobility - are also quite clear. At the positive (upper) side, we see the items about using bikes or other environmentally sustainable transportation modes, whereas on the negative (bottom) side, we see items about using the car. Notice that the question topics (assigned by the survey writers) are grouped into higher level themes by the PCA analysis. For example, the first theme / principal component - about feeling good or bad where one lives - contains survey items from the Wonen en woonomgeving (Living and living environment), Samenleven (Living together), and Cultuur en vrije tijd (Culture & leisure) topics. Analysis Part 2: Plotting the Gemeenten / Municipalities and Provinces According to Their Scores on Themes 1 & 2 We can use the results of our PCA analysis to plot each gemeente / municipality in the two-dimensional space defined by the first two principal components, coloring the points by the color of the Flemish province in which they are located. We can also compute the average of each province along the principal components; these province averages are indicated by the larger dots on the plot below. There is a clear ordering of the provinces according to the first theme / principal component (displayed on the horizontal axis), which concerns feelings about one’s place of residence. Gemeenten / municipalities in West Flanders have, on average, the highest scores on this dimension, indicating that residents in this province feel the most positive about where they live. West Flanders is followed closely by Limburg, while Antwerp and East Flanders fall somewhere in the middle. Flemish Brabant has the lowest average scores on the first theme / principal component, indicating that gemeenten / municipalities in this province are the least positive about their place of residence, and that on average feelings of insecurity are higher there. There is much less variation by province along the second theme / principal component (displayed on the vertical axis). On this transport and mobility dimension, Antwerp scores the highest, indicating that gemeenten / municipalities in this province, on average, indicate greater use of environmentally sustainable transportation. West Flanders falls in the middle, while the remaining provinces are located on the frequent car usage side of this dimension. Focusing on the Gemeenten / Municipalities at the Extremes of Themes 1 & 2 In the plot above, I’ve focused on the center of the coordinates defined by the first two principal components. However, there are gemeenten / municipalities that fall at both positive and negative extremes along both of these two dimensions. The plots below display the gemeenten / municipalities with the most extreme scores at the positive and negative ends of each theme. Highest Scores On Theme / PC 1 - Feelings About One’s Place of Residence The plot below shows the gemeenten / municipalities with the highest scores on theme / principal component 1. These are the places where survey respondents feel the most positive about where they live. Lowest Scores On Theme / PC 1 - Feelings About One’s Place of Residence The plot below shows the gemeenten / municipalities with the lowest scores on theme / principal component 1. These are the places where survey respondents report the greatest feelings of insecurity where they live. Highest Scores On Theme / PC 2 - Transport & Mobility The plot below shows the gemeenten / municipalities with the highest scores on theme / principal component 2. These are the places where survey respondents report the greatest use of environmentally sustainable transportation; people here regularly walk or use the bike in their daily lives. Lowest Scores On Theme / PC 2 - Transport & Mobility The plot below shows the gemeenten / municipalities with the lowest scores on theme / principal component 2. These are the places where survey respondents report the highest degrees of frequent car usage; people here travel more by car in their daily lives. Summary and Conclusion In this blog post, we used principal components analysis to analyze the summary data per gemeente / municipality from the 2023 Stadsmonitor survey. This analysis allowed us to uncover the main themes that underlie the survey responses. We focused here on the first two themes or principal components. The first theme concerned feelings about one’s place of residence. At the positive end of this dimension, we found questions related to feeling good about where one lives. At the negative end of this dimension, we found questions related to feeling insecure or unsafe where one lives. The second theme concerned transport and mobility. At the positive end of this dimension, we found questions related to the usage of environmentally sustainable transportation, while at the negative end of this dimension, we found questions related to frequent car usage. We then plotted the gemeenten / municipalities and provinces in the two-dimensional space defined by the first two themes / principal components. There was a clear ordering of the provinces according to the first dimension. Respondents in West Flemish municipalities feel, on average, the best about their places of residence, while residents in Flemish Brabant report the greatest feelings of insecurity about where they live. There was less variation by province according to the second dimension; the one clear difference was that gemeenten / municipalities in Antwerp were much more likely to report greater use of environmentally sustainable transportation such as biking or walking, compared to the other provinces. Finally, we made plots of the gemeenten / municipalities that scored the highest on the positive and negative dimensions of the two dimensions, highlighting the places that feel the best vs. worst about where they live, and the places that make the greatest use of environmentally sustainable transport options vs. the places where residents are most likely to make use of the car in their daily lives. Coming Up Next The next two posts will also focus on the data from the 2023 Stadsmonitor survey. We will further explore the dimensions of the PCA analysis described above, and understand how Flemish gementeen and provinces differ according to the third theme / principal component.4 The final post in this series will be a technical one, and will present (with code) the details of the analyses described above. Stay tuned! If anyone from the survey team sees this, what happened to Herstappe - NIS Code 73028? ↩ For an excellent introduction to PCA, I highly recommend these wonderfully clear videos from the Hastie and Tibshirani “Introduction to Statistical Learning” online course. François Husson, a developer of the R package we’ll use to do the PCA, has a great open course about sensographics on YouTube (in French only), and some interesting tutorials about PCA in English. Also, for a similar application of PCA in a different context, you can check out this previous post describing a market-mapping study of beverages, based on consumer responses to a survey describing how the beverages made them feel. ↩ The sign of a question’s score (e.g. positive or negative) is determined by the direction and magnitude of the variable’s contribution to the principal component and is arbitrary; the relative signs of loadings within a component, which indicate the pattern of correlations among variables, allow us to interpret the component’s meaning. ↩ Spoiler alert: the third principal component concerns lefty, green, or “crunchy” themes such as organic and fair trade product purchases, vegetarian eating, limiting plastic use, etc. ↩

The Rebus Code: Unveiling the Secrets of Regex in R | R-bloggers

In the intricate world of data analysis, the task of text pattern recognition and extraction is akin to unlocking a secret cipher hidden within ancient manuscripts. This is the realm of regular expressions (regex), a powerful yet often underappreciated tool in the data scientist’s toolkit. Much like the cryptex from Dan Brown’s “The Da Vinci Code,” which holds the key to unraveling historical and cryptic puzzles, regular expressions unlock the patterns embedded in strings of text data.However, the power of regex comes at a cost — its syntax is notoriously complex and can be as enigmatic as the riddles solved by Robert Langdon in his thrilling adventures. For those not versed in its arcane symbols, crafting regex patterns can feel like deciphering a code without a Rosetta Stone. This is where the rebus package in R provides a lifeline. It simplifies the creation of regex expressions, transforming them from a cryptic sequence of characters into a readable and manageable code, akin to translating a hidden message in an old relic.In this tutorial, we embark on a journey akin to that of Langdon’s through Paris and London, but instead of ancient symbols hidden in art, we’ll navigate through the complexities of text data. We will explore the fundamental principles of regex that form the backbone of text manipulation tasks. From basic pattern matching to crafting intricate regex expressions with the rebus package, this guide will illuminate the path towards mastering regex in R, making the process as engaging as uncovering a secret passage in an ancient temple.Just as Langdon used his knowledge of symbolism to solve mysteries, we will use rebus to demystify regex in R, making this powerful tool accessible and practical for everyday data tasks. Whether you're a seasoned data scientist or a novice in the field, understanding how to effectively use regex is like discovering a hidden map that leads to buried treasure, providing you with the insights necessary to make informed decisions based on your data.With our thematic setting now established, let us delve deeper into the world of regular expressions and reveal how the rebus package can transform your approach to data analysis, turning a daunting task into an intriguing puzzle-solving adventure.In the quest for understanding regular expressions, akin to decoding a series of cryptic messages left in Leonardo Da Vinci’s artworks, we start with the very basics — the symbols and syntax that are the foundational tools of this powerful scripting language. Just as symbols held profound meanings in ancient scripts, each character in a regex pattern holds specific and significant implications.Unveiling the SymbolsRegular expressions operate through special characters that, when combined, form patterns capable of matching and extracting text with incredible precision. Here are a few fundamental symbols to understand:Dot (.): Like the omnipresent eye in a Da Vinci painting, the dot matches any single character, except newline characters. It sees all but the end of a line.Asterisk (*): Mirroring the endless loops in a Fibonacci spiral, the asterisk matches the preceding element zero or more times, extending its reach across the string.Plus (+): This symbol requires the preceding element to appear at least once, much like insisting on the presence of a key motif in an artwork.Question Mark (?): It makes the preceding element optional, introducing ambiguity into the pattern, akin to an unclear symbol whose meaning might vary.Caret (^): Matching the start of a string, the caret sets the stage much like the opening scene in a historical mystery.Dollar Sign ($): This symbol matches the end of a string, providing closure and ensuring that the pattern adheres strictly to the end of the text.Example: Simple Patterns in ActionUsing the stringr library enhances readability and flexibility in handling regular expressions. Let’s apply this to find specific patterns:library(stringr)text_vector