1. home
  2. R-bloggers | R news and tutorials contributed by (750) R bloggers
  3. Software

R-bloggers | R news and tutorials contributed by (750) R bloggers - Software

8 | Follower

Learning The Basics of Phylogenetic Analysis | R-bloggers

🧬🔬 Explore phylogenetic analysis from genome to tree! Basic workflow with R/Bioconductor. Learnt to work with large genomic dataset. Extract 16S rRNA from 10K+ E.coli strains using dataset dehydrate, barrnap for extraction, rapidNJ for tree building & FigTree for visualization. Movitation: After that last hands-on experience on Bioconductor, we will continue our journey in phylogenetic analysis. I’ve always been intrigued in how biologists piece these phylogenetic tree together and I want to know the big idea of how this is done. We’ll again be using Bioconductor. Let’s go! Disclaimer: I am not a bioinformatician and do not work with genes directly, the articles and method presented is my attempt to get a birds eye view on how we went from different isolates to piecing them together onto a single tree. Please take this with a grain of salt. Verify the information presented. If you noted some error in this article, please let me know so that I can learn! Also, some of the analysis results were not run during rmarkdown knitting because that causes a significant delay, however, the results posted here should be reprodicuble. Please again let me know if they are not The End Goal: Get a basic workflow of how this is done Assess Many Many Ecoli strains Assess Different Genus on a single tree Looks quite doable! Along the line, we may have some deeper dive to look at the machinery behind. Ultimately, we want to visualize beuatiful trees! 🌴 Like this. Objectives: What is phylogenetic analysis? The workflow Extract 16S rRna Align Distance Calculation Tree construction Visualizing Phylogenetic tree Ecoli Large Dataset Other Genus Opportunities for improvement Lessons Learnt What is phylogenetic analysis? Phylogenetic analysis is a method used to study the evolutionary relationships between organisms. It involves comparing genetic sequences, such as DNA or RNA, to infer how species are related through common ancestry. This analysis can help identify how different organisms have evolved over time and can be particularly useful in contact tracing, outbreak assessment. Here is a quick wiki Let’s Download ALL Ecoli Fasta One of our previous opportunity improvement is to learn to use NCBI dataset CLI. We’ll be using that this time because if you try to download more than 10,000 of fasta via website or event the CLI itself, it won’t be as smooth, at least from my multiple tries. I’ve attempted maybe 5-6 times and couldn’t complete the download even when I used datasets and unable to resume downloads. However, the most stable is actually to download dehydrated then rehydrate. See here Terminal curl -o datasets 'https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/mac/datasets' chmod +x datasets dataformat datasets download genome taxon "Escherichia coli" \ --assembly-level scaffold,chromosome,complete \ --dehydrated \ --filename ecoli_high_quality_dehydrated.zip unzip ecoli_genomes_dehydrated.zip cd ncbi_dataset datasets rehydrate --directory For large genome downloads, use datasets dehydrated then rehydrate. It’s easier to resume download if connection got lost. You notice above that we left off contig and just included scaffold, chromosome, and complete? That’s because there were 300,000 sequences if we include all of them. I definitely don’t need all of those packages. With scaffold, chromosome, and complete, we got 30,000+ and the entire zip file was 58 gigs of data. 😵‍💫 Just for my education: Contig - Short for “contiguous sequence.” This is a continuous stretch of DNA sequence with no gaps. Contigs are assembled from overlapping sequencing reads, but they represent isolated pieces without known relationships to other contigs. Scaffold - A collection of contigs that have been ordered and oriented relative to each other, often with gaps of estimated size between them. Scaffolding uses additional information like paired-end reads or mate-pair libraries to determine how contigs should be arranged, even when the sequence connecting them isn’t fully resolved. Chromosome - Contigs and scaffolds have been assembled into chromosome-scale sequences that represent entire chromosomes. This typically requires additional long-range information like Hi-C data, optical mapping, or long-read sequencing to achieve proper chromosome-level organization. Complete - The highest level of assembly where the entire genome is finished with no gaps, including challenging repetitive regions, centromeres, and telomeres. This represents a truly complete, end-to-end sequence of all chromosomes. The Workflow The workflow for phylogenetic analysis typically involves several key steps: Extract 16s rRNA -> Align -> Calculate Distance -> Construct Tree Extract 16S rRNA This step involves obtaining the genetic material from the organisms of interest, such as bacteria or viruses. For example, in the case of antibiotic resistance, researchers might extract the 16S rRNA gene, which is commonly used for bacterial identification. The reason for extracting 16S rRNA in bacteria is beccause the sequence are highly conserved across different species, making it a reliable marker for phylogenetic analysis. The 16S rRNA gene is present in all bacteria and archaea, and its sequence can provide insights into the evolutionary relationships between different microbial species. Let’s dive into the code to do this. We’ll use the same 2 groups we used before, regular Ecoli and ESBL Ecoli from before library(Biostrings) library(DECIPHER) # load data path1

Learning Antimicrobial Resistance (AMR) genes with Bioconductor | R-bloggers

Instead of flashcards, we Rube Goldberg’d this with Bioconductor! Analyzed 3,280 E. coli genomes from NCBI, detecting ESBL genes in 84.4% of samples. CTX-M-15 was most common. Helped us understand gene nomenclature and sequence analysis! 📊🔬 Motivation I’ve always had a hard time learning and remembering all these genes for antimicrobial resistance (AMR). Yes, we can probably create some nice flash cards and try to memorize that way. But, why do the easy way when we can Rube Goldberg machine this! And use this as an opportunity to revisit Bioconductor and learn it! Let’s go! Thought Process But, how? There is so much to learn! Well, to make it somewhat clinically applicable, or at least bridge the gap of understanding, is answer some of my own questions: What genes control the production of extended spectrum beta lactamase (ESBL) in Escherichia coli (E. coli)? Do these genes have the same DNA sequence across species and genus? Disclaimer I am not a bioinformatician and do not work with amr, the articles and method presented is my attempt to form a better memory association towards clinical amr and the genetic terminologies that are usually used by microbiologists. Please take this with a grain of salt. Verify the information presented. If you noted some error in this article, please let me know so that I can learn! Also, some of the analysis results were not run during rmarkdown knitting because that causes a significant delay, however, the results posted here should be reprodicuble. Please again let me know if they are not Objectives How To Download E. Coli Data? How To Download Class A Beta Lactamase Genes? How To Detect Gene? Let’s Go All In and Assess ALL Available ESBL Ecoli in NCBI Answer To My Questions Opportunity for Improvements Lessons Learnt How To Download E. Coli Data ? Let’s select 2 different groups. The first group is just a search of the bacteria and download the first 10. The second group we’ll specifically filter out ESBL group. Regular E. Coli Click here to go to this page. Select the first 10 E. coli and then click Download > Download Package The selection should be default like above, then hit Download and you’ll have a zip file. Once you downloaded the zip file above. After unzipping it, go to the data folder and you’ll have quite a few folders with their NCBI RefSeq assembly like so. Each folder contains fna file (FASTA file containing nucleotide sequences) of the bacteria that contains the whole genome sequence of that particular strain of bacteria. Let’s take a look at the first assembly! (sequence filter(assembly != "none") |> group_by(file) |> mutate(n = row_number()) |> filter(n == 1) |> ungroup(file) |> select(esbl_seq) |> mutate(gene=str_extract(esbl_seq,"(?

Nonparametric serial interval estimation | R-bloggers

Motivation Epidemiological delays inform about the time between two well-defined events related to a disease. The serial interval (SI) of an infectious disease is defined as the time between symptom onset in a primary case (infector) and symptom onset in a secondary case (infectee). It is a widely used epidemiological delay quantity and plays a central role in mathematical/statistical models of disease transmission. There exists a tight link between the reproduction number (average number of secondary infections generated by an infected individual) and the serial interval. Therefore, getting accurate knowledge about the SI distribution is key to gain a clear understanding of transmission dynamics during outbreaks. Timings of symptom onset for infector-infectee pairs can be obtained from line list data and observations usually consist of calendar dates. From a mathematical perspective, it is more convenient to work with numbers than with calendar dates and the latter are typically transformed to integers for the sake of statistical analysis. The main challenge when working with SI data is censoring in the sense that exact symptom onset times are usually unobserved and only known to have occurred between two time points. If the time resolution of a reported timing of illness onset is a calendar day, for instance July 15, there is not enough information to determine the exact time of illness onset within that day. As such, symptom onset is assumed to have occurred between July 15 and July 16 and we say that serial interval data are interval-censored. The figure below illustrates the coarse structure of SI data that adds a layer of complexity to the estimation problem. Source: Gressani O, Hens N. (2025). Nonparametric serial interval estimation with uniform mixtures. PLoS Comput Biol 21(8): e1013338. A recent article by Gressani and Hens (2025) published in PLOS Computational Biology proposes a new estimator of the cumulative distribution function of the serial interval without making parametric assumptions regarding the underlying SI distribution. The estimator is based on mixtures of uniform distributions and only requires left and right bounds of serial interval windows of infector-infectee pairs as a main input (\(s_{iL}\) and \(s_{iR}\) in the above figure). Point estimates of different serial interval features are available in closed-form and the bootstrap is used to compute confidence intervals. The nonparametric methodology is relatively simple and computationally fast and stable. Moreover, a user-friendly routine is available in the EpiDelays package written in R. This post aims at giving users a simple first experience with this new nonparametric methodology for serial interval estimation. The package can be installed from GitHub (using devtools) as follows: install.packages("devtools") devtools::install_github("oswaldogressani/EpiDelays") Simulated data The estimSI() routine of the EpiDelays package can be used to compute nonparametric estimates (point estimates with standard errors and confidence intervals) of different serial interval features (e.g. the mean, median, standard deviation). The routine is simple to use and requires only two inputs: x: A data frame with \(n\) rows (corresponding to the number of transmission pairs for which illness onset data is available) and two columns containing the lower bound of the SI window \(s_{iL}\) (first column) and the upper bound of the SI window \(s_{iR}\) (second column). nboot: An integer for the bootstrap sample size (default is 2000) used to construct (\(90\%\) and \(95\%\)) confidence intervals (CIs). We start by illustrating the use of estimSI() on simulated data. The simSI() routine can be used to simulate artificial serial interval data with SI windows having a width (coarseness) of at least two days. The underlying target SI distribution is assumed to have a Gaussian distribution with mean muS and standard deviation sdS that have to be specified by the user. The code below can be used to generate \(n=15\) SI windows from a Gaussian distribution with a mean of \(3\) days and standard deviation of \(2\) days. More details regarding the data generating mechanism can be found in the article. set.seed(2025) simdata