GSoC'25 @ The R Project for Statistical Computing
Weekly updates on my Google Summer of Code 2025 project, optimizing performance testing workflows for R packages by caching built package versions between CI runs.
This summer I got selected to work on a project with The R Project for Statistical Computing as part of Google Summer of Code 2025. The project is titled Optimizing a performance testing workflow by reusing minified R package versions between CI runs, and my mentors are Anirban Chetia and Toby Dylan Hocking.
The core aim is to streamline a critical part of R developers' workflow. Many R packages (data.table being a prime example) use the atime package for performance benchmarking across different versions. This is crucial for identifying performance regressions or improvements when reviewing contributions. Currently, these CI performance tests can be quite time-consuming, often rebuilding multiple package versions repeatedly. My project implements a caching mechanism that reuses previously built package versions across different CI runs.
A key challenge in CI is securely handling pull requests from external forks, as these forks typically don't have access to repository secrets (needed to, say, comment back on the PR). I will be implementing a two-step process: the first step builds the package versions and runs the performance tests, and the second step comments the results back on the PR.
qualification tests
As part of the application process, I completed a series of qualification tests in January 2025. Through these tests, I laid the groundwork for a significant portion of the project:
-
Package Minification Script (Easy Task): A script that takes any R package tarball, strips out unnecessary files (vignettes, documentation, tests), and installs this "minified" version. This directly addresses the goal of reducing package size for faster CI installations.
-
GitHub Action for Minification and Artifact Upload (Medium Task): Building on the minification script, a GitHub Action that reads a package name and version, checks if a minified version already exists (a precursor to caching), minifies it if not, and uploads the result as a build artifact.
-
Supporting PRs from Forks in Autocomment-atime-results (Hard Task): Modified the existing workflow to support pull requests from external forks. The solution is a two-part workflow: one part runs the performance tests and uploads results as artifacts (safely done by forked PRs), and a separate trusted part downloads these artifacts and comments the results onto the pull request.
week 0: community bonding period
In my initial plan, I thought of implementing the caching mechanism in the atime package itself. I made this PR for feedback, and it turned out I had missed a simpler approach. Toby suggested caching the library directory of the package directly, rather than saving files again in an atime cache directory, implementing the caching in the CI workflow itself, which would be more efficient.
week 1: implementing caching in CI
I focused on implementing the caching mechanism directly in the CI workflow. I started by modifying the existing GitHub Actions workflow from the Hard Task. I cached both the library directory and the built libgit2 files, which was another source of wasted time in the CI.
week 2: combining the two steps into one action
Anirban had previously suggested having one action that would run the performance tests and comment results back on the PR. I made it two jobs in the same action: one for running the performance tests and uploading results as artifacts, and another for downloading those artifacts and commenting them on the PR. This simplifies the workflow and makes it easier to implement in any package that uses autocomment-atime-results.
week 3 & 4: delays, implementing feedback & changing approach (again)
In the third week, I was partially busy traveling to the Warpspeed: Agentic AI Hackathon 2025. I did finally open a PR on autocomment-atime-results. I changed approach again: for supporting forks, instead of two jobs, I used the pull_request_target event, which runs the workflow in the context of the base repository, allowing it to access secrets and comment on the PR, while ensuring the token is only accessible on the comment step.
The other change was using the actions/cache action to cache the library directory instead of uploading it as an artifact. This had its own challenges: for PRs, the cache is scoped to the merge ref rather than the base branch, which would not allow other PRs to access the cache. The solution was to allow the action to run on other events like push and workflow_dispatch, which creates the cache on the base branch and makes it accessible to PRs.