18% of Github Projects Have CI
Introduction
In this post, I’ll share my analysis of the adoption of continuous integration (CI) in public Github repositories, based on usage of the following CI vendors:
I’m the first to admit that this is not a comprehensive list of CI vendors.
It’s also obvious that using a CI vendor for executing your builds is not a requirement for having CI.
That being said, we can dig into public data provided by Github to get a lower-bound estimate.
Methodology
Github has provided a public dataset of code that can be queried via Google’s BigQuery database service.
I followed Google’s “Codelab” for querying this dataset using SQL, running a basic query to identify repositories in their test dataset.
In order to identify whether a repository uses a given CI vendor we must assemble a list of rules for identifying repositories that are configured for specific CI vendors.
For example, in modern versions of Jenkins it’s encouraged to include a Jenkinsfile
in the project root of your repository.
If this file is present then Jenkins will process its contents as a manifest for how to build your software.
This type of file-based configuration is not exclusive to Jenkins, and it allows us to count repositories that are using file-based manifests for configuring CI.
Here is a list of rules for detecting if a repository is configured for CI with the given vendor:
Here is the SQL query I used to get a count of repositories that had each of these files:
SELECT
CASE
WHEN path = '.travis.yml' THEN 'TravisCI'
WHEN path = '.circleci/config.yml' OR path = 'circle.yml' THEN 'CircleCI'
WHEN path = 'Jenkinsfile' THEN 'Jenkins'
WHEN path LIKE '.github/workflows%' THEN 'Github Actions'
WHEN path = 'azure-pipelines.yml' THEN 'Azure Pipelines'
WHEN path LIKE '.teamcity/%' THEN 'TeamCity'
WHEN path = 'cloudbuild.yaml' THEN 'Google Cloud Build'
END,
COUNT(distinct repo_name) AS num_duplicates
FROM `bigquery-public-data.github_repos.files`
GROUP BY 1
ORDER BY 2 DESC
Results
CI Vendor | Number of Repos | Percentage of Repos |
Total | 3345134 | 1 |
TravisCI | 565369 | 16.90% |
CircleCI | 28608 | 0.86% |
Github Actions | 5532 | 0.17% |
Jenkins | 5389 | 0.16% |
Azure Pipelines | 1829 | 0.05% |
Google Cloud Build | 253 | 0.01% |
TeamCity | 72 | 0.00% |
That’s a grand total of 18.1% of repositories in this dataset having been setup with any CI vendor at all.
The market seems to be dominated by TravisCI, having a .travis.yml
file present in 16.9% of repos.
Here are some data visualizations to help illustrate the results:
Discussion
With only 18% adoption, that leaves 82% of these public repositories not using popular CI vendors.
This is a huge opportunity to make CI a default and easier to adopt.
Areas of future research:
- relationship between CI adoption and popularity of repository
- analysis of a repository’s dependencies & their use of CI
- analysis of CI adoption by programming language
- when in the repository’s timeline was CI introduced (at the beginning? later?)
One area I am most curious about is automatically profiling a software’s CI runtimes and factors that introduce flakiness to the build.
These two characteristics of software reflect:
- the mental health of software engineers working on it
- the ability to make changes quickly and confidently to it
Please do email me with comments/suggestions, etc. at max.mautner@gmail.com